COVID-19 & Machine Learning

Original article can be found here (source): Deep Learning on Medium

1: Addressing the main problems/challenges to control COVID-19 spread + death toll in India & other Countries?

— The problem is:

Lack of focused testing!!!

1.1: Why testing is important?

  • Testing allows infected people to know that they are infected. This can help them receive the care they need, and it can help them take measures to reduce the probability of infecting others. People who don’t know they are infected might not stay at home and thereby risk infecting others.
  • Testing is also crucial for an appropriate response to the pandemic. It allows us to understand the spread of the disease and to take evidence-based measures to slow down the spread of the disease.
  • Unfortunately, the capacity for COVID-19 testing is still low in many countries including INDIA and around the world. For this reason, we still do not have a good understanding of the spread of the pandemic.

1.2: Then, What should be the testing threshold?

To find this, we need to see understand the visualization. I have also attached the link to the entire ipython notebook below.

Data source:

2 Y-axis bar plot per million
  • Countries like China-Wuhan, South Korea, Australia has been successful till now in flattening the curve for COVID-19 spread is because they have increased the testing_per_million according to there population_per_million.

Clearly, from the above visualization, we can conclude that to control COVID-19 spread, the (test_per_million ≥ population_per_million * 100) which simply means that for a population of 1 million needs at least 100 tests capacity.

Countries like India, USA, UK, Spain, Italy has not been able to match the above testing criteria and has been still struggling with COVID-19 pandemic, even with testing_per_million > population_per_million

1.3: What are the Challenges in increasing the # of testing_per_million for INDIA & other countries? And why it needed in multiples?

  • India & other affected countries have limited health resources Currently, India has eight doctors per 10,000 people compared to 41 in Italy and 71 in South Korea.
  • In the early stage of the disease (first 1–3 days) with a viral load that is too low to be detected. This results in false-negative and hence its requires multiple test/individual which eventually results in testing_per_million should be at least of greater than or equal to 100 folds of population_per_million, which what we can learn from other successful countries data.
  • Testing is a manual which results in handling & human errors.

2: My proposed idea/solution

So, if the problem is Lack of focused testing!!! Then the solution for COVID-19 pandemic is nothing but:

“Test Test Test & then Quarantine”

But, I have already mentioned the Challenges in increasing # of testing are not very trivial esp when, INDIA and other countries the even USA, UK, Italy, Iran also had very limited health resources and have the same problem.

In Computer Science, there is a very popular saying: “The best solutions are the simplest ones!”. We need to increase the # of testing per million, in very less time and here I am proposing my idea/solution to this problem:

We should prioritize the testing using the Priority-based Automated Testing System (PbATS) using ML/AI. This not only helps in prioritizing the Testing but also helps in determining the priority of giving limited healthcare services.

Using PbATS we will categorize the population-based on their input. Doing this let’s say we categorize the population into 3 categories as follows:

Category 1 (Self-Quarantined): They are those who are either, not affected OR still not showing the systems but affected (the false-negatives). Mainly contains, the unaffected & some of the false-negative cases. Hence, they will need to undergo the PbATS periodically up to a certain duration.

Category 2 (Test-Priority): This category contains some false-negatives but started showing symptoms and hence need manual testing

Category 3 (Healthcare services-Priority): This category consists of highly likely people got affected with COVID-19, and hence, they should be strictly Quarantined and has been given priority hospital services.

PbATS Workflow

Note: This PbATS mechanism will be an AI/ML approach with which we categorize these 3 categories based on the historic data of all the past COVID-19 patients. And as we all know, that “All Machine Learning models are wrong and only some are useful” — Considering this also, my solution contains multiple attempts for category 1 and category 3, as category 2 already having manual testing in place.

Also, the # of categories and periodicity can be decided accurately with the help of Domain Experts/COVID-19 Experts.

3: ML problem formulation & solution addressing the above challenges

Now, For the above solution to work, we need a robust PbATS. I will now demonstrate a potential approach to building this PbATS using Machine Learning techniques.

ML Problem formulation: We need to categorize the population based on there, features (age, sex), symptoms and past travel history.

Solution workflow: Once this PbATS will be ready then, each needs to undergo this PbATS by just filling there details digitally or via a volunteer (to minimize the errors) and then Doctors can easily prioritize the Testing /Healthcare services and can select whom to Quarantined. Which helps Doctors and GOV to stop COVID-19 with minimal health resources.

Eventually, This PbATS Mechanism will indirectly help the Government & Healthcare providers to increase the test_per_million according to population_per_million, needed to stop the COVID-19 spread.

4: A potential ML approach/solution

4.1: Big Question, Data? For this, I have taken the data and scrape the data (using text mining techniques) and concatenate them. Data sources as follows:

Here are the raw data looks like:

sample raw data

4.2: After data cleaning:


4.3: Then, I made some feature engineering as follows:

  • Convert the age range (60–65) to its mean value (62.5)
  • Added date_past_onset_symptoms (dpos) which is the difference in the # of days b/w date_confirmation & date_onset_symptoms
  • Added travel_history (th) flag based on and operation between travel_history_dates & travel_history_locations
  • finally, merge all the features as one master-symptom/mixed-symptom from the entire data point. Why I have done this will be more clear later

4.4: Final dataset:

final dataset

4.5: Natural Language Processing (NLP) on master symptom

Then, I have performed all the standard NLP techniques for vectorizing the master-symptom, I have used BoW+W2V (Word2Vec-gensim). I am using W2V because I need to cluster the symptoms based on the relationship (not similarity & counts) with each other, which helps in the clustering process.

I have used BoW instead of TF-IDF because Our dataset does not have many rare occurring words that need more importance.

4.6: ML Clustering:

Finally, I have done clustering using KMeans++ (most general purpose). The no of clusters we are getting using elbow-method is also 3 (co-incidentally) but it can be changed using much more Domain Knowledge also!

4.7: Results/Word Cloud:

3-clusters based master-symptoms, dpos: days-past-onset-symptoms, th: travel-history

Final Categories: As you can see from the above word clouds of master-symptoms, We can simply approximately categorize the prone populations as:

Category 3 (Healthcare service-Priority): [Age between 30–75] + [fever, cough, respiratory, runny nose, sore throat, pneumonia, headache, chest tightness with dpos > 4] + [th-1 active travel histories mainly]

Category 2 (Test-Priority): [Age between 0–75] + [fever, cough, malaise, pneumonia, stiffness, joint, muscular soreness with dpos between 1–3] + [th-0/1 active or inactive travel histories]

Category 1 (Self-Quarantined): [Age between 0–75] + [fever, cough, weakness, diarrhea, dizziness, chills with dpos between 1–3] + [th-0 mostly inactive travel histories with some active travel histories also]

learning: As we can see that the symptoms patterns (mixed symptoms) depends on age mainly in the case of COVID-19 and gets severe with dpos. My efforts were to detect these patterns only for prioritizing the testing using PbATS.

Jupyter notebook

5: Is this a permanent solution?

Simply No! The above solution only helps INDIA and other countries to stop COVID-19 spread with limited Healthcare services. And hence helps in the containment of Sars-CoV-2.

This permanent solution is Vaccines, which require weeks and even months to be ready and I believe that this gap can easily field also using ML/AI, by finding a vaccine using a combination of existing virus vaccines (Sars-1, Spanish flu), etc. using ML/AI techniques.

last, Yes, the above approach address limited Healthcare resources problems, but Countries like India and others need to have at least 30%-40% of the population per million Healthcare resources (Bed/Ventilators) for category-3.

Hi, I am Burhanuddin Bhopalwala. This Blog is my very small contribution battle against COVID-19 esp for my home country INDIA.

Submitted to:

You can contact me directly on my email id: