COVID-19 Analysis

Analyse the COVID-19 data from John Hopkins University by yourself with python (pandas, holoviews, matplotlib)

In this post, you will see how to access the COVID-19 dataset provided by John Hopkins University, and how to analyse it with python.

You will learn how to:

  • preprocess the dataset with pandas
  • create interactive plots with holoviews
  • fit a simple exponential model to the data to predict the number of cases in the near future

The COVID-19 Dataset from John Hopkins University

The dataset can be found in this github repository. To get it, just clone the repository:

git clone https://github.com/CSSEGISandData/COVID-19.git

cd COVID-19

It consists of three csv files that are updated daily.

Installation

To run this tutorial, first Install Anaconda for Machine Learning and Data Science in Python.

Then, create a conda environement

conda create -n covid19

Activate it:

conda activate covid19

And install holoviz (you'll also get pandas, matplotlib, and everything you need :)

conda install -c pyviz holoviz

Finally, start your jupyter notebook:

jupyter notebook

Preprocessing the COVID-19 Dataset from John Hopkins University

The dataset consists of three csv files that are updated daily:

  • time_series_19-covid-Confirmed.csv
  • time_series_19-covid-Deaths.csv
  • time_series_19-covid-Recovered.csv

We create three dataframes, one for each csv file, and we store them in a dictionary:

In [1]:
import pandas as pd
datatemplate = 'csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{}.csv'

fields = ['Confirmed', 'Deaths', 'Recovered']
dfs = dict()
for field in fields: 
    dfs[field] = pd.read_csv(datatemplate.format(field))

Here is what we have for one of them:

In [2]:
dfs['Confirmed'].head()
Out[2]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/2/20 3/3/20 3/4/20 3/5/20 3/6/20 3/7/20 3/8/20 3/9/20 3/10/20 3/11/20
0 NaN Thailand 15.0000 101.0000 2 3 5 7 8 8 ... 43 43 43 47 48 50 50 50 53 59
1 NaN Japan 36.0000 138.0000 2 1 2 2 4 4 ... 274 293 331 360 420 461 502 511 581 639
2 NaN Singapore 1.2833 103.8333 0 1 3 3 4 5 ... 108 110 110 117 130 138 150 150 160 178
3 NaN Nepal 28.1667 84.2500 0 0 0 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
4 NaN Malaysia 2.5000 112.5000 0 0 0 3 4 4 ... 29 36 50 50 83 93 99 117 129 149

5 rows × 54 columns

We need to do a bit of preprocessing before we can analyze these data. Indeed:

  • The number of cases are stored in columns, which is not practical for analysis and display. We'd rather have one line per measurement.
  • The numbers of confirmed cases, deaths, and recovered patients are currently stored in different dataframes. We would prefer to have all the information in a single dataframe.
  • The column names are too long and painful to type.

Here is the code:

In [4]:
# loop on the dataframe dictionary
for field, df in dfs.items():
    # group by country, to sum on states
    df = df.groupby('Country/Region', as_index=False).sum()
    # turn each measurement column into a separate line, 
    # and store the results in a new dataframe
    df = df.melt(id_vars=['Country/Region', 'Lat', 'Long'],
                 value_name='counts')
    # keep track of the quantity that is measured 
    # either Confirmed, Deaths, or Recovered
    df['quantity'] = field
    # change column names 
    df.columns =  ['country', 'lat', 'lon', 'date', 'counts', 'quantity']
    # replace the dataframe in the dictionary
    dfs[field] = df

Now, we can concatenate the three dataframes and look at the results:

In [5]:
dfall = pd.concat(dfs.values())
dfall['date'] = pd.to_datetime(dfall['date'])
dfall.head()
Out[5]:
country lat lon date counts quantity
0 Afghanistan 33.0000 65.0000 2020-01-22 0 Confirmed
1 Albania 41.1533 20.1683 2020-01-22 0 Confirmed
2 Algeria 28.0339 1.6596 2020-01-22 0 Confirmed
3 Andorra 42.5063 1.5218 2020-01-22 0 Confirmed
4 Argentina -38.4161 -63.6167 2020-01-22 0 Confirmed

Displaying COVID-19 Cases vs Time

For the display, I decided to use holoviews, which is a fast way to get nice interactive plots, with bokeh as a powerful backend.

First, we initialize holoviews:

In [6]:
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')