Analyse the COVID-19 data from John Hopkins University by yourself with python (pandas, holoviews, matplotlib)
In this post, you will see how to access the COVID-19 dataset provided by John Hopkins University, and how to analyse it with python.
You will learn how to:
The dataset can be found in this github repository. To get it, just clone the repository:
git clone https://github.com/CSSEGISandData/COVID-19.git
cd COVID-19
It consists of three csv files that are updated daily.
To run this tutorial, first Install Anaconda for Machine Learning and Data Science in Python.
Then, create a conda environement
conda create -n covid19
Activate it:
conda activate covid19
And install holoviz (you'll also get pandas, matplotlib, and everything you need :)
conda install -c pyviz holoviz
Finally, start your jupyter notebook:
jupyter notebook
The dataset consists of three csv files that are updated daily:
We create three dataframes, one for each csv file, and we store them in a dictionary:
import pandas as pd
datatemplate = 'csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{}.csv'
fields = ['Confirmed', 'Deaths', 'Recovered']
dfs = dict()
for field in fields:
dfs[field] = pd.read_csv(datatemplate.format(field))
Here is what we have for one of them:
dfs['Confirmed'].head()
We need to do a bit of preprocessing before we can analyze these data. Indeed:
Here is the code:
# loop on the dataframe dictionary
for field, df in dfs.items():
# group by country, to sum on states
df = df.groupby('Country/Region', as_index=False).sum()
# turn each measurement column into a separate line,
# and store the results in a new dataframe
df = df.melt(id_vars=['Country/Region', 'Lat', 'Long'],
value_name='counts')
# keep track of the quantity that is measured
# either Confirmed, Deaths, or Recovered
df['quantity'] = field
# change column names
df.columns = ['country', 'lat', 'lon', 'date', 'counts', 'quantity']
# replace the dataframe in the dictionary
dfs[field] = df
Now, we can concatenate the three dataframes and look at the results:
dfall = pd.concat(dfs.values())
dfall['date'] = pd.to_datetime(dfall['date'])
dfall.head()
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')