COVID-19

Exploratory data analysis on COVID-19 pandemic

Posted on April 14, 2020

The coronavirus pandemic is an ongoing pandemic of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The outbreak started in Wuhan, China. The World Health Organization declared the outbreak to be a public health emergency of international concern and recognized it as a pandemic on the 11th of March 2020.

The pandemic has led to one of the biggest economical, political and social crisis mankind has faced in decades. Effective measures were slow to be put in practice due to the unknown and uncertainty related to the virus. Initial information has been of all different sorts, with many incorrect and false news being spread.

The purpose of this notebook is to have an understanding of the virus through data. By any means this analysis should not be taken into consideration for scientific purposes as there is a large number of variables that need to be considered for this (quarantines, quality of the medical resources deployed, environmental measures, etc).

1. First look at the datasets

We will be initially using the following datasets throughout the analysis:

These are from Johns Hopkins University's repo on github and have kindly been provided to the public for educational and academic research purposes. We will be importing first all the libraries used for this study and start loading the data.

In [1]:
import itertools
from datetime import date
from datetime import timedelta
import numpy as np
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
_ = sns.set()
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
confirmed_df.head()
Out[2]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 ... 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20 3/15/20 3/16/20 3/17/20 3/18/20 3/19/20 3/20/20 3/21/20 3/22/20 3/23/20 3/24/20 3/25/20 3/26/20 3/27/20 3/28/20 3/29/20 3/30/20 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20 4/10/20 4/11/20 4/12/20 4/13/20 4/14/20 4/15/20 4/16/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 ... 4 4 5 7 7 7 11 16 21 22 22 22 24 24 40 40 74 84 94 110 110 120 170 174 237 273 281 299 349 367 423 444 484 521 555 607 665 714 784 840
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 2 10 12 23 33 38 42 51 55 59 64 70 76 89 104 123 146 174 186 197 212 223 243 259 277 304 333 361 377 383 400 409 416 433 446 467 475 494 518
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 ... 19 20 20 20 24 26 37 48 54 60 74 87 90 139 201 230 264 302 367 409 454 511 584 716 847 986 1171 1251 1320 1423 1468 1572 1666 1761 1825 1914 1983 2070 2160 2268
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 2 39 39 53 75 88 113 133 164 188 224 267 308 334 370 376 390 428 439 466 501 525 545 564 583 601 601 638 646 659 673 673
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 3 3 4 4 5 7 7 7 8 8 8 10 14 16 17 19 19 19 19 19 19 19 19 19

5 rows × 90 columns

In [3]:
fatalities_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
fatalities_df.head()
Out[3]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 ... 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20 3/15/20 3/16/20 3/17/20 3/18/20 3/19/20 3/20/20 3/21/20 3/22/20 3/23/20 3/24/20 3/25/20 3/26/20 3/27/20 3/28/20 3/29/20 3/30/20 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20 4/10/20 4/11/20 4/12/20 4/13/20 4/14/20 4/15/20 4/16/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 4 4 4 4 4 4 4 6 6 7 7 11 14 14 15 15 18 18 21 23 25 30
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 4 5 5 6 8 10 10 11 15 15 16 17 20 20 21 22 22 23 23 23 23 23 24 25 26
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 2 3 4 4 4 7 9 11 15 17 17 19 21 25 26 29 31 35 44 58 86 105 130 152 173 193 205 235 256 275 293 313 326 336 348
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 3 3 3 6 8 12 14 15 16 17 18 21 22 23 25 26 26 29 29 31 33 33
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

5 rows × 90 columns

In [4]:
recovered_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
recovered_df.head()
Out[4]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 ... 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20 3/15/20 3/16/20 3/17/20 3/18/20 3/19/20 3/20/20 3/21/20 3/22/20 3/23/20 3/24/20 3/25/20 3/26/20 3/27/20 3/28/20 3/29/20 3/30/20 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20 4/10/20 4/11/20 4/12/20 4/13/20 4/14/20 4/15/20 4/16/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 5 5 10 10 10 15 18 18 29 32 32 32 32 32 40 43 54
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 10 17 17 31 31 33 44 52 67 76 89 99 104 116 131 154 165 182 197 217 232 248 251 277
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 8 8 12 12 12 12 12 32 32 32 65 65 24 65 29 29 31 31 37 46 61 61 62 90 90 90 113 237 347 405 460 591 601 691 708 783
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 10 10 10 16 21 26 31 39 52 58 71 71 128 128 128 169 169
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 2 2 2 2 2 2 4 4 4 5 5 5

5 rows × 90 columns

The datasets contain daily information on total number of confirmed cases, fatalities and recovered respectively. Some countries also have a region informed.

1. 1. Merging the data

The structure of the data is not bestly suited for an analysis. Let's create a function to change that.

In [ ]:
def reshape_data(df):
    to_keep = ["Country/Region", "Province/State", "Lat", "Long"]
    to_change = set(df.columns) - set(to_keep)
    df = pd.melt(df, 
                 id_vars=to_keep, 
                 value_vars=to_change, 
                 var_name='date',
                 value_name='total')
    return df

The function reshape_data() melts the DataFrame in order to remove the dates from the columns. We can apply it to the three DataFrames and then merge them together.

In [6]:
def merge_dfs(df1, df2, df3):
    merged = pd.merge(reshape_data(df1), reshape_data(df2), how='inner', on=['Country/Region', 'Province/State', 'Lat', 'Long', 'date'])
    merged = pd.merge(merged, reshape_data(df3), how='inner', on=['Country/Region', 'Province/State', 'Lat', 'Long', 'date'])

    merged.columns = ['country', 'province', 'lat', 'long', 'date', 'confirmed', 'fatalities', 'recovered']
    merged['date'] = pd.to_datetime(merged['date'])
    merged = merged.sort_values(['country', 'province', 'date'])

    return merged

merged = merge_dfs(confirmed_df, fatalities_df, recovered_df)
merged.head(3)
Out[6]:
country province lat long date confirmed fatalities recovered
12792 Afghanistan NaN 33.0 65.0 2020-01-22 0 0 0
15498 Afghanistan NaN 33.0 65.0 2020-01-23 0 0 0
19188 Afghanistan NaN 33.0 65.0 2020-01-24 0 0 0

Great! Before moving on to some descriptive statistics we will also apply a few extra changes.

1. 2. Editing features

In [7]:
print("Countries with Province/State informed: ", merged[merged['province'].isna()==False]['country'].unique())
Countries with Province/State informed:  ['Australia' 'China' 'Denmark' 'France' 'Netherlands' 'United Kingdom']

We can see that the region is stated for 6 countries only, we will be removing the column for now.

It will be taken into consideration if we will be analysing those countries separately.

In [8]:
def drop_province(df):
    df = merged.groupby(['country', 'date']).agg({'lat': 'first',
                                                  'long': 'first',
                                                  'confirmed': 'sum',
                                                  'fatalities': 'sum',
                                                  'recovered': 'sum'}).reset_index()
    return df

df = drop_province(merged)
df.head()
Out[8]:
country date lat long confirmed fatalities recovered
0 Afghanistan 2020-01-22 33.0 65.0 0 0 0
1 Afghanistan 2020-01-23 33.0 65.0 0 0 0
2 Afghanistan 2020-01-24 33.0 65.0 0 0 0
3 Afghanistan 2020-01-25 33.0 65.0 0 0 0
4 Afghanistan 2020-01-26 33.0 65.0 0 0 0

We will also be adding some extra feature that will be useful for our EDA. These are the number of confirmed, fatalities and recovered cases on a daily basis.

In [ ]:
def edit_features(df):
    df['daily_confirmed'] = df.groupby(['country'])['confirmed'].diff()
    df['daily_fatalities'] = df.groupby('country')['fatalities'].diff()
    df['daily_recovered'] = df.groupby('country')['recovered'].diff()

    mask = (df['date']==min(df['date']).date())
    df.loc[df.index[mask], 'daily_confirmed'] = df.loc[df.index[mask], 'confirmed']
    df.loc[df.index[mask], 'daily_fatalities'] = df.loc[df.index[mask], 'fatalities']
    df.loc[df.index[mask], 'daily_recovered'] = df.loc[df.index[mask], 'recovered']
    return df

df = edit_features(df)

Nice, now that the data is ready we can start with some descriptive statistics and EDA.

In [10]:
print("Number of countries: ", df['country'].nunique())
print("Dates: from", min(df['date']).date(), "to", max(df['date']).date(), "(total of", df['date'].nunique(), "days)")
Number of countries:  181
Dates: from 2020-01-22 to 2020-04-16 (total of 86 days)
In [11]:
df.describe()
Out[11]:
lat long confirmed fatalities recovered daily_confirmed daily_fatalities daily_recovered
count 15566.000000 15566.000000 15566.000000 15566.000000 15566.000000 15385.000000 15385.000000 15385.00000
mean 19.646433 15.894324 2234.376333 122.492098 550.744443 137.874618 9.263828 34.60338
std 23.140321 57.398398 18642.130279 1138.447470 4963.797603 1215.120885 88.864624 301.74638
min -40.900600 -102.552800 0.000000 0.000000 0.000000 -15.000000 -31.000000 -322.00000
25% 5.152100 -9.429500 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
50% 17.607800 19.145100 1.000000 0.000000 0.000000 0.000000 0.000000 0.00000
75% 40.000000 44.000000 70.000000 1.000000 3.000000 6.000000 0.000000 0.00000
max 64.963100 178.065000 667801.000000 32916.000000 78401.000000 35098.000000 4591.000000 10980.00000

There seem to be some incorrect entries as we get some negative values in the daily_confirmed, daily_fatalities and daily_recovered.

These datasets rely upon publicly available data from multiple sources, that do not always agree. I do not know yet what the cause is but as we can see from the following records regarding Algeria for example, the number of recovered changes on the 24th of March to the go back to the original value the following day.

In [12]:
df[df['country']=='Algeria'].set_index('date')['23-03-2020':'25-03-2020']
Out[12]:
country lat long confirmed fatalities recovered daily_confirmed daily_fatalities daily_recovered
date
2020-03-23 Algeria 28.0339 1.6596 230 17 65 29.0 0.0 0.0
2020-03-24 Algeria 28.0339 1.6596 264 19 24 34.0 2.0 -41.0
2020-03-25 Algeria 28.0339 1.6596 302 21 65 38.0 2.0 41.0
In [13]:
incorrect_entries = len(df[(df['daily_confirmed']<0) | (df['daily_fatalities']<0) | (df['daily_recovered']<0)])
print('Number of incorrect entries:', incorrect_entries, '(', round(incorrect_entries/len(df)*100, 2), '% of total data)')
Number of incorrect entries: 45 ( 0.29 % of total data)

As the number of incorrect entries is a small portion of the data I will be removing them for now.

In [ ]:
def get_incorrect_entries(df):

    mask = ((df['confirmed'].diff() < 0) | 
            (df['fatalities'].diff() < 0) | 
            (df['recovered'].diff() < 0)) & (df['date'] != min(df['date']))  

    return df[mask].index

Great! Let's start plotting.

2. EDA

We will start the exploratory data analysis by studying the worldwide situation. Although it will give us a collective picture of the pandemic it is important to notice that the virus did not spread throughout the world altogether. This will strongly interfere with the results, thus we will also be working independently on a few cases.

2. 1. Worldwide cases

The first plots will focus on the global situation, all countries in the dataset will be taken into consideration.

2. 1. 1. Total and daily cases

In [15]:
by_date = df.groupby('date').sum()

def plot_total_and_daily(df):

  fig, ax = plt.subplots(1, 2, figsize=(18,8))

  ax[0].plot(df.daily_confirmed, label='Confirmed', color='orange')
  ax[0].plot(df.daily_fatalities, label='Fatalities', color='r')
  ax[0].plot(df.daily_recovered, label='Recovered', color='g')
  ax[0].legend()

  ax[1].plot(df.confirmed, color='orange')
  ax[1].plot(df.fatalities, color='r')
  ax[1].plot(df.recovered, color='g')
  ax[1].fill_between(df.index, df.confirmed, df.recovered, label='Confirmed', color='orange')
  ax[1].fill_between(df.index, df.recovered, df.fatalities, label='Recovered', color='g')
  ax[1].fill_between(df.index, df.fatalities, 0, label='Fatalities', color='r')
  ax[1].ticklabel_format(axis='y', style='plain')
  ax[1].legend()

  return fig, ax

fig, ax = plot_total_and_daily(by_date)
ax[0].set_title('Daily new cases worldwide')
ax[1].set_title('Cumulative number of cases worldwide')
plt.show()

Observations: The global curve shows a rich fine structure, but these numbers are strongly affected by the vector zero country, China. Given that COVID-19 started there, during the initial expansion of the virus there was no reliable information about the real infected cases. In fact, the criteria to consider infection cases was modified around 11th February 2020, which strongly perturbed the curve as we can see from the figure.

In [16]:
by_date_no_china = df[df['country']!='China'].groupby('date').sum()

fig, ax = plot_total_and_daily(by_date_no_china)
ax[0].set_title('Daily new cases worldwide excluding China')
ax[1].set_title('Cumulative number of cases worldwide excluding China')
plt.show()

Observations: In this case the general behavior looks cleaner, and in fact the curve resembles a typical epidemiology model like SIR. SIR models present a large increasing in the number of infections that, once it reaches the maximum of the contagion, decreases with a lower slope.

2. 1. 2. Confirmed, recovered and fatalities on a daily basis

In [17]:
PALETTE = itertools.cycle(sns.color_palette())

def plot_in_cols(df, to_plot=[]):

    n_plots = len(to_plot)

    fig, ax = plt.subplots(1, n_plots, figsize=(n_plots*6,8))

    for i, col in enumerate(to_plot):
        ax[i].plot(df[col], label=col.replace('_', ' '), color=next(PALETTE))
        ax[i].set_title('Number of ' + col.replace('_', ' '))
        ax[i].xaxis.set_major_locator(mdates.MonthLocator())
        ax[i].ticklabel_format(axis='y', style='plain')


    return fig, ax

fig, ax = plot_in_cols(by_date, to_plot=['daily_confirmed', 'daily_fatalities', 'daily_recovered'])

Observations: The following plots show a clear upward trend.

2. 1. 3. Growth Factor

Growth factor is the factor by which a quantity multiplies itself over time. The formula used is every day's new cases / new cases on the previous day. For example, a quantity growing by 7% every period (in this case daily) has a growth factor of 1.07.

A growth factor above 1 indicates an increase, whereas one which remains between 0 and 1 it is a sign of decline, with the quantity eventually becoming zero, whereas a growth factor constantly above 1 could signal exponential growth

In [18]:
def plot_growth_factor(df):
    
    fig, ax = plt.subplots(figsize=(9, 9))

    growth_factor = df['daily_confirmed'] / df['daily_confirmed'].shift(-1)

    ax.plot(df.index, growth_factor, marker='o')
    ax.axhline(0)
    ax.axhline(1, linewidth=2, color='r')
    ax.set_ylabel('New cases / new cases on previous day')

    return fig, ax

fig, ax = plot_growth_factor(by_date)
ax.set_title('Global growth factor')
plt.show()

Observations: The curve seems to be tending to one, suggesting that it is increasing exponentially. There are quite a few peaks in February, this could be due to the fact that during the initial expansion of the virus there was no reliable information about the real infected cases. Some of the incorrect records we saw in the dataset might also be influencing the plot.

2. 1. 4. Active cases by date

By removing fatalities and recovered from total cases, we get "currently infected cases" or "active cases" (cases still awaiting for an outcome).

In [19]:
def get_active_cases(df):
    active_cases = df['confirmed'] - df['recovered'] - df['fatalities']
    return active_cases

def plot_active_cases(df):
    
    fig, ax = plt.subplots(figsize=(14, 6))

    ax.plot(get_active_cases(df))
    ax.ticklabel_format(axis='y', style='plain')

    return fig, ax

fig, ax = plot_active_cases(by_date)
ax.set_title('Active cases by date worldwide')
plt.show()

Observations: It is important to know that not all recoveries are being accounted for.

2. 1. 5. Top countries for total confirmed cases

In [20]:
by_country = merged.groupby('country').sum()

def plot_top_countries(df, n_countries=10):
    top = df.sort_values('confirmed', ascending=False).head(n_countries)

    fig, ax = plt.subplots(figsize=(12, 8))

    ax.bar(top.index, top['fatalities'], label='fatalities')
    ax.bar(top.index, top['confirmed'], bottom=top['fatalities'], label='confirmed')
    ax.bar(top.index, top['recovered'], bottom=top['confirmed'] + top['fatalities'], label='recovered')

    ax.set_xticklabels(top.index, rotation=90)
    ax.ticklabel_format(axis='y', style='plain')
    ax.legend()

    return fig, ax

fig, ax = plot_top_countries(by_country)
ax.set_title('Top conutries per number of confirmed cases (ordered by confirmed cases)')
plt.show()

Observations: Here we can see how the relationship between number of fatalities and recovered changes amongst the countries most afflicted by the coronavirus. In China more than half the people recovered and the number of fatalities has kept relatively low compared to US and other european countries. The same cannot be said for Italy and Spain, although these countries are still at a different stage of the pandemic.

2. 1. 6. New cases for countries with most cases

Next I will be plotting the number of confirmed cases per country. I will not be using all countries for the following plots, the countries taken in consideration are the ones with the most number of cases.

In [21]:
def plot_subplots_for_country_cases(df, column, n_countries=5, **kwargs):

    top = df.groupby('country').sum().sort_values('confirmed', ascending=False).head(n_countries)

    fig, ax = plt.subplots(len(top), 1, **kwargs)

    for i, country in enumerate(top.index):
        country_df = df[df['country']==country]
        grouped_by_date = country_df.groupby('date')[column].sum()

        ax[i].plot(grouped_by_date, color=next(PALETTE))
        ax[i].set_title(country)
        ax[i].xaxis.set_ticks(grouped_by_date.index[::14])
        ax[i].ticklabel_format(axis='y', style='plain')

    fig.tight_layout(pad=3.0)

    fig.suptitle('Number of ' + column.replace('_', ' '))
        
    return fig, ax

fig, ax = plot_subplots_for_country_cases(df, n_countries=6, column='daily_confirmed', figsize=(12, 12))

Observations: We can see how China's curve is different compared to the other countries. Besides the fact that they were the first country to deal with the coronavirus outbreak the measures seem to have worked well as we can see a downwards trend in new cases after slightly more than one month.

2. 1. 7. Active cases by country

In [22]:
def plot_active_cases_countries(df, n_countries=5):

    active_cases = get_active_cases(by_country).sort_values(ascending=False)

    out = active_cases.iloc[:n_countries]
    out.loc['Other'] = active_cases.iloc[n_countries:].sum()

    fig, ax = plt.subplots(figsize=(7,7))

    labels = out.index 
    sizes = [ x/out.sum() for x in out ]
    explode = np.zeros(len(out))
    explode[-1] = 0.1

    ax.pie(sizes, labels=labels, autopct='%1.1f%%',
           shadow=True, startangle=90, explode=explode)
    ax.axis('equal')  
    return fig, ax

fig, ax = plot_active_cases_countries(by_country, n_countries=5)
ax.set_title('Active cases by country')
plt.show()

2. 1. 8. Outcome of cases (recovery or death)

The outcome of cases is the cumulative total deaths and recoveries over cumulative number of closed cases.

In [23]:
def plot_outcome_of_cases(df):
    fig, ax = plt.subplots(figsize=(14, 6))

    closed_cases = df['recovered'] + df['fatalities']
    perc_recovered = df['recovered'] / closed_cases * 100
    perc_deaths = df['fatalities'] / closed_cases * 100

    ax.plot(perc_recovered, marker='o', color='g', label='recovered')
    ax.plot(perc_deaths, marker='o', color='r', label='fatalities')
    ax.set_ylabel('Percent (%)')
    ax.set_title('Outcome of total closed cases (recovery rate vs death rate)')
    ax.legend(loc='upper left')

    return fig, ax

fig, ax = plot_outcome_of_cases(by_date)
plt.show()

Observations: The difference in stages between countries does not make this graph any useful. We will see again a more informative outcome of total closed cases when we will be analysing the countries separately.

2. 2. China

In [ ]:
by_date_china = df[df.country=='China'].groupby('date').sum()

2. 2. 1. Total and daily cases

In [25]:
fig, ax = plot_total_and_daily(by_date_china)
ax[0].set_title('Cumulative number of cases in China')
ax[1].set_title('Daily new cases in China')
plt.show()

Observations: China seems to be at later stages of the outbreak with positive behaviour being shown in the graphs.

2. 2. 2. Growth factor

In [26]:
yesterday = date.today() - timedelta(days = 1) 

fig, ax = plot_growth_factor(by_date_china['23-02-2020':])
ax.set_title("China's growth factor")
plt.show()

Observations: Although the growth factor seems to be centering around one we are only working with a very little number of daily new cases compared to other countries as we can see below.

In [27]:
past_week = date.today() - timedelta(days = 7) 
new_cases_past_week = by_date_china[past_week:]['daily_confirmed'].mean()
print('There was an average of', int(new_cases_past_week), 'daily new cases in China last week')
There was an average of 74 daily new cases in China last week

2. 2. 3. Active cases

In [28]:
fig, ax = plot_active_cases(by_date_china)
ax.set_title('Active cases in China')
plt.show()

Observations: The number of active cases has reduced drastically. As we will see from the following plot, in most cases the people have recovered.

2. 2. 4. Outcome of cases

In [29]:
fig, ax = plot_outcome_of_cases(by_date_china)
plt.show()

Observations: China looks to have handeled the pandemic well with very few deaths after the initial outbreak.

2. 3. Europe

We will be analysing Europe as whole. Though it is true that different measures were taken amongst european countries, we will be taking into consideration the geographical context, with the total EU area being smaller than US and China.

In [ ]:
european_countries = ['Austria', 
                      'Belgium',
                      'Bulgaria',
                      'Croatia',
                      'Cyprus',
                      'Czech Republic',
                      'Denmark',
                      'Estonia',
                      'Finland',
                      'France',
                      'Germany',
                      'Greece',
                      'Hungary',
                      'Ireland',
                      'Italy',
                      'Latvia',
                      'Lithuania',
                      'Luxembourg',
                      'Malta',
                      'Netherlands',
                      'Poland',
                      'Portugal',
                      'Romania',
                      'Slovakia',
                      'Slovenia',
                      'Spain',
                      'Sweden']

by_date_eu = df[df.country.isin(european_countries)].groupby('date').sum()

2. 3. 1. Total and daily cases

In [31]:
fig, ax = plot_total_and_daily(by_date_eu)
ax[0].set_title('Daily new cases in the EU')
ax[1].set_title('Cumulative number of cases in the EU')
plt.show()

Observations: We can see how the new cases are starting to follow a downwards trend, on the other side the daily number of people recovered seems to follow a positive one while the fatalities are slightly increasing.

2. 3. 2. Growth factor

In [32]:
fig, ax = plot_growth_factor(by_date_eu)
ax.set_title('Growth factor in EU')
plt.show()

Observations: We can see some inconsistency with some of the initial data being reported. Starting from around 15th of March we can see that most datapoints are below one, suggesting a decrease in new cases.

2. 3. 3. Active cases

In [33]:
fig, ax = plot_active_cases(by_date_eu)
ax.set_title('Active cases in EU')
plt.show()

Observations:

2. 3. 4. Outcome of cases

In [34]:
fig, ax = plot_outcome_of_cases(by_date_eu)
ax.set_title('Outcome of cases in EU')
plt.show()

Observations: In the later stages there seems to be a steady increase in the number of people recovered.

2. 4. USA

In [ ]:
by_date_usa = df[df.country=='US'].groupby('date').sum()

2. 4. 1. Total and daily cases

In [36]:
fig, ax = plot_total_and_daily(by_date_usa)
ax[0].set_title('Daily new cases in USA')
ax[1].set_title('Cumulative number of cases in USA')
plt.show()

Observations: USA seems to be at the initial stages of the pandemic with no positive signs being shown yet. The number of deaths is also higher than the number of people recovered so far.

2. 4. 2. Growth factor

In [37]:
fig, ax = plot_growth_factor(by_date_usa)
ax.set_title("USA's growth factor")
plt.show()

Observations: As the datapoints are all very close to one we are still at an exponential increase in new cases.

2. 4. 3. Active cases

In [38]:
fig, ax = plot_active_cases(by_date_usa)
ax.set_title('Active cases in USA')
plt.show()

Observations: Also the number of active cases is increasing steeply.

2. 4. 4. Outcome of cases

In [39]:
fig, ax = plot_outcome_of_cases(by_date_usa)
ax.set_title('Outcome of cases in USA')
Out[39]:
Text(0.5, 1.0, 'Outcome of cases in USA')

Observations: USA's outcome of cases is highlighting issues with the handling of coronavirus. Though it is still at initial stages it is the only country analysed that shows a higher percentage of deaths over recovery cases.

3. Disclaimer

  • With new data being added on a daily basis, in order to include the most recent cases, some sections might not be up to date.

  • The objective of this notebook is to provide some insights about the COVID-19 transmission from a data-centric perspective in a didactical and simple way. Results should not be considered in any way an affirmation of what will happen in the future. Observations obtained from data exploration are personal opinions.