The coronavirus pandemic is an ongoing pandemic of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The outbreak started in Wuhan, China. The World Health Organization declared the outbreak to be a public health emergency of international concern and recognized it as a pandemic on the 11th of March 2020.

The pandemic has led to one of the biggest economical, political and social crisis mankind has faced in decades. Effective measures were slow to be put in practice due to the unknown and uncertainty related to the virus. Initial information has been of all different sorts, with many incorrect and false news being spread.

The purpose of this notebook is to have an understanding of the virus through data. By any means this analysis should not be taken into consideration for scientific purposes as there is a large number of variables that need to be considered for this (quarantines, quality of the medical resources deployed, environmental measures, etc).

TABLE OF CONTENTS

First look at the datasets
1. Merging the data
2. Editing features
EDA
Disclaimer

1. First look at the datasets ¶

We will be initially using the following datasets throughout the analysis:

These are from Johns Hopkins University's repo on github and have kindly been provided to the public for educational and academic research purposes. We will be importing first all the libraries used for this study and start loading the data.

In [1]:

import itertools
from datetime import date
from datetime import timedelta
import numpy as np
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
_ = sns.set()

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

In [2]:

confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
confirmed_df.head()

Out[2]:

	Province/State	Country/Region	Lat	Long	2/24/20	2/25/20	2/26/20	...	3/8/20	3/9/20	3/10/20	3/11/20	3/12/20	3/13/20	3/14/20	3/15/20	3/16/20	3/17/20	3/18/20	3/19/20	3/20/20	3/21/20	3/22/20	3/23/20	3/24/20	3/25/20	3/26/20	3/27/20	3/28/20	3/29/20	3/30/20	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20	4/10/20	4/11/20	4/12/20	4/13/20	4/14/20	4/15/20	4/16/20
0	NaN	Afghanistan	33.0000	65.0000	1	1	1	...	4	4	5	7	7	7	11	16	21	22	22	22	24	24	40	40	74	84	94	110	110	120	170	174	237	273	281	299	349	367	423	444	484	521	555	607	665	714	784	840
1	NaN	Albania	41.1533	20.1683	0	0	0	...	0	2	10	12	23	33	38	42	51	55	59	64	70	76	89	104	123	146	174	186	197	212	223	243	259	277	304	333	361	377	383	400	409	416	433	446	467	475	494	518
2	NaN	Algeria	28.0339	1.6596	0	1	1	...	19	20	20	20	24	26	37	48	54	60	74	87	90	139	201	230	264	302	367	409	454	511	584	716	847	986	1171	1251	1320	1423	1468	1572	1666	1761	1825	1914	1983	2070	2160	2268
3	NaN	Andorra	42.5063	1.5218	0	0	0	...	1	1	1	1	1	1	1	1	2	39	39	53	75	88	113	133	164	188	224	267	308	334	370	376	390	428	439	466	501	525	545	564	583	601	601	638	646	659	673	673
4	NaN	Angola	-11.2027	17.8739	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	3	3	4	4	5	7	7	7	8	8	8	10	14	16	17	19	19	19	19	19	19	19	19	19

5 rows × 90 columns

In [3]:

fatalities_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
fatalities_df.head()

Out[3]:

	Province/State	Country/Region	Lat	Long	...	3/11/20	3/12/20	3/13/20	3/14/20	3/15/20	3/16/20	3/17/20	3/18/20	3/19/20	3/20/20	3/21/20	3/22/20	3/23/20	3/24/20	3/25/20	3/26/20	3/27/20	3/28/20	3/29/20	3/30/20	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20	4/10/20	4/11/20	4/12/20	4/13/20	4/14/20	4/15/20	4/16/20
0	NaN	Afghanistan	33.0000	65.0000	...	0	0	0	0	0	0	0	0	0	0	0	1	1	1	2	4	4	4	4	4	4	4	6	6	7	7	11	14	14	15	15	18	18	21	23	25	30
1	NaN	Albania	41.1533	20.1683	...	1	1	1	1	1	1	1	2	2	2	2	2	4	5	5	6	8	10	10	11	15	15	16	17	20	20	21	22	22	23	23	23	23	23	24	25	26
2	NaN	Algeria	28.0339	1.6596	...	0	1	2	3	4	4	4	7	9	11	15	17	17	19	21	25	26	29	31	35	44	58	86	105	130	152	173	193	205	235	256	275	293	313	326	336	348
3	NaN	Andorra	42.5063	1.5218	...	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	3	3	3	6	8	12	14	15	16	17	18	21	22	23	25	26	26	29	29	31	33	33
4	NaN	Angola	-11.2027	17.8739	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2

5 rows × 90 columns

In [4]:

recovered_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
recovered_df.head()

Out[4]:

	Province/State	Country/Region	Lat	Long	...	3/12/20	3/13/20	3/14/20	3/15/20	3/16/20	3/17/20	3/18/20	3/19/20	3/20/20	3/21/20	3/22/20	3/23/20	3/24/20	3/25/20	3/26/20	3/27/20	3/28/20	3/29/20	3/30/20	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20	4/10/20	4/11/20	4/12/20	4/13/20	4/14/20	4/15/20	4/16/20
0	NaN	Afghanistan	33.0000	65.0000	...	0	0	0	0	1	1	1	1	1	1	1	1	1	2	2	2	2	2	2	5	5	10	10	10	15	18	18	29	32	32	32	32	32	40	43	54
1	NaN	Albania	41.1533	20.1683	...	0	0	0	0	0	0	0	0	0	2	2	2	10	17	17	31	31	33	44	52	67	76	89	99	104	116	131	154	165	182	197	217	232	248	251	277
2	NaN	Algeria	28.0339	1.6596	...	8	8	12	12	12	12	12	32	32	32	65	65	24	65	29	29	31	31	37	46	61	61	62	90	90	90	113	237	347	405	460	591	601	691	708	783
3	NaN	Andorra	42.5063	1.5218	...	1	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	10	10	10	10	16	21	26	31	39	52	58	71	71	128	128	128	169	169
4	NaN	Angola	-11.2027	17.8739	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	2	2	2	2	2	2	2	4	4	4	5	5	5

5 rows × 90 columns

The datasets contain daily information on total number of confirmed cases, fatalities and recovered respectively. Some countries also have a region informed.

1. 1. Merging the data ¶

The structure of the data is not bestly suited for an analysis. Let's create a function to change that.

In [ ]:

def reshape_data(df):
    to_keep = ["Country/Region", "Province/State", "Lat", "Long"]
    to_change = set(df.columns) - set(to_keep)
    df = pd.melt(df, 
                 id_vars=to_keep, 
                 value_vars=to_change, 
                 var_name='date',
                 value_name='total')
    return df

The function reshape_data() melts the DataFrame in order to remove the dates from the columns. We can apply it to the three DataFrames and then merge them together.

In [6]:

def merge_dfs(df1, df2, df3):
    merged = pd.merge(reshape_data(df1), reshape_data(df2), how='inner', on=['Country/Region', 'Province/State', 'Lat', 'Long', 'date'])
    merged = pd.merge(merged, reshape_data(df3), how='inner', on=['Country/Region', 'Province/State', 'Lat', 'Long', 'date'])

    merged.columns = ['country', 'province', 'lat', 'long', 'date', 'confirmed', 'fatalities', 'recovered']
    merged['date'] = pd.to_datetime(merged['date'])
    merged = merged.sort_values(['country', 'province', 'date'])

    return merged

merged = merge_dfs(confirmed_df, fatalities_df, recovered_df)
merged.head(3)

Out[6]:

	country	province	lat	long	date
12792	Afghanistan	NaN	33.0	65.0	2020-01-22
15498	Afghanistan	NaN	33.0	65.0	2020-01-23
19188	Afghanistan	NaN	33.0	65.0	2020-01-24

Great! Before moving on to some descriptive statistics we will also apply a few extra changes.

1. 2. Editing features ¶

In [7]:

print("Countries with Province/State informed: ", merged[merged['province'].isna()==False]['country'].unique())

Countries with Province/State informed:  ['Australia' 'China' 'Denmark' 'France' 'Netherlands' 'United Kingdom']

We can see that the region is stated for 6 countries only, we will be removing the column for now.

It will be taken into consideration if we will be analysing those countries separately.

In [8]:

def drop_province(df):
    df = merged.groupby(['country', 'date']).agg({'lat': 'first',
                                                  'long': 'first',
                                                  'confirmed': 'sum',
                                                  'fatalities': 'sum',
                                                  'recovered': 'sum'}).reset_index()
    return df

df = drop_province(merged)
df.head()

Out[8]:

	country	date	lat	long
0	Afghanistan	2020-01-22	33.0	65.0
1	Afghanistan	2020-01-23	33.0	65.0
2	Afghanistan	2020-01-24	33.0	65.0
3	Afghanistan	2020-01-25	33.0	65.0
4	Afghanistan	2020-01-26	33.0	65.0

We will also be adding some extra feature that will be useful for our EDA. These are the number of confirmed, fatalities and recovered cases on a daily basis.

In [ ]:

def edit_features(df):
    df['daily_confirmed'] = df.groupby(['country'])['confirmed'].diff()
    df['daily_fatalities'] = df.groupby('country')['fatalities'].diff()
    df['daily_recovered'] = df.groupby('country')['recovered'].diff()

    mask = (df['date']==min(df['date']).date())
    df.loc[df.index[mask], 'daily_confirmed'] = df.loc[df.index[mask], 'confirmed']
    df.loc[df.index[mask], 'daily_fatalities'] = df.loc[df.index[mask], 'fatalities']
    df.loc[df.index[mask], 'daily_recovered'] = df.loc[df.index[mask], 'recovered']
    return df

df = edit_features(df)

Nice, now that the data is ready we can start with some descriptive statistics and EDA.

In [10]:

print("Number of countries: ", df['country'].nunique())
print("Dates: from", min(df['date']).date(), "to", max(df['date']).date(), "(total of", df['date'].nunique(), "days)")

Number of countries:  181
Dates: from 2020-01-22 to 2020-04-16 (total of 86 days)

In [11]:

df.describe()

Out[11]:

	lat	long	confirmed	fatalities	recovered	daily_confirmed	daily_fatalities	daily_recovered
count	15566.000000	15566.000000	15566.000000	15566.000000	15566.000000	15385.000000	15385.000000	15385.00000
mean	19.646433	15.894324	2234.376333	122.492098	550.744443	137.874618	9.263828	34.60338
std	23.140321	57.398398	18642.130279	1138.447470	4963.797603	1215.120885	88.864624	301.74638
min	-40.900600	-102.552800	0.000000	0.000000	0.000000	-15.000000	-31.000000	-322.00000
25%	5.152100	-9.429500	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
50%	17.607800	19.145100	1.000000	0.000000	0.000000	0.000000	0.000000	0.00000
75%	40.000000	44.000000	70.000000	1.000000	3.000000	6.000000	0.000000	0.00000
max	64.963100	178.065000	667801.000000	32916.000000	78401.000000	35098.000000	4591.000000	10980.00000

There seem to be some incorrect entries as we get some negative values in the daily_confirmed, daily_fatalities and daily_recovered.

These datasets rely upon publicly available data from multiple sources, that do not always agree. I do not know yet what the cause is but as we can see from the following records regarding Algeria for example, the number of recovered changes on the 24th of March to the go back to the original value the following day.

In [12]:

df[df['country']=='Algeria'].set_index('date')['23-03-2020':'25-03-2020']

Out[12]:

	country	lat	long	confirmed	fatalities	recovered	daily_confirmed	daily_fatalities	daily_recovered
date
2020-03-23	Algeria	28.0339	1.6596	230	17	65	29.0	0.0	0.0
2020-03-24	Algeria	28.0339	1.6596	264	19	24	34.0	2.0	-41.0
2020-03-25	Algeria	28.0339	1.6596	302	21	65	38.0	2.0	41.0

In [13]:

incorrect_entries = len(df[(df['daily_confirmed']<0) | (df['daily_fatalities']<0) | (df['daily_recovered']<0)])
print('Number of incorrect entries:', incorrect_entries, '(', round(incorrect_entries/len(df)*100, 2), '% of total data)')

Number of incorrect entries: 45 ( 0.29 % of total data)

~~As the number of incorrect entries is a small portion of the data I will be removing them for now.~~

In [ ]:

def get_incorrect_entries(df):

    mask = ((df['confirmed'].diff() < 0) | 
            (df['fatalities'].diff() < 0) | 
            (df['recovered'].diff() < 0)) & (df['date'] != min(df['date']))  

    return df[mask].index

Great! Let's start plotting.

2. EDA ¶

We will start the exploratory data analysis by studying the worldwide situation. Although it will give us a collective picture of the pandemic it is important to notice that the virus did not spread throughout the world altogether. This will strongly interfere with the results, thus we will also be working independently on a few cases.

2. 1. Worldwide cases ¶

The first plots will focus on the global situation, all countries in the dataset will be taken into consideration.

2. 1. 1. Total and daily cases ¶

In [15]:

by_date = df.groupby('date').sum()

def plot_total_and_daily(df):

  fig, ax = plt.subplots(1, 2, figsize=(18,8))

  ax[0].plot(df.daily_confirmed, label='Confirmed', color='orange')
  ax[0].plot(df.daily_fatalities, label='Fatalities', color='r')
  ax[0].plot(df.daily_recovered, label='Recovered', color='g')
  ax[0].legend()

  ax[1].plot(df.confirmed, color='orange')
  ax[1].plot(df.fatalities, color='r')
  ax[1].plot(df.recovered, color='g')
  ax[1].fill_between(df.index, df.confirmed, df.recovered, label='Confirmed', color='orange')
  ax[1].fill_between(df.index, df.recovered, df.fatalities, label='Recovered', color='g')
  ax[1].fill_between(df.index, df.fatalities, 0, label='Fatalities', color='r')
  ax[1].ticklabel_format(axis='y', style='plain')
  ax[1].legend()

  return fig, ax

fig, ax = plot_total_and_daily(by_date)
ax[0].set_title('Daily new cases worldwide')
ax[1].set_title('Cumulative number of cases worldwide')
plt.show()

Observations: The global curve shows a rich fine structure, but these numbers are strongly affected by the vector zero country, China. Given that COVID-19 started there, during the initial expansion of the virus there was no reliable information about the real infected cases. In fact, the criteria to consider infection cases was modified around 11th February 2020, which strongly perturbed the curve as we can see from the figure.

In [16]:

by_date_no_china = df[df['country']!='China'].groupby('date').sum()

fig, ax = plot_total_and_daily(by_date_no_china)
ax[0].set_title('Daily new cases worldwide excluding China')
ax[1].set_title('Cumulative number of cases worldwide excluding China')
plt.show()

Observations: In this case the general behavior looks cleaner, and in fact the curve resembles a typical epidemiology model like SIR. SIR models present a large increasing in the number of infections that, once it reaches the maximum of the contagion, decreases with a lower slope.

2. 1. 2. Confirmed, recovered and fatalities on a daily basis ¶

In [17]:

PALETTE = itertools.cycle(sns.color_palette())

def plot_in_cols(df, to_plot=[]):

    n_plots = len(to_plot)

    fig, ax = plt.subplots(1, n_plots, figsize=(n_plots*6,8))

    for i, col in enumerate(to_plot):
        ax[i].plot(df[col], label=col.replace('_', ' '), color=next(PALETTE))
        ax[i].set_title('Number of ' + col.replace('_', ' '))
        ax[i].xaxis.set_major_locator(mdates.MonthLocator())
        ax[i].ticklabel_format(axis='y', style='plain')


    return fig, ax

fig, ax = plot_in_cols(by_date, to_plot=['daily_confirmed', 'daily_fatalities', 'daily_recovered'])

Observations: The following plots show a clear upward trend.

2. 1. 3. Growth Factor ¶

Growth factor is the factor by which a quantity multiplies itself over time. The formula used is every day's new cases / new cases on the previous day. For example, a quantity growing by 7% every period (in this case daily) has a growth factor of 1.07.

A growth factor above 1 indicates an increase, whereas one which remains between 0 and 1 it is a sign of decline, with the quantity eventually becoming zero, whereas a growth factor constantly above 1 could signal exponential growth

In [18]:

def plot_growth_factor(df):
    
    fig, ax = plt.subplots(figsize=(9, 9))

    growth_factor = df['daily_confirmed'] / df['daily_confirmed'].shift(-1)

    ax.plot(df.index, growth_factor, marker='o')
    ax.axhline(0)
    ax.axhline(1, linewidth=2, color='r')
    ax.set_ylabel('New cases / new cases on previous day')

    return fig, ax

fig, ax = plot_growth_factor(by_date)
ax.set_title('Global growth factor')
plt.show()

Observations: The curve seems to be tending to one, suggesting that it is increasing exponentially. There are quite a few peaks in February, this could be due to the fact that during the initial expansion of the virus there was no reliable information about the real infected cases. Some of the incorrect records we saw in the dataset might also be influencing the plot.

2. 1. 4. Active cases by date ¶

By removing fatalities and recovered from total cases, we get "currently infected cases" or "active cases" (cases still awaiting for an outcome).

In [19]:

def get_active_cases(df):
    active_cases = df['confirmed'] - df['recovered'] - df['fatalities']
    return active_cases

def plot_active_cases(df):
    
    fig, ax = plt.subplots(figsize=(14, 6))

    ax.plot(get_active_cases(df))
    ax.ticklabel_format(axis='y', style='plain')

    return fig, ax

fig, ax = plot_active_cases(by_date)
ax.set_title('Active cases by date worldwide')
plt.show()

Observations: It is important to know that not all recoveries are being accounted for.

2. 1. 5. Top countries for total confirmed cases ¶

In [20]:

by_country = merged.groupby('country').sum()

def plot_top_countries(df, n_countries=10):
    top = df.sort_values('confirmed', ascending=False).head(n_countries)

    fig, ax = plt.subplots(figsize=(12, 8))

    ax.bar(top.index, top['fatalities'], label='fatalities')
    ax.bar(top.index, top['confirmed'], bottom=top['fatalities'], label='confirmed')
    ax.bar(top.index, top['recovered'], bottom=top['confirmed'] + top['fatalities'], label='recovered')

    ax.set_xticklabels(top.index, rotation=90)
    ax.ticklabel_format(axis='y', style='plain')
    ax.legend()

    return fig, ax

fig, ax = plot_top_countries(by_country)
ax.set_title('Top conutries per number of confirmed cases (ordered by confirmed cases)')
plt.show()

Observations: Here we can see how the relationship between number of fatalities and recovered changes amongst the countries most afflicted by the coronavirus. In China more than half the people recovered and the number of fatalities has kept relatively low compared to US and other european countries. The same cannot be said for Italy and Spain, although these countries are still at a different stage of the pandemic.

2. 1. 6. New cases for countries with most cases ¶

Next I will be plotting the number of confirmed cases per country. I will not be using all countries for the following plots, the countries taken in consideration are the ones with the most number of cases.

In [21]:

def plot_subplots_for_country_cases(df, column, n_countries=5, **kwargs):

    top = df.groupby('country').sum().sort_values('confirmed', ascending=False).head(n_countries)

    fig, ax = plt.subplots(len(top), 1, **kwargs)

    for i, country in enumerate(top.index):
        country_df = df[df['country']==country]
        grouped_by_date = country_df.groupby('date')[column].sum()

        ax[i].plot(grouped_by_date, color=next(PALETTE))
        ax[i].set_title(country)
        ax[i].xaxis.set_ticks(grouped_by_date.index[::14])
        ax[i].ticklabel_format(axis='y', style='plain')

    fig.tight_layout(pad=3.0)

    fig.suptitle('Number of ' + column.replace('_', ' '))
        
    return fig, ax

fig, ax = plot_subplots_for_country_cases(df, n_countries=6, column='daily_confirmed', figsize=(12, 12))

Observations: We can see how China's curve is different compared to the other countries. Besides the fact that they were the first country to deal with the coronavirus outbreak the measures seem to have worked well as we can see a downwards trend in new cases after slightly more than one month.

2. 1. 7. Active cases by country ¶

In [22]:

def plot_active_cases_countries(df, n_countries=5):

    active_cases = get_active_cases(by_country).sort_values(ascending=False)

    out = active_cases.iloc[:n_countries]
    out.loc['Other'] = active_cases.iloc[n_countries:].sum()

    fig, ax = plt.subplots(figsize=(7,7))

    labels = out.index 
    sizes = [ x/out.sum() for x in out ]
    explode = np.zeros(len(out))
    explode[-1] = 0.1

    ax.pie(sizes, labels=labels, autopct='%1.1f%%',
           shadow=True, startangle=90, explode=explode)
    ax.axis('equal')  
    return fig, ax

fig, ax = plot_active_cases_countries(by_country, n_countries=5)
ax.set_title('Active cases by country')
plt.show()

2. 1. 8. Outcome of cases (recovery or death) ¶

The outcome of cases is the cumulative total deaths and recoveries over cumulative number of closed cases.

In [23]:

def plot_outcome_of_cases(df):
    fig, ax = plt.subplots(figsize=(14, 6))

    closed_cases = df['recovered'] + df['fatalities']
    perc_recovered = df['recovered'] / closed_cases * 100
    perc_deaths = df['fatalities'] / closed_cases * 100

    ax.plot(perc_recovered, marker='o', color='g', label='recovered')
    ax.plot(perc_deaths, marker='o', color='r', label='fatalities')
    ax.set_ylabel('Percent (%)')
    ax.set_title('Outcome of total closed cases (recovery rate vs death rate)')
    ax.legend(loc='upper left')

    return fig, ax

fig, ax = plot_outcome_of_cases(by_date)
plt.show()

Observations: The difference in stages between countries does not make this graph any useful. We will see again a more informative outcome of total closed cases when we will be analysing the countries separately.

2. 2. China ¶

In [ ]:

by_date_china = df[df.country=='China'].groupby('date').sum()

2. 2. 1. Total and daily cases ¶

In [25]:

fig, ax = plot_total_and_daily(by_date_china)
ax[0].set_title('Cumulative number of cases in China')
ax[1].set_title('Daily new cases in China')
plt.show()

Observations: China seems to be at later stages of the outbreak with positive behaviour being shown in the graphs.

2. 2. 2. Growth factor ¶

In [26]:

yesterday = date.today() - timedelta(days = 1) 

fig, ax = plot_growth_factor(by_date_china['23-02-2020':])
ax.set_title("China's growth factor")
plt.show()

Observations: Although the growth factor seems to be centering around one we are only working with a very little number of daily new cases compared to other countries as we can see below.

In [27]:

past_week = date.today() - timedelta(days = 7) 
new_cases_past_week = by_date_china[past_week:]['daily_confirmed'].mean()
print('There was an average of', int(new_cases_past_week), 'daily new cases in China last week')

There was an average of 74 daily new cases in China last week

2. 2. 3. Active cases ¶

In [28]:

fig, ax = plot_active_cases(by_date_china)
ax.set_title('Active cases in China')
plt.show()

Observations: The number of active cases has reduced drastically. As we will see from the following plot, in most cases the people have recovered.

2. 2. 4. Outcome of cases ¶

In [29]:

fig, ax = plot_outcome_of_cases(by_date_china)
plt.show()

Observations: China looks to have handeled the pandemic well with very few deaths after the initial outbreak.

2. 3. Europe ¶

We will be analysing Europe as whole. Though it is true that different measures were taken amongst european countries, we will be taking into consideration the geographical context, with the total EU area being smaller than US and China.

In [ ]:

european_countries = ['Austria', 
                      'Belgium',
                      'Bulgaria',
                      'Croatia',
                      'Cyprus',
                      'Czech Republic',
                      'Denmark',
                      'Estonia',
                      'Finland',
                      'France',
                      'Germany',
                      'Greece',
                      'Hungary',
                      'Ireland',
                      'Italy',
                      'Latvia',
                      'Lithuania',
                      'Luxembourg',
                      'Malta',
                      'Netherlands',
                      'Poland',
                      'Portugal',
                      'Romania',
                      'Slovakia',
                      'Slovenia',
                      'Spain',
                      'Sweden']

by_date_eu = df[df.country.isin(european_countries)].groupby('date').sum()

2. 3. 1. Total and daily cases ¶

In [31]:

fig, ax = plot_total_and_daily(by_date_eu)
ax[0].set_title('Daily new cases in the EU')
ax[1].set_title('Cumulative number of cases in the EU')
plt.show()

Observations: We can see how the new cases are starting to follow a downwards trend, on the other side the daily number of people recovered seems to follow a positive one while the fatalities are slightly increasing.

2. 3. 2. Growth factor ¶

In [32]:

fig, ax = plot_growth_factor(by_date_eu)
ax.set_title('Growth factor in EU')
plt.show()

Observations: We can see some inconsistency with some of the initial data being reported. Starting from around 15th of March we can see that most datapoints are below one, suggesting a decrease in new cases.

2. 3. 3. Active cases ¶

In [33]:

fig, ax = plot_active_cases(by_date_eu)
ax.set_title('Active cases in EU')
plt.show()

Observations:

2. 3. 4. Outcome of cases ¶

In [34]:

fig, ax = plot_outcome_of_cases(by_date_eu)
ax.set_title('Outcome of cases in EU')
plt.show()

Observations: In the later stages there seems to be a steady increase in the number of people recovered.

2. 4. USA ¶

In [ ]:

by_date_usa = df[df.country=='US'].groupby('date').sum()

2. 4. 1. Total and daily cases ¶

In [36]:

fig, ax = plot_total_and_daily(by_date_usa)
ax[0].set_title('Daily new cases in USA')
ax[1].set_title('Cumulative number of cases in USA')
plt.show()

Observations: USA seems to be at the initial stages of the pandemic with no positive signs being shown yet. The number of deaths is also higher than the number of people recovered so far.

2. 4. 2. Growth factor ¶

In [37]:

fig, ax = plot_growth_factor(by_date_usa)
ax.set_title("USA's growth factor")
plt.show()

Observations: As the datapoints are all very close to one we are still at an exponential increase in new cases.

2. 4. 3. Active cases ¶

In [38]:

fig, ax = plot_active_cases(by_date_usa)
ax.set_title('Active cases in USA')
plt.show()

Observations: Also the number of active cases is increasing steeply.

2. 4. 4. Outcome of cases ¶

In [39]:

fig, ax = plot_outcome_of_cases(by_date_usa)
ax.set_title('Outcome of cases in USA')

Out[39]:

Text(0.5, 1.0, 'Outcome of cases in USA')

Observations: USA's outcome of cases is highlighting issues with the handling of coronavirus. Though it is still at initial stages it is the only country analysed that shows a higher percentage of deaths over recovery cases.

3. Disclaimer¶

With new data being added on a daily basis, in order to include the most recent cases, some sections might not be up to date.
The objective of this notebook is to provide some insights about the COVID-19 transmission from a data-centric perspective in a didactical and simple way. Results should not be considered in any way an affirmation of what will happen in the future. Observations obtained from data exploration are personal opinions.

1. First look at the datasets ¶

1. 1. Merging the data ¶

1. 2. Editing features ¶

2. EDA ¶

2. 1. Worldwide cases ¶

2. 1. 1. Total and daily cases ¶

2. 1. 2. Confirmed, recovered and fatalities on a daily basis ¶

2. 1. 3. Growth Factor ¶

2. 1. 4. Active cases by date¶

2. 1. 5. Top countries for total confirmed cases¶

2. 1. 6. New cases for countries with most cases ¶

2. 1. 7. Active cases by country ¶

2. 1. 8. Outcome of cases (recovery or death) ¶

2. 2. China ¶

2. 2. 1. Total and daily cases ¶

2. 2. 2. Growth factor ¶

2. 2. 3. Active cases ¶

2. 2. 4. Outcome of cases ¶

2. 3. Europe ¶

2. 3. 1. Total and daily cases ¶

2. 3. 2. Growth factor ¶

2. 3. 3. Active cases ¶

2. 3. 4. Outcome of cases ¶

2. 4. USA ¶

2. 4. 1. Total and daily cases ¶

2. 4. 2. Growth factor ¶

2. 4. 3. Active cases ¶

2. 4. 4. Outcome of cases ¶

3. Disclaimer¶

2. 1. 4. Active cases by date ¶

2. 1. 5. Top countries for total confirmed cases ¶