The coronavirus pandemic is an ongoing pandemic of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The outbreak started in Wuhan, China. The World Health Organization declared the outbreak to be a public health emergency of international concern and recognized it as a pandemic on the 11th of March 2020.
The pandemic has led to one of the biggest economical, political and social crisis mankind has faced in decades. Effective measures were slow to be put in practice due to the unknown and uncertainty related to the virus. Initial information has been of all different sorts, with many incorrect and false news being spread.
The purpose of this notebook is to have an understanding of the virus through data. By any means this analysis should not be taken into consideration for scientific purposes as there is a large number of variables that need to be considered for this (quarantines, quality of the medical resources deployed, environmental measures, etc).
1. First look at the datasets ¶
We will be initially using the following datasets throughout the analysis:
These are from Johns Hopkins University's repo on github and have kindly been provided to the public for educational and academic research purposes. We will be importing first all the libraries used for this study and start loading the data.
import itertools
from datetime import date
from datetime import timedelta
import numpy as np
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
_ = sns.set()
confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
confirmed_df.head()
fatalities_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
fatalities_df.head()
recovered_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
recovered_df.head()
The datasets contain daily information on total number of confirmed cases, fatalities and recovered respectively. Some countries also have a region informed.
1. 1. Merging the data ¶
The structure of the data is not bestly suited for an analysis. Let's create a function to change that.
def reshape_data(df):
to_keep = ["Country/Region", "Province/State", "Lat", "Long"]
to_change = set(df.columns) - set(to_keep)
df = pd.melt(df,
id_vars=to_keep,
value_vars=to_change,
var_name='date',
value_name='total')
return df
The function reshape_data()
melts the DataFrame
in order to
remove the dates from the columns. We can apply it to the three
DataFrame
s and then merge them together.
def merge_dfs(df1, df2, df3):
merged = pd.merge(reshape_data(df1), reshape_data(df2), how='inner', on=['Country/Region', 'Province/State', 'Lat', 'Long', 'date'])
merged = pd.merge(merged, reshape_data(df3), how='inner', on=['Country/Region', 'Province/State', 'Lat', 'Long', 'date'])
merged.columns = ['country', 'province', 'lat', 'long', 'date', 'confirmed', 'fatalities', 'recovered']
merged['date'] = pd.to_datetime(merged['date'])
merged = merged.sort_values(['country', 'province', 'date'])
return merged
merged = merge_dfs(confirmed_df, fatalities_df, recovered_df)
merged.head(3)
Great! Before moving on to some descriptive statistics we will also apply a few extra changes.
1. 2. Editing features ¶
print("Countries with Province/State informed: ", merged[merged['province'].isna()==False]['country'].unique())
We can see that the region is stated for 6 countries only, we will be removing the column for now.
It will be taken into consideration if we will be analysing those countries separately.
def drop_province(df):
df = merged.groupby(['country', 'date']).agg({'lat': 'first',
'long': 'first',
'confirmed': 'sum',
'fatalities': 'sum',
'recovered': 'sum'}).reset_index()
return df
df = drop_province(merged)
df.head()
We will also be adding some extra feature that will be useful for our EDA. These are the number of confirmed, fatalities and recovered cases on a daily basis.
def edit_features(df):
df['daily_confirmed'] = df.groupby(['country'])['confirmed'].diff()
df['daily_fatalities'] = df.groupby('country')['fatalities'].diff()
df['daily_recovered'] = df.groupby('country')['recovered'].diff()
mask = (df['date']==min(df['date']).date())
df.loc[df.index[mask], 'daily_confirmed'] = df.loc[df.index[mask], 'confirmed']
df.loc[df.index[mask], 'daily_fatalities'] = df.loc[df.index[mask], 'fatalities']
df.loc[df.index[mask], 'daily_recovered'] = df.loc[df.index[mask], 'recovered']
return df
df = edit_features(df)
Nice, now that the data is ready we can start with some descriptive statistics and EDA.
print("Number of countries: ", df['country'].nunique())
print("Dates: from", min(df['date']).date(), "to", max(df['date']).date(), "(total of", df['date'].nunique(), "days)")
df.describe()
There seem to be some incorrect entries as we get some negative values in the
daily_confirmed
, daily_fatalities
and
daily_recovered
.
These datasets rely upon publicly available data from multiple sources, that do not
always agree. I do not know yet what the cause is but as we can see from the
following records regarding Algeria for example, the number of
recovered
changes on the 24th of March to the go back to the original
value the following day.
df[df['country']=='Algeria'].set_index('date')['23-03-2020':'25-03-2020']
incorrect_entries = len(df[(df['daily_confirmed']<0) | (df['daily_fatalities']<0) | (df['daily_recovered']<0)])
print('Number of incorrect entries:', incorrect_entries, '(', round(incorrect_entries/len(df)*100, 2), '% of total data)')
As the number of incorrect entries is a small portion of the data I will be
removing them for now.
def get_incorrect_entries(df):
mask = ((df['confirmed'].diff() < 0) |
(df['fatalities'].diff() < 0) |
(df['recovered'].diff() < 0)) & (df['date'] != min(df['date']))
return df[mask].index
Great! Let's start plotting.
2. EDA ¶
We will start the exploratory data analysis by studying the worldwide situation. Although it will give us a collective picture of the pandemic it is important to notice that the virus did not spread throughout the world altogether. This will strongly interfere with the results, thus we will also be working independently on a few cases.
2. 1. Worldwide cases ¶
The first plots will focus on the global situation, all countries in the dataset will be taken into consideration.
2. 1. 1. Total and daily cases ¶
by_date = df.groupby('date').sum()
def plot_total_and_daily(df):
fig, ax = plt.subplots(1, 2, figsize=(18,8))
ax[0].plot(df.daily_confirmed, label='Confirmed', color='orange')
ax[0].plot(df.daily_fatalities, label='Fatalities', color='r')
ax[0].plot(df.daily_recovered, label='Recovered', color='g')
ax[0].legend()
ax[1].plot(df.confirmed, color='orange')
ax[1].plot(df.fatalities, color='r')
ax[1].plot(df.recovered, color='g')
ax[1].fill_between(df.index, df.confirmed, df.recovered, label='Confirmed', color='orange')
ax[1].fill_between(df.index, df.recovered, df.fatalities, label='Recovered', color='g')
ax[1].fill_between(df.index, df.fatalities, 0, label='Fatalities', color='r')
ax[1].ticklabel_format(axis='y', style='plain')
ax[1].legend()
return fig, ax
fig, ax = plot_total_and_daily(by_date)
ax[0].set_title('Daily new cases worldwide')
ax[1].set_title('Cumulative number of cases worldwide')
plt.show()
Observations: The global curve shows a rich fine structure, but these numbers are strongly affected by the vector zero country, China. Given that COVID-19 started there, during the initial expansion of the virus there was no reliable information about the real infected cases. In fact, the criteria to consider infection cases was modified around 11th February 2020, which strongly perturbed the curve as we can see from the figure.
by_date_no_china = df[df['country']!='China'].groupby('date').sum()
fig, ax = plot_total_and_daily(by_date_no_china)
ax[0].set_title('Daily new cases worldwide excluding China')
ax[1].set_title('Cumulative number of cases worldwide excluding China')
plt.show()
Observations: In this case the general behavior looks cleaner, and in fact the curve resembles a typical epidemiology model like SIR. SIR models present a large increasing in the number of infections that, once it reaches the maximum of the contagion, decreases with a lower slope.
2. 1. 2. Confirmed, recovered and fatalities on a daily basis ¶
PALETTE = itertools.cycle(sns.color_palette())
def plot_in_cols(df, to_plot=[]):
n_plots = len(to_plot)
fig, ax = plt.subplots(1, n_plots, figsize=(n_plots*6,8))
for i, col in enumerate(to_plot):
ax[i].plot(df[col], label=col.replace('_', ' '), color=next(PALETTE))
ax[i].set_title('Number of ' + col.replace('_', ' '))
ax[i].xaxis.set_major_locator(mdates.MonthLocator())
ax[i].ticklabel_format(axis='y', style='plain')
return fig, ax
fig, ax = plot_in_cols(by_date, to_plot=['daily_confirmed', 'daily_fatalities', 'daily_recovered'])
Observations: The following plots show a clear upward trend.
2. 1. 3. Growth Factor ¶
Growth factor is the factor by which a quantity multiplies itself over time. The formula used is every day's new cases / new cases on the previous day. For example, a quantity growing by 7% every period (in this case daily) has a growth factor of 1.07.
A growth factor above 1 indicates an increase, whereas one which remains between 0 and 1 it is a sign of decline, with the quantity eventually becoming zero, whereas a growth factor constantly above 1 could signal exponential growth
def plot_growth_factor(df):
fig, ax = plt.subplots(figsize=(9, 9))
growth_factor = df['daily_confirmed'] / df['daily_confirmed'].shift(-1)
ax.plot(df.index, growth_factor, marker='o')
ax.axhline(0)
ax.axhline(1, linewidth=2, color='r')
ax.set_ylabel('New cases / new cases on previous day')
return fig, ax
fig, ax = plot_growth_factor(by_date)
ax.set_title('Global growth factor')
plt.show()
Observations: The curve seems to be tending to one, suggesting that it is increasing exponentially. There are quite a few peaks in February, this could be due to the fact that during the initial expansion of the virus there was no reliable information about the real infected cases. Some of the incorrect records we saw in the dataset might also be influencing the plot.
2. 1. 4. Active cases by date¶
By removing fatalities
and recovered
from total cases, we
get "currently infected cases" or "active cases" (cases still awaiting for an
outcome).
def get_active_cases(df):
active_cases = df['confirmed'] - df['recovered'] - df['fatalities']
return active_cases
def plot_active_cases(df):
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(get_active_cases(df))
ax.ticklabel_format(axis='y', style='plain')
return fig, ax
fig, ax = plot_active_cases(by_date)
ax.set_title('Active cases by date worldwide')
plt.show()
Observations: It is important to know that not all recoveries are being accounted for.
2. 1. 5. Top countries for total confirmed cases¶
by_country = merged.groupby('country').sum()
def plot_top_countries(df, n_countries=10):
top = df.sort_values('confirmed', ascending=False).head(n_countries)
fig, ax = plt.subplots(figsize=(12, 8))
ax.bar(top.index, top['fatalities'], label='fatalities')
ax.bar(top.index, top['confirmed'], bottom=top['fatalities'], label='confirmed')
ax.bar(top.index, top['recovered'], bottom=top['confirmed'] + top['fatalities'], label='recovered')
ax.set_xticklabels(top.index, rotation=90)
ax.ticklabel_format(axis='y', style='plain')
ax.legend()
return fig, ax
fig, ax = plot_top_countries(by_country)
ax.set_title('Top conutries per number of confirmed cases (ordered by confirmed cases)')
plt.show()
Observations: Here we can see how the relationship between number of fatalities and recovered changes amongst the countries most afflicted by the coronavirus. In China more than half the people recovered and the number of fatalities has kept relatively low compared to US and other european countries. The same cannot be said for Italy and Spain, although these countries are still at a different stage of the pandemic.
2. 1. 6. New cases for countries with most cases ¶
Next I will be plotting the number of confirmed cases per country. I will not be using all countries for the following plots, the countries taken in consideration are the ones with the most number of cases.
def plot_subplots_for_country_cases(df, column, n_countries=5, **kwargs):
top = df.groupby('country').sum().sort_values('confirmed', ascending=False).head(n_countries)
fig, ax = plt.subplots(len(top), 1, **kwargs)
for i, country in enumerate(top.index):
country_df = df[df['country']==country]
grouped_by_date = country_df.groupby('date')[column].sum()
ax[i].plot(grouped_by_date, color=next(PALETTE))
ax[i].set_title(country)
ax[i].xaxis.set_ticks(grouped_by_date.index[::14])
ax[i].ticklabel_format(axis='y', style='plain')
fig.tight_layout(pad=3.0)
fig.suptitle('Number of ' + column.replace('_', ' '))
return fig, ax
fig, ax = plot_subplots_for_country_cases(df, n_countries=6, column='daily_confirmed', figsize=(12, 12))
Observations: We can see how China's curve is different compared to the other countries. Besides the fact that they were the first country to deal with the coronavirus outbreak the measures seem to have worked well as we can see a downwards trend in new cases after slightly more than one month.
2. 1. 7. Active cases by country ¶
def plot_active_cases_countries(df, n_countries=5):
active_cases = get_active_cases(by_country).sort_values(ascending=False)
out = active_cases.iloc[:n_countries]
out.loc['Other'] = active_cases.iloc[n_countries:].sum()
fig, ax = plt.subplots(figsize=(7,7))
labels = out.index
sizes = [ x/out.sum() for x in out ]
explode = np.zeros(len(out))
explode[-1] = 0.1
ax.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90, explode=explode)
ax.axis('equal')
return fig, ax
fig, ax = plot_active_cases_countries(by_country, n_countries=5)
ax.set_title('Active cases by country')
plt.show()
2. 1. 8. Outcome of cases (recovery or death) ¶
The outcome of cases is the cumulative total deaths and recoveries over cumulative number of closed cases.
def plot_outcome_of_cases(df):
fig, ax = plt.subplots(figsize=(14, 6))
closed_cases = df['recovered'] + df['fatalities']
perc_recovered = df['recovered'] / closed_cases * 100
perc_deaths = df['fatalities'] / closed_cases * 100
ax.plot(perc_recovered, marker='o', color='g', label='recovered')
ax.plot(perc_deaths, marker='o', color='r', label='fatalities')
ax.set_ylabel('Percent (%)')
ax.set_title('Outcome of total closed cases (recovery rate vs death rate)')
ax.legend(loc='upper left')
return fig, ax
fig, ax = plot_outcome_of_cases(by_date)
plt.show()
Observations: The difference in stages between countries does not make this graph any useful. We will see again a more informative outcome of total closed cases when we will be analysing the countries separately.
2. 2. China ¶
by_date_china = df[df.country=='China'].groupby('date').sum()
2. 2. 1. Total and daily cases ¶
fig, ax = plot_total_and_daily(by_date_china)
ax[0].set_title('Cumulative number of cases in China')
ax[1].set_title('Daily new cases in China')
plt.show()
Observations: China seems to be at later stages of the outbreak with positive behaviour being shown in the graphs.
2. 2. 2. Growth factor ¶
yesterday = date.today() - timedelta(days = 1)
fig, ax = plot_growth_factor(by_date_china['23-02-2020':])
ax.set_title("China's growth factor")
plt.show()
Observations: Although the growth factor seems to be centering around one we are only working with a very little number of daily new cases compared to other countries as we can see below.
past_week = date.today() - timedelta(days = 7)
new_cases_past_week = by_date_china[past_week:]['daily_confirmed'].mean()
print('There was an average of', int(new_cases_past_week), 'daily new cases in China last week')
2. 2. 3. Active cases ¶
fig, ax = plot_active_cases(by_date_china)
ax.set_title('Active cases in China')
plt.show()
Observations: The number of active cases has reduced drastically. As we will see from the following plot, in most cases the people have recovered.
2. 2. 4. Outcome of cases ¶
fig, ax = plot_outcome_of_cases(by_date_china)
plt.show()
Observations: China looks to have handeled the pandemic well with very few deaths after the initial outbreak.
2. 3. Europe ¶
We will be analysing Europe as whole. Though it is true that different measures were taken amongst european countries, we will be taking into consideration the geographical context, with the total EU area being smaller than US and China.
european_countries = ['Austria',
'Belgium',
'Bulgaria',
'Croatia',
'Cyprus',
'Czech Republic',
'Denmark',
'Estonia',
'Finland',
'France',
'Germany',
'Greece',
'Hungary',
'Ireland',
'Italy',
'Latvia',
'Lithuania',
'Luxembourg',
'Malta',
'Netherlands',
'Poland',
'Portugal',
'Romania',
'Slovakia',
'Slovenia',
'Spain',
'Sweden']
by_date_eu = df[df.country.isin(european_countries)].groupby('date').sum()
2. 3. 1. Total and daily cases ¶
fig, ax = plot_total_and_daily(by_date_eu)
ax[0].set_title('Daily new cases in the EU')
ax[1].set_title('Cumulative number of cases in the EU')
plt.show()
Observations: We can see how the new cases are starting to follow a downwards trend, on the other side the daily number of people recovered seems to follow a positive one while the fatalities are slightly increasing.
2. 3. 2. Growth factor ¶
fig, ax = plot_growth_factor(by_date_eu)
ax.set_title('Growth factor in EU')
plt.show()
Observations: We can see some inconsistency with some of the initial data being reported. Starting from around 15th of March we can see that most datapoints are below one, suggesting a decrease in new cases.
2. 3. 3. Active cases ¶
fig, ax = plot_active_cases(by_date_eu)
ax.set_title('Active cases in EU')
plt.show()
Observations:
2. 3. 4. Outcome of cases ¶
fig, ax = plot_outcome_of_cases(by_date_eu)
ax.set_title('Outcome of cases in EU')
plt.show()
Observations: In the later stages there seems to be a steady increase in the number of people recovered.
2. 4. USA ¶
by_date_usa = df[df.country=='US'].groupby('date').sum()
2. 4. 1. Total and daily cases ¶
fig, ax = plot_total_and_daily(by_date_usa)
ax[0].set_title('Daily new cases in USA')
ax[1].set_title('Cumulative number of cases in USA')
plt.show()
Observations: USA seems to be at the initial stages of the pandemic with no positive signs being shown yet. The number of deaths is also higher than the number of people recovered so far.
2. 4. 2. Growth factor ¶
fig, ax = plot_growth_factor(by_date_usa)
ax.set_title("USA's growth factor")
plt.show()
Observations: As the datapoints are all very close to one we are still at an exponential increase in new cases.
2. 4. 3. Active cases ¶
fig, ax = plot_active_cases(by_date_usa)
ax.set_title('Active cases in USA')
plt.show()
Observations: Also the number of active cases is increasing steeply.
2. 4. 4. Outcome of cases ¶
fig, ax = plot_outcome_of_cases(by_date_usa)
ax.set_title('Outcome of cases in USA')
Observations: USA's outcome of cases is highlighting issues with the handling of coronavirus. Though it is still at initial stages it is the only country analysed that shows a higher percentage of deaths over recovery cases.
3. Disclaimer¶
-
With new data being added on a daily basis, in order to include the most recent cases, some sections might not be up to date.
-
The objective of this notebook is to provide some insights about the COVID-19 transmission from a data-centric perspective in a didactical and simple way. Results should not be considered in any way an affirmation of what will happen in the future. Observations obtained from data exploration are personal opinions.