Xinyi Miao
This project uses the data provided by Korea Centers for Disease Control & Prevention and the data obtained from COVID-19 Tracking project. Many thanks to the data providers.
COVID-19 becomes a trending word at the start of 2020 when the virus started to spread in China. The number of confirmed cases rises to 3.66 millions until May 5th 2020, about 105 days after people started to pay attention to it. The COVID-19 data provides us an insight into the virus and the patients, how they got infected and what kind of people are more vulnerable to it? Thanks to Korea Centers for Disease Control & Prevention, we have such a valuable data set to learn from.
According to the news, Korea announced the first cases on January 20th, 2020, a Chinese women travel from Wuhan. From then on, the number of the confirmed cases increased slowly and there was an outbreak on from Feburary 20th. The sudden jump has been related to the 'Patient 31', a super spreader who attended the ativity held in Shincheonji Church of Jesus the Temple of the Tabernacle of the Testimony church in Daegu. Unitl May 5th, Korea has 10804 confirmed cases, 254 confirmed cases, and 9283 recovered cases.
In this project, I first do exploratory analysis to understand the general situation in the Korea, like the time pattern of the total number of the confirmed cases, as well as the daily growth per day. Also, I analyze the characteristics of the patient and find that people in their 20s take the greatest proportion of the patients. The elder people and people who have underlying diseases are more vulnarable to the COVID-19, as they have the highest fatality rate. Moreover, for the Korea, the group infection in Shincheonji Church, overseas inflow, and the contact with patients are top 3 infection cases.
Then, I build the models to make the prediction for the Korea and the United States. The model provides a general insight into the future situation of the virus in the two countries. However, the models should be improved in the future to make the prediction more accurate.
# import the package we would use in the analysis and modelling
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from datetime import datetime
Case = pd.read_csv('/content/drive/My Drive/CIS545_2020/Case.csv',
sep=',')
PatientRoute = pd.read_csv('/content/drive/My Drive/CIS545_2020/PatientRoute.csv',
sep=',')
PatientInfo = pd.read_csv('/content/drive/My Drive/CIS545_2020/PatientInfo.csv' , dtype={'patient_id': object,'infected_by': object},
sep=',')
Region = pd.read_csv('/content/drive/My Drive/CIS545_2020/Region.csv',
sep=',')
SeoulFloating = pd.read_csv('/content/drive/My Drive/CIS545_2020/SeoulFloating.csv',
sep=',')
Time = pd.read_csv('/content/drive/My Drive/CIS545_2020/Time.csv',
sep=',')
TimeProvince = pd.read_csv('/content/drive/My Drive/CIS545_2020/TimeProvince.csv',
sep=',')
TimeAge = pd.read_csv('/content/drive/My Drive/CIS545_2020/TimeAge.csv',
sep=',')
SearchTrend = pd.read_csv('/content/drive/My Drive/CIS545_2020/SearchTrend.csv',
sep=',')
Weather = pd.read_csv('/content/drive/My Drive/CIS545_2020/Weather.csv',
sep=',')
In this part, I first anlayze the general situation in South Korea first to understand the mortality rate, the daily growth rate and the development trend of the virus in the country from Jan 20th. Next, I want to explore that which group of people are more likely to get infected and even decease due to the coronavirus, and whether the weather effects would influence the spread of the virus. Moreover, I analyze and visualize the data by geography and by infection cases. At last, I consider the spread from person to person as a graph and track the transmission of a super spreader using the Patient Information dataset.
To gain a general understanding about the situation in Korea, we first plot the daily confirmed case to gain an insight into how the number of infection change by time in the whole country. We also calculate the mortality and the test positive rate for the country.
# calculate some index: mortality, positive rate
Time['mortality'] = Time['deceased']/Time['confirmed']*100
Time['positive_rate'] = Time['confirmed']/Time['test']*100
print('The current mortality:{}.'.format(round(Time.mortality.iloc[-1],2)))
print('The current test positive rate:{}.'.format(round(Time.positive_rate.iloc[-1],2)))
Until the end of the April, the fatality of COVID-19 in Korea is about 2.29% and the positive rate is 1.74%. To compare, the mortality in China is 5.51% and the number for global is 4.34%. In terms of the number, Korea did a better job than China. After the outbreak in the country, Korea has taken several measures to isolate the infected people and trace and publish the route of them without further lockdown.
# convert the date field from
Time['date'] = Time['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
From the plot we can see a great jump in the number of confirmed cases from the last week of the Feburary. However, the country control the situation soon and the daily increase reduce obviously after a half of the month. We can also see that there seems to be a time lag of 2 weeks between the trend of confirmed and the released cases.
from matplotlib import pyplot as plt
%matplotlib inline
Time = Time.set_index('date')
Time['confirmed'].plot(figsize=(10,5))
Time['released'].plot()
Time['deceased'].plot()
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea by Day', size=16)
plt.legend()
plt.show()
The next question to explore is when did the Korean people start to notice the COVID-19? Before the outbreak when China started to struggle with the coronavirus or afther the outbreak when the number of confirmed cases is so high in their country?
# we only care about the search trend after January 20th, 2020
SearchTrend['date'] = SearchTrend['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
SearchTrend = SearchTrend.loc[SearchTrend['date'] > datetime(2020, 1, 20) ]
# plot the COVID status in Korea and the search trend
fig = plt.subplots(figsize=(10,5))
ax1 = plt.subplot(111)
ax1.plot(SearchTrend.date, SearchTrend.coronavirus, lw=3)
ax1.set_ylabel('Search Trand', size=12)
plt.suptitle(" Search Trend of 'coronavirus' in Korea", size=18)
plt.title("(Compare with the number of the confirmed case)", c='gray')
ax2 = ax1.twinx()
ax2.plot(Time.confirmed, '-.', lw=2, c='lightblue')
ax2.set_ylabel('The number of confirmed case', size=12)
ax2.set_xlabel('Date', size=12)
plt.show()
Apparently, Korean people started to notice the COVID-19 virus long before the outbreak in the country, about a month ago when China reported their first death related to the COVID-19 and the virus started to spread. The word coronavirus became trending on around Feburary 20th as shown when the group infection in Daegu in the Shincheonji Church. After the surge, as the growth of confirmed case slowed down, though the curve is still flutuating, the word coronavirus got fewer search.
In the last part, we mainly focus on the number of confirmed cases, and in this part, we calculate the daily growth rate to capture the inflection point in Korea.
Before the calculationa and the analysis, I first set up the environment for runing Spark SQL.
! sudo apt install openjdk-8-jdk
! sudo update-alternatives --config java
### Install required packages
%%capture
!pip install pandasql
!apt install libkrb5-dev
!wget https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install findspark
!pip install sparkmagic
!pip install pyspark
! pip install pyspark --user
! pip install seaborn --user
! pip install plotly --user
! pip install imageio --user
! pip install folium --user
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
import os
spark = SparkSession.builder.appName('Graphs-HW2').config("spark.driver.memory", '5g').getOrCreate()
%load_ext sparkmagic.magics
import pandasql
import sqlite3
os.environ['SPARK_HOME'] = '/content/spark-2.4.5-bin-hadoop2.7'
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
from pyspark.sql import SQLContext
try:
if(spark == None):
spark = SparkSession.builder.appName('Initial').getOrCreate()
sqlContext=SQLContext(spark)
except NameError:
spark = SparkSession.builder.appName('Initial').getOrCreate()
sqlContext=SQLContext(spark)
# read the data as spark data frame
Time_sdf = spark.read.format("csv").option("header", "true").load("/content/drive/My Drive/CIS545_2020/Time.csv")
Time_sdf.createOrReplaceTempView('Time')
In this project, daily growth rate is calculate by dividing the number of the daily new cases by the number of the new cases the day before. To calculate, we would first want to extract the daily increase from the accumulated values.
# extract the daily increase
DailyGrowth = spark.sql('''
SELECT t1.date, t1.confirmed, t1.confirmed - t2.confirmed as today, (t1.confirmed - t2.confirmed)*100/t2.confirmed as GrowthRate FROM Time t1
JOIN Time t2 ON t1.date == date_add(t2.date,1)
''')
DailyGrowth = DailyGrowth.toPandas()
DailyGrowth['date'] = DailyGrowth['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
# plot the case growth status in Korea
fig = plt.subplots(figsize=(10,5))
ax1 = plt.subplot(111)
ax1.plot(DailyGrowth.date, DailyGrowth.GrowthRate, lw=3,c='#FFD43B')
ax1.set_ylabel('Growth Rate', size=12)
plt.suptitle("Case Growth in Korea", size=18)
plt.title('Yellow line indicate the growth rate and the blue line indicates the daily increase', c='gray')
ax2 = ax1.twinx()
ax2.plot(DailyGrowth.date, DailyGrowth.today, lw=3)
ax2.set_ylabel('Daily Increase', size=12)
ax2.set_xlabel('Date', size=12)
plt.show()
From the image above, we can conclude that the growth is fast in mid-to-late Febuary and in the first week of March, the country experienced the highest daily increase. To be more detailed, the growth rate increase from 3% to 65% to 104% from Feb 18th to Feb 20th. According to the news at that time, this great jump mainly came from the participants of an activity at a church in Daegu infected by the 'Patient 31', which is a super spreader.
TimeProvince_sdf = spark.read.format("csv").option("header", "true").load("/content/drive/My Drive/CIS545_2020/TimeProvince.csv")
TimeProvince_sdf.createOrReplaceTempView('TimeProvince')
# extract the daily increase
DailyGrowthByProv = spark.sql('''
SELECT t1.date, t1.province, t1.confirmed, t1.confirmed - t2.confirmed AS DailyGrowth, (t1.confirmed - t2.confirmed)*100/t2.confirmed as GrowthRate FROM TimeProvince t1
JOIN TimeProvince t2 ON t1.date == date_add(t2.date,1) AND t1.province==t2.province
''')
DailyGrowthByProv = DailyGrowthByProv.toPandas()
DailyGrowthByProv['date'] = DailyGrowthByProv['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
DailyGrowthByProv.confirmed = DailyGrowthByProv.confirmed.astype(int)
DailyGrowthByProv = DailyGrowthByProv.fillna(0)
confirmedByProv = DailyGrowthByProv.pivot_table(
index=['date'],
columns=['province'],
values=['confirmed']
)
confirmedByProv.columns = [x[1] for x in confirmedByProv.columns]
fig, ax = plt.subplots(figsize=(9,5))
plot = sns.lineplot(x="date", y="GrowthRate",
hue="province",
data=DailyGrowthByProv)
plt.xticks(rotation=45)
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea \nby Province (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.7, -0.05), ncol=2)
plt.show()
fig, ax = plt.subplots(figsize=(9,5))
plot = sns.lineplot(x="date", y="DailyGrowth",
hue="province",
data=DailyGrowthByProv)
plt.xticks(rotation=45)
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea by Province (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.7, -0.05), ncol=2)
plt.show()
Obviously, we have different patterns in terms of the situation of COVID-19 in different province. From the two plots above, we again conclude that the outbreak area is Daegu, and the values for Gyeongsangbuk-do is also quite high.
In this session, we group the patient by their characteristics like age and sex in order to explore the common characteristics of the infected people.
TimeAge['date'] = TimeAge['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
20s is the group that have the highest infection rate. But we also should notice that the number of people in 20s is high, especially when comparing with the number of 0s, 10s, and 80s. Thus, we should focus on the percentage instead of the number.
TimeAge_plot = TimeAge.set_index('date')
fig, ax = plt.subplots(figsize=(10,5))
ax.stackplot(np.unique(TimeAge['date']), TimeAge.loc[TimeAge.age=='0s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='10s']['confirmed'].tolist(),
TimeAge.loc[TimeAge.age=='20s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='30s']['confirmed'].tolist(),
TimeAge.loc[TimeAge.age=='40s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='50s']['confirmed'].tolist(),
TimeAge.loc[TimeAge.age=='60s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='70s']['confirmed'].tolist(),
TimeAge.loc[TimeAge.age=='80s']['confirmed'].tolist(), labels = np.unique(TimeAge['age']),
colors = ['#003f5c', '#2f4b7c', '#665191', '#a05195', '#d45087', '#f95d6a', '#ff7c43', '#ffa600', '#ffd311'])
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.xticks(rotation=45)
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea by Age (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.2, 0.15), ncol=1)
plt.show()
#TimeAge_plot = TimeAge.set_index('date')
fig, ax = plt.subplots(figsize=(10,5))
sns.lineplot(x="date", y="confirmed",
hue="age",
data=TimeAge)
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.xticks(rotation=45)
plt.title('Number of Confirmed Case in Korea by Age (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.2, 0.15), ncol=1)
plt.show()
To better compare, we calculate the infection rate for each age group based on the Korean population data. The infection rate is defined as the percentage of infection in the total population by age. Besides, the fatality by age is also calculated in order to help understand the infection in different age groups.
# create a dataframe to store the population in Korea by age
KorPop = {'age':np.unique(TimeAge['age']).tolist(),
'Population':[4055740, 4732100, 6971785, 7203550, 8291728, 8587047, 6472987, 3591533, 1646916]}
KorPop = pd.DataFrame(KorPop)
# add the fields of the number of confirmed/deceased case to the dataframe and calculate the mortolity and infection rate
KorPop = pd.merge(KorPop, TimeAge.loc[TimeAge.date=='2020-04-20'][['age','confirmed','deceased']], on='age')
KorPop['infection_rate'] = KorPop['confirmed']/KorPop['Population']*100
KorPop['mortality'] = KorPop['deceased']/KorPop['confirmed']*100
As our expectation, 20s has the highest infection rate, about 0.04%, in Korea. 80s comes the second with the infection rate of 0.03%. However, the mortality of 80s is much higher than the other age groups, about 25%. Thus, we can conclude that 20s and 80s are easily to get infected and the elder patients in their 80s have higher risk to death. Some possible reasons: 20s people have more contact with the society and 80s people tend to have poor immunity and are more vulnerable.
fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(14,4))
# plot the infection rate by age
ax1.bar(KorPop['age'], KorPop['infection_rate'])
ax1.set_title('Injection Rate in Korea by Age', size=14)
ax1.set_xlabel('Age', size=12)
ax1.set_ylabel('Infection Rate(%)', size=12)
# plot the infection rate by age
ax2.bar(KorPop['age'], KorPop['mortality'])
ax2.set_title('Mortality in Korea by Age', size=14)
ax2.set_xlabel('Age', size=12)
ax2.set_ylabel('Mortality(%)', size=12)
plt.show()
PatientBySex = PatientInfo.groupby('sex')[['patient_id']].count().reset_index().rename(columns={'patient_id':'count'})
There are more female patient in Korea, with the percentage of 56%.
fig, ax = plt.subplots()
piechart = ax.pie(PatientBySex['count'], radius=1, autopct='%1.1f%%', pctdistance = 0.4, textprops = {'size':12},
wedgeprops=dict(width=0.3, edgecolor='w'), labels = ['Female','Male'])
aspect = ax.set(aspect="equal")
title = ax.set_title('The percentage of patient by female', size=16)
plt.show()
MortalityBySex = PatientInfo.groupby(['sex','state'])[['patient_id']].count().reset_index().rename(columns={'patient_id':'count'})
MortalityBySex
According to the researches, patient who has underlying diseases is more likelier to decease in COVID-19. This might also be the reasons for the fact that elder people tend to have higher risk. Unfortunately, we only have 19 records that are not null and all of the recorded patients with underlying diseases were died. We may also care about the number of day between the confirmation date and the date they left.
PatientDisease = PatientInfo.copy().loc[pd.notnull(PatientInfo.disease)]
PatientDisease['confirmed_date'] = PatientDisease.confirmed_date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDisease['deceased_date'] = PatientDisease['deceased_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDisease['number_of_day'] = PatientDisease['deceased_date'] - PatientDisease['confirmed_date']
print('According to the patient records, patients with underlying disease died in the {} day after the confirmation on average.'.format(PatientDisease['number_of_day'].mean().days))
Though we don't have dataset for patient without the underlying diseases, we calculate the number for the whole dataset to see if there are any different.
PatientDeath_D = PatientInfo.copy().loc[pd.notnull(PatientInfo.deceased_date) & (PatientInfo.disease==True)]
PatientDeath_D['deceased_date'] = PatientDeath_D['deceased_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath_D['confirmed_date'] = PatientDeath_D['confirmed_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath_D['number_of_day'] = PatientDeath_D['deceased_date'] - PatientDeath_D['confirmed_date']
print('According to the patient records, patients with underlying disease died in the {}th day after the confirmation on average.'.format(PatientDeath['number_of_day'].mean().days))
PatientDeath = PatientInfo.copy().loc[pd.notnull(PatientInfo.deceased_date)]
PatientDeath['confirmed_date'] = PatientDeath['confirmed_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath['deceased_date'] = PatientDeath['deceased_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath['number_of_day'] = PatientDeath['deceased_date'] - PatientDeath['confirmed_date']
print('According to the patient records, patients with underlying disease died in the {}th day after the confirmation on average.'.format(PatientDeath['number_of_day'].mean().days))
Thus, generally, patients with underlying diseases in Korea decease in the 8th day after the confirmation (people without the diseases can survive longer to have more chance to overcome the coronavirus) while those who have underlying diseases fail to survive for a longer time after infected by the virus.
visit_mean = PatientRoute.groupby('patient_id')['global_num'].count().mean()
visit_max = PatientRoute.groupby('patient_id')['global_num'].count().max()
print('People testing positive have on average {} route before the confirmation. \nThe patient who has highest route count visited different places for {} times.'.format(round(visit_mean,2), visit_max))
The top 3 places the patients visited oftern is hospital, public transportation and restaurant, where people inevitably stay in close contact. About 85% of the patients have visited hospital before the confirmation.
PatientRoute.groupby(['type'])['global_num'].count().sort_values(ascending=False).head(4)
people_visit_hosp = PatientRoute.loc[PatientRoute.type=='hospital'].groupby('patient_id').count()
total_num = len(np.unique(PatientRoute['patient_id']))
percent = len(people_visit_hosp)/total_num*100
percent
We would visualize the confirmed case in Korea by the province. Before the analyzing, we first import the packages and read the province shapefile to link the data with geography information.
! pip install geopandas
! pip install datashader
import geopandas as gpd
# Datashader imports
import datashader as ds
import datashader.transfer_functions as tf
from colorcet import fire
korea_shp = gpd.read_file('https://map.igismap.com/igismap/Api/Api_download/downloadGlobalFile?out=shp&layerid=7eabe3a1649ffa2b3ff8c02ebfd5659f&token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTUxMTQsImVtYWlsIjoiaGlzdW15ZWVAaG90bWFpbC5jb20iLCJpYXQiOjE1ODg0NTQ1NDQsImV4cCI6MTU4ODU0MDk0NH0.LJaujIExBAk0YLR1E2hRu0JFopxdFp8-gSKwqyLSwQ0')
korea_shp = korea_shp.drop(['id', 'country', 'enname', 'locname', 'offname', 'boundary',
'adminlevel', 'wikidata', 'wikimedia', 'timestamp', 'note', 'path',
'rpath', 'iso3166_2'], axis=1)
# merge the Korea geography infomation with the TimeProvince data to create the map
TimeProvinceGEO.date = TimeProvinceGEO.date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
TimeProvinceGEO = pd.merge(TimeProvince, korea_shp, left_on='province', right_on='name')
TimeProvinceGEOByDate = TimeProvinceGEO.groupby('date')[['confirmed','deceased','released']].sum().reset_index()
TimeProvinceGEOByDate['total'] = TimeProvinceGEOByDate['confirmed'] - TimeProvinceGEOByDate['released'] - TimeProvinceGEOByDate['deceased']
TimeProvinceGEO = pd.merge(TimeProvinceGEO, TimeProvinceGEOByDate[['date','total']], on='date', how='left')
TimeProvinceGEO = gpd.GeoDataFrame(TimeProvinceGEO)
To make the visualization more informative, I calculate the percentage of the isolated people in each province, that is the percentage of the infected people per day.
TimeProvinceGEO['percent'] = round((TimeProvinceGEO['confirmed'] - TimeProvinceGEO['released'] - TimeProvinceGEO['deceased'])/TimeProvinceGEO['total']*100)
# import the package to plot the gif
import imageio
def plot_status_by_day(fig, data, date, x_range, y_range):
# trim to the specific date
df_today = data.loc[data["date"] == date]
# plot the image on a matplotlib axes
plt.clf()
ax = fig.gca()
df_today.plot(ax=ax, column = 'percent', edgecolor='white', legend=True, vmin=0, vmax=100)
ax.set_axis_off()
# add a text label for the date
ax.text(
0.6,
0.1,
date,
color="black",
fontsize=22,
ha="left",
transform=ax.transAxes,
)
# draw the figure and return the image
fig.canvas.draw()
image = np.frombuffer(fig.canvas.tostring_rgb(), dtype="uint8")
image = image.reshape(fig.canvas.get_width_height()[::-1] + (3,))
return image
# create a figure
x_range, y_range = [126.117397903, 129.468304478], [34.3900458847, 38.6122429469]
fig, ax = plt.subplots(figsize=(8,5), facecolor='white')
# Create an image for each day
imgs = []
for date in np.unique(TimeProvince.date).tolist():
img = plot_status_by_day(fig, TimeProvinceGEO, date, x_range=x_range, y_range=y_range)
imgs.append(img)
# Combing the images for each hour into a single GIF
imageio.mimsave('TimeProvince.gif', imgs, fps=1);
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython import display
from fastai.imports import *
gifPath = Path("/content/drive/My Drive/CIS545_2020/TimeProvince.gif")
with open(gifPath,'rb') as f:
display.Image(data=f.read(), format='png')
With the gif image we obtained, we can see that at first the confirmed cases concentrated in the northern west provinces, like Incheon, Gyeonggi-do and Seoul, and then from Feb 20th, the center turned to Dagu and gyeongsangnam-do. We can also conclude that there is a spatial autocorrelation here, for example, after Incheon had the confirmed case on January 20th, Seoul and Gyeonggi-do also had the cases soon.
From the researchs, we know that the weather like the temperature and humidity. Thus, in this part, we explore the relationship between the weather and the number of new increase each day in Korea. Since we only have the sample in one country, and the time span in rather short, the relationship may be not very reliable. Before the analysis, I convert the precipitation to category variable because I think what we care are whether it rain and the relationship between it and the new growth per day, instead of the amount of rainfall. According to the Pearson Index, humidity has the highest index value compared with other weather effects, but the value is 0.07, which is hard to say whether weather is related to the total number of the confirmed cases.
Weather.date = Weather.date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
WeatherByProv = Weather[['date','province','avg_temp','precipitation','avg_relative_humidity']]
WeatherByProv['isPrecip'] = WeatherByProv.precipitation.apply(lambda x: 1 if x!=0 else 0)
WeatherByProv = WeatherByProv.drop(['Precipitation'],axis=1)
WeatherEffects = pd.merge(TimeLag[['date','province','today']], WeatherByProv, how='left', on=['date','province'])
WeatherEffects = WeatherEffects.dropna()
WeatherEffects.drop(['date','province'], axis=1).corr()
With the Cases table, we explore more details in how people get infected. For these people with specific infection causes, the top 3 reasons are the Shincheonji Church (59%), and contact with patient (14%), and overseas inflow (8%). Table below shows the top 5 reasons and three of them could be defined as group infection.
# We want to check the specific reasons and thus, delete the records with 'etc' in the infection_case.
CaseNoEtc = Case.loc[Case.infection_case!='etc']
CauseCount = CaseNoEtc.groupby('infection_case')['confirmed'].sum().reset_index().rename(columns = {'confirmed':'count'}).sort_values('count', ascending=False)
CauseCount['percent'] = CauseCount['count']/CauseCount['count'].sum()*100
top5 = CauseCount.sort_values('count', ascending=False).head(5)
top5
Focusing on these reasons, we would like to know more about the time pattern of the number of cases caused by these 5 causes by week. Thus, we will switch to the PatientInfo dataset, where the number of records are limited but based on it, the general trend could be obtained.
In this part, we analyze the cases on week level, and thus, for each date, we extract the corresponding week of the year and calculate the total number of confirmed case by infection case by day. Then in the second SQL query, I calculate the week-level numbers and the proportion for each infection case within one week.
PatientInfo_sdf = spark.read.format("csv").option("header", "true").load("/content/drive/My Drive/CIS545_2020/PatientInfo.csv")#
PatientInfo_sdf.createOrReplaceTempView('PatientInfo')
CauseCountByWeek = spark.sql('''
SELECT confirmed_date, infection_case, mean(weekofyear(confirmed_date)) AS week, COUNT(*) AS cnt FROM PatientInfo
GROUP BY confirmed_date, infection_case
ORDER BY confirmed_date, infection_case
''')
CauseCountByWeek.createOrReplaceTempView('CauseCountByWeek')
CausePercentByWeek = spark.sql('''
SELECT * FROM
(SELECT t1.week, t1.infection_case, SUM(t1.cnt) AS CauseTotal, mean(t2.WeekTotal) AS WeekTotal,
ROUND(SUM(t1.cnt)*100.0/mean(t2.WeekTotal),2) AS percent FROM CauseCountByWeek t1
JOIN
(SELECT week, SUM(cnt) AS WeekTotal FROM CauseCountByWeek
GROUP BY week) t2
ON t1.week==t2.week
GROUP BY t1.week, t1.infection_case
ORDER BY t1.week, t1.infection_case) result
WHERE infection_case IN
(SELECT infection_case FROM
(SELECT infection_case, COUNT(*) as total FROM PatientInfo
WHERE infection_case not like 'etc' GROUP BY infection_case ORDER BY total DESC LIMIT 5) t)
''')
CausePercentByWeek = CausePercentByWeek.toPandas()
CausePercentByWeek['percent_round'] = CausePercentByWeek.percent.apply(lambda x: round(x))
CausePercentByWeek['week'] = CausePercentByWeek.week.apply(lambda x: round(x))
CausePercentByWeek['infection_case'] = CausePercentByWeek['infection_case'].apply(lambda x: x[0].upper()+x[1:])
Here, I create a panel to display the data because the table CausePercentByWeek only has the record when the count is not zero, however, there is a scenario where there is no case for certain infection cases in specific week. To display the statistics completely, we have to create the panel.
# expend grid
import itertools
def expandgrid(*itrs):
product = list(itertools.product(*itrs))
return {'Var{}'.format(i+1):[x[i] for x in product] for i in range(len(itrs))}
cases = np.unique(CausePercentByWeek.infection_case).tolist()
week = np.unique(CausePercentByWeek.week).tolist()
# create a panel to store the numbers
panel = pd.DataFrame(expandgrid(week, cases)).rename(columns={'Var1':'week', 'Var2':'infection_case'})
casesStatFull = pd.merge(CausePercentByWeek, panel, on=['week','infection_case'], how='outer').fillna(0)
Take week 11 as an example. Note that we only display the statistics for the top 5 reasons we show above.
casesStatFull.loc[casesStatFull.week==11].sort_values('CauseTotal', ascending=False)
import altair as alt
chart = alt.Chart(CausePercentByWeek).mark_bar(
cornerRadiusTopLeft=3,
cornerRadiusTopRight=3
).encode(
x=alt.X('week:Q',title='Week of the Year'),
y=alt.Y('mean(percent_round):Q', title='Percentage'),
color='infection_case:N',
size=alt.value(20)
).properties(
width=500,
height=300,
title = 'Percentage of top 5 infection cases by week'
)
chart.configure_title(
fontSize=20
)
From the bar plot above, it could be concluded that as time passes, the proportions of each infection cases change. For example, at the very begining and recently, overseas inflow is the main infection case and at the mid-term, there are more different kinds of infection cases, like the Onchun Church and Shincheonji Church in week 8 and the Guro-gu Call Center in week 11. Generally, the cases caused by the contact with patient needs to be noticed. For the next step, the Korean government should focus more on controlling the overseas inflow.
*The number shown in the bar plot might be different from the trend we obtained from Case dataset, this is because there are only part of the patients are recored in the PatientInfo table.
In this session, we think the spread from person to person as a graph, and try to trace the spread from one of the super spreader recorded in the PatientInfo table.
# Extract the fileds to use
infected = PatientInfo[['patient_id','infected_by']]
schema = StructType([
StructField("patient_id", StringType(), True),
StructField("infected_by", StringType(), False)
])
infected_sdf = spark.createDataFrame(infected, schema)
infected_sdf.createOrReplaceTempView('infected')
First, we want to know who are super spreader according to the data we have, and according to the statistics, a female in her 70s has infected more than 50 people is the super spreader in our data set.
infected = PatientInfo[['patient_id','infected_by']].dropna()
SuperSpreader = infected.groupby('infected_by').count().reset_index().sort_values('patient_id',ascending=False).rename(columns={'patient_id':'cnt','infected_by':'patient_id'}).head(3)
SuperSpreader = pd.merge(SuperSpreader, PatientInfo, on='patient_id')
SuperSpreader
The super spreader is patient #2000000205 who was infected by #1000000138 and spreaded the virus to about 51 people according to the record we have. The spread of the virus among people could be thought as a directed graph. We then track the spread from the patient #2000000205.
PatientInfo.loc[PatientInfo.patient_id=='1000000138']
# TODO: iterative search over undirected graph
def super_spreader(G, origins, max_depth):
G.createOrReplaceTempView('G')
schema = StructType([
StructField("patient_id", StringType(), True),
StructField("distance", IntegerType(), False)
])
visited = spark.createDataFrame(origins, schema)
to_visit = spark.createDataFrame(origins, schema)
for i in range(1, max_depth+1):
to_visit.createOrReplaceTempView('to_visit')
to_visit = spark.sql('SELECT g1.patient_id as patient_id, min(t.distance)+1 as distance from to_visit t \
JOIN G g1 ON g1.infected_by = t.patient_id \
GROUP BY g1.patient_id \
')
visited = visited.union(to_visit)
if i==max_depth:
visited.createOrReplaceTempView('visited')
visited = spark.sql('SELECT patient_id, min(distance) as distance FROM visited GROUP BY patient_id ORDER BY patient_id')
return visited
We set the max depth as 5 to compare the number of infection in the 5 transmission.
orig = [{'patient_id': '2000000205','distance':0}]
#grab the count
count = super_spreader(infected_sdf, orig, 5)
infected5 = count.toPandas()
Fortunately, after the spread of the virus caused by patient #2000000205, the spread was in control and there are only 7 people were infected by those who infected by the super spreader. And the count comes to 0 in the 5th transmission.
infected5.groupby('distance').count().rename(columns={'patient_id':'count'})
In 1.3, we group the patient by their characteristics using clustering analysis. Since there are mixed numerical and categorical variables, I choose to use K-Prototype, which is built based on K-Means and K-Modes.
First, we extract the variables to use.
X = PatientInfo.drop(['patient_id','global_num','city', 'country','birth_year','disease','infection_order', 'infected_by','symptom_onset_date','confirmed_date','released_date','deceased_date','state'], axis=1)
X.info()
Based on the table shown above, I select the features that might be useful in grouping patients. However, as you can see there are many NAs in the field contact_number, and therefore, we will first fill the missing values. But before doing that, we need to know how to impute the values, and therefore, I plot the distribution of the contact number. To make it easy to interpret, I set the bins as 120, and the width of each bar is about 10.
sns.distplot(PatientInfo.contact_number, hist=True, kde=False,
bins=int(120), color='#143d59',
hist_kws={'edgecolor':'black'})
plt.title('The Distribution of The Contact Number', size=16)
plt.xlabel('Contact Number',size=12)
plt.ylabel('Count', size=12)
plt.show()
The distribution of the contact number is shown above, and most of the patient have fewer than 10 times of contact with others, and a few of people have more than 1000 contacts and thus, filling the missing values with mean value is not appropriate, instead, we use the median value here.
X.contact_number = X.contact_number.fillna(X.contact_number.median())
X = X.dropna()
X.head(3)
# convert the string to the int
X['age'] = X['age'].apply(lambda x: int(x.split('s')[0]))
from sklearn.preprocessing import StandardScaler
scal = StandardScaler()
X[['age','contact_number']] = scal.fit_transform(X[['age','contact_number']])
To use K-Prototype, we first need to define the k, i.e., the number of the clusters. Here, to pick a proper k, I first use K-Means and try 10 values for k to compare the distortion.
X_kmeans = X.copy()
# clustering analysis
X_kmeans.province = X_kmeans.province.astype('category')
X_kmeans.sex = X_kmeans.sex.astype('category')
X_kmeans.infection_case = X_kmeans.infection_case.astype('category')
# Use pd.get_dummies() to convert the categorical variables to dummy variables.
X_kmeans = pd.concat([X_kmeans[['age', 'contact_number']], pd.get_dummies(X_kmeans[['province','sex','infection_case']])], axis=1)
from sklearn.cluster import KMeans
distortions = []
max_k = 10
for i in range(1,max_k+1):
km = KMeans(n_clusters=i,
init='random',
n_init=1,
max_iter=300,
random_state=0)
km.fit(X_kmeans)
distortions.append(km.inertia_)
To pick the right value of k, we will look at the value that (roughly) minimizes the distortion and can split the clusters in a propor and interpretable way. Here, we would set k as 4.
plt.plot(range(1,max_k+1), distortions, marker='o')
plt.xlabel('Cluster count (k)')
plt.ylabel('Distortion')
plt.show()
However, there is a problem to do the clustering analysis using K-Means here due to the type of data we have. K-Means is used for continuous data while we have mixed numerical and categorical data. Thus, after doing some searches, I found a new algorithm proposed by Zhexue Huang, K-Prototype, which is based on K-Means and K-Modes.
! pip install kmodes
X_kprototype = X.copy()
from kmodes.kprototypes import KPrototypes
kproto = KPrototypes(n_clusters=4, init='Cao', verbose=1)
clusters = kproto.fit_predict(X_kprototype, categorical=[0, 2, 3])
Most of the patients are assigned to the first two and the last clusters. What are the characteristics of these clusters?
X_kprototype['cluster'] = clusters
X_kprototype.groupby('cluster')['sex'].count().reset_index().rename(columns={'sex':'count'})
kproto_center = pd.concat([pd.DataFrame(kproto.cluster_centroids_[0],columns=['age', 'contact_number']),
pd.DataFrame(kproto.cluster_centroids_[1],columns=['sex', 'province', 'infection_case'])], axis=1)
kproto_center[['age','contact_number']] = scal.inverse_transform(kproto_center[['age','contact_number']])
kproto_center
As the table shown above, patients in different clusters have different characteristics. For the first cluster, the representative patient is a female in her 60s-70s in Gyeongsangbuk-do, and she still got infected throught the limited contact with people. People in cluster 1 are infected overseas and test positive in Seoul. These patient are young, in their 20s, from my opinion, they might be the overseas students. The persona in third cluster is a female who live in Gyeonggi-do in her 30s and she got infected by the contact with patient. We only have 5 people in cluster 3 and surprisingly, they have contacted more than 700 people on average probably due to the participant in the activity in church in Busan.
In this session, I do a prediction for the Korea and the United States. It is important to have a general understanding of the prospect of the COVID-19 to better take measures to control the situation.
For Korea, the situation is easing off and I first use a logistic growth model to fit the confirmed cases data and predict the situation in the futrue. I also try to use ARIMA to do the prediction for Korea, however the accuracy of the model is rather low and time pattern for the data is varing. Besides, I do the prediction of the new growth by province using space-time model based on the thought that the infections nearby and the recent infections might greatly influence the prediction for each province in the coming days.
I use logistics growth model to mimic the situation in Korea, and the model is not the best one to help us understand the COVID-19 in Korea, but with it, we can know the general trend. The reason for using this function is the curve is similar to that of the COVID-19 in the Korea. At first, the infected people is few and as time passes, the virus spread from person to person and the number of the confirmed cases increase rapidly. After the government notice and take measures, the growth slow down, making the curve become a 'S' shape. It is true that the ARIMA model should be suitable for the time-series data, however, the time-data is not staionary even after the a differencing. So, I would use the logistics growth model when predicting for the situation in the Korea.
I use the 90-day data to predict the future the situation in the coming 10 days. r is set as 0.11 and the result is shown in the plot below.
def logistic_increase_function(t, K, P0, r):
r=0.11
t0 = 1
exp_value = np.exp(r * (t - t0))
return (K * exp_value * P0) / (K + (exp_value - 1) * P0)
# fitting the recorded data
t=np.linspace(1, 90, 90)
P=Time.confirmed[:90]
from scipy.optimize import curve_fit
popt, pocv = curve_fit(logistic_increase_function, t, P)
# prediction for a long time in future
P_predict=logistic_increase_function(t,popt[0],popt[1],popt[2])
future=[100,103,107,114,120,130,140,150]
future=np.array(future)
future_predict=logistic_increase_function(future,popt[0],popt[1],popt[2])
# prediction for the next 15 days
tomorrow=range(86,100)
tomorrow=np.array(tomorrow)
tomorrow_predict=logistic_increase_function(tomorrow,popt[0],popt[1],popt[2])
# plot the fitting and the prediction
plot1=plt.plot(t,P,'s',label='confirmed_actual')
plot2=plt.plot(t,P_predict,'r',label='confired_fit')
plot3=plt.plot(tomorrow,tomorrow_predict,'s',label='confirmed_pred')
plt.xlabel('Number of Day From Jan 20th', size=12)
plt.ylabel('Count', size=12)
plt.legend(loc=0)
plt.title('Use Logistic model to fit the recorded comfirmed cases', size=14)
plt.show()
Generally, the model fit the data well and the prediction is to some extent reflect the prospect for the virus in the Korea. The plot below is the prediction for the coming month. The increase is pretty slow and the COVID-19 in the Korea would be in control.
# convert the number of day from Jan 20th to the data
future_date = []
for i in future:
future_date.append((datetime(2020,1,20)+timedelta(days=int(i))).date())
# plot the predicted confirmed case for the future
plt.plot(future_date,future_predict,'s',label='pred')
plt.xlabel('Date', size=12)
plt.xticks(rotation=45)
plt.ylabel('Total confirmed',size=12)
plt.title('Prediction for future',size=14)
plt.show()
This is a simple mathmetic model that did not take all the socio-economic characteristics into accounts. According to the model, there would be about 10730 confirmed case on May 1st, and the actual number is 10774, which is pretty close. And thus, the logistic model here can provide us a general insight.
In this session, I take the time effects, spatial effects and the socio-economic chracteristics into accounts when predicting the number of confirmed cases by province. As we have mentioned above, the number of cases in each proveince is somehow related to the number of their neighbors and varies by time.
# extract the fields to use and convert the dataframe to the spark dataframe
DailyGrowthByProv_sdf = spark.createDataFrame(DailyGrowthByProv[['date','province','DailyGrowth']])
DailyGrowthByProv_sdf.createOrReplaceTempView('DailyGrowthByProv')
Time effects are measured by the time lag. Time lag is defined as the daily growth before, for example, the daily growth a day ago.
# calculate the time lag for each province
TimeLag = spark.sql('''
SELECT t1.date, t1.province, t1.DailyGrowth AS today, t2.DailyGrowth AS 1DayBefore, t3.DailyGrowth AS 3DaysBefore,
t4.DailyGrowth AS 7DaysBefore, t5.DailyGrowth AS 14DaysBefore FROM DailyGrowthByProv t1
JOIN DailyGrowthByProv t2 ON t1.date == date_add(t2.date,1) AND t1.province == t2.province
JOIN DailyGrowthByProv t3 ON t1.date == date_add(t3.date,3) AND t1.province == t3.province
JOIN DailyGrowthByProv t4 ON t1.date == date_add(t4.date,7) AND t1.province == t4.province
JOIN DailyGrowthByProv t5 ON t1.date == date_add(t5.date,14) AND t1.province == t5.province
ORDER BY t1.date
''')
TimeLag = TimeLag.toPandas()
As shown in the table below, we calculate the time lag for each province and take Seoul as example. It is obvious that the time lag could be a strong features when predicting the daily growth, especially the count 1/3 day(s) before.
TimeLag[['today','1DayBefore','3DaysBefore','7DaysBefore','14DaysBefore']]=TimeLag[['today','1DayBefore','3DaysBefore','7DaysBefore','14DaysBefore']].astype(float)
TimeLag.loc[TimeLag.province=='Seoul'].drop(['date','province'], axis=1).astype(int).corr()
University count, nursing home count and elderly population are considered as the factors that might be related to the number of confirmed cases.
count = Region.groupby('province')['university_count','nursing_home_count'].sum().reset_index()
# we drop the province fields in the ratio to avoid the repeatation
ratio = Region.groupby('province')['elderly_population_ratio','elderly_alone_ratio'].mean().reset_index().drop(['province'],axis=1)
CensusBuiltin = pd.concat([count,ratio],axis=1)
data = pd.merge(TimeLag,CensusBuiltin, on='province')
data = pd.merge(data, WeatherEffects[['date','avg_relative_humidity','province']], on=['date','province'])
Spatial effects is measured by spatial lag. Spatial lag is defined as the mean count of the neighbors of each province.
import geopandas as gpd
df = korea_shp.copy()
df["NEIGHBORS"] = None # add NEIGHBORS column
for index, country in df.iterrows():
# get 'not disjoint' countries
neighbors = df[~df.geometry.disjoint(country.geometry)].name.tolist()
# remove own name from the list
neighbors = [name for name in neighbors if country[0] != name ]
# add names of neighbors as NEIGHBORS value
df.at[index, "NEIGHBORS"] = ", ".join(neighbors)
neighbor = df[['name','NEIGHBORS']]
def spatialLag(data):
sumDailyNew = 0
if data['province']=='Jeju-do':
return 0
else:
for prov in data['NEIGHBORS'].split(', '):
temp = TimeLag.loc[(TimeLag.date==data.date) & (TimeLag.province == prov)]
value = temp.today.values
sumDailyNew += value[0]
return sumDailyNew
SpatialLag = pd.merge(TimeLag, neighbor, left_on='province', right_on='name')
spatiallag_result = SpatialLag.apply(lambda x: spatialLag(x), axis=1)
SpatialLag.insert(3, "spatiallag", spatiallag_result)
data = pd.merge(data, SpatialLag[['date','province','spatiallag']], on=['date','province'])
Then, I calculate the relationship between the predicted variable and the predictors. The socio-economic chracteristics have low index values.
data.corr()
from sklearn.model_selection import train_test_split
import sklearn.metrics
ProvinceData = TimeLag.loc[TimeLag.province=='Seoul']
X = data.drop(['date','province','today'], axis=1)
y = data.today
Given the data is in time series, I divide the data set into training and test based on the date instead of split them randomly. I consider the data before March 30th as the training dataset and the other data as the test set.
training = data.loc[data.date <= datetime(2020, 3, 30)]
test = data.loc[data.date > datetime(2020, 3, 30)]
X_train = training.drop(['date','province','today'], axis=1)
y_train = training.today
X_test = test.drop(['date','province','today'], axis=1)
y_test = test.today
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
regr = make_pipeline(
StandardScaler(), RandomForestRegressor(n_estimators=1000, random_state=42)
)
from sklearn.ensemble import RandomForestRegressor
regr.fit(X_train, y_train)
I calculate the accuracy for both training data set and the test set. The MAE for the test set is 2.21 and the mean daily growth is 1.91. Obviously, the model do not perform well as I expected before. There needs more improvement here, for example, use a more appropriate model, or add more strong factors for the model.
# Use the forest's predict method on the test data
predictions = regr.predict(X_test).astype(int)
# Calculate the absolute errors
abs_errors = abs(predictions - y_test)
errors = predictions - y_test
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(abs_errors), 2))
print('Mean Error:', round(np.mean(errors), 2))
print('Mean Daily Growth:', round(y_test.mean(),2))
# Use the forest's predict method on the test data
predictions = regr.predict(X_train).astype(int)
# Calculate the absolute errors
abs_errors = abs(predictions - y_train)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(abs_errors), 2))
print('Mean Daily Growth:', round(y_train.mean(),2))
We pick a day when the daily growth is high. Accoding to the analysis in session 2.2, we use the numbers on March 1st to predict the daily increase on March 2nd, and compare it to the actual situation.
regr.predict(status31.drop(['date','province','today'], axis=1)).astype(int)
status31 = data.loc[data['date']==datetime(2020,4,1)]
prediction = regr.predict(status31.drop(['date','province','today'], axis=1)).astype(int)
status31.insert(3, "prediction", prediction.astype(float).tolist())
status32 = TimeLag.loc[TimeLag.date==datetime(2020,4,2)]
comp = pd.merge(status32[['province','today']], status31[['province','prediction']],on='province').rename(columns={'today':'actual'})
comp
For most of the province, the predicted values are close to the actual numbers, and the model does provide us a general understanding to the future daily growth across the country.
For United States, the coronavirus spread fast here and the number of the confirmed cases keep increasing rapidly every day. It is possible to use the exist time pattern to predict the future. For this prediction, I use ARIMA model.
USAdata = pd.read_csv('/content/drive/My Drive/CIS545_2020/us_covid19_daily.csv', dtype={'date': str},
sep=',')
# Days with 0 cases is set to be NA in this dataset, before the analyzing, we fill the na with 0.
USAdata = USAdata[['date','positive']].fillna(0)
USAdata.date = USAdata.date.apply(lambda x: datetime.strptime(x, '%Y%m%d'))
# reverse the data. The origin data is ordered by the date in descending order.
USAdata = USAdata[::-1]
# we first train the model using the 96-day data. And do the prediction for the rest of the days and evaluate the accuracy.
confirmedCases90days = USAdata[['date','positive']].iloc[:96]
toPred10days = USAdata[['date','positive']].iloc[96:]
confirmedCases90days = confirmedCases90days.set_index('date')
toPred10days = toPred10days.set_index('date')
rolmean = confirmedCases90days.rolling(window=7).mean() #calculate the rolling mean values by week
rolstd = confirmedCases90days.rolling(window=7).std()#calculate the rolling mean values by week
#Plot rolling statistics
plt.figure(figsize=(13,5),dpi = 80)
plt.xticks(rotation=45)
orig = plt.plot(confirmedCases90days.positive, color='blue', label='Original')#原数据
mean = plt.plot(rolmean, color='red', label='Rolling Mean')#平均值
std = plt.plot(rolstd, color='black', label='Rolling Std')#标准差
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation', size=16)
plt.xlabel('Date', size=12)
plt.ylabel('Count', size=12)
plt.show()
Apperently, for the origin count data, it is not stationary according to the plots, but we would do the statistics test, augmented dickey-fuller, to check whether is stationary statistically. According to the result, with the p-value equal to 0.99, we fail to reject the null hypothesis. The data is probably non-stationary.
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
#Determine rolling statistics
movingAverage = timeseries.rolling(window=7).mean()
movingSTD = timeseries.rolling(window=7).std()
#Plot rolling statistics
orig = plt.plot(timeseries, color='blue', label='Original')
mean = plt.plot(movingAverage, color='red', label='Rolling Mean')
std = plt.plot(movingSTD, color='black', label='Rolling Std')
plt.xticks(rotation=45)
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)
#Perform Dickey–Fuller test
print('Results of Dickey Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
test_stationarity(confirmedCases90days.positive)
I then check the stationarity for the log-transformed data. Apparently, with the p-value is 0.88, the log-data is probably not stationary.
#Estimating the trend
confirmedCases90days_logScale = np.log(confirmedCases90days.positive) #taking log
test_stationarity(confirmedCases90days_logScale)
Thus, the differencing is used to remove the trend.
#Differencing
datasetLogDiffShifting_1 = confirmedCases90days_logScale - confirmedCases90days_logScale.shift()
plt.plot(datasetLogDiffShifting_1,'r')
datasetLogDiffShifting_2 = datasetLogDiffShifting_1 - datasetLogDiffShifting_1.shift()
plt.plot(datasetLogDiffShifting_2)
After a differencing, the p-value is 0.01 which indicates that the data is probably stationary.
datasetLogDiffShifting_1.dropna(inplace=True)
test_stationarity(datasetLogDiffShifting_1)
After the time series has been stationarized by differencing, the next step is to determine whether AR or MA terms are needed to correct any autocorrelation that remains in the differenced series. By looking at the ACF and PACF plots, wee can identify the AR and MA terms here. In this case, I would consider the p ranging from 1 to 3 and q ranging from 1 to 2.
import statsmodels.api as sm
fig = plt.figure(figsize=(12,8))
#acf from statsmodels.tsa.stattools import acf, pacf
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(datasetLogDiffShifting_1[2:], lags=20,ax=ax1)
ax1.xaxis.set_ticks_position('bottom')
fig.tight_layout();
#pacf
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(datasetLogDiffShifting_1[2:], lags=20, ax=ax2)
ax2.xaxis.set_ticks_position('bottom')
fig.tight_layout();
We have different combination for p and q. One way to identify the best combination is to calculate the AIC, Akaike information criterion. Generally, the lower the value is, the better the model is.
import sys
def _proper_model(ts_log_diff, maxp, maxq):
best_p = 0
best_q = 0
best_aic = sys.maxsize
best_model=None
for p in np.arange(maxp+1):
for q in np.arange(maxq+1):
model = ARIMA(ts_log_diff, order=(p,1, q))
try:
results_ARIMA = model.fit(disp=-1)
except:
continue
aic = results_ARIMA.aic
print(aic, best_aic)
if aic < best_aic:
best_p = p
best_q = q
best_aic = aic
best_model = results_ARIMA
return best_p,best_q,best_model
_proper_model(confirmedCases90days_logScale, 3, 2)
from statsmodels.tsa.arima_model import ARIMA
# build the model based on the training data and predict on the test data
model = ARIMA(confirmedCases90days_logScale, order=(1,1,1)) # after trying all combination, the best model has p=1 and q=1.
results_ARIMA = model.fit(disp=-1)
plt.plot(datasetLogDiffShifting_1)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_ARIMA.fittedvalues - datasetLogDiffShifting_1)**2))
plt.xticks(rotation=45)
plt.show()
We translate the ARIMA model back to the origin scale and see how well the model is forecasting.
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(confirmedCases90days_logScale.iloc[0], index=confirmedCases90days_logScale.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
# exponentiate the log values.
predictions_ARIMA = np.exp(predictions_ARIMA_log)
The red line indicates the predicted values and the blue line indicates the actual numbers in the U.S. From the plot below, our model underpredict the number of the positive cases.
plt.plot(predictions_ARIMA, color='red')
plt.plot(confirmedCases90days.positive.iloc[:96])
plt.xticks(rotation=45)
plt.xlabel('Date', size=12)
plt.ylabel('Count', size=12)
plt.show()
USAdata_logScale = np.log(USAdata.positive)
Then, I use the model to predict for the test data, i.e., the situation in the latest 10 days.
# calculate the predicted values
predictions_ARIMA_=results_ARIMA.predict(1,103)
predictions_ARIMA_diff = pd.Series(predictions_ARIMA_, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(USAdata_logScale.iloc[0], index=predictions_ARIMA_.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
# exponentiate the log values.
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.figure(figsize=(10,8),dpi = 80)
plt.xticks(rotation=45,fontsize=10)
plt.yticks(fontsize=10)
USAdata = USAdata.set_index('date')
plt.plot(predictions_ARIMA[:96],markersize=30,label='Predicted',color = 'darkblue',linestyle='--',linewidth=2.0)
plt.plot(predictions_ARIMA[95:],markersize=30,label='Predicted-future',color = 'blue',linewidth=2.0)
plt.plot(USAdata.positive,markersize=30,label='Actual',color = 'orange',linewidth=2.0)
plt.plot(confirmedCases90days.positive,markersize=30,label='Training',color = 'green',linestyle='--',linewidth=2.0)
plt.xlabel("Date",fontsize=15)
plt.ylabel("The total number of confirmed case",fontsize=15)
plt.title("Accumulated confirmed cases",fontsize=20)
plt.grid(alpha=1)
plt.legend(loc="upper left")
plt.show()
Based on the prediction result, the model fail to perferm well as we expect, and according to the actual numbers, the daily growth in the U.S. start to slow down. However, as the prediction, the number keep increasing rapidly, as inthe first week of the May, the total number rises to 1.4 million. The real number for May 6th is 1.25 million.
Then, I use all of the data we now have to build the ARIMA model and do the prediction. The steps is the same as what we did above. But this time, we build the ARIMA based on the whole data set instead of using part of it.
USAdata_logScale = np.log(USAdata.positive)
After differencing, the p-value is lower than 0.05, which means our data is stationary.
# Differencing
datasetLogDiffShifting_1 = USAdata_logScale - USAdata_logScale.shift()
test_stationarity(datasetLogDiffShifting_1[1:])
_proper_model(USAdata_logScale, 3, 2)
Here, the p and q are set as 1 after comparing all the combination.
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(USAdata_logScale, order=(1,1,1))
results_ARIMA = model.fit(disp=-1)
plt.plot(datasetLogDiffShifting_1[1:])
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_ARIMA.fittedvalues - datasetLogDiffShifting_1[1:])**2))
I predict the situation in the coming week using the model and the result is plotted below. According to the model, the total number of the positive case is about 1.75 million, but from my persepective, the number would be the worest situation. But the government should take more efforts to control since the actual number may reach to that values at this rate.
predictions_ARIMA_=results_ARIMA.predict(1,110)
predictions_ARIMA_diff = pd.Series(predictions_ARIMA_, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(USAdata_logScale.iloc[0], index=predictions_ARIMA_.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
# Inverse of log is exp.
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.figure(figsize=(10,8),dpi = 80)
plt.xticks(rotation=45,fontsize=10)
plt.yticks(fontsize=10)
plt.plot(predictions_ARIMA,markersize=30,label='Predicted',color = 'blue',linewidth=2.0)
plt.plot(USAdata.positive,markersize=30,label='Actual',color = 'orange',linewidth=2.0)
plt.xlabel("Date",fontsize=15)
plt.ylabel("The total number of confirmed case",fontsize=15)
plt.title("Accumulated confirmed cases",fontsize=20)
plt.grid(alpha=1)
plt.legend(loc="upper left")
plt.show()
There are space for the improvement in the models. For the prediction for the situation in the Korea, I think space-time model is a good approach to predict the daily increase for each province, however, if there could be more strong features, the model would perform much better. In the future, I might try to use different ways to difference the time series data in the Korea to make it stationary and build the ARIMA model to predict. For the model in session 3.2, the accuracy is rather poor and more should be done in eliminating the trend, like differencing using moving average.