COVID-19 IN THE KOREA AND THE UNITED STATES: Analysis and Prediction

Xinyi Miao

This project uses the data provided by Korea Centers for Disease Control & Prevention and the data obtained from COVID-19 Tracking project. Many thanks to the data providers.

1 Introduction

COVID-19 becomes a trending word at the start of 2020 when the virus started to spread in China. The number of confirmed cases rises to 3.66 millions until May 5th 2020, about 105 days after people started to pay attention to it. The COVID-19 data provides us an insight into the virus and the patients, how they got infected and what kind of people are more vulnerable to it? Thanks to Korea Centers for Disease Control & Prevention, we have such a valuable data set to learn from.

According to the news, Korea announced the first cases on January 20th, 2020, a Chinese women travel from Wuhan. From then on, the number of the confirmed cases increased slowly and there was an outbreak on from Feburary 20th. The sudden jump has been related to the 'Patient 31', a super spreader who attended the ativity held in Shincheonji Church of Jesus the Temple of the Tabernacle of the Testimony church in Daegu. Unitl May 5th, Korea has 10804 confirmed cases, 254 confirmed cases, and 9283 recovered cases.

In this project, I first do exploratory analysis to understand the general situation in the Korea, like the time pattern of the total number of the confirmed cases, as well as the daily growth per day. Also, I analyze the characteristics of the patient and find that people in their 20s take the greatest proportion of the patients. The elder people and people who have underlying diseases are more vulnarable to the COVID-19, as they have the highest fatality rate. Moreover, for the Korea, the group infection in Shincheonji Church, overseas inflow, and the contact with patients are top 3 infection cases.

Then, I build the models to make the prediction for the Korea and the United States. The model provides a general insight into the future situation of the virus in the two countries. However, the models should be improved in the future to make the prediction more accurate.

2 Exploratory Analysis

2.1 Set up & Read the data

In [0]:
# import the package we would use in the analysis and modelling
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from datetime import datetime
In [0]:
Case = pd.read_csv('/content/drive/My Drive/CIS545_2020/Case.csv',
            sep=',')
In [0]:
PatientRoute = pd.read_csv('/content/drive/My Drive/CIS545_2020/PatientRoute.csv',
            sep=',')
PatientInfo = pd.read_csv('/content/drive/My Drive/CIS545_2020/PatientInfo.csv' , dtype={'patient_id': object,'infected_by': object}, 
            sep=',')
In [0]:
Region = pd.read_csv('/content/drive/My Drive/CIS545_2020/Region.csv',
            sep=',')
In [0]:
SeoulFloating = pd.read_csv('/content/drive/My Drive/CIS545_2020/SeoulFloating.csv',
            sep=',')
In [0]:
Time = pd.read_csv('/content/drive/My Drive/CIS545_2020/Time.csv',
            sep=',')
TimeProvince = pd.read_csv('/content/drive/My Drive/CIS545_2020/TimeProvince.csv',
            sep=',')
TimeAge = pd.read_csv('/content/drive/My Drive/CIS545_2020/TimeAge.csv',
            sep=',')
In [0]:
SearchTrend = pd.read_csv('/content/drive/My Drive/CIS545_2020/SearchTrend.csv',
            sep=',')
In [0]:
Weather = pd.read_csv('/content/drive/My Drive/CIS545_2020/Weather.csv',
            sep=',')

2.2 Visualization and Analysis

COVID-19 status in South Korea

In this part, I first anlayze the general situation in South Korea first to understand the mortality rate, the daily growth rate and the development trend of the virus in the country from Jan 20th. Next, I want to explore that which group of people are more likely to get infected and even decease due to the coronavirus, and whether the weather effects would influence the spread of the virus. Moreover, I analyze and visualize the data by geography and by infection cases. At last, I consider the spread from person to person as a graph and track the transmission of a super spreader using the Patient Information dataset.

To gain a general understanding about the situation in Korea, we first plot the daily confirmed case to gain an insight into how the number of infection change by time in the whole country. We also calculate the mortality and the test positive rate for the country.

In [0]:
# calculate some index: mortality, positive rate
Time['mortality'] = Time['deceased']/Time['confirmed']*100
Time['positive_rate'] = Time['confirmed']/Time['test']*100
In [11]:
print('The current mortality:{}.'.format(round(Time.mortality.iloc[-1],2)))
print('The current test positive rate:{}.'.format(round(Time.positive_rate.iloc[-1],2)))
The current mortality:2.29.
The current test positive rate:1.74.

Until the end of the April, the fatality of COVID-19 in Korea is about 2.29% and the positive rate is 1.74%. To compare, the mortality in China is 5.51% and the number for global is 4.34%. In terms of the number, Korea did a better job than China. After the outbreak in the country, Korea has taken several measures to isolate the infected people and trace and publish the route of them without further lockdown.

In [0]:
# convert the date field from 
Time['date'] = Time['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

From the plot we can see a great jump in the number of confirmed cases from the last week of the Feburary. However, the country control the situation soon and the daily increase reduce obviously after a half of the month. We can also see that there seems to be a time lag of 2 weeks between the trend of confirmed and the released cases.

In [22]:
from matplotlib import pyplot as plt
%matplotlib inline

Time = Time.set_index('date')

Time['confirmed'].plot(figsize=(10,5))
Time['released'].plot()
Time['deceased'].plot()
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea by Day', size=16)
plt.legend()
plt.show()

The next question to explore is when did the Korean people start to notice the COVID-19? Before the outbreak when China started to struggle with the coronavirus or afther the outbreak when the number of confirmed cases is so high in their country?

In [0]:
# we only care about the search trend after January 20th, 2020
SearchTrend['date'] = SearchTrend['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
SearchTrend = SearchTrend.loc[SearchTrend['date'] > datetime(2020, 1, 20) ]
In [27]:
# plot the COVID status in Korea and the search trend
fig = plt.subplots(figsize=(10,5))

ax1 = plt.subplot(111)
ax1.plot(SearchTrend.date, SearchTrend.coronavirus, lw=3)
ax1.set_ylabel('Search Trand', size=12)
plt.suptitle("  Search Trend of 'coronavirus' in Korea", size=18)
plt.title("(Compare with the number of the confirmed case)", c='gray')

ax2 = ax1.twinx()
ax2.plot(Time.confirmed, '-.', lw=2, c='lightblue')
ax2.set_ylabel('The number of confirmed case', size=12)
ax2.set_xlabel('Date', size=12)

plt.show()

Apparently, Korean people started to notice the COVID-19 virus long before the outbreak in the country, about a month ago when China reported their first death related to the COVID-19 and the virus started to spread. The word coronavirus became trending on around Feburary 20th as shown when the group infection in Daegu in the Shincheonji Church. After the surge, as the growth of confirmed case slowed down, though the curve is still flutuating, the word coronavirus got fewer search.

Daily Growth Rate

In the last part, we mainly focus on the number of confirmed cases, and in this part, we calculate the daily growth rate to capture the inflection point in Korea.

Before the calculationa and the analysis, I first set up the environment for runing Spark SQL.

In [0]:
! sudo apt install openjdk-8-jdk
! sudo update-alternatives --config java
In [0]:
### Install required packages
%%capture
!pip install pandasql
In [0]:
!apt install libkrb5-dev
!wget https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install findspark
!pip install sparkmagic
!pip install pyspark
! pip install pyspark --user
! pip install seaborn --user
! pip install plotly --user
! pip install imageio --user
! pip install folium --user
In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

import os

spark = SparkSession.builder.appName('Graphs-HW2').config("spark.driver.memory", '5g').getOrCreate()
In [0]:
%load_ext sparkmagic.magics
In [0]:
import pandasql
import sqlite3
os.environ['SPARK_HOME'] = '/content/spark-2.4.5-bin-hadoop2.7'
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
from pyspark.sql import SQLContext
In [0]:
try:
    if(spark == None):
        spark = SparkSession.builder.appName('Initial').getOrCreate()
        sqlContext=SQLContext(spark)
except NameError:
    spark = SparkSession.builder.appName('Initial').getOrCreate()
    sqlContext=SQLContext(spark)
In [0]:
# read the data as spark data frame
Time_sdf = spark.read.format("csv").option("header", "true").load("/content/drive/My Drive/CIS545_2020/Time.csv")
Time_sdf.createOrReplaceTempView('Time')

In this project, daily growth rate is calculate by dividing the number of the daily new cases by the number of the new cases the day before. To calculate, we would first want to extract the daily increase from the accumulated values.

In [0]:
# extract the daily increase
DailyGrowth = spark.sql('''
                        SELECT t1.date, t1.confirmed, t1.confirmed - t2.confirmed as today, (t1.confirmed - t2.confirmed)*100/t2.confirmed as GrowthRate FROM Time t1
                        JOIN Time t2 ON t1.date == date_add(t2.date,1)
                        ''')
DailyGrowth = DailyGrowth.toPandas()
DailyGrowth['date'] = DailyGrowth['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
In [0]:
# plot the case growth status in Korea
fig = plt.subplots(figsize=(10,5))

ax1 = plt.subplot(111)
ax1.plot(DailyGrowth.date, DailyGrowth.GrowthRate, lw=3,c='#FFD43B')
ax1.set_ylabel('Growth Rate', size=12)
plt.suptitle("Case Growth in Korea", size=18)
plt.title('Yellow line indicate the growth rate and the blue line indicates the daily increase', c='gray')

ax2 = ax1.twinx()
ax2.plot(DailyGrowth.date, DailyGrowth.today, lw=3)
ax2.set_ylabel('Daily Increase', size=12)
ax2.set_xlabel('Date', size=12)

plt.show()

From the image above, we can conclude that the growth is fast in mid-to-late Febuary and in the first week of March, the country experienced the highest daily increase. To be more detailed, the growth rate increase from 3% to 65% to 104% from Feb 18th to Feb 20th. According to the news at that time, this great jump mainly came from the participants of an activity at a church in Daegu infected by the 'Patient 31', which is a super spreader.

In [0]:
TimeProvince_sdf = spark.read.format("csv").option("header", "true").load("/content/drive/My Drive/CIS545_2020/TimeProvince.csv")
TimeProvince_sdf.createOrReplaceTempView('TimeProvince')

# extract the daily increase
DailyGrowthByProv = spark.sql('''
                        SELECT t1.date, t1.province, t1.confirmed, t1.confirmed - t2.confirmed AS DailyGrowth, (t1.confirmed - t2.confirmed)*100/t2.confirmed as GrowthRate FROM TimeProvince t1
                        JOIN TimeProvince t2 ON t1.date == date_add(t2.date,1) AND t1.province==t2.province
                        ''')
DailyGrowthByProv = DailyGrowthByProv.toPandas()
DailyGrowthByProv['date'] = DailyGrowthByProv['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
In [0]:
DailyGrowthByProv.confirmed = DailyGrowthByProv.confirmed.astype(int)
In [0]:
DailyGrowthByProv = DailyGrowthByProv.fillna(0)
confirmedByProv = DailyGrowthByProv.pivot_table(
                  index=['date'],
                  columns=['province'], 
                  values=['confirmed']
                      )
In [0]:
confirmedByProv.columns = [x[1] for x in confirmedByProv.columns]
In [47]:
fig, ax = plt.subplots(figsize=(9,5))
plot = sns.lineplot(x="date", y="GrowthRate",
             hue="province", 
             data=DailyGrowthByProv)
plt.xticks(rotation=45)
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea \nby Province (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.7, -0.05), ncol=2)
plt.show()
In [46]:
fig, ax = plt.subplots(figsize=(9,5))
plot = sns.lineplot(x="date", y="DailyGrowth",
             hue="province", 
             data=DailyGrowthByProv)
plt.xticks(rotation=45)
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea by Province (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.7, -0.05), ncol=2)
plt.show()

Obviously, we have different patterns in terms of the situation of COVID-19 in different province. From the two plots above, we again conclude that the outbreak area is Daegu, and the values for Gyeongsangbuk-do is also quite high.

Personal Characteristics

In this session, we group the patient by their characteristics like age and sex in order to explore the common characteristics of the infected people.

  • Age
In [0]:
TimeAge['date'] = TimeAge['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

20s is the group that have the highest infection rate. But we also should notice that the number of people in 20s is high, especially when comparing with the number of 0s, 10s, and 80s. Thus, we should focus on the percentage instead of the number.

In [54]:
TimeAge_plot = TimeAge.set_index('date')

fig, ax = plt.subplots(figsize=(10,5))

ax.stackplot(np.unique(TimeAge['date']), TimeAge.loc[TimeAge.age=='0s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='10s']['confirmed'].tolist(), 
             TimeAge.loc[TimeAge.age=='20s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='30s']['confirmed'].tolist(),
             TimeAge.loc[TimeAge.age=='40s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='50s']['confirmed'].tolist(),
             TimeAge.loc[TimeAge.age=='60s']['confirmed'].tolist(), TimeAge.loc[TimeAge.age=='70s']['confirmed'].tolist(),
             TimeAge.loc[TimeAge.age=='80s']['confirmed'].tolist(), labels = np.unique(TimeAge['age']),
             colors = ['#003f5c', '#2f4b7c', '#665191', '#a05195', '#d45087', '#f95d6a', '#ff7c43', '#ffa600', '#ffd311'])
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.xticks(rotation=45)
plt.ylabel('Count', fontdict={'size':12})
plt.title('Number of Confirmed Case in Korea by Age (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.2, 0.15), ncol=1)
plt.show()
In [56]:
#TimeAge_plot = TimeAge.set_index('date')

fig, ax = plt.subplots(figsize=(10,5))

sns.lineplot(x="date", y="confirmed",
             hue="age", 
             data=TimeAge)
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width* 0.8 , box.height])
plt.xlabel('Date', fontdict={'size':12})
plt.ylabel('Count', fontdict={'size':12})
plt.xticks(rotation=45)
plt.title('Number of Confirmed Case in Korea by Age (Accumulated)', size=16)
ax.legend(loc='lower right', bbox_to_anchor=(1.2, 0.15), ncol=1)
plt.show()

To better compare, we calculate the infection rate for each age group based on the Korean population data. The infection rate is defined as the percentage of infection in the total population by age. Besides, the fatality by age is also calculated in order to help understand the infection in different age groups.

In [0]:
# create a dataframe to store the population in Korea by age
KorPop = {'age':np.unique(TimeAge['age']).tolist(),
          'Population':[4055740, 4732100, 6971785, 7203550, 8291728, 8587047, 6472987, 3591533, 1646916]}
KorPop = pd.DataFrame(KorPop)
In [0]:
# add the fields of the number of confirmed/deceased case to the dataframe and calculate the mortolity and infection rate
KorPop = pd.merge(KorPop, TimeAge.loc[TimeAge.date=='2020-04-20'][['age','confirmed','deceased']], on='age')
KorPop['infection_rate'] = KorPop['confirmed']/KorPop['Population']*100
KorPop['mortality'] = KorPop['deceased']/KorPop['confirmed']*100

As our expectation, 20s has the highest infection rate, about 0.04%, in Korea. 80s comes the second with the infection rate of 0.03%. However, the mortality of 80s is much higher than the other age groups, about 25%. Thus, we can conclude that 20s and 80s are easily to get infected and the elder patients in their 80s have higher risk to death. Some possible reasons: 20s people have more contact with the society and 80s people tend to have poor immunity and are more vulnerable.

In [60]:
fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(14,4))

# plot the infection rate by age
ax1.bar(KorPop['age'], KorPop['infection_rate'])
ax1.set_title('Injection Rate in Korea by Age', size=14)
ax1.set_xlabel('Age', size=12)
ax1.set_ylabel('Infection Rate(%)', size=12)

# plot the infection rate by age
ax2.bar(KorPop['age'], KorPop['mortality'])
ax2.set_title('Mortality in Korea by Age', size=14)
ax2.set_xlabel('Age', size=12)
ax2.set_ylabel('Mortality(%)', size=12)

plt.show()
  • Sex
In [0]:
PatientBySex = PatientInfo.groupby('sex')[['patient_id']].count().reset_index().rename(columns={'patient_id':'count'})

There are more female patient in Korea, with the percentage of 56%.

In [0]:
fig, ax = plt.subplots()
piechart = ax.pie(PatientBySex['count'], radius=1, autopct='%1.1f%%', pctdistance = 0.4, textprops = {'size':12},
       wedgeprops=dict(width=0.3, edgecolor='w'), labels = ['Female','Male'])
aspect = ax.set(aspect="equal")
title = ax.set_title('The percentage of patient by female', size=16)
plt.show()
In [0]:
MortalityBySex = PatientInfo.groupby(['sex','state'])[['patient_id']].count().reset_index().rename(columns={'patient_id':'count'})
In [0]:
MortalityBySex
Out[0]:
sex state count
0 female deceased 25
1 female isolated 780
2 female released 1051
3 male deceased 44
4 male isolated 595
5 male released 816
  • Underlying Disease

According to the researches, patient who has underlying diseases is more likelier to decease in COVID-19. This might also be the reasons for the fact that elder people tend to have higher risk. Unfortunately, we only have 19 records that are not null and all of the recorded patients with underlying diseases were died. We may also care about the number of day between the confirmation date and the date they left.

In [0]:
PatientDisease = PatientInfo.copy().loc[pd.notnull(PatientInfo.disease)]
PatientDisease['confirmed_date'] = PatientDisease.confirmed_date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDisease['deceased_date'] = PatientDisease['deceased_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
In [0]:
PatientDisease['number_of_day'] = PatientDisease['deceased_date'] - PatientDisease['confirmed_date']
print('According to the patient records, patients with underlying disease died in the {} day after the confirmation on average.'.format(PatientDisease['number_of_day'].mean().days))
According to the patient records, patients with underlying disease died in the 2 day after the confirmation on average.

Though we don't have dataset for patient without the underlying diseases, we calculate the number for the whole dataset to see if there are any different.

In [0]:
PatientDeath_D = PatientInfo.copy().loc[pd.notnull(PatientInfo.deceased_date) & (PatientInfo.disease==True)]
PatientDeath_D['deceased_date'] = PatientDeath_D['deceased_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath_D['confirmed_date'] = PatientDeath_D['confirmed_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath_D['number_of_day'] = PatientDeath_D['deceased_date'] - PatientDeath_D['confirmed_date']

print('According to the patient records, patients with underlying disease died in the {}th day after the confirmation on average.'.format(PatientDeath['number_of_day'].mean().days))
According to the patient records, patients with underlying disease died in the 2th day after the confirmation on average.
In [0]:
PatientDeath = PatientInfo.copy().loc[pd.notnull(PatientInfo.deceased_date)]
PatientDeath['confirmed_date'] = PatientDeath['confirmed_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath['deceased_date'] = PatientDeath['deceased_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
PatientDeath['number_of_day'] = PatientDeath['deceased_date'] - PatientDeath['confirmed_date']

print('According to the patient records, patients with underlying disease died in the {}th day after the confirmation on average.'.format(PatientDeath['number_of_day'].mean().days))
According to the patient records, patients with underlying disease died in the 8th day after the confirmation on average.

Thus, generally, patients with underlying diseases in Korea decease in the 8th day after the confirmation (people without the diseases can survive longer to have more chance to overcome the coronavirus) while those who have underlying diseases fail to survive for a longer time after infected by the virus.

Route

In [0]:
visit_mean = PatientRoute.groupby('patient_id')['global_num'].count().mean()
visit_max = PatientRoute.groupby('patient_id')['global_num'].count().max()
print('People testing positive have on average {} route before the confirmation. \nThe patient who has highest route count visited different places for {} times.'.format(round(visit_mean,2), visit_max))
People testing positive have on average 1.15 route before the confirmation. 
The patient who has highest route count visited different places for 38 times.

The top 3 places the patients visited oftern is hospital, public transportation and restaurant, where people inevitably stay in close contact. About 85% of the patients have visited hospital before the confirmation.

In [0]:
PatientRoute.groupby(['type'])['global_num'].count().sort_values(ascending=False).head(4)
Out[0]:
type
hospital                 355
etc                      275
public_transportation    137
restaurant               120
Name: global_num, dtype: int64
In [0]:
people_visit_hosp = PatientRoute.loc[PatientRoute.type=='hospital'].groupby('patient_id').count()
total_num = len(np.unique(PatientRoute['patient_id']))
percent = len(people_visit_hosp)/total_num*100
percent
Out[0]:
85.62300319488818

Geography

We would visualize the confirmed case in Korea by the province. Before the analyzing, we first import the packages and read the province shapefile to link the data with geography information.

In [0]:
! pip install geopandas
! pip install datashader
In [0]:
import geopandas as gpd
# Datashader imports
import datashader as ds
import datashader.transfer_functions as tf

from colorcet import fire
In [0]:
korea_shp = gpd.read_file('https://map.igismap.com/igismap/Api/Api_download/downloadGlobalFile?out=shp&layerid=7eabe3a1649ffa2b3ff8c02ebfd5659f&token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTUxMTQsImVtYWlsIjoiaGlzdW15ZWVAaG90bWFpbC5jb20iLCJpYXQiOjE1ODg0NTQ1NDQsImV4cCI6MTU4ODU0MDk0NH0.LJaujIExBAk0YLR1E2hRu0JFopxdFp8-gSKwqyLSwQ0')
In [0]:
korea_shp = korea_shp.drop(['id', 'country', 'enname', 'locname', 'offname', 'boundary',
       'adminlevel', 'wikidata', 'wikimedia', 'timestamp', 'note', 'path',
       'rpath', 'iso3166_2'], axis=1)
In [0]:
# merge the Korea geography infomation with the TimeProvince data to create the map
TimeProvinceGEO.date = TimeProvinceGEO.date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
TimeProvinceGEO = pd.merge(TimeProvince, korea_shp, left_on='province', right_on='name')
TimeProvinceGEOByDate = TimeProvinceGEO.groupby('date')[['confirmed','deceased','released']].sum().reset_index()
TimeProvinceGEOByDate['total'] = TimeProvinceGEOByDate['confirmed'] - TimeProvinceGEOByDate['released']  - TimeProvinceGEOByDate['deceased']
TimeProvinceGEO = pd.merge(TimeProvinceGEO, TimeProvinceGEOByDate[['date','total']], on='date', how='left')
TimeProvinceGEO = gpd.GeoDataFrame(TimeProvinceGEO)

To make the visualization more informative, I calculate the percentage of the isolated people in each province, that is the percentage of the infected people per day.

In [0]:
TimeProvinceGEO['percent'] = round((TimeProvinceGEO['confirmed'] - TimeProvinceGEO['released']  - TimeProvinceGEO['deceased'])/TimeProvinceGEO['total']*100)
In [0]:
# import the package to plot the gif
import imageio
In [0]:
def plot_status_by_day(fig, data, date, x_range, y_range):
    # trim to the specific date
    df_today = data.loc[data["date"] == date]

    # plot the image on a matplotlib axes
    plt.clf()
    ax = fig.gca()
    df_today.plot(ax=ax, column = 'percent', edgecolor='white', legend=True, vmin=0, vmax=100)
    ax.set_axis_off()

    # add a text label for the date
    ax.text(
        0.6,
        0.1,
        date,
        color="black",
        fontsize=22,
        ha="left",
        transform=ax.transAxes,
    )

    # draw the figure and return the image
    fig.canvas.draw()
    image = np.frombuffer(fig.canvas.tostring_rgb(), dtype="uint8")
    image = image.reshape(fig.canvas.get_width_height()[::-1] + (3,))

    return image
In [0]:
# create a figure
x_range, y_range =  [126.117397903, 129.468304478], [34.3900458847, 38.6122429469]

fig, ax = plt.subplots(figsize=(8,5), facecolor='white')

# Create an image for each day
imgs = []
for date in np.unique(TimeProvince.date).tolist():
    img = plot_status_by_day(fig, TimeProvinceGEO, date, x_range=x_range, y_range=y_range)
    imgs.append(img)
    
# Combing the images for each hour into a single GIF
imageio.mimsave('TimeProvince.gif', imgs, fps=1);
In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython import display
from fastai.imports import *

gifPath = Path("/content/drive/My Drive/CIS545_2020/TimeProvince.gif")
with open(gifPath,'rb') as f:
    display.Image(data=f.read(), format='png')
Out[2]:

With the gif image we obtained, we can see that at first the confirmed cases concentrated in the northern west provinces, like Incheon, Gyeonggi-do and Seoul, and then from Feb 20th, the center turned to Dagu and gyeongsangnam-do. We can also conclude that there is a spatial autocorrelation here, for example, after Incheon had the confirmed case on January 20th, Seoul and Gyeonggi-do also had the cases soon.

Weather

From the researchs, we know that the weather like the temperature and humidity. Thus, in this part, we explore the relationship between the weather and the number of new increase each day in Korea. Since we only have the sample in one country, and the time span in rather short, the relationship may be not very reliable. Before the analysis, I convert the precipitation to category variable because I think what we care are whether it rain and the relationship between it and the new growth per day, instead of the amount of rainfall. According to the Pearson Index, humidity has the highest index value compared with other weather effects, but the value is 0.07, which is hard to say whether weather is related to the total number of the confirmed cases.

In [0]:
Weather.date = Weather.date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
WeatherByProv = Weather[['date','province','avg_temp','precipitation','avg_relative_humidity']]
WeatherByProv['isPrecip'] = WeatherByProv.precipitation.apply(lambda x: 1 if x!=0 else 0)
WeatherByProv = WeatherByProv.drop(['Precipitation'],axis=1)
In [0]:
WeatherEffects = pd.merge(TimeLag[['date','province','today']], WeatherByProv, how='left', on=['date','province'])
In [0]:
WeatherEffects = WeatherEffects.dropna()
In [0]:
WeatherEffects.drop(['date','province'], axis=1).corr()
Out[0]:
today avg_temp precipitation avg_relative_humidity isPrecip
today 1.000000 -0.037880 -0.024445 0.072334 -0.014158
avg_temp -0.037880 1.000000 0.054099 -0.026802 -0.012774
precipitation -0.024445 0.054099 1.000000 0.485703 0.552046
avg_relative_humidity 0.072334 -0.026802 0.485703 1.000000 0.579864
isPrecip -0.014158 -0.012774 0.552046 0.579864 1.000000

Cause for Infection

With the Cases table, we explore more details in how people get infected. For these people with specific infection causes, the top 3 reasons are the Shincheonji Church (59%), and contact with patient (14%), and overseas inflow (8%). Table below shows the top 5 reasons and three of them could be defined as group infection.

In [0]:
# We want to check the specific reasons and thus, delete the records with 'etc' in the infection_case.
CaseNoEtc = Case.loc[Case.infection_case!='etc']
CauseCount = CaseNoEtc.groupby('infection_case')['confirmed'].sum().reset_index().rename(columns = {'confirmed':'count'}).sort_values('count', ascending=False)
In [12]:
CauseCount['percent'] = CauseCount['count']/CauseCount['count'].sum()*100
top5 = CauseCount.sort_values('count', ascending=False).head(5)
top5
Out[12]:
infection_case count percent
31 Shincheonji Church 5212 59.099671
38 contact with patient 1250 14.173943
41 overseas inflow 777 8.810523
28 Second Mi-Ju Hospital 196 2.222474
11 Guro-gu Call Center 166 1.882300

Focusing on these reasons, we would like to know more about the time pattern of the number of cases caused by these 5 causes by week. Thus, we will switch to the PatientInfo dataset, where the number of records are limited but based on it, the general trend could be obtained.

In this part, we analyze the cases on week level, and thus, for each date, we extract the corresponding week of the year and calculate the total number of confirmed case by infection case by day. Then in the second SQL query, I calculate the week-level numbers and the proportion for each infection case within one week.

In [0]:
PatientInfo_sdf = spark.read.format("csv").option("header", "true").load("/content/drive/My Drive/CIS545_2020/PatientInfo.csv")# 
PatientInfo_sdf.createOrReplaceTempView('PatientInfo')

CauseCountByWeek = spark.sql('''
                             SELECT confirmed_date, infection_case, mean(weekofyear(confirmed_date)) AS week, COUNT(*) AS cnt FROM PatientInfo
                             GROUP BY confirmed_date, infection_case
                             ORDER BY confirmed_date, infection_case
                        ''')
CauseCountByWeek.createOrReplaceTempView('CauseCountByWeek')

CausePercentByWeek = spark.sql('''
                        SELECT * FROM 
                        (SELECT t1.week, t1.infection_case, SUM(t1.cnt) AS CauseTotal, mean(t2.WeekTotal) AS WeekTotal, 
                        ROUND(SUM(t1.cnt)*100.0/mean(t2.WeekTotal),2) AS percent FROM CauseCountByWeek t1
                        JOIN 
                        (SELECT week, SUM(cnt) AS WeekTotal FROM CauseCountByWeek
                        GROUP BY week) t2
                        ON t1.week==t2.week
                        GROUP BY t1.week, t1.infection_case
                        ORDER BY t1.week, t1.infection_case) result
                        WHERE infection_case IN 
                              (SELECT infection_case FROM 
                                (SELECT infection_case, COUNT(*) as total FROM PatientInfo 
                                WHERE infection_case not like 'etc' GROUP BY infection_case ORDER BY total DESC LIMIT 5) t)
                        ''')
CausePercentByWeek = CausePercentByWeek.toPandas()
In [0]:
CausePercentByWeek['percent_round'] = CausePercentByWeek.percent.apply(lambda x: round(x))
CausePercentByWeek['week'] = CausePercentByWeek.week.apply(lambda x: round(x))
CausePercentByWeek['infection_case'] = CausePercentByWeek['infection_case'].apply(lambda x: x[0].upper()+x[1:])

Here, I create a panel to display the data because the table CausePercentByWeek only has the record when the count is not zero, however, there is a scenario where there is no case for certain infection cases in specific week. To display the statistics completely, we have to create the panel.

In [0]:
# expend grid
import itertools
def expandgrid(*itrs):
   product = list(itertools.product(*itrs))
   return {'Var{}'.format(i+1):[x[i] for x in product] for i in range(len(itrs))}

cases = np.unique(CausePercentByWeek.infection_case).tolist()
week = np.unique(CausePercentByWeek.week).tolist()

# create a panel to store the numbers
panel = pd.DataFrame(expandgrid(week, cases)).rename(columns={'Var1':'week', 'Var2':'infection_case'})
casesStatFull = pd.merge(CausePercentByWeek, panel, on=['week','infection_case'], how='outer').fillna(0)

Take week 11 as an example. Note that we only display the statistics for the top 5 reasons we show above.

In [54]:
casesStatFull.loc[casesStatFull.week==11].sort_values('CauseTotal', ascending=False)
Out[54]:
week infection_case CauseTotal WeekTotal percent percent_round
21 11 Contact with patient 124.0 356.0 34.83 35.0
19 11 Guro-gu Call Center 94.0 356.0 26.40 26.0
22 11 Overseas inflow 15.0 356.0 4.21 4.0
20 11 Shincheonji Church 3.0 356.0 0.84 1.0
54 11 Onchun Church 0.0 0.0 0.00 0.0
In [45]:
import altair as alt

chart = alt.Chart(CausePercentByWeek).mark_bar(
    cornerRadiusTopLeft=3,
    cornerRadiusTopRight=3
).encode(
    x=alt.X('week:Q',title='Week of the Year'),
    y=alt.Y('mean(percent_round):Q', title='Percentage'),
    color='infection_case:N',
    size=alt.value(20)
).properties(
    width=500,
    height=300,
    title = 'Percentage of top 5 infection cases by week'
)

chart.configure_title(
    fontSize=20
)
Out[45]:

From the bar plot above, it could be concluded that as time passes, the proportions of each infection cases change. For example, at the very begining and recently, overseas inflow is the main infection case and at the mid-term, there are more different kinds of infection cases, like the Onchun Church and Shincheonji Church in week 8 and the Guro-gu Call Center in week 11. Generally, the cases caused by the contact with patient needs to be noticed. For the next step, the Korean government should focus more on controlling the overseas inflow.

*The number shown in the bar plot might be different from the trend we obtained from Case dataset, this is because there are only part of the patients are recored in the PatientInfo table.

Super Spreader

In this session, we think the spread from person to person as a graph, and try to trace the spread from one of the super spreader recorded in the PatientInfo table.

In [0]:
# Extract the fileds to use
infected = PatientInfo[['patient_id','infected_by']]
schema = StructType([
                StructField("patient_id", StringType(), True),
                StructField("infected_by", StringType(), False)
            ])
infected_sdf = spark.createDataFrame(infected, schema)
In [0]:
infected_sdf.createOrReplaceTempView('infected')

First, we want to know who are super spreader according to the data we have, and according to the statistics, a female in her 70s has infected more than 50 people is the super spreader in our data set.

In [57]:
infected = PatientInfo[['patient_id','infected_by']].dropna()
SuperSpreader = infected.groupby('infected_by').count().reset_index().sort_values('patient_id',ascending=False).rename(columns={'patient_id':'cnt','infected_by':'patient_id'}).head(3)
SuperSpreader = pd.merge(SuperSpreader, PatientInfo, on='patient_id')
SuperSpreader
Out[57]:
patient_id cnt global_num sex birth_year age country province city disease infection_case infection_order infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state
0 2000000205 51 8100.0 female 1946.0 70s Korea Gyeonggi-do Seongnam-si NaN contact with patient NaN 1000000138 8.0 NaN 2020-03-14 NaN NaN isolated
1 4100000008 27 NaN female 1974.0 40s Korea Chungcheongnam-do Cheonan-si NaN gym facility in Cheonan NaN NaN 130.0 2020-02-20 2020-02-26 2020-03-29 NaN released
2 2000000167 24 7663.0 female 1976.0 40s Korea Gyeonggi-do Bucheon-si NaN contact with patient NaN 1000000125 NaN NaN 2020-03-10 NaN NaN isolated

The super spreader is patient #2000000205 who was infected by #1000000138 and spreaded the virus to about 51 people according to the record we have. The spread of the virus among people could be thought as a directed graph. We then track the spread from the patient #2000000205.

In [0]:
PatientInfo.loc[PatientInfo.patient_id=='1000000138']
Out[0]:
patient_id global_num sex birth_year age country province city disease infection_case infection_order infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state
137 1000000138 7683.0 male 1987.0 30s Korea Seoul etc NaN etc NaN NaN NaN NaN 2020-03-09 NaN NaN released
In [0]:
# TODO: iterative search over undirected graph
def super_spreader(G, origins, max_depth):
    G.createOrReplaceTempView('G')
    schema = StructType([
                StructField("patient_id", StringType(), True),
                StructField("distance", IntegerType(), False)
            ])
    visited = spark.createDataFrame(origins, schema)
    to_visit = spark.createDataFrame(origins, schema)
    for i in range(1, max_depth+1):
      to_visit.createOrReplaceTempView('to_visit')
      to_visit = spark.sql('SELECT g1.patient_id as patient_id, min(t.distance)+1 as distance from to_visit t \
                      JOIN G g1 ON g1.infected_by = t.patient_id \
                      GROUP BY g1.patient_id \
                      ')
      visited = visited.union(to_visit)
      if i==max_depth:
        visited.createOrReplaceTempView('visited')
        visited = spark.sql('SELECT patient_id, min(distance) as distance FROM visited GROUP BY patient_id ORDER BY patient_id')
    return visited

We set the max depth as 5 to compare the number of infection in the 5 transmission.

In [0]:
orig = [{'patient_id': '2000000205','distance':0}]
#grab the count
count = super_spreader(infected_sdf, orig, 5)
In [0]:
infected5 = count.toPandas()

Fortunately, after the spread of the virus caused by patient #2000000205, the spread was in control and there are only 7 people were infected by those who infected by the super spreader. And the count comes to 0 in the 5th transmission.

In [61]:
infected5.groupby('distance').count().rename(columns={'patient_id':'count'})
Out[61]:
count
distance
0 1
1 51
2 7
3 1

2.3 Clustering Analysis

In 1.3, we group the patient by their characteristics using clustering analysis. Since there are mixed numerical and categorical variables, I choose to use K-Prototype, which is built based on K-Means and K-Modes.

First, we extract the variables to use.

In [0]:
X = PatientInfo.drop(['patient_id','global_num','city', 'country','birth_year','disease','infection_order', 'infected_by','symptom_onset_date','confirmed_date','released_date','deceased_date','state'], axis=1)
In [63]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3388 entries, 0 to 3387
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             3311 non-null   object 
 1   age             3304 non-null   object 
 2   province        3388 non-null   object 
 3   infection_case  2620 non-null   object 
 4   contact_number  643 non-null    float64
dtypes: float64(1), object(4)
memory usage: 132.5+ KB

Based on the table shown above, I select the features that might be useful in grouping patients. However, as you can see there are many NAs in the field contact_number, and therefore, we will first fill the missing values. But before doing that, we need to know how to impute the values, and therefore, I plot the distribution of the contact number. To make it easy to interpret, I set the bins as 120, and the width of each bar is about 10.

In [64]:
sns.distplot(PatientInfo.contact_number, hist=True, kde=False, 
             bins=int(120), color='#143d59',
             hist_kws={'edgecolor':'black'})
plt.title('The Distribution of The Contact Number', size=16)
plt.xlabel('Contact Number',size=12)
plt.ylabel('Count', size=12)
plt.show()
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f62c4106518>
Out[64]:
Text(0.5, 1.0, 'The Distribution of The Contact Number')
Out[64]:
Text(0.5, 0, 'Contact Number')
Out[64]:
Text(0, 0.5, 'Count')

The distribution of the contact number is shown above, and most of the patient have fewer than 10 times of contact with others, and a few of people have more than 1000 contacts and thus, filling the missing values with mean value is not appropriate, instead, we use the median value here.

In [0]:
X.contact_number = X.contact_number.fillna(X.contact_number.median())
In [66]:
X = X.dropna()
X.head(3)
Out[66]:
sex age province infection_case contact_number
0 male 50s Seoul overseas inflow 75.0
1 male 30s Seoul overseas inflow 31.0
2 male 50s Seoul contact with patient 17.0
In [0]:
# convert the string to the int
X['age'] = X['age'].apply(lambda x: int(x.split('s')[0]))
In [0]:
from sklearn.preprocessing import StandardScaler
scal = StandardScaler()
X[['age','contact_number']] = scal.fit_transform(X[['age','contact_number']])

To use K-Prototype, we first need to define the k, i.e., the number of the clusters. Here, to pick a proper k, I first use K-Means and try 10 values for k to compare the distortion.

In [0]:
X_kmeans = X.copy()
In [0]:
# clustering analysis
X_kmeans.province = X_kmeans.province.astype('category')
X_kmeans.sex = X_kmeans.sex.astype('category')
X_kmeans.infection_case = X_kmeans.infection_case.astype('category')

#  Use pd.get_dummies() to convert the categorical variables to dummy variables.
X_kmeans = pd.concat([X_kmeans[['age', 'contact_number']], pd.get_dummies(X_kmeans[['province','sex','infection_case']])], axis=1)
In [0]:
from sklearn.cluster import KMeans
distortions = []

max_k = 10
for i in range(1,max_k+1):
  km = KMeans(n_clusters=i,
              init='random',
              n_init=1,
              max_iter=300,
              random_state=0)
  km.fit(X_kmeans)
  distortions.append(km.inertia_)

To pick the right value of k, we will look at the value that (roughly) minimizes the distortion and can split the clusters in a propor and interpretable way. Here, we would set k as 4.

In [0]:
plt.plot(range(1,max_k+1), distortions, marker='o')
plt.xlabel('Cluster count (k)')
plt.ylabel('Distortion')
plt.show()

However, there is a problem to do the clustering analysis using K-Means here due to the type of data we have. K-Means is used for continuous data while we have mixed numerical and categorical data. Thus, after doing some searches, I found a new algorithm proposed by Zhexue Huang, K-Prototype, which is based on K-Means and K-Modes.

In [0]:
! pip install kmodes
In [0]:
X_kprototype = X.copy()
In [0]:
from kmodes.kprototypes import KPrototypes
kproto = KPrototypes(n_clusters=4, init='Cao', verbose=1)
clusters = kproto.fit_predict(X_kprototype, categorical=[0, 2, 3])

Most of the patients are assigned to the first two and the last clusters. What are the characteristics of these clusters?

In [79]:
X_kprototype['cluster'] = clusters
X_kprototype.groupby('cluster')['sex'].count().reset_index().rename(columns={'sex':'count'})
Out[79]:
cluster count
0 0 578
1 1 994
2 2 960
3 3 5
In [0]:
kproto_center = pd.concat([pd.DataFrame(kproto.cluster_centroids_[0],columns=['age', 'contact_number']),
                          pd.DataFrame(kproto.cluster_centroids_[1],columns=['sex', 'province', 'infection_case'])], axis=1)
In [0]:
kproto_center[['age','contact_number']] = scal.inverse_transform(kproto_center[['age','contact_number']])
In [86]:
kproto_center
Out[86]:
age contact_number sex province infection_case
0 67.387543 4.717993 female Gyeongsangbuk-do contact with patient
1 21.016097 6.153924 male Seoul overseas inflow
2 42.322917 7.020833 female Gyeonggi-do contact with patient
3 34.000000 721.600000 male Busan Shincheonji Church

As the table shown above, patients in different clusters have different characteristics. For the first cluster, the representative patient is a female in her 60s-70s in Gyeongsangbuk-do, and she still got infected throught the limited contact with people. People in cluster 1 are infected overseas and test positive in Seoul. These patient are young, in their 20s, from my opinion, they might be the overseas students. The persona in third cluster is a female who live in Gyeonggi-do in her 30s and she got infected by the contact with patient. We only have 5 people in cluster 3 and surprisingly, they have contacted more than 700 people on average probably due to the participant in the activity in church in Busan.

3 Modeling

In this session, I do a prediction for the Korea and the United States. It is important to have a general understanding of the prospect of the COVID-19 to better take measures to control the situation.

3.1 Precition for Korea

For Korea, the situation is easing off and I first use a logistic growth model to fit the confirmed cases data and predict the situation in the futrue. I also try to use ARIMA to do the prediction for Korea, however the accuracy of the model is rather low and time pattern for the data is varing. Besides, I do the prediction of the new growth by province using space-time model based on the thought that the infections nearby and the recent infections might greatly influence the prediction for each province in the coming days.

3.1.1 Logistic Growth Model

I use logistics growth model to mimic the situation in Korea, and the model is not the best one to help us understand the COVID-19 in Korea, but with it, we can know the general trend. The reason for using this function is the curve is similar to that of the COVID-19 in the Korea. At first, the infected people is few and as time passes, the virus spread from person to person and the number of the confirmed cases increase rapidly. After the government notice and take measures, the growth slow down, making the curve become a 'S' shape. It is true that the ARIMA model should be suitable for the time-series data, however, the time-data is not staionary even after the a differencing. So, I would use the logistics growth model when predicting for the situation in the Korea.

I use the 90-day data to predict the future the situation in the coming 10 days. r is set as 0.11 and the result is shown in the plot below.

In [0]:
def logistic_increase_function(t, K, P0, r):
     r=0.11
     t0 = 1
     exp_value = np.exp(r * (t - t0))
     return (K * exp_value * P0) / (K + (exp_value - 1) * P0)
In [92]:
# fitting the recorded data
t=np.linspace(1, 90, 90)
P=Time.confirmed[:90]

from scipy.optimize import curve_fit
popt, pocv = curve_fit(logistic_increase_function, t, P)

# prediction for a long time in future
P_predict=logistic_increase_function(t,popt[0],popt[1],popt[2])

future=[100,103,107,114,120,130,140,150]

future=np.array(future)

future_predict=logistic_increase_function(future,popt[0],popt[1],popt[2])

# prediction for the next 15 days
tomorrow=range(86,100)
tomorrow=np.array(tomorrow)
tomorrow_predict=logistic_increase_function(tomorrow,popt[0],popt[1],popt[2])

# plot the fitting and the prediction
plot1=plt.plot(t,P,'s',label='confirmed_actual')
plot2=plt.plot(t,P_predict,'r',label='confired_fit')
plot3=plt.plot(tomorrow,tomorrow_predict,'s',label='confirmed_pred')
plt.xlabel('Number of Day From Jan 20th', size=12)
plt.ylabel('Count', size=12)
plt.legend(loc=0)
plt.title('Use Logistic model to fit the recorded comfirmed cases', size=14)
plt.show()
/usr/local/lib/python3.6/dist-packages/scipy/optimize/minpack.py:808: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)
Out[92]:
Text(0.5, 0, 'Number of Day From Jan 20th')
Out[92]:
Text(0, 0.5, 'Count')
Out[92]:
<matplotlib.legend.Legend at 0x7f62c2501ef0>
Out[92]:
Text(0.5, 1.0, 'Use Logistic model to fit the recorded comfirmed cases')

Generally, the model fit the data well and the prediction is to some extent reflect the prospect for the virus in the Korea. The plot below is the prediction for the coming month. The increase is pretty slow and the COVID-19 in the Korea would be in control.

In [0]:
# convert the number of day from Jan 20th to the data
future_date = []
for i in future:
  future_date.append((datetime(2020,1,20)+timedelta(days=int(i))).date())

# plot the predicted confirmed case for the future
plt.plot(future_date,future_predict,'s',label='pred')
plt.xlabel('Date', size=12)
plt.xticks(rotation=45)
plt.ylabel('Total confirmed',size=12)
plt.title('Prediction for future',size=14)
plt.show()

This is a simple mathmetic model that did not take all the socio-economic characteristics into accounts. According to the model, there would be about 10730 confirmed case on May 1st, and the actual number is 10774, which is pretty close. And thus, the logistic model here can provide us a general insight.

3.1.2 Space-Time Model

In this session, I take the time effects, spatial effects and the socio-economic chracteristics into accounts when predicting the number of confirmed cases by province. As we have mentioned above, the number of cases in each proveince is somehow related to the number of their neighbors and varies by time.

In [0]:
# extract the fields to use and convert the dataframe to the spark dataframe
DailyGrowthByProv_sdf = spark.createDataFrame(DailyGrowthByProv[['date','province','DailyGrowth']])
DailyGrowthByProv_sdf.createOrReplaceTempView('DailyGrowthByProv')

Time effects are measured by the time lag. Time lag is defined as the daily growth before, for example, the daily growth a day ago.

In [0]:
# calculate the time lag for each province
TimeLag = spark.sql('''
                        SELECT t1.date, t1.province, t1.DailyGrowth AS today, t2.DailyGrowth AS 1DayBefore, t3.DailyGrowth AS 3DaysBefore,
                        t4.DailyGrowth AS 7DaysBefore, t5.DailyGrowth AS 14DaysBefore FROM DailyGrowthByProv t1
                        JOIN DailyGrowthByProv t2 ON t1.date == date_add(t2.date,1) AND t1.province == t2.province
                        JOIN DailyGrowthByProv t3 ON t1.date == date_add(t3.date,3) AND t1.province == t3.province
                        JOIN DailyGrowthByProv t4 ON t1.date == date_add(t4.date,7) AND t1.province == t4.province
                        JOIN DailyGrowthByProv t5 ON t1.date == date_add(t5.date,14) AND t1.province == t5.province
                        ORDER BY t1.date
                        ''')
TimeLag = TimeLag.toPandas()

As shown in the table below, we calculate the time lag for each province and take Seoul as example. It is obvious that the time lag could be a strong features when predicting the daily growth, especially the count 1/3 day(s) before.

In [0]:
TimeLag[['today','1DayBefore','3DaysBefore','7DaysBefore','14DaysBefore']]=TimeLag[['today','1DayBefore','3DaysBefore','7DaysBefore','14DaysBefore']].astype(float)
In [0]:
TimeLag.loc[TimeLag.province=='Seoul'].drop(['date','province'], axis=1).astype(int).corr()
Out[0]:
today 1DayBefore 3DaysBefore 7DaysBefore 14DaysBefore
today 1.000000 0.635373 0.524951 0.246392 0.197434
1DayBefore 0.635373 1.000000 0.519484 0.329432 0.139027
3DaysBefore 0.524951 0.519484 1.000000 0.409023 0.306777
7DaysBefore 0.246392 0.329432 0.409023 1.000000 0.253750
14DaysBefore 0.197434 0.139027 0.306777 0.253750 1.000000

University count, nursing home count and elderly population are considered as the factors that might be related to the number of confirmed cases.

In [0]:
count = Region.groupby('province')['university_count','nursing_home_count'].sum().reset_index()
# we drop the province fields in the ratio to avoid the repeatation
ratio = Region.groupby('province')['elderly_population_ratio','elderly_alone_ratio'].mean().reset_index().drop(['province'],axis=1)
CensusBuiltin = pd.concat([count,ratio],axis=1)
In [0]:
data = pd.merge(TimeLag,CensusBuiltin, on='province')
In [0]:
data = pd.merge(data, WeatherEffects[['date','avg_relative_humidity','province']], on=['date','province'])

Spatial effects is measured by spatial lag. Spatial lag is defined as the mean count of the neighbors of each province.

In [0]:
import geopandas as gpd
df = korea_shp.copy()
df["NEIGHBORS"] = None  # add NEIGHBORS column

for index, country in df.iterrows():   
    # get 'not disjoint' countries
    neighbors = df[~df.geometry.disjoint(country.geometry)].name.tolist()
    # remove own name from the list
    neighbors = [name for name in neighbors if country[0] != name ]
    # add names of neighbors as NEIGHBORS value
    df.at[index, "NEIGHBORS"] = ", ".join(neighbors)

neighbor = df[['name','NEIGHBORS']]
In [0]:
def spatialLag(data):
    sumDailyNew = 0
    if data['province']=='Jeju-do':
      return 0
    else:
      for prov in data['NEIGHBORS'].split(', '):
        temp = TimeLag.loc[(TimeLag.date==data.date) & (TimeLag.province == prov)]
        value = temp.today.values
        sumDailyNew += value[0]
      return sumDailyNew
In [0]:
SpatialLag = pd.merge(TimeLag, neighbor, left_on='province', right_on='name')
In [0]:
spatiallag_result = SpatialLag.apply(lambda x: spatialLag(x), axis=1)
SpatialLag.insert(3, "spatiallag", spatiallag_result)
In [0]:
data = pd.merge(data, SpatialLag[['date','province','spatiallag']], on=['date','province'])

Then, I calculate the relationship between the predicted variable and the predictors. The socio-economic chracteristics have low index values.

In [0]:
data.corr()
Out[0]:
today 1DayBefore 3DaysBefore 7DaysBefore 14DaysBefore university_count nursing_home_count elderly_population_ratio elderly_alone_ratio avg_relative_humidity spatiallag
today 1.000000 0.929543 0.830090 0.561966 0.173956 -0.013925 0.032521 -0.043970 -0.046879 0.072334 0.174473
1DayBefore 0.929543 1.000000 0.874492 0.636522 0.206115 -0.013984 0.032450 -0.043921 -0.046827 0.075439 0.175532
3DaysBefore 0.830090 0.874492 1.000000 0.758790 0.280700 -0.014176 0.032217 -0.043851 -0.046733 0.062889 0.155266
7DaysBefore 0.561966 0.636522 0.758790 1.000000 0.562140 -0.014324 0.032076 -0.043731 -0.046609 0.021247 0.112917
14DaysBefore 0.173956 0.206115 0.280700 0.562140 1.000000 -0.015106 0.031325 -0.043668 -0.046457 -0.060934 0.010376
university_count -0.013925 -0.013984 -0.014176 -0.014324 -0.015106 1.000000 0.882574 0.057055 0.005488 0.094845 0.048895
nursing_home_count 0.032521 0.032450 0.032217 0.032076 0.031325 0.882574 1.000000 -0.247552 -0.297447 0.048044 -0.033359
elderly_population_ratio -0.043970 -0.043921 -0.043851 -0.043731 -0.043668 0.057055 -0.247552 1.000000 0.984624 -0.041523 0.209722
elderly_alone_ratio -0.046879 -0.046827 -0.046733 -0.046609 -0.046457 0.005488 -0.297447 0.984624 1.000000 -0.044950 0.223864
avg_relative_humidity 0.072334 0.075439 0.062889 0.021247 -0.060934 0.094845 0.048044 -0.041523 -0.044950 1.000000 0.061348
spatiallag 0.174473 0.175532 0.155266 0.112917 0.010376 0.048895 -0.033359 0.209722 0.223864 0.061348 1.000000
In [0]:
from sklearn.model_selection import train_test_split
import sklearn.metrics

ProvinceData = TimeLag.loc[TimeLag.province=='Seoul']
X = data.drop(['date','province','today'], axis=1)
y = data.today

Given the data is in time series, I divide the data set into training and test based on the date instead of split them randomly. I consider the data before March 30th as the training dataset and the other data as the test set.

In [0]:
training = data.loc[data.date <= datetime(2020, 3, 30)]
test = data.loc[data.date > datetime(2020, 3, 30)]
X_train = training.drop(['date','province','today'], axis=1)
y_train = training.today
X_test = test.drop(['date','province','today'], axis=1)
y_test = test.today
In [0]:
from sklearn.preprocessing import StandardScaler 
from sklearn.pipeline import make_pipeline
regr = make_pipeline(
    StandardScaler(), RandomForestRegressor(n_estimators=1000, random_state=42)
)
In [0]:
from sklearn.ensemble import RandomForestRegressor
regr.fit(X_train, y_train)
Out[0]:
Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('randomforestregressor',
                 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                       criterion='mse', max_depth=None,
                                       max_features='auto', max_leaf_nodes=None,
                                       max_samples=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       n_estimators=1000, n_jobs=None,
                                       oob_score=False, random_state=42,
                                       verbose=0, warm_start=False))],
         verbose=False)

I calculate the accuracy for both training data set and the test set. The MAE for the test set is 2.21 and the mean daily growth is 1.91. Obviously, the model do not perform well as I expected before. There needs more improvement here, for example, use a more appropriate model, or add more strong factors for the model.

In [0]:
# Use the forest's predict method on the test data
predictions = regr.predict(X_test).astype(int)
# Calculate the absolute errors
abs_errors = abs(predictions - y_test)
errors = predictions - y_test
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(abs_errors), 2))
print('Mean Error:', round(np.mean(errors), 2))
print('Mean Daily Growth:', round(y_test.mean(),2))
Mean Absolute Error: 2.21
Mean Error: 1.08
Mean Daily Growth: 1.91
In [0]:
# Use the forest's predict method on the test data
predictions = regr.predict(X_train).astype(int)
# Calculate the absolute errors
abs_errors = abs(predictions - y_train)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(abs_errors), 2))
print('Mean Daily Growth:', round(y_train.mean(),2))
Mean Absolute Error: 1.79
Mean Daily Growth: 10.5

We pick a day when the daily growth is high. Accoding to the analysis in session 2.2, we use the numbers on March 1st to predict the daily increase on March 2nd, and compare it to the actual situation.

In [0]:
regr.predict(status31.drop(['date','province','today'], axis=1)).astype(int)
Out[0]:
array([ 1,  0, 46, 19,  0,  0,  0,  7, 13,  5,  0,  1,  0,  0,  1])
In [0]:
status31 = data.loc[data['date']==datetime(2020,4,1)]
prediction = regr.predict(status31.drop(['date','province','today'], axis=1)).astype(int)
In [0]:
status31.insert(3, "prediction", prediction.astype(float).tolist())
In [0]:
status32 = TimeLag.loc[TimeLag.date==datetime(2020,4,2)]
comp = pd.merge(status32[['province','today']], status31[['province','prediction']],on='province').rename(columns={'today':'actual'})
comp
Out[0]:
province actual prediction
0 Gyeongsangnam-do 6.0 0.0
1 Jeollanam-do 2.0 0.0
2 Daejeon 0.0 0.0
3 Gangwon-do 1.0 0.0
4 Incheon 4.0 5.0
5 Jeollabuk-do 0.0 0.0
6 Ulsan 1.0 1.0
7 Gyeongsangbuk-do 2.0 7.0
8 Gyeonggi-do 17.0 13.0
9 Chungcheongnam-do 2.0 1.0
10 Jeju-do 0.0 0.0
11 Daegu 21.0 46.0
12 Seoul 14.0 19.0
13 Busan 0.0 1.0
14 Gwangju 1.0 0.0

For most of the province, the predicted values are close to the actual numbers, and the model does provide us a general understanding to the future daily growth across the country.

3.2 Prediction for United States Using ARIMA Model

For United States, the coronavirus spread fast here and the number of the confirmed cases keep increasing rapidly every day. It is possible to use the exist time pattern to predict the future. For this prediction, I use ARIMA model.

In [0]:
USAdata = pd.read_csv('/content/drive/My Drive/CIS545_2020/us_covid19_daily.csv', dtype={'date': str}, 
            sep=',')
In [0]:
# Days with 0 cases is set to be NA in this dataset, before the analyzing, we fill the na with 0.
USAdata = USAdata[['date','positive']].fillna(0)
USAdata.date = USAdata.date.apply(lambda x: datetime.strptime(x, '%Y%m%d'))
In [0]:
# reverse the data. The origin data is ordered by the date in descending order.
USAdata = USAdata[::-1]
In [0]:
# we first train the model using the 96-day data. And do the prediction for the rest of the days and evaluate the accuracy.
confirmedCases90days = USAdata[['date','positive']].iloc[:96]
In [0]:
toPred10days = USAdata[['date','positive']].iloc[96:]
In [0]:
confirmedCases90days = confirmedCases90days.set_index('date')
toPred10days = toPred10days.set_index('date')
In [103]:
rolmean = confirmedCases90days.rolling(window=7).mean() #calculate the rolling mean values by week
rolstd = confirmedCases90days.rolling(window=7).std()#calculate the rolling mean values by week

#Plot rolling statistics
plt.figure(figsize=(13,5),dpi = 80)
plt.xticks(rotation=45)

orig = plt.plot(confirmedCases90days.positive, color='blue', label='Original')#原数据
mean = plt.plot(rolmean, color='red', label='Rolling Mean')#平均值
std = plt.plot(rolstd, color='black', label='Rolling Std')#标准差

plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation', size=16)
plt.xlabel('Date', size=12)
plt.ylabel('Count', size=12)
plt.show()
Out[103]:
<Figure size 1040x400 with 0 Axes>
Out[103]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
 <a list of 6 Text major ticklabel objects>)
Out[103]:
<matplotlib.legend.Legend at 0x7f62c2565b00>
Out[103]:
Text(0.5, 1.0, 'Rolling Mean & Standard Deviation')
Out[103]:
Text(0.5, 0, 'Date')
Out[103]:
Text(0, 0.5, 'Count')

Apperently, for the origin count data, it is not stationary according to the plots, but we would do the statistics test, augmented dickey-fuller, to check whether is stationary statistically. According to the result, with the p-value equal to 0.99, we fail to reject the null hypothesis. The data is probably non-stationary.

In [0]:
from statsmodels.tsa.stattools import adfuller 
In [0]:
def test_stationarity(timeseries):  
    #Determine rolling statistics
    movingAverage = timeseries.rolling(window=7).mean() 
    movingSTD = timeseries.rolling(window=7).std()    
    
    #Plot rolling statistics
    orig = plt.plot(timeseries, color='blue', label='Original')
    mean = plt.plot(movingAverage, color='red', label='Rolling Mean')
    std = plt.plot(movingSTD, color='black', label='Rolling Std')
    plt.xticks(rotation=45)
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    #Perform Dickey–Fuller test
    print('Results of Dickey Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
In [111]:
test_stationarity(confirmedCases90days.positive)
Results of Dickey Fuller Test:
Test Statistic                  1.934444
p-value                         0.998581
#Lags Used                     12.000000
Number of Observations Used    83.000000
Critical Value (1%)            -3.511712
Critical Value (5%)            -2.897048
Critical Value (10%)           -2.585713
dtype: float64

I then check the stationarity for the log-transformed data. Apparently, with the p-value is 0.88, the log-data is probably not stationary.

In [112]:
#Estimating the trend
confirmedCases90days_logScale = np.log(confirmedCases90days.positive)  #taking log 
test_stationarity(confirmedCases90days_logScale)
Results of Dickey Fuller Test:
Test Statistic                 -0.549680
p-value                         0.881952
#Lags Used                      2.000000
Number of Observations Used    93.000000
Critical Value (1%)            -3.502705
Critical Value (5%)            -2.893158
Critical Value (10%)           -2.583637
dtype: float64

Thus, the differencing is used to remove the trend.

In [114]:
#Differencing
datasetLogDiffShifting_1 = confirmedCases90days_logScale - confirmedCases90days_logScale.shift()
plt.plot(datasetLogDiffShifting_1,'r')

datasetLogDiffShifting_2 = datasetLogDiffShifting_1 - datasetLogDiffShifting_1.shift()
plt.plot(datasetLogDiffShifting_2)
Out[114]:
[<matplotlib.lines.Line2D at 0x7f62c0bd2ac8>]
Out[114]:
[<matplotlib.lines.Line2D at 0x7f62c0b96668>]

After a differencing, the p-value is 0.01 which indicates that the data is probably stationary.

In [115]:
datasetLogDiffShifting_1.dropna(inplace=True)
test_stationarity(datasetLogDiffShifting_1)
Results of Dickey Fuller Test:
Test Statistic                 -3.487027
p-value                         0.008327
#Lags Used                      1.000000
Number of Observations Used    93.000000
Critical Value (1%)            -3.502705
Critical Value (5%)            -2.893158
Critical Value (10%)           -2.583637
dtype: float64

After the time series has been stationarized by differencing, the next step is to determine whether AR or MA terms are needed to correct any autocorrelation that remains in the differenced series. By looking at the ACF and PACF plots, wee can identify the AR and MA terms here. In this case, I would consider the p ranging from 1 to 3 and q ranging from 1 to 2.

In [0]:
import statsmodels.api as sm
fig = plt.figure(figsize=(12,8))
#acf   from statsmodels.tsa.stattools import acf, pacf
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(datasetLogDiffShifting_1[2:], lags=20,ax=ax1)
ax1.xaxis.set_ticks_position('bottom')
fig.tight_layout();
#pacf
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(datasetLogDiffShifting_1[2:], lags=20, ax=ax2)
ax2.xaxis.set_ticks_position('bottom')
fig.tight_layout();

We have different combination for p and q. One way to identify the best combination is to calculate the AIC, Akaike information criterion. Generally, the lower the value is, the better the model is.

In [0]:
import sys
def _proper_model(ts_log_diff, maxp, maxq):
    best_p = 0 
    best_q = 0
    best_aic = sys.maxsize
    best_model=None
    for p in np.arange(maxp+1):
        for q in np.arange(maxq+1):
            model = ARIMA(ts_log_diff, order=(p,1, q))
            try:
                results_ARIMA = model.fit(disp=-1)
            except:
                continue
            aic = results_ARIMA.aic
            print(aic, best_aic)
            if aic < best_aic:
                best_p = p
                best_q = q
                best_aic = aic
                best_model = results_ARIMA
    return best_p,best_q,best_model
_proper_model(confirmedCases90days_logScale, 3, 2)
In [137]:
from statsmodels.tsa.arima_model import ARIMA 
# build the model based on the training data and predict on the test data
model = ARIMA(confirmedCases90days_logScale, order=(1,1,1)) # after trying all combination, the best model has p=1 and q=1.
results_ARIMA = model.fit(disp=-1)
plt.plot(datasetLogDiffShifting_1)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_ARIMA.fittedvalues - datasetLogDiffShifting_1)**2))
plt.xticks(rotation=45)
plt.show()
/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/base/tsa_model.py:165: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
  % freq, ValueWarning)
/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/base/tsa_model.py:165: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
  % freq, ValueWarning)
Out[137]:
[<matplotlib.lines.Line2D at 0x7f62c0b0f128>]
Out[137]:
[<matplotlib.lines.Line2D at 0x7f62c0ccd7f0>]
Out[137]:
Text(0.5, 1.0, 'RSS: 3.2132')
Out[137]:
(array([737456., 737470., 737485., 737499., 737516., 737530.]),
 <a list of 6 Text major ticklabel objects>)

We translate the ARIMA model back to the origin scale and see how well the model is forecasting.

In [0]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()

predictions_ARIMA_log = pd.Series(confirmedCases90days_logScale.iloc[0], index=confirmedCases90days_logScale.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)

# exponentiate the log values.
predictions_ARIMA = np.exp(predictions_ARIMA_log)

The red line indicates the predicted values and the blue line indicates the actual numbers in the U.S. From the plot below, our model underpredict the number of the positive cases.

In [138]:
plt.plot(predictions_ARIMA, color='red') 
plt.plot(confirmedCases90days.positive.iloc[:96])
plt.xticks(rotation=45)
plt.xlabel('Date', size=12)
plt.ylabel('Count', size=12)
plt.show()
Out[138]:
[<matplotlib.lines.Line2D at 0x7f62c2558358>]
Out[138]:
[<matplotlib.lines.Line2D at 0x7f62c0e05710>]
Out[138]:
(array([737456., 737470., 737485., 737499., 737516., 737530.]),
 <a list of 6 Text major ticklabel objects>)
Out[138]:
Text(0.5, 0, 'Date')
Out[138]:
Text(0, 0.5, 'Count')
In [0]:
USAdata_logScale = np.log(USAdata.positive)

Then, I use the model to predict for the test data, i.e., the situation in the latest 10 days.

In [0]:
# calculate the predicted values
predictions_ARIMA_=results_ARIMA.predict(1,103) 
predictions_ARIMA_diff = pd.Series(predictions_ARIMA_, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()

predictions_ARIMA_log = pd.Series(USAdata_logScale.iloc[0], index=predictions_ARIMA_.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)

# exponentiate the log values.
predictions_ARIMA = np.exp(predictions_ARIMA_log)
In [158]:
plt.figure(figsize=(10,8),dpi = 80)
plt.xticks(rotation=45,fontsize=10)
plt.yticks(fontsize=10)

USAdata = USAdata.set_index('date')

plt.plot(predictions_ARIMA[:96],markersize=30,label='Predicted',color = 'darkblue',linestyle='--',linewidth=2.0)
plt.plot(predictions_ARIMA[95:],markersize=30,label='Predicted-future',color = 'blue',linewidth=2.0)
plt.plot(USAdata.positive,markersize=30,label='Actual',color = 'orange',linewidth=2.0)
plt.plot(confirmedCases90days.positive,markersize=30,label='Training',color = 'green',linestyle='--',linewidth=2.0)

plt.xlabel("Date",fontsize=15)
plt.ylabel("The total number of confirmed case",fontsize=15)
plt.title("Accumulated confirmed cases",fontsize=20)
plt.grid(alpha=1)

plt.legend(loc="upper left")
plt.show()
Out[158]:
<Figure size 800x640 with 0 Axes>
Out[158]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
 <a list of 6 Text major ticklabel objects>)
Out[158]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
 <a list of 6 Text major ticklabel objects>)
Out[158]:
[<matplotlib.lines.Line2D at 0x7f62bafef4a8>]
Out[158]:
[<matplotlib.lines.Line2D at 0x7f62bafb6048>]
Out[158]:
[<matplotlib.lines.Line2D at 0x7f62bb028e48>]
Out[158]:
[<matplotlib.lines.Line2D at 0x7f62bafb67f0>]
Out[158]:
Text(0.5, 0, 'Date')
Out[158]:
Text(0, 0.5, 'The total number of confirmed case')
Out[158]:
Text(0.5, 1.0, 'Accumulated confirmed cases')
Out[158]:
<matplotlib.legend.Legend at 0x7f62bb00e828>

Based on the prediction result, the model fail to perferm well as we expect, and according to the actual numbers, the daily growth in the U.S. start to slow down. However, as the prediction, the number keep increasing rapidly, as inthe first week of the May, the total number rises to 1.4 million. The real number for May 6th is 1.25 million.

Then, I use all of the data we now have to build the ARIMA model and do the prediction. The steps is the same as what we did above. But this time, we build the ARIMA based on the whole data set instead of using part of it.

In [0]:
USAdata_logScale = np.log(USAdata.positive)

After differencing, the p-value is lower than 0.05, which means our data is stationary.

In [167]:
# Differencing
datasetLogDiffShifting_1 = USAdata_logScale - USAdata_logScale.shift()
test_stationarity(datasetLogDiffShifting_1[1:])
Results of Dickey Fuller Test:
Test Statistic                  -3.563806
p-value                          0.006492
#Lags Used                       1.000000
Number of Observations Used    100.000000
Critical Value (1%)             -3.497501
Critical Value (5%)             -2.890906
Critical Value (10%)            -2.582435
dtype: float64
In [0]:
_proper_model(USAdata_logScale, 3, 2)

Here, the p and q are set as 1 after comparing all the combination.

In [171]:
from statsmodels.tsa.arima_model import ARIMA 
model = ARIMA(USAdata_logScale, order=(1,1,1))
results_ARIMA = model.fit(disp=-1)
plt.plot(datasetLogDiffShifting_1[1:])
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'%sum((results_ARIMA.fittedvalues - datasetLogDiffShifting_1[1:])**2))
/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/base/tsa_model.py:165: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
  % freq, ValueWarning)
/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/base/tsa_model.py:165: ValueWarning: No frequency information was provided, so inferred frequency D will be used.
  % freq, ValueWarning)
Out[171]:
[<matplotlib.lines.Line2D at 0x7f62babf9940>]
Out[171]:
[<matplotlib.lines.Line2D at 0x7f62c0b47c18>]
Out[171]:
Text(0.5, 1.0, 'RSS: 3.2181')

I predict the situation in the coming week using the model and the result is plotted below. According to the model, the total number of the positive case is about 1.75 million, but from my persepective, the number would be the worest situation. But the government should take more efforts to control since the actual number may reach to that values at this rate.

In [0]:
predictions_ARIMA_=results_ARIMA.predict(1,110) 
predictions_ARIMA_diff = pd.Series(predictions_ARIMA_, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()

predictions_ARIMA_log = pd.Series(USAdata_logScale.iloc[0], index=predictions_ARIMA_.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)

# Inverse of log is exp.
predictions_ARIMA = np.exp(predictions_ARIMA_log)
In [173]:
plt.figure(figsize=(10,8),dpi = 80)
plt.xticks(rotation=45,fontsize=10)
plt.yticks(fontsize=10)

plt.plot(predictions_ARIMA,markersize=30,label='Predicted',color = 'blue',linewidth=2.0)
plt.plot(USAdata.positive,markersize=30,label='Actual',color = 'orange',linewidth=2.0)

plt.xlabel("Date",fontsize=15)
plt.ylabel("The total number of confirmed case",fontsize=15)
plt.title("Accumulated confirmed cases",fontsize=20)
plt.grid(alpha=1)

plt.legend(loc="upper left")
plt.show()
Out[173]:
<Figure size 800x640 with 0 Axes>
Out[173]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
 <a list of 6 Text major ticklabel objects>)
Out[173]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
 <a list of 6 Text major ticklabel objects>)
Out[173]:
[<matplotlib.lines.Line2D at 0x7f62babe52b0>]
Out[173]:
[<matplotlib.lines.Line2D at 0x7f62bab39a90>]
Out[173]:
Text(0.5, 0, 'Date')
Out[173]:
Text(0, 0.5, 'The total number of confirmed case')
Out[173]:
Text(0.5, 1.0, 'Accumulated confirmed cases')
Out[173]:
<matplotlib.legend.Legend at 0x7f62bab8a940>

4 Challenges

  1. The provided data set only has the accumulated number but in this study, I want to focus on both the accumulated numbers and the daily growth.
  2. When analyzing and predicting the cases in each province, we can't ignore the spatial autocorrelation. When measuring the spatial effects, identifying the neighbor and calculate the spatial lag for each province is somehow challenging.
  3. Since there are mixed numeric and categorical features in the data set, K-Means is not appropriate here.
  4. When building ARIMA model to predict the time series data, I found the data is not stationary and thus I difference the data using the easiest way and did not try other ways. The accuracy of the model is rather low.

5 Futures Direction

There are space for the improvement in the models. For the prediction for the situation in the Korea, I think space-time model is a good approach to predict the daily increase for each province, however, if there could be more strong features, the model would perform much better. In the future, I might try to use different ways to difference the time series data in the Korea to make it stationary and build the ARIMA model to predict. For the model in session 3.2, the accuracy is rather poor and more should be done in eliminating the trend, like differencing using moving average.