- 👟 Ready To Run!
- 🚀 Data Exploration and Cleaning
- 🤖 Machine Learning
- ⏱️ Time Series
- ⚕️ Healthcare
In Part-1 of this study, we:
In this notebook, we will:
Each record in our data represents information about an individual nursing home in the United States for each week. For the analysis, we will use the data ranging from June 07, 2020 - June 6, 2021
.
# Import Libraries
# Display
from IPython.display import display
# ArcGIS
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
# Data Exploration
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)
# Plotting Libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
from ipywidgets import HBox, VBox, Layout, Label
# Datetime and os
from datetime import datetime
import time
import os
import itertools
# ARIMA models
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tools.eval_measures import rmse
# Stop warnings
import warnings
warnings.filterwarnings("ignore")
# Create a GIS connection
gis = GIS("home")
ARIMA (Autoregressive Integrated Moving Average) [1][2] models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.
ARIMA [3] is a combination of 3 models:
ARIMA [3] combines the ideas of autoregression and differencing. Instead of using autoregression with actual observed data, we use autoregression on the differences. An ARIMA(p,d,q) model can be written as: \begin{array}{l} y_{t}^{\prime}=c+\phi_{1} y_{t-1}^{\prime}+\cdots+\phi_{p} y_{t-p}^{\prime}+\theta_{1} \varepsilon_{t-1}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $y_{t}^{\prime}$ is the differenced series. The “predictors” on the right hand side include both lagged values of $y_{t}$ and lagged errors.
In this section, we will:
We will use data from June 7, 2020 - March 14, 2021
to train the model. Data from March 21, 2021 - June 6, 2021
(12 time steps) will be used for forecasting.
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_item = gis.content.search(f'NH_TS_US_Cleaned_0621 {group_query}',
item_type = "Feature Service",
outside_org=True)
ts_item
# Access layer
ts_layer = ts_item[1].layers[0]
ts_layer
# Get column names in layer
for f in ts_layer.properties.fields:
print(f['name'])
# Create dataframe from layer
testdf = ts_layer.query(as_df=True, out_fields=['county','provider_state',
'week_ending','weekly_resident_covid19_deaths_'])
testdf.shape
# Check head of data
testdf.head()
We will group the data by county and state to get aggregate counts of resident deaths for each county, state pair for each time period.
# Group data by County and State
grouped_df = testdf.groupby(['week_ending','provider_state','county']) \
.aggregate('mean')[['weekly_resident_covid19_deaths_']].copy()
grouped_df.reset_index(inplace=True)
grouped_df.set_index('week_ending', inplace=True)
grouped_df.head()
# Check shape
grouped_df.shape
There are many counties for which the data of the training time period, June 7, 2020 - March 14, 2021
, is missing. As such, we will remove these counties from the dataset.
The counties are identified by:
Nan
values.# Subset missing data
miss_vals = grouped_df.groupby(['county', 'provider_state']) \
.filter(lambda x:x['weekly_resident_covid19_deaths_'][:'2021-03-14'] \
.sum()==0)
miss_vals.shape
# Identify county, state pairs for which data is missing
cnty_ls = [(county, state) for county, state in zip(miss_vals['county'], miss_vals['provider_state'])]
cnty_ls_new = list(dict.fromkeys(cnty_ls))
# Display count and names of some counties with missing data
display(len(cnty_ls_new))
display(cnty_ls_new[:5])
There are 266 counties where resident death rate data for the time period
June 7, 2020 - March 14, 2021
is missing. We will remove the records for these counties.
# Subset counties with data
grouped_df2 = grouped_df.groupby(['county', 'provider_state']) \
.filter(lambda x:x['weekly_resident_covid19_deaths_'][:'2021-03-14'].sum()!=0).copy()
grouped_df2.shape
# Identify county, state pairs for which data is present
grp_ls = [(county, state) for county, state in zip(grouped_df2['county'], grouped_df2['provider_state'])]
grp_ls_new = list(dict.fromkeys(grp_ls))
# Display count and names of some counties with data
display(len(grp_ls_new))
display(grp_ls_new[:5])
There are 2554 counties where resident death rate data for the time period
June 7, 2020 - March 14, 2021
exists.
# Check to confirm that data for missing county is not present
grouped_df2.query('(county in ["Benson"]) and (provider_state in ["ND"])')
# Check head of data
grouped_df2.head()
Here, we will build an ARIMA model. The order(p,d,q) for an ARIMA(p,d,q) model can be different for each county. We will:
Later, we will define helper functions that will iterate over all counties to build ARIMA(p,d,q) models for each county and gather results. We will use data from June 7, 2020 - March 14, 2021
to train the model. Data from March 21, 2021 - June 6, 2021
(12 time steps) will be used for forecasting.
We will subset and explore the data for San Bernardino County, CA.
# Subset grouped data for SB County and specific feature (resident deaths per 1000 residents)
ar_sb = grouped_df2.query('(county == "san bernardino") and (provider_state == "CA")')[['weekly_resident_covid19_deaths_']].copy()
ar_sb_train = ar_sb[:'2021-03-14'] # remaining time periods are used for forecasting
ar_sb_test = ar_sb['2021-03-21':] # used for performance evaluation
# Check frequency
ar_sb_train.index
# Set Frequency for data
ar_sb_train.index.freq = 'W'
ar_sb_test.index.freq = 'W'
ar_sb_train.index
# Check shape
ar_sb_train.shape
Use auto_arima
from pmdarima
to find the optimal model order (p,d,q).
# Install pmdarima
import sys
!{sys.executable} -m pip install pmdarima
# Import auto_arima
from pmdarima.arima import auto_arima
ar_model = auto_arima(ar_sb_train, suppress_warnings=True,
error_action='warn', trace=True,
stepwise=True)
The optimal order generated by
auto_arima
is ARIMA(3,0,0).
Use optimal order to fit an ARIMA model.
# Build model using ARIMA from statsmodels library
ar_model = ARIMA(ar_sb_train.iloc[:,0], order=(3,0,0))
ar_results = ar_model.fit()
ar_results.summary()
# Get ARIMA fitted values
arima_fitted = ar_results.fittedvalues
When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods i.e. from March 21, 2021 - June 6, 2021
.
# Generate forecast and confidence interval for 12 weeks
forecast = ar_results.get_forecast(12)
arima_pred = forecast.predicted_mean
arima_conf_int = forecast.conf_int(alpha=0.05)
# Plot ARIMA model fitted, forecasted values and CI
ax = arima_fitted.plot(legend=True, label='ARIMA Model: Fitted and Forecasted Values',
marker='o', markersize=5,
linewidth=1.8,color='blue',figsize=(18,12))
ax = arima_pred.plot(label='_nolegend_', marker='o', markersize=5,
linewidth=1.8,color='blue')
plt.fill_between(arima_conf_int.index,
arima_conf_int.iloc[:,0],
arima_conf_int.iloc[:,1],
color='blue', alpha=0.1, label='ARIMA Model-Confidence Interval')
# Plot actual values
ax = ar_sb_train.iloc[:,0].plot(legend=True, label='Actual Deaths',
marker='x', markersize=5,
linewidth=1.8, color='black', alpha=0.8)
# Plot line showing Start of Forecast
plt.vlines(arima_pred.index[-12],
ymin=arima_conf_int.iloc[:,0].min(),
ymax=arima_conf_int.iloc[:,1].max(),
color='grey', linewidth=1, linestyle="--",
label='Forecast results start here')
# Plot label and title
ax.autoscale(axis='x')
ax.set(ylabel='Resident death rate', xlabel='',
title='ARIMA Model: Resident death rates - San Bernardino, CA')
ax.title.set_fontsize(15)
plt.legend(loc='upper left', prop={'size': 12})
plt.show()
The plot shows:
- the fitted values generated by ARIMA(2,1,2) model for June 7, 2020 - March 14, 2021.
- the forecasted resident death rates generated by ARIMA(2,1,2) model for March 21, 2021 - June 6, 2021.
- the actual resident death rates for June 7, 2020 - March 14, 2021.
- the confidence interval of the forecast for March 21, 2021 - June 6, 2021.
Observations:
- The fitted values seem to generalize the actual death rates well.
- The forecast shows an increase in trend for the county.
- The increasing width of the confidence intervals show higher uncertainty associated with each forecast as the forecasting time period increases.
As mentioned earlier, the order(p,d,q) for an ARIMA(p,d,q) model can be different for each county. We just saw how an ARIMA(p,d,q) model can be created and then forecasted for a single county. In the next section, we will define helper functions to iterate over all counties to get ARIMA model results and errors for each county.
Let's build ARIMA models for each county and then combine the results with Exponential Smoothing and Forest-based Forecast models.
Here, we will:
The combined results will be used next to generate plots overlaying various models.
Use the ES_FB_BigLayer
feature layer to get Exponential Smoothing and Forest-based forecast model results. This layer was created in the notebook for part-1 of this analysis.
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
es_fb_item = gis.content.search(f'ES_FB_BigLayer {group_query}',
item_type = "Feature Service",
outside_org=True)
es_fb_item
# Access layer
es_fb_layer = es_fb_item[0].layers[0]
es_fb_layer
# Create dataframe from layer
es_fb_df = es_fb_layer.query(as_df=True)
# Check data
display(es_fb_df.shape)
display(es_fb_df.columns)
Let's define the helper functions. The functions in this sections are designed to:
def arima_df(ar_best_fitted, ar_best_pred, ar_best_conf_int, ordr, test_data, county, state):
"""
Creates dataframe of ARIMA model results along with model error.
:param ar_best_fitted: array, fitted values from ARIMA model
:param ar_best_pred: array, forecasted values from ARIMA model
:param ar_best_conf_int: dataframe, high and low confidence interval values from ARIMA model
:param ordr: tuple, model (p,d,q) order
:param test_data: dataframe, test data
:param county: string, name of the county in lower case
:param state: string, state abbreviation in upper case
"""
# dataframe for fitted
fit_df = pd.DataFrame(ar_best_fitted, columns=['fitted_ar']).reset_index()
# dataframe for forecast
fcst_df = pd.DataFrame(ar_best_pred).reset_index()
cols_to_rename = {'index':'week_ending', 'predicted_mean':'fitted_ar'}
fcst_df.rename(columns=cols_to_rename, inplace=True)
# Concat fitted and fcst df
ar_df = pd.concat([fit_df, fcst_df])
# Process confidence interval
ar_best_conf_int.reset_index(inplace=True)
cols_to_rename = {'index':'week_ending',
'lower weekly_resident_covid19_deaths_':'low_ar',
'upper weekly_resident_covid19_deaths_':'high_ar'}
ar_best_conf_int.rename(columns=cols_to_rename, inplace=True)
# Merge conf int with ar_df
ar_df = ar_df.merge(ar_best_conf_int, on='week_ending', how='left')
# Create df for raw and fitted values for test data
test_data.reset_index(inplace=True)
test_data = test_data.merge(fcst_df)
cols_to_rename = {'weekly_resident_covid19_deaths_':'raw_value'}
test_data.rename(columns=cols_to_rename, inplace=True)
# Calculate error
error = rmse(test_data['raw_value'], test_data['fitted_ar'])
# Create column for error
dt1 = datetime.strptime('2021-03-21', '%Y-%m-%d')
ar_df['f_rmse_ar'] = ar_df['week_ending'].apply(lambda x: 0 if x < dt1 else error)
# Replace NA with 0
ar_df = ar_df.fillna(0)
# Rename week ending col
ar_df.rename(columns={'week_ending':'end_date'}, inplace=True)
# Create county,state and model order columns
ar_df['county'] = county.title()
ar_df['state'] = state
ar_df['ar_order'] = str(ordr)
return ar_df
def arima_model_results(grouped_df, county, state, forecast_period):
"""
Builds ARIMA model and returns a dataframe of the results.
:param grouped_df: dataframe containing aggragated values of resident deaths per county per time period
:param county: string, name of the county in lower case
:param state: string, state abbreviation in upper case
:param forecast_period: int, number of periods to forecast ARIMA models
"""
cnty = county
state = state
# Subset data for county, state
ar_cnty = grouped_df.query('(county == @cnty) and (provider_state == @state)')[['weekly_resident_covid19_deaths_']].copy()
# Split train, test data
ar_cnty_train = ar_cnty[:'2021-03-14']
ar_cnty_test = ar_cnty['2021-03-21':]
# Set Frequency for data
ar_cnty_train.index.freq = 'W'
# Find Best Model and get its (p,d,q) order
ar_mod = auto_arima(ar_cnty_train, suppress_warnings=True,
error_action='warn', trace=False,
stepwise=True)
ordr = ar_mod.get_params()['order']
p,d,q = ordr
# Fit the best model
ar_best_mod = ARIMA(ar_cnty_train, order=(p,d,q))
ar_best_results = ar_best_mod.fit()
# Get fitted values
ar_best_fitted = ar_best_results.fittedvalues
# Forecast for 12 weeks and get predictions and interval
ar_forecast = ar_best_results.get_forecast(forecast_period)
ar_best_pred = ar_forecast.predicted_mean # predicted values
ar_best_conf_int = ar_forecast.conf_int(alpha=0.05) # interval
# Create dataframe of fitted, forecasted and RMSE using `arima_df` function
ar_df = arima_df(ar_best_fitted, ar_best_pred, ar_best_conf_int, ordr, ar_cnty_test, cnty, state)
return ar_df
def combine_results(main_df, grouped_df, fcast=12):
"""
Combines results of ARIMA models for different counties and combines with results from
Exponential Smoothing and Forest based models.
:param main_df: dataframe containing results from Exponential Smoothing and Forest based models
:param grouped_df: dataframe containing aggragated values of resident deaths per county per time period
:param fcast: int, number of periods to forecast ARIMA models
"""
# Create a list of unique (county, state) tuples from main df
ls = [(county, state) for county, state in zip(main_df['county'], main_df['state'])]
new_ls = list(dict.fromkeys(ls))
# Create a list of unique (county, state) tuples from grouped df
# Change first letter of county to upper case
grpd_ls = [(county.title(), state) for county, state in zip(grouped_df['county'], grouped_df['provider_state'])]
grpd_new_l = list(dict.fromkeys(grpd_ls))
cnty_len = len(grpd_new_l)
new_df = pd.DataFrame() # empty df to append arima results
not_present = [] # stores county,state for which data is not in grouped df (actual data not available)
errors = [] # stores county, state for which ARIMA model errored out
"""
The code below:
1. Iterates over (county, state) values from main_df
2. Checks if (county, state) values are in grouped_df
a. If values are present, builds ARIMA model for the (county,state) using `arima_model_results` function
b. If not, append (county,state) to the not_present list
3. Combines ARIMA results for all counties
"""
count = 0
for i in new_ls:
if i in grpd_new_l:
county = i[0].lower() # convert county to lower case
state = i[1]
try:
ar_res = arima_model_results(grouped_df, county, state, fcast) # Run ARIMA model
new_df = new_df.append(ar_res) # append to dataframe
count+= 1
if count % 50 == 0:
print(f'Processed {count} of {cnty_len} counties...')
# time.sleep(5)
except:
print('Error at ', i)
errors.append(i)
continue
else:
not_present.append(i)
continue
return new_df, not_present, errors
combine_results
helper function defined in the section above.
%%time
# Run to get ARIMA results combined with ESF and FBF results
comb_df, not_present, errors = combine_results(es_fb_df, grouped_df2, fcast=12)
# Check details of results
print(f'Shape of dataframe: {comb_df.shape}')
print(f'Number of counties for which data is not present: {len(not_present)}')
print(f'Counties for which ARIMA model errored out: {errors}')
# Combine ARIMA results with ESF and FBF results
comb_df2 = pd.merge(es_fb_df, comb_df, on=['county','state','end_date'], how='left')
comb_df2.shape
The data in es_fb_df
includes records of counties for which no data is present. Since we are combining the ARIMA results with this dataframe, the number of records in the resulting dataframe increases. We will address this in the next section by removing records for counties where no data is present.
# Check head
comb_df2.head()
In this section, we will do some more clean-up of the resulting dataframe. We will:
grouped_df2
with the results dataframe comb_df2
.Add actual death rates from grouped_df
to comb_df2
.
# Create a temporary df to rearrange grouped df data
temp_df = grouped_df2.copy()
temp_df.reset_index(inplace=True)
cols_to_rename = {'provider_state':'state',
'weekly_resident_covid19_deaths_':'agg_res_deaths_per1000_res',
'week_ending':'end_date'}
temp_df.rename(columns=cols_to_rename, inplace=True)
temp_df['county'] = [county.title() for county in temp_df['county']] # first letter capital
display(temp_df.shape)
display(temp_df.head())
# Combine temp df with arima results
es_fb_ar_df = pd.merge(comb_df2, temp_df, on=['county','state','end_date'], how='left')
es_fb_ar_df.shape
# Check head
es_fb_ar_df.head()
# No. of counties for which data is not present
len(not_present)
There are many counties for which the data was either missing or could not be aggregated by the Space-Time cube. As such, we will remove these counties from the dataset.
The counties are identified by:
# Group records for counties with no data
z_vals = es_fb_ar_df.groupby(['county', 'state']).filter(lambda x:x['agg_res_deaths_per1000_res'].sum()==0)
z_vals.shape
# Create a list of (county, state) pairs with no data
cnty_ls = [(county, state) for county, state in zip(z_vals['county'], z_vals['state'])]
cnty_ls_new = list(dict.fromkeys(cnty_ls))
# Check details
display(len(cnty_ls_new))
display(cnty_ls_new[:5])
There are 490 counties where actual resident death rate data is missing. We will remove the records for these counties.
# Subset data
es_fb_ar_df_new = es_fb_ar_df.groupby(['county', 'state']).filter(lambda x:x['agg_res_deaths_per1000_res'].sum()!=0).copy()
es_fb_ar_df_new.shape
# Check to confirm that data for missing county is not present
es_fb_ar_df_new.query('(county in ["Benson"]) and (state in ["ND"])')
We will publish the combined model results to a feature layer for ease of access and use.
es_fb_ar_item = es_fb_ar_df_new.spatial.to_featurelayer('ES_FB_AR_BigLayer')
es_fb_ar_item
ES_FB_AR_BigLayer
layer.
# # Get item
# data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
# outside_org=True)
# group_query = f"group: {data_groups[0].id}" if data_groups else ""
# es_fb_ar_item = gis.content.search(f'ES_FB_AR_BigLayer {group_query}',
# item_type = "Feature Service",
# outside_org=True)[0]
# es_fb_ar_item
# # Get layer
# es_fb_ar_lyr = es_fb_ar_item[0].layers[0]
# # Query as dataframe and check shape
# es_fb_ar_df_new = es_fb_ar_lyr.query(as_df=True)
# es_fb_ar_df_new.shape
So far, we have generated forecast plots using Exponential Smoothing, Forest-based, and ARIMA models individually. We have, however, not seen the forecasts generated by these models together to visualize how each model performs with respect to actual resident deaths data. The idea behind overlay plot is to visualize the fitted values and the forecasted values and their confidence intervals for all models, on a single plot, along with the actual data for each US county.
In this section, we will:
We will start by defining a function, overlay_plot
, which will subset the data for each county and plot the data. This function can then be called using the county FIPS code for one or more counties to generate plots.
# Define function
def overlay_plot(main_df, fips: list, ylabel: str, title: str, fcast_date = '2021-03-21'):
# Create grid
if len(fips) > 4:
fig = plt.figure(figsize=(20, 25))
else:
fig = plt.figure(figsize=(20, 15))
for sp in range(0,len(fips)):
if len(fips) == 1:
ax = fig
elif len(fips)%2 == 0:
ax = fig.add_subplot(len(fips)/2,2,sp+1)
else:
ax = fig.add_subplot(len(fips)//2+1,2,sp+1)
# Subset data for county
cnty_df = main_df[main_df['fips']==fips[sp]]
cnty_df.set_index('end_date', inplace=True)
### Plot Random Forest Model ####
# Plot fitted values including forecast
ax = cnty_df['fitted_fb'].plot(legend=True, label='RF Model-Fitted Values',marker='o',
markersize=5, linewidth=1.8, color='orange')
# Plot confidence interval
plt.fill_between(cnty_df[-12:].index,
cnty_df.loc[fcast_date:,'low_fb'],
cnty_df.loc[fcast_date:,'high_fb'],
color='orange', alpha=0.1, label='RF Model-Confidence Interval')
### Plot Exponential Smoothing model ####
# Plot fitted values including forecast
ax = cnty_df['fitted_es'].plot(legend=True, label='ES Model-Fitted Values',marker='o',
markersize=5,linewidth=1.8,color='green')
# Plot confidence interval
plt.fill_between(cnty_df[-12:].index,
cnty_df.loc[fcast_date:,'low_es'],
cnty_df.loc[fcast_date:,'high_es'],
color='g', alpha=0.1, label='ES Model-Confidence Interval')
### Plot ARIMA model ####
# Plot fitted values including forecast
ax = cnty_df['fitted_ar'].plot(legend=True,
label='ARIMA{} Model-Fitted Values'.format(cnty_df['ar_order'][0]),
marker='o', markersize=5,linewidth=1.8,color='blue')
# Plot confidence interval
plt.fill_between(cnty_df[-12:].index,
cnty_df.loc[fcast_date:,'low_ar'],
cnty_df.loc[fcast_date:,'high_ar'],
color='blue', alpha=0.1, label='ARIMA Model-Confidence Interval')
### Plot actual deaths from `2020-05-27` to `2021-06-14` ####
ax = cnty_df['agg_res_deaths_per1000_res'].plot(legend=True, label='Actual Deaths',
marker='x', markersize=5,
linewidth=1.8, color='black', alpha=0.8)
# Plot line showing Start of Forecast
minimum = min(cnty_df[['low_ar','low_es','low_fb']].min())
maximum = max(cnty_df[['high_ar','high_es','high_fb']].max())
plt.vlines(cnty_df.index[-12],
ymin=minimum,
ymax=maximum,
color='grey', linewidth=1, linestyle="--",
label='Forecast results start here')
# Label, title and legend
ax.set(ylabel=ylabel, xlabel='')
ax.set_title(cnty_df['county'][0]+', '+cnty_df['state'][0], fontsize=20)
plt.legend(loc='upper left', prop={'size': 15})
# Remove spines
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
fig.suptitle(title, fontsize=25, y=1.0)
plt.rc('axes', labelsize=15)
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.tight_layout() # automatically adjusts layout to fit long labels
plt.show();
Here, we will use the overlay_plot
function to generate plots for one or more counties.
Let's start by creating a plot for San Bernardino county.
fips = [6071]
overlay_plot(es_fb_ar_df_new, fips, ylabel='Resident death rate',
title='Model Overlay: Average weekly nursing home resident death rates')
The plot shows:
- fitted values generated by the models for June 7, 2020 - March 14, 2021.
- forecasted values generated by the models for March 21, 2021 - June 6, 2021.
- actual resident death rates for June 7, 2020 - March 14, 2021.
- confidence intervals of the forecasts for March 21, 2021 - June 6, 2021.
Observations:
- The general trend of actual resident death rates for San Bernardino County seem to follow an increasing trend from July'20 - Aug.'20, followed by a decline until Oct.'20. The trend goes up again from Oct'20 - Dec'20, followed by a decline.
- The fitted values for all models seem to follow the actual data closely. Although the actual resident death rates show a declining trend for the forecast period, the forecasts generated by ARIMA and Forest-based models predict a higher death rate.
- Forecasts generated by the ARIMA(3,0,0) model shows a slightly higher death rate than the actual data and other models.
- The forecast generated by the Forest-based Forecast (FBF) model follows a straight line for the first few time periods and then starts trending up for the remaining time periods, showing some variance between the FBF results and the actual data.
- Note that the Confidence Interval (CI) for FBF is the largest, indicating higher uncertainty associated with each forecast compared to the ARIMA or ESF models, as the forecasting time period progresses. The CI for the ARIMA model is the narrowest for this county.
Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.
County, State | FIPS | Cluster Id |
---|---|---|
Maricopa, AZ | 4013 | 2 |
Delaware, PA | 42045 | 3 |
Porter, IN | 18127 | 4 |
# Generate plots
fips = [4013,42045,18127]
overlay_plot(es_fb_ar_df_new, fips, ylabel='Resident death rate',
title='Model Overlay: Average weekly nursing home resident death rates')
The plots show the:
- fitted values generated by the models for June 7, 2020 - March 14, 2021.
- forecasted values generated by the models for March 21, 2021 - June 6, 2021.
- actual resident deaths for June 7, 2020 - March 14, 2021.
- confidence intervals of the forecasts for March 21, 2021 - June 6, 2021.
Observations:
- The trend of actual resident death rates is different for each county. Earlier in 2020, Maricopa county shows an increase in death rates from June - August. Both Delaware and Porter counties show an increase from June - July.
- Maricopa and Delaware counties show an increase in death rates from mid Dec'20 - early Feb'21. Porter county shows a large spike in November followed by consistent increasing and decresing trends early Feb'21.
- Porter county had the highest average weekly resident death rates, followed by Delaware and Maricopa counties.
- Most counties show a declining trend in death rate for the forecast period.
- The fitted values for all models seem to follow the actual data closely.
- The forecasted values and confidence intervals for all models vary for different counties.
- The increasing widths of the confidence intervals show higher uncertainty associated with each forecast as the forecasting time period increases.
- The confidence Interval for the FBF model is the largest, indicating a lower confidence in forecast results when compared to the ARIMA or ESF models as the forecasting time period progresses.
We have already plotted the forecasts generated by the different models and visualized how the models perform when compared to the actual data. Now, let's map the forecast errors generated by the Exponential Smoothing, Forest-based, and ARIMA models. In this section, we will:
Subset es_fb_ar_df
to keep only the error data.
# Subset to get a single record for each county
err_df = es_fb_ar_df_new.groupby(["county", "state"]).nth(-1)
err_df.reset_index(inplace=True)
# Keep relevant error columns
cols_to_keep = ['county','state','fips','f_rmse_es','f_rmse_fb','f_rmse_ar','v_rmse_es','v_rmse_fb','SHAPE']
errors_df = err_df[cols_to_keep].copy()
# Check shape
errors_df.shape
# Check head
errors_df.head()
We will publish the data to a feature layer for ease of access and use later.
errors_item = errors_df.spatial.to_featurelayer('TS_Errors')
errors_item
Map the errors generated by the different models.
# Search error item and get layer
err_item = gis.content.search('TS_Errors', 'Feature Layer')
err_lyr = err_item[0].layers[0]
err_lyr
# Generate map
map_esf = gis.map('US', zoomlevel=4)
map_esf.add_layer(err_lyr,
{"renderer":"ClassedColorRenderer",
"field_name":"f_rmse_es",
"opacity":0.7})
map_esf
The plot shows variations in forecast errors as generated by the Exponential Smoothing Forecast model for each county. The variation seems to be high for some counties, with values ranging from 10.6 - 125.
# Take screenshot
map_esf.legend=True
map_esf.take_screenshot(output_in_cell=False)
# Generate map
map_fbf = gis.map('US', zoomlevel=4)
map_fbf.add_layer(err_lyr,
{"renderer":"ClassedColorRenderer",
"field_name":"f_rmse_fb",
"opacity":0.7})
map_fbf
The plot shows variations in forecast errors as generated by the Forest-based Forecast model for each county. The variation seems to be high for some counties, with values ranging from 6 - 66.
# Take screenshot
map_fbf.legend=True
map_fbf.take_screenshot(output_in_cell=False)
# Generate map
map_arima = gis.map('US', zoomlevel=4)
map_arima.add_layer(err_lyr,
{"renderer":"ClassedColorRenderer",
"field_name":"f_rmse_ar",
"opacity":0.7})
map_arima
The plot shows variation in forecast errors as generated by the ARIMA model for each county. The variation seems to be high for some counties, with values ranging from 4.2 - 95.
# Take screenshot
map_arima.legend=True
map_arima.take_screenshot(output_in_cell=False)
To identify the counties that fall within a certain error range, we can perform a simple query operation on the errors dataframe.
# Counties where ARIMA model error is greater than 1000.
errors_df.query('f_rmse_ar > 90')
To summarize, in this notebook we:
Because the future may not repeat the past for this pandemic, no prediction model is certain. However, any prediction tool with acceptable prediction performance can still be useful for public-health planning, policy decision-making, and facilitating the transition back to normalcy.
Our analysis utilizes advanced data analysis, visualization, and machine learning techniques on a publicly available dataset [6] to show the variation in forecasts generated by some of the more popular time series forecasting algorithms [2][4][5] over different geographical areas.
The analysis can be easily adapted to use other features in the dataset, such as resident and staff cases, Covid-19 testing features, and availability/shortage of personal protective equipment (PPE). The techniques defined in this analysis can also support further data integrations, such as vaccination rates, nursing home health inspections, and quality of care ratings data.
[1] https://people.duke.edu/~rnau/411arim.htm
[2] Sokol, J. (2018). Module 7: Time Series Models [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6501-intro-analytics-modeling
[3] Serban, N. (2018). Unit 2: Auto Regressive and Moving Average (ARMA) Model [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6402-time-series-analysis
[4] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/exponentialsmoothingforecast.htm
[5] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/forestbasedforecast.htm
[6] https://data.cms.gov/stories/s/bkwz-xpvg