Time Series Forecasting: COVID-19 trends in US nursing homes - Part 2/2

  • 👟 Ready To Run!
  • 🚀 Data Exploration and Cleaning
  • 🤖 Machine Learning
  • ⏱️ Time Series
  • ⚕️ Healthcare

In Part-1 of this study, we:

  • Explored the data
    • Created a Space Time Cube to structure the data into a netCDF data format and aggregate resident death rates at the US county geographic level.
    • Clustered the data using Time Series Clustering to better understand spatiotemporal patterns in our data.
  • Generated forecasts for resident deaths per 1000 residents using Exponential Smoothing Forecast and Forest-based Forecast tools.
  • Plotted forecast and validation errors as generated by the two models.

In this notebook, we will:

  • Build ARIMA models to generate forecasts for average weekly resident deaths per 1000 residents in each county.
  • Combine ARIMA results with Exponential Smoothing, Forest-based Forecast models and actual data.
  • Create plots that overlay the fitted values, forecasted values and their confidence intervals, and actual data values for all models.

Each record in our data represents information about an individual nursing home in the United States for each week. For the analysis, we will use the data ranging from June 07, 2020 - June 6, 2021.

Introduction

General Imports
In [4]:
# Import Libraries

# Display
from IPython.display import display

# ArcGIS
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap

# Data Exploration
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)

# Plotting Libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
from ipywidgets import HBox, VBox, Layout, Label

# Datetime and os
from datetime import datetime
import time
import os
import itertools

# ARIMA models
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tools.eval_measures import rmse

# Stop warnings
import warnings
warnings.filterwarnings("ignore")
In [5]:
# Create a GIS connection
gis = GIS("home")

ARIMA (Autoregressive Integrated Moving Average) Model

ARIMA (Autoregressive Integrated Moving Average) [1][2] models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

ARIMA [3] is a combination of 3 models:

  • AR(p) Autoregression - a regression model that utilizes the dependent relationship between a current observation and observations over a previous period. An AR(p) model can be written as: \begin{array}{l} y_{t}=c+\phi_{1} y_{t-1}+\phi_{2} y_{t-2}+\cdots+\phi_{p} y_{t-p}+\varepsilon_{t} \end{array} where $\phi$ is the auto-regressive parameter and $\varepsilon_{t}$ is the error term.
  • I(d) Integration - uses differencing of observations (subtracting an observation from an observation at the previous time step) in order to make the time series stationary. First Order Differencing can be written as: \begin{array}{l} y_{t}^{\prime} &=y_{t}-y_{t-1} \\ \end{array}
  • MA(q) Moving Average - a model that uses previous errors as predictors in a regression like model. An MA(q) model can be written as: \begin{array}{l} y_{t}=c+\theta_{1} \varepsilon_{t-1}+\theta_{2} \varepsilon_{t-2}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $\theta$ is the moving average parameter and $\varepsilon_{t}$ is the error term.

ARIMA [3] combines the ideas of autoregression and differencing. Instead of using autoregression with actual observed data, we use autoregression on the differences. An ARIMA(p,d,q) model can be written as: \begin{array}{l} y_{t}^{\prime}=c+\phi_{1} y_{t-1}^{\prime}+\cdots+\phi_{p} y_{t-p}^{\prime}+\theta_{1} \varepsilon_{t-1}+\cdots+\theta_{q} \varepsilon_{t-q}+\varepsilon_{t} \end{array} where $y_{t}^{\prime}$ is the differenced series. The “predictors” on the right hand side include both lagged values of $y_{t}$ and lagged errors.

In this section, we will:

  • get actual data for all dates and remove records for US counties with no data.
  • build an ARIMA(p,d,q) model for a single county.
  • generate forecasts.
  • combine ARIMA model results for all counties with Exponential Smoothing and Forest-based model results.
  • clean the combined results and publish as a feature layer

We will use data from June 7, 2020 - March 14, 2021 to train the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting.

Get Resident data - data for all dates

Use the NH_TS_US_Cleaned_0621 feature layer to get nursing home resident data. To learn more about how this service was created, review the following notebook.

In [6]:
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_item = gis.content.search(f'NH_TS_US_Cleaned_0621 {group_query}',
                              item_type = "Feature Service",
                              outside_org=True)
ts_item
Out[6]:
[<Item title:"Subset_NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>,
 <Item title:"NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>]
In [7]:
# Access layer
ts_layer = ts_item[1].layers[0]
ts_layer
Out[7]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/NH_TS_US_Cleaned_0621/FeatureServer/0">
In [8]:
# Get column names in layer
for f in ts_layer.properties.fields:
    print(f['name'])
objectid
county
provider_state
initial_covid_case
threeormore_cases_this_week
oneweek_ppe_shortage
total_staff_shortage
shortage_of_nursing_staff
oneweek_shortage_of_n95_masks
staff_total_confirmed_covid19
staff_total_covid19_deaths
residents_total_admissions_covi
weekly_resident_confirmed_covid
weekly_resident_covid19_deaths_
total_number_of_occupied_beds
week_ending
In [9]:
# Create dataframe from layer
testdf = ts_layer.query(as_df=True, out_fields=['county','provider_state',
                                                'week_ending','weekly_resident_covid19_deaths_'])
testdf.shape
Out[9]:
(769541, 6)
In [10]:
# Check head of data
testdf.head()
Out[10]:
SHAPE county objectid provider_state week_ending weekly_resident_covid19_deaths_
0 {"x": -9141988.7262, "y": 3396571.8598999977, ... marion 1 FL 2020-07-12 0.0
1 {"x": -9093028.0764, "y": 3501992.7968000025, ... clay 2 FL 2020-06-07 0.0
2 {"x": -8921663.4089, "y": 3000273.0898, "spati... broward 3 FL 2021-01-03 0.0
3 {"x": -9794605.9202, "y": 5185103.478500001, "... lake 4 IL 2020-10-04 0.0
4 {"x": -10041672.7395, "y": 4858313.148800001, ... cass 5 IL 2020-06-28 0.0

Group Data

We will group the data by county and state to get aggregate counts of resident deaths for each county, state pair for each time period.

  • aggregation by mean allows us to maintain the ratio of deaths per 1000 residents, such that the numbers represent the average weekly resident deaths per 1000 residents.
In [11]:
# Group data by County and State
grouped_df = testdf.groupby(['week_ending','provider_state','county']) \
                    .aggregate('mean')[['weekly_resident_covid19_deaths_']].copy()
grouped_df.reset_index(inplace=True)
grouped_df.set_index('week_ending', inplace=True)
grouped_df.head()
Out[11]:
provider_state county weekly_resident_covid19_deaths_
week_ending
2020-06-07 AK anchorage 0.0
2020-06-07 AK cordova-mccarthy 0.0
2020-06-07 AK fairbanks 0.0
2020-06-07 AK juneau 0.0
2020-06-07 AK kenai-cook inlet 0.0
In [12]:
# Check shape
grouped_df.shape
Out[12]:
(149267, 3)

Remove counties with no data

There are many counties for which the data of the training time period, June 7, 2020 - March 14, 2021, is missing. As such, we will remove these counties from the dataset.

The counties are identified by:

  • Grouping the data by state and county.
  • Applying a filter where the resident death rate for each the county/state pair over the training time period (June 7, 2020 - March 14, 2021) is 0. This includes records with Nan values.
Identify counties with missing data
In [13]:
# Subset missing data
miss_vals = grouped_df.groupby(['county', 'provider_state']) \
                        .filter(lambda x:x['weekly_resident_covid19_deaths_'][:'2021-03-14'] \
                        .sum()==0)
miss_vals.shape
Out[13]:
(13933, 3)
In [14]:
# Identify county, state pairs for which data is missing
cnty_ls = [(county, state) for county, state in zip(miss_vals['county'], miss_vals['provider_state'])]
cnty_ls_new = list(dict.fromkeys(cnty_ls))
In [15]:
# Display count and names of some counties with missing data
display(len(cnty_ls_new))
display(cnty_ls_new[:5])
266
[('fairbanks', 'AK'),
 ('juneau', 'AK'),
 ('ketchikan', 'AK'),
 ('kodiak island borough', 'AK'),
 ('not available', 'AK')]

There are 266 counties where resident death rate data for the time period June 7, 2020 - March 14, 2021 is missing. We will remove the records for these counties.

Subset for counties that have data
In [16]:
# Subset counties with data
grouped_df2 = grouped_df.groupby(['county', 'provider_state']) \
                        .filter(lambda x:x['weekly_resident_covid19_deaths_'][:'2021-03-14'].sum()!=0).copy()
grouped_df2.shape
Out[16]:
(135334, 3)
In [17]:
# Identify county, state pairs for which data is present
grp_ls = [(county, state) for county, state in zip(grouped_df2['county'], grouped_df2['provider_state'])]
grp_ls_new = list(dict.fromkeys(grp_ls))

# Display count and names of some counties with data
display(len(grp_ls_new))
display(grp_ls_new[:5])
2554
[('anchorage', 'AK'),
 ('cordova-mccarthy', 'AK'),
 ('kenai-cook inlet', 'AK'),
 ('autauga', 'AL'),
 ('baldwin', 'AL')]

There are 2554 counties where resident death rate data for the time period June 7, 2020 - March 14, 2021 exists.

In [18]:
# Check to confirm that data for missing county is not present
grouped_df2.query('(county in ["Benson"]) and (provider_state in ["ND"])')
Out[18]:
provider_state county weekly_resident_covid19_deaths_
week_ending
In [19]:
# Check head of data
grouped_df2.head()
Out[19]:
provider_state county weekly_resident_covid19_deaths_
week_ending
2020-06-07 AK anchorage 0.0
2020-06-07 AK cordova-mccarthy 0.0
2020-06-07 AK kenai-cook inlet 0.0
2020-06-07 AL autauga 0.0
2020-06-07 AL baldwin 0.0

Build a Model

Here, we will build an ARIMA model. The order(p,d,q) for an ARIMA(p,d,q) model can be different for each county. We will:

  • subset the data for a single county.
  • use the auto_arima tool from pmdarima to determine the optimal (p,d,q) order for an ARIMA model.
  • build an ARIMA model for the optimized order using statsmodels.

Later, we will define helper functions that will iterate over all counties to build ARIMA(p,d,q) models for each county and gather results. We will use data from June 7, 2020 - March 14, 2021 to train the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting.

Subset Data for a County

We will subset and explore the data for San Bernardino County, CA.

In [20]:
# Subset grouped data for SB County and specific feature (resident deaths per 1000 residents)
ar_sb = grouped_df2.query('(county == "san bernardino") and (provider_state == "CA")')[['weekly_resident_covid19_deaths_']].copy()
ar_sb_train = ar_sb[:'2021-03-14']  # remaining time periods are used for forecasting
ar_sb_test = ar_sb['2021-03-21':]  # used for performance evaluation
Set Frequency

DatetimeIndex data does not have a frequency. An ARIMA model requires the data to have a specific frequency, such as daily, monthly, etc. Since our observations occur at the start of each week, we'll use the W frequency. A full list of time series offset aliases can be found here.

In [21]:
# Check frequency
ar_sb_train.index
Out[21]:
DatetimeIndex(['2020-06-07', '2020-06-14', '2020-06-21', '2020-06-28',
               '2020-07-05', '2020-07-12', '2020-07-19', '2020-07-26',
               '2020-08-02', '2020-08-09', '2020-08-16', '2020-08-23',
               '2020-08-30', '2020-09-06', '2020-09-13', '2020-09-20',
               '2020-09-27', '2020-10-04', '2020-10-11', '2020-10-18',
               '2020-10-25', '2020-11-01', '2020-11-08', '2020-11-15',
               '2020-11-22', '2020-11-29', '2020-12-06', '2020-12-13',
               '2020-12-20', '2020-12-27', '2021-01-03', '2021-01-10',
               '2021-01-17', '2021-01-24', '2021-01-31', '2021-02-07',
               '2021-02-14', '2021-02-21', '2021-02-28', '2021-03-07',
               '2021-03-14'],
              dtype='datetime64[ns]', name='week_ending', freq=None)
In [22]:
# Set Frequency for data
ar_sb_train.index.freq = 'W'
ar_sb_test.index.freq = 'W'
ar_sb_train.index
Out[22]:
DatetimeIndex(['2020-06-07', '2020-06-14', '2020-06-21', '2020-06-28',
               '2020-07-05', '2020-07-12', '2020-07-19', '2020-07-26',
               '2020-08-02', '2020-08-09', '2020-08-16', '2020-08-23',
               '2020-08-30', '2020-09-06', '2020-09-13', '2020-09-20',
               '2020-09-27', '2020-10-04', '2020-10-11', '2020-10-18',
               '2020-10-25', '2020-11-01', '2020-11-08', '2020-11-15',
               '2020-11-22', '2020-11-29', '2020-12-06', '2020-12-13',
               '2020-12-20', '2020-12-27', '2021-01-03', '2021-01-10',
               '2021-01-17', '2021-01-24', '2021-01-31', '2021-02-07',
               '2021-02-14', '2021-02-21', '2021-02-28', '2021-03-07',
               '2021-03-14'],
              dtype='datetime64[ns]', name='week_ending', freq='W-SUN')
In [23]:
# Check shape
ar_sb_train.shape
Out[23]:
(41, 1)

Find the Best Model

Use auto_arima from pmdarima to find the optimal model order (p,d,q).

In [26]:
# Install pmdarima
import sys
!{sys.executable} -m pip install pmdarima
Collecting pmdarima
  Using cached pmdarima-1.8.2-cp37-cp37m-win_amd64.whl (591 kB)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (52.0.0.post20210125)
Requirement already satisfied: Cython!=0.29.18,>=0.29 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (0.29.23)
Requirement already satisfied: scikit-learn>=0.22 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (0.23.2)
Requirement already satisfied: urllib3 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (1.26.4)
Requirement already satisfied: scipy>=1.3.2 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (1.6.2)
Requirement already satisfied: numpy~=1.19.0 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (1.19.5)
Requirement already satisfied: pandas>=0.19 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (1.2.3)
Requirement already satisfied: joblib>=0.11 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (1.0.1)
Requirement already satisfied: statsmodels!=0.12.0,>=0.11 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pmdarima) (0.12.2)
Requirement already satisfied: pytz>=2017.3 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages\pytz-2020.1-py3.7.egg (from pandas>=0.19->pmdarima) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from pandas>=0.19->pmdarima) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from python-dateutil>=2.7.3->pandas>=0.19->pmdarima) (1.15.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from scikit-learn>=0.22->pmdarima) (2.1.0)
Requirement already satisfied: patsy>=0.5 in c:\users\mohi9282\appdata\local\esri\conda\envs\arcgispro282\lib\site-packages (from statsmodels!=0.12.0,>=0.11->pmdarima) (0.5.1)
Installing collected packages: pmdarima
Successfully installed pmdarima-1.8.2
In [27]:
# Import auto_arima
from pmdarima.arima import auto_arima

ar_model = auto_arima(ar_sb_train, suppress_warnings=True, 
                      error_action='warn', trace=True,
                      stepwise=True)
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(0,0,0)[0] intercept   : AIC=inf, Time=0.21 sec
 ARIMA(0,0,0)(0,0,0)[0] intercept   : AIC=183.333, Time=0.01 sec
 ARIMA(1,0,0)(0,0,0)[0] intercept   : AIC=159.064, Time=0.02 sec
 ARIMA(0,0,1)(0,0,0)[0] intercept   : AIC=172.506, Time=0.02 sec
 ARIMA(0,0,0)(0,0,0)[0]             : AIC=212.821, Time=0.00 sec
 ARIMA(2,0,0)(0,0,0)[0] intercept   : AIC=157.912, Time=0.03 sec
 ARIMA(3,0,0)(0,0,0)[0] intercept   : AIC=153.517, Time=0.04 sec
 ARIMA(4,0,0)(0,0,0)[0] intercept   : AIC=153.709, Time=0.05 sec
 ARIMA(3,0,1)(0,0,0)[0] intercept   : AIC=inf, Time=0.20 sec
 ARIMA(2,0,1)(0,0,0)[0] intercept   : AIC=157.542, Time=0.05 sec
 ARIMA(4,0,1)(0,0,0)[0] intercept   : AIC=155.672, Time=0.05 sec
 ARIMA(3,0,0)(0,0,0)[0]             : AIC=156.490, Time=0.02 sec

Best model:  ARIMA(3,0,0)(0,0,0)[0] intercept
Total fit time: 0.708 seconds

The optimal order generated by auto_arima is ARIMA(3,0,0).

Fit the Best Model

Use optimal order to fit an ARIMA model.

In [28]:
# Build model using ARIMA from statsmodels library
ar_model = ARIMA(ar_sb_train.iloc[:,0], order=(3,0,0))
ar_results = ar_model.fit()
ar_results.summary()
Out[28]:
SARIMAX Results
Dep. Variable: weekly_resident_covid19_deaths_ No. Observations: 41
Model: ARIMA(3, 0, 0) Log Likelihood -71.759
Date: Wed, 08 Sep 2021 AIC 153.517
Time: 13:05:58 BIC 162.085
Sample: 06-07-2020 HQIC 156.637
- 03-14-2021
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
const 2.3329 0.886 2.632 0.008 0.595 4.070
ar.L1 0.6109 0.120 5.072 0.000 0.375 0.847
ar.L2 0.4529 0.169 2.677 0.007 0.121 0.785
ar.L3 -0.3836 0.187 -2.047 0.041 -0.751 -0.016
sigma2 1.8811 0.422 4.459 0.000 1.054 2.708
Ljung-Box (L1) (Q): 0.24 Jarque-Bera (JB): 2.62
Prob(Q): 0.62 Prob(JB): 0.27
Heteroskedasticity (H): 0.96 Skew: 0.58
Prob(H) (two-sided): 0.94 Kurtosis: 3.43


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [29]:
# Get ARIMA fitted values
arima_fitted = ar_results.fittedvalues

Plot Forecasts

When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods i.e. from March 21, 2021 - June 6, 2021.

In [30]:
# Generate forecast and confidence interval for 12 weeks
forecast = ar_results.get_forecast(12)
arima_pred = forecast.predicted_mean
arima_conf_int = forecast.conf_int(alpha=0.05)

Create Plot

In [31]:
# Plot ARIMA model fitted, forecasted values and CI
ax = arima_fitted.plot(legend=True, label='ARIMA Model: Fitted and Forecasted Values',
                        marker='o', markersize=5,
                        linewidth=1.8,color='blue',figsize=(18,12))

ax = arima_pred.plot(label='_nolegend_', marker='o', markersize=5,
                       linewidth=1.8,color='blue')

plt.fill_between(arima_conf_int.index,
                 arima_conf_int.iloc[:,0],
                 arima_conf_int.iloc[:,1],
                 color='blue', alpha=0.1, label='ARIMA Model-Confidence Interval')

# Plot actual values
ax = ar_sb_train.iloc[:,0].plot(legend=True, label='Actual Deaths',
                marker='x', markersize=5,
                linewidth=1.8, color='black', alpha=0.8)

# Plot line showing Start of Forecast
plt.vlines(arima_pred.index[-12],
           ymin=arima_conf_int.iloc[:,0].min(),
           ymax=arima_conf_int.iloc[:,1].max(),
            color='grey', linewidth=1, linestyle="--",
           label='Forecast results start here')

# Plot label and title
ax.autoscale(axis='x')
ax.set(ylabel='Resident death rate', xlabel='', 
       title='ARIMA Model: Resident death rates - San Bernardino, CA')
ax.title.set_fontsize(15)
plt.legend(loc='upper left', prop={'size': 12})
plt.show()

The plot shows:

  • the fitted values generated by ARIMA(2,1,2) model for June 7, 2020 - March 14, 2021.
  • the forecasted resident death rates generated by ARIMA(2,1,2) model for March 21, 2021 - June 6, 2021.
  • the actual resident death rates for June 7, 2020 - March 14, 2021.
  • the confidence interval of the forecast for March 21, 2021 - June 6, 2021.

Observations:

  • The fitted values seem to generalize the actual death rates well.
  • The forecast shows an increase in trend for the county.
  • The increasing width of the confidence intervals show higher uncertainty associated with each forecast as the forecasting time period increases.

As mentioned earlier, the order(p,d,q) for an ARIMA(p,d,q) model can be different for each county. We just saw how an ARIMA(p,d,q) model can be created and then forecasted for a single county. In the next section, we will define helper functions to iterate over all counties to get ARIMA model results and errors for each county.

Combine ARIMA Model Results

Let's build ARIMA models for each county and then combine the results with Exponential Smoothing and Forest-based Forecast models.

Here, we will:

  • get Exponential Smoothing and Forest-based forecast model results.
  • define helper functions to iterate over all counties to get ARIMA model results and errors for each county.
  • combine the results generated using ARIMA models with the results generated from the Exponential Smoothing and Forest-based forecast models.

The combined results will be used next to generate plots overlaying various models.

NOTE: the analysis performed in this section (2.4) and the next section (2.5) is computationally expensive and may take ~ 40 minutes to run. A feature service with the results of the analysis has been provided in section 2.5.3. If needed, you may skip the two sections and jump straight to the feature service.

Get ESF and FBF combined data

Use the ES_FB_BigLayer feature layer to get Exponential Smoothing and Forest-based forecast model results. This layer was created in the notebook for part-1 of this analysis.

In [34]:
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
es_fb_item = gis.content.search(f'ES_FB_BigLayer {group_query}',
                                item_type = "Feature Service",
                                outside_org=True)
es_fb_item
Out[34]:
[<Item title:"ES_FB_BigLayer" type:Feature Layer Collection owner:portaladmin>,
 <Item title:"ES_FB_AR_BigLayer" type:Feature Layer Collection owner:portaladmin>]
In [35]:
# Access layer
es_fb_layer = es_fb_item[0].layers[0]
es_fb_layer
Out[35]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/ES_FB_BigLayer/FeatureServer/0">
In [36]:
# Create dataframe from layer
es_fb_df = es_fb_layer.query(as_df=True)
In [37]:
# Check data
display(es_fb_df.shape)
display(es_fb_df.columns)
(156509, 29)
Index(['SHAPE', 'SHAPE__Area', 'SHAPE__Length', 'county', 'end_date',
       'f_rmse_es', 'f_rmse_fb', 'fips', 'fitted_es', 'fitted_fb', 'high_es',
       'high_fb', 'level_es', 'leveltrend_es', 'low_es', 'low_fb', 'objectid',
       'residual_es', 'residual_fb', 'season_es', 'shape_area_es',
       'shape_area_fb', 'shape_length_es', 'shape_length_fb', 'start_date',
       'state', 'trend_es', 'v_rmse_es', 'v_rmse_fb'],
      dtype='object')

Helper Functions

Let's define the helper functions. The functions in this sections are designed to:

  • iterate over all counties.
  • build ARIMA models and compute model errors for each county.
  • combine the results in a dataframe.
  • return results, a list of counties for which data is unavailable, and any errors returned if the model did not run for a particular county.
In [38]:
def arima_df(ar_best_fitted, ar_best_pred, ar_best_conf_int, ordr, test_data, county, state):
    """
    Creates dataframe of ARIMA model results along with model error.
    
    :param ar_best_fitted: array, fitted values from ARIMA model
    :param ar_best_pred: array, forecasted values from ARIMA model
    :param ar_best_conf_int: dataframe, high and low confidence interval values from ARIMA model 
    :param ordr: tuple, model (p,d,q) order
    :param test_data: dataframe, test data
    :param county: string, name of the county in lower case
    :param state: string, state abbreviation in upper case 
    """
    # dataframe for fitted
    fit_df = pd.DataFrame(ar_best_fitted, columns=['fitted_ar']).reset_index()
    
    # dataframe for forecast
    fcst_df = pd.DataFrame(ar_best_pred).reset_index()

    cols_to_rename = {'index':'week_ending', 'predicted_mean':'fitted_ar'}
    fcst_df.rename(columns=cols_to_rename, inplace=True)
    
    # Concat fitted and fcst df
    ar_df = pd.concat([fit_df, fcst_df])

    # Process confidence interval
    ar_best_conf_int.reset_index(inplace=True)
    cols_to_rename = {'index':'week_ending', 
                      'lower weekly_resident_covid19_deaths_':'low_ar',
                      'upper weekly_resident_covid19_deaths_':'high_ar'}
    ar_best_conf_int.rename(columns=cols_to_rename, inplace=True)

    # Merge conf int with ar_df
    ar_df = ar_df.merge(ar_best_conf_int, on='week_ending', how='left')

    # Create df for raw and fitted values for test data
    test_data.reset_index(inplace=True)
    test_data = test_data.merge(fcst_df)
    cols_to_rename = {'weekly_resident_covid19_deaths_':'raw_value'}
    test_data.rename(columns=cols_to_rename, inplace=True)

    # Calculate error
    error = rmse(test_data['raw_value'], test_data['fitted_ar'])

    # Create column for error
    dt1 = datetime.strptime('2021-03-21', '%Y-%m-%d')
    ar_df['f_rmse_ar'] = ar_df['week_ending'].apply(lambda x: 0 if x < dt1 else error)

    # Replace NA with 0
    ar_df = ar_df.fillna(0)

    # Rename week ending col
    ar_df.rename(columns={'week_ending':'end_date'}, inplace=True)

    # Create county,state and model order columns
    ar_df['county'] = county.title()
    ar_df['state'] = state
    ar_df['ar_order'] = str(ordr)

    return ar_df
In [39]:
def arima_model_results(grouped_df, county, state, forecast_period):
    """
    Builds ARIMA model and returns a dataframe of the results.
    
    :param grouped_df: dataframe containing aggragated values of resident deaths per county per time period
    :param county: string, name of the county in lower case
    :param state: string, state abbreviation in upper case 
    :param forecast_period: int, number of periods to forecast ARIMA models
    """
    cnty = county
    state = state
    
    # Subset data for county, state
    ar_cnty = grouped_df.query('(county == @cnty) and (provider_state == @state)')[['weekly_resident_covid19_deaths_']].copy()
    
    # Split train, test data
    ar_cnty_train = ar_cnty[:'2021-03-14']
    ar_cnty_test = ar_cnty['2021-03-21':] 
    
    # Set Frequency for data
    ar_cnty_train.index.freq = 'W'

    # Find Best Model and get its (p,d,q) order
    ar_mod = auto_arima(ar_cnty_train, suppress_warnings=True, 
                          error_action='warn', trace=False,
                          stepwise=True)

    ordr = ar_mod.get_params()['order']
    p,d,q = ordr

    # Fit the best model
    ar_best_mod = ARIMA(ar_cnty_train, order=(p,d,q))
    ar_best_results = ar_best_mod.fit()

    # Get fitted values
    ar_best_fitted = ar_best_results.fittedvalues

    # Forecast for 12 weeks and get predictions and interval
    ar_forecast = ar_best_results.get_forecast(forecast_period)
    ar_best_pred = ar_forecast.predicted_mean    # predicted values
    ar_best_conf_int = ar_forecast.conf_int(alpha=0.05)    # interval

    # Create dataframe of fitted, forecasted and RMSE using `arima_df` function
    ar_df = arima_df(ar_best_fitted, ar_best_pred, ar_best_conf_int, ordr, ar_cnty_test, cnty, state)
    
    return ar_df
In [40]:
def combine_results(main_df, grouped_df, fcast=12):
    """
    Combines results of ARIMA models for different counties and combines with results from 
    Exponential Smoothing and Forest based models.
    
    :param main_df: dataframe containing results from Exponential Smoothing and Forest based models
    :param grouped_df: dataframe containing aggragated values of resident deaths per county per time period
    :param fcast: int, number of periods to forecast ARIMA models
    """
    # Create a list of unique (county, state) tuples from main df
    ls = [(county, state) for county, state in zip(main_df['county'], main_df['state'])]
    new_ls = list(dict.fromkeys(ls))
    
    # Create a list of unique (county, state) tuples from grouped df
    # Change first letter of county to upper case
    grpd_ls = [(county.title(), state) for county, state in zip(grouped_df['county'], grouped_df['provider_state'])]
    grpd_new_l = list(dict.fromkeys(grpd_ls))
    cnty_len = len(grpd_new_l)
    
    new_df = pd.DataFrame()  # empty df to append arima results
    
    not_present = []  # stores county,state for which data is not in grouped df (actual data not available) 
    errors = []   # stores county, state for which ARIMA model errored out
    
    """
    The code below:
    1. Iterates over (county, state) values from main_df
    2. Checks if (county, state) values are in grouped_df
        a. If values are present, builds ARIMA model for the (county,state) using `arima_model_results` function
        b. If not, append (county,state) to the not_present list
    3. Combines ARIMA results for all counties
    """
    count = 0
    for i in new_ls:
        if i in grpd_new_l:
            county = i[0].lower() # convert county to lower case
            state = i[1]
            try:
                ar_res = arima_model_results(grouped_df, county, state, fcast) # Run ARIMA model
                new_df = new_df.append(ar_res)  # append to dataframe
                count+= 1
                if count % 50 == 0:
                    print(f'Processed {count} of {cnty_len} counties...')
#                     time.sleep(5)
            except:
                print('Error at ', i)
                errors.append(i)
                continue
            
        else:
            not_present.append(i)
            continue

    return new_df, not_present, errors

Combine ARIMA Results

NOTE: The cell below takes ~ 30 mins to process as it runs combine_results helper function defined in the section above.
In [41]:
%%time
# Run to get ARIMA results combined with ESF and FBF results
comb_df, not_present, errors = combine_results(es_fb_df, grouped_df2, fcast=12)
Processed 50 of 2554 counties...
Processed 100 of 2554 counties...
Processed 150 of 2554 counties...
Processed 200 of 2554 counties...
Processed 250 of 2554 counties...
Processed 300 of 2554 counties...
Processed 350 of 2554 counties...
Processed 400 of 2554 counties...
Processed 450 of 2554 counties...
Processed 500 of 2554 counties...
Processed 550 of 2554 counties...
Processed 600 of 2554 counties...
Processed 650 of 2554 counties...
Processed 700 of 2554 counties...
Processed 750 of 2554 counties...
Processed 800 of 2554 counties...
Processed 850 of 2554 counties...
Processed 900 of 2554 counties...
Processed 950 of 2554 counties...
Processed 1000 of 2554 counties...
Processed 1050 of 2554 counties...
Processed 1100 of 2554 counties...
Processed 1150 of 2554 counties...
Processed 1200 of 2554 counties...
Processed 1250 of 2554 counties...
Processed 1300 of 2554 counties...
Processed 1350 of 2554 counties...
Processed 1400 of 2554 counties...
Processed 1450 of 2554 counties...
Processed 1500 of 2554 counties...
Processed 1550 of 2554 counties...
Processed 1600 of 2554 counties...
Processed 1650 of 2554 counties...
Processed 1700 of 2554 counties...
Processed 1750 of 2554 counties...
Processed 1800 of 2554 counties...
Processed 1850 of 2554 counties...
Processed 1900 of 2554 counties...
Processed 1950 of 2554 counties...
Processed 2000 of 2554 counties...
Processed 2050 of 2554 counties...
Processed 2100 of 2554 counties...
Processed 2150 of 2554 counties...
Processed 2200 of 2554 counties...
Processed 2250 of 2554 counties...
Processed 2300 of 2554 counties...
Processed 2350 of 2554 counties...
Processed 2400 of 2554 counties...
Processed 2450 of 2554 counties...
Wall time: 22min 54s
In [42]:
# Check details of results
print(f'Shape of dataframe: {comb_df.shape}')
print(f'Number of counties for which data is not present: {len(not_present)}')
print(f'Counties for which ARIMA model errored out: {errors}')
Shape of dataframe: (130534, 8)
Number of counties for which data is not present: 490
Counties for which ARIMA model errored out: []
In [43]:
# Combine ARIMA results with ESF and FBF results
comb_df2 = pd.merge(es_fb_df, comb_df, on=['county','state','end_date'], how='left')
comb_df2.shape
Out[43]:
(156509, 34)

The data in es_fb_df includes records of counties for which no data is present. Since we are combining the ARIMA results with this dataframe, the number of records in the resulting dataframe increases. We will address this in the next section by removing records for counties where no data is present.

In [44]:
# Check head
comb_df2.head()
Out[44]:
SHAPE SHAPE__Area SHAPE__Length county end_date f_rmse_es f_rmse_fb fips fitted_es fitted_fb high_es high_fb level_es leveltrend_es low_es low_fb objectid residual_es residual_fb season_es shape_area_es shape_area_fb shape_length_es shape_length_fb start_date state trend_es v_rmse_es v_rmse_fb fitted_ar low_ar high_ar f_rmse_ar ar_order
0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-07 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-05-31 00:00:01 WA 0.0 2.0375 1.757818 1.258745 0.0 0.0 0.0 (0, 0, 1)
1 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-14 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-07 00:00:01 WA 0.0 2.0375 1.757818 0.793563 0.0 0.0 0.0 (0, 0, 1)
2 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-21 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-14 00:00:01 WA 0.0 2.0375 1.757818 0.919087 0.0 0.0 0.0 (0, 0, 1)
3 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-28 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-21 00:00:01 WA 0.0 2.0375 1.757818 0.855265 0.0 0.0 0.0 (0, 0, 1)
4 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-07-05 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-28 00:00:01 WA 0.0 2.0375 1.757818 0.881464 0.0 0.0 0.0 (0, 0, 1)

More Cleanup

In this section, we will do some more clean-up of the resulting dataframe. We will:

  • merge actual resident death rate data from grouped_df2 with the results dataframe comb_df2.
  • remove any counties for which the data was either missing or could not be aggregated by the space time cube.
  • publish results to a feature layer.

Add Actual Deaths

Add actual death rates from grouped_df to comb_df2.

In [45]:
# Create a temporary df to rearrange grouped df data 
temp_df = grouped_df2.copy()
temp_df.reset_index(inplace=True)
cols_to_rename = {'provider_state':'state',
                 'weekly_resident_covid19_deaths_':'agg_res_deaths_per1000_res',
                 'week_ending':'end_date'}
temp_df.rename(columns=cols_to_rename, inplace=True)
temp_df['county'] = [county.title() for county in temp_df['county']] # first letter capital
display(temp_df.shape)
display(temp_df.head())
(135334, 4)
end_date state county agg_res_deaths_per1000_res
0 2020-06-07 AK Anchorage 0.0
1 2020-06-07 AK Cordova-Mccarthy 0.0
2 2020-06-07 AK Kenai-Cook Inlet 0.0
3 2020-06-07 AL Autauga 0.0
4 2020-06-07 AL Baldwin 0.0
In [46]:
# Combine temp df with arima results
es_fb_ar_df = pd.merge(comb_df2, temp_df, on=['county','state','end_date'], how='left')
es_fb_ar_df.shape
Out[46]:
(156509, 35)
In [47]:
# Check head
es_fb_ar_df.head()
Out[47]:
SHAPE SHAPE__Area SHAPE__Length county end_date f_rmse_es f_rmse_fb fips fitted_es fitted_fb high_es high_fb level_es leveltrend_es low_es low_fb objectid residual_es residual_fb season_es shape_area_es shape_area_fb shape_length_es shape_length_fb start_date state trend_es v_rmse_es v_rmse_fb fitted_ar low_ar high_ar f_rmse_ar ar_order agg_res_deaths_per1000_res
0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-07 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-05-31 00:00:01 WA 0.0 2.0375 1.757818 1.258745 0.0 0.0 0.0 (0, 0, 1) 0.0
1 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-14 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-07 00:00:01 WA 0.0 2.0375 1.757818 0.793563 0.0 0.0 0.0 (0, 0, 1) 0.0
2 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-21 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-14 00:00:01 WA 0.0 2.0375 1.757818 0.919087 0.0 0.0 0.0 (0, 0, 1) 0.0
3 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-06-28 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-21 00:00:01 WA 0.0 2.0375 1.757818 0.855265 0.0 0.0 0.0 (0, 0, 1) 0.0
4 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2020-07-05 6.126846 3.597878 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5 0.0 0.0 0.0 1.484728e+10 1.484728e+10 684811.980009 684811.980009 2020-06-28 00:00:01 WA 0.0 2.0375 1.757818 0.881464 0.0 0.0 0.0 (0, 0, 1) 0.0
In [48]:
# No. of counties for which data is not present
len(not_present)
Out[48]:
490

Remove Counties with no data

There are many counties for which the data was either missing or could not be aggregated by the Space-Time cube. As such, we will remove these counties from the dataset.

The counties are identified by:

  • grouping the data by state and county
  • applying a filter where the aggregated death rate for each state/county pair is 0. This includes records with Nan values.
Identify counties with missing data
In [49]:
# Group records for counties with no data
z_vals = es_fb_ar_df.groupby(['county', 'state']).filter(lambda x:x['agg_res_deaths_per1000_res'].sum()==0)
z_vals.shape
Out[49]:
(25970, 35)
In [50]:
# Create a list of (county, state) pairs with no data
cnty_ls = [(county, state) for county, state in zip(z_vals['county'], z_vals['state'])]
cnty_ls_new = list(dict.fromkeys(cnty_ls))
In [51]:
# Check details
display(len(cnty_ls_new))
display(cnty_ls_new[:5])
490
[('Boundary', 'ID'),
 ('Glacier', 'MT'),
 ('Divide', 'ND'),
 ('Burke', 'ND'),
 ('Cavalier', 'ND')]

There are 490 counties where actual resident death rate data is missing. We will remove the records for these counties.

Subset for counties that have data
In [52]:
# Subset data
es_fb_ar_df_new = es_fb_ar_df.groupby(['county', 'state']).filter(lambda x:x['agg_res_deaths_per1000_res'].sum()!=0).copy()
es_fb_ar_df_new.shape
Out[52]:
(130539, 35)
In [53]:
# Check to confirm that data for missing county is not present
es_fb_ar_df_new.query('(county in ["Benson"]) and (state in ["ND"])')
Out[53]:
SHAPE SHAPE__Area SHAPE__Length county end_date f_rmse_es f_rmse_fb fips fitted_es fitted_fb high_es high_fb level_es leveltrend_es low_es low_fb objectid residual_es residual_fb season_es shape_area_es shape_area_fb shape_length_es shape_length_fb start_date state trend_es v_rmse_es v_rmse_fb fitted_ar low_ar high_ar f_rmse_ar ar_order agg_res_deaths_per1000_res

Publish to Feature Layer

We will publish the combined model results to a feature layer for ease of access and use.

In [54]:
es_fb_ar_item = es_fb_ar_df_new.spatial.to_featurelayer('ES_FB_AR_BigLayer')
es_fb_ar_item
Out[54]:
ES_FB_AR_BigLayer
Feature Layer Collection by portaladmin
Last Modified: September 08, 2021
0 comments, 0 views
Uncomment and run the cells below to directly access the results from published ES_FB_AR_BigLayer layer.
In [55]:
# # Get item
# data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
#                                 outside_org=True)
# group_query = f"group: {data_groups[0].id}" if data_groups else ""
# es_fb_ar_item = gis.content.search(f'ES_FB_AR_BigLayer {group_query}',
#                                   item_type = "Feature Service",
#                                   outside_org=True)[0]
# es_fb_ar_item
In [56]:
# # Get layer
# es_fb_ar_lyr = es_fb_ar_item[0].layers[0]

# # Query as dataframe and check shape
# es_fb_ar_df_new = es_fb_ar_lyr.query(as_df=True)
# es_fb_ar_df_new.shape

Overlay Plot

So far, we have generated forecast plots using Exponential Smoothing, Forest-based, and ARIMA models individually. We have, however, not seen the forecasts generated by these models together to visualize how each model performs with respect to actual resident deaths data. The idea behind overlay plot is to visualize the fitted values and the forecasted values and their confidence intervals for all models, on a single plot, along with the actual data for each US county.

In this section, we will:

  • define a function that creates a plot by overlaying fitted, forecasted, and confidence interval values generated by the different models. It will also plot actual values for the time period.
  • generate plots.

Define plot function

We will start by defining a function, overlay_plot, which will subset the data for each county and plot the data. This function can then be called using the county FIPS code for one or more counties to generate plots.

In [57]:
# Define function
def overlay_plot(main_df, fips: list, ylabel: str, title: str, fcast_date = '2021-03-21'):
    
    # Create grid
    if len(fips) > 4:
        fig = plt.figure(figsize=(20, 25))
    else:
        fig = plt.figure(figsize=(20, 15))
    for sp in range(0,len(fips)):
        if len(fips) == 1:
            ax = fig
        elif len(fips)%2 == 0:
            ax = fig.add_subplot(len(fips)/2,2,sp+1)
        else:
            ax = fig.add_subplot(len(fips)//2+1,2,sp+1)
        
        # Subset data for county
        cnty_df = main_df[main_df['fips']==fips[sp]]
        cnty_df.set_index('end_date', inplace=True)

        ### Plot Random Forest Model #### 
        
        # Plot fitted values including forecast  
        ax = cnty_df['fitted_fb'].plot(legend=True, label='RF Model-Fitted Values',marker='o', 
                                    markersize=5, linewidth=1.8, color='orange')

        # Plot confidence interval
        plt.fill_between(cnty_df[-12:].index,
                         cnty_df.loc[fcast_date:,'low_fb'],
                         cnty_df.loc[fcast_date:,'high_fb'],
                         color='orange', alpha=0.1, label='RF Model-Confidence Interval')
        
        
        ### Plot Exponential Smoothing model ####

        # Plot fitted values including forecast 
        ax = cnty_df['fitted_es'].plot(legend=True, label='ES Model-Fitted Values',marker='o', 
                                       markersize=5,linewidth=1.8,color='green')

        # Plot confidence interval
        plt.fill_between(cnty_df[-12:].index,
                         cnty_df.loc[fcast_date:,'low_es'],
                         cnty_df.loc[fcast_date:,'high_es'],
                         color='g', alpha=0.1, label='ES Model-Confidence Interval')
        
        
        ### Plot ARIMA model ####

        # Plot fitted values including forecast 
        ax = cnty_df['fitted_ar'].plot(legend=True, 
                                       label='ARIMA{} Model-Fitted Values'.format(cnty_df['ar_order'][0]),
                                       marker='o', markersize=5,linewidth=1.8,color='blue')

        # Plot confidence interval
        plt.fill_between(cnty_df[-12:].index,
                         cnty_df.loc[fcast_date:,'low_ar'],
                         cnty_df.loc[fcast_date:,'high_ar'],
                         color='blue', alpha=0.1, label='ARIMA Model-Confidence Interval')
    
        
        ### Plot actual deaths from `2020-05-27` to `2021-06-14` ####
        ax = cnty_df['agg_res_deaths_per1000_res'].plot(legend=True, label='Actual Deaths',
                                                            marker='x', markersize=5,
                                                            linewidth=1.8, color='black', alpha=0.8)

        # Plot line showing Start of Forecast       
        minimum = min(cnty_df[['low_ar','low_es','low_fb']].min())
        maximum = max(cnty_df[['high_ar','high_es','high_fb']].max())
        
        plt.vlines(cnty_df.index[-12],
                   ymin=minimum,
                   ymax=maximum,
                    color='grey', linewidth=1, linestyle="--",
                   label='Forecast results start here')

        # Label, title and legend
        ax.set(ylabel=ylabel, xlabel='')
        ax.set_title(cnty_df['county'][0]+', '+cnty_df['state'][0], fontsize=20)
        plt.legend(loc='upper left', prop={'size': 15})

        # Remove spines
        for key, spine in ax.spines.items():
            spine.set_visible(False)
        ax.margins(0.01, 0)

    fig.suptitle(title, fontsize=25, y=1.0)
    plt.rc('axes', labelsize=15)
    plt.rc('xtick', labelsize=15)
    plt.rc('ytick', labelsize=15)
    plt.tight_layout() # automatically adjusts layout to fit long labels
    plt.show();

Create Plots

Here, we will use the overlay_plot function to generate plots for one or more counties.

Plot for a single county

Let's start by creating a plot for San Bernardino county.

In [58]:
fips = [6071]
overlay_plot(es_fb_ar_df_new, fips, ylabel='Resident death rate',
             title='Model Overlay: Average weekly nursing home resident death rates')

The plot shows:

  • fitted values generated by the models for June 7, 2020 - March 14, 2021.
  • forecasted values generated by the models for March 21, 2021 - June 6, 2021.
  • actual resident death rates for June 7, 2020 - March 14, 2021.
  • confidence intervals of the forecasts for March 21, 2021 - June 6, 2021.

Observations:

  • The general trend of actual resident death rates for San Bernardino County seem to follow an increasing trend from July'20 - Aug.'20, followed by a decline until Oct.'20. The trend goes up again from Oct'20 - Dec'20, followed by a decline.
  • The fitted values for all models seem to follow the actual data closely. Although the actual resident death rates show a declining trend for the forecast period, the forecasts generated by ARIMA and Forest-based models predict a higher death rate.
  • Forecasts generated by the ARIMA(3,0,0) model shows a slightly higher death rate than the actual data and other models.
  • The forecast generated by the Forest-based Forecast (FBF) model follows a straight line for the first few time periods and then starts trending up for the remaining time periods, showing some variance between the FBF results and the actual data.
  • Note that the Confidence Interval (CI) for FBF is the largest, indicating higher uncertainty associated with each forecast compared to the ARIMA or ESF models, as the forecasting time period progresses. The CI for the ARIMA model is the narrowest for this county.

Plot for multiple counties

Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.

County, State FIPS Cluster Id
Maricopa, AZ 4013 2
Delaware, PA 42045 3
Porter, IN 18127 4
In [59]:
# Generate plots
fips = [4013,42045,18127]
overlay_plot(es_fb_ar_df_new, fips, ylabel='Resident death rate',
             title='Model Overlay: Average weekly nursing home resident death rates')

The plots show the:

  • fitted values generated by the models for June 7, 2020 - March 14, 2021.
  • forecasted values generated by the models for March 21, 2021 - June 6, 2021.
  • actual resident deaths for June 7, 2020 - March 14, 2021.
  • confidence intervals of the forecasts for March 21, 2021 - June 6, 2021.

Observations:

  • The trend of actual resident death rates is different for each county. Earlier in 2020, Maricopa county shows an increase in death rates from June - August. Both Delaware and Porter counties show an increase from June - July.
  • Maricopa and Delaware counties show an increase in death rates from mid Dec'20 - early Feb'21. Porter county shows a large spike in November followed by consistent increasing and decresing trends early Feb'21.
  • Porter county had the highest average weekly resident death rates, followed by Delaware and Maricopa counties.
  • Most counties show a declining trend in death rate for the forecast period.
  • The fitted values for all models seem to follow the actual data closely.
  • The forecasted values and confidence intervals for all models vary for different counties.
  • The increasing widths of the confidence intervals show higher uncertainty associated with each forecast as the forecasting time period increases.
  • The confidence Interval for the FBF model is the largest, indicating a lower confidence in forecast results when compared to the ARIMA or ESF models as the forecasting time period progresses.

Plot Errors

We have already plotted the forecasts generated by the different models and visualized how the models perform when compared to the actual data. Now, let's map the forecast errors generated by the Exponential Smoothing, Forest-based, and ARIMA models. In this section, we will:

  • subset the data for errors.
  • map the forecast errors generated by the different models.

Subset data for Errors

Subset es_fb_ar_df to keep only the error data.

In [60]:
# Subset to get a single record for each county
err_df = es_fb_ar_df_new.groupby(["county", "state"]).nth(-1)
err_df.reset_index(inplace=True)

# Keep relevant error columns
cols_to_keep = ['county','state','fips','f_rmse_es','f_rmse_fb','f_rmse_ar','v_rmse_es','v_rmse_fb','SHAPE']
errors_df = err_df[cols_to_keep].copy()
In [61]:
# Check shape
errors_df.shape
Out[61]:
(2463, 9)
In [62]:
# Check head
errors_df.head()
Out[62]:
county state fips f_rmse_es f_rmse_fb f_rmse_ar v_rmse_es v_rmse_fb SHAPE
0 Acadia LA 22001 3.083111 0.607884 0.013995 3.744268 3.205362 {"rings": [[[-10309644.8827, 3516679.617499999...
1 Accomack VA 51001 4.492012 2.598795 1.440333 5.830255 2.415093 {"rings": [[[-8421081.9868, 4575224.680500001]...
2 Ada ID 16001 2.210151 1.184250 1.596018 2.396963 1.942228 {"rings": [[[-12909994.3696, 5400307.765799999...
3 Adair IA 19001 25.904524 18.226247 6.700970 7.959651 11.031152 {"rings": [[[-10517558.2214, 5036093.2192], [-...
4 Adair KY 21001 8.437757 4.662914 1.711458 5.586216 2.524163 {"rings": [[[-9513352.0417, 4430747.513999999]...

Publish to Feature Layer

We will publish the data to a feature layer for ease of access and use later.

In [63]:
errors_item = errors_df.spatial.to_featurelayer('TS_Errors')
errors_item
Out[63]:
TS_Errors
Feature Layer Collection by portaladmin
Last Modified: September 08, 2021
0 comments, 0 views

Generate Maps

Map the errors generated by the different models.

In [65]:
# Search error item and get layer
err_item = gis.content.search('TS_Errors', 'Feature Layer')
err_lyr = err_item[0].layers[0]
err_lyr
Out[65]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/a51c5f/FeatureServer/0">

Exponential Smoothing: Forecast Error

In [68]:
# Generate map
map_esf = gis.map('US', zoomlevel=4)
map_esf.add_layer(err_lyr, 
                    {"renderer":"ClassedColorRenderer",
                   "field_name":"f_rmse_es",
                   "opacity":0.7})
map_esf
Out[68]:

The plot shows variations in forecast errors as generated by the Exponential Smoothing Forecast model for each county. The variation seems to be high for some counties, with values ranging from 10.6 - 125.

In [67]:
# Take screenshot
map_esf.legend=True
map_esf.take_screenshot(output_in_cell=False)

Forest-based: Forecast Error

In [71]:
# Generate map
map_fbf = gis.map('US', zoomlevel=4)
map_fbf.add_layer(err_lyr, 
                    {"renderer":"ClassedColorRenderer",
                   "field_name":"f_rmse_fb",
                   "opacity":0.7})
map_fbf
Out[71]:

The plot shows variations in forecast errors as generated by the Forest-based Forecast model for each county. The variation seems to be high for some counties, with values ranging from 6 - 66.

In [70]:
# Take screenshot
map_fbf.legend=True
map_fbf.take_screenshot(output_in_cell=False)

ARIMA: Forecast Error

In [74]:
# Generate map
map_arima = gis.map('US', zoomlevel=4)
map_arima.add_layer(err_lyr, 
                    {"renderer":"ClassedColorRenderer",
                   "field_name":"f_rmse_ar",
                   "opacity":0.7})
map_arima
Out[74]:

The plot shows variation in forecast errors as generated by the ARIMA model for each county. The variation seems to be high for some counties, with values ranging from 4.2 - 95.

In [73]:
# Take screenshot
map_arima.legend=True
map_arima.take_screenshot(output_in_cell=False)

To identify the counties that fall within a certain error range, we can perform a simple query operation on the errors dataframe.

In [75]:
# Counties where ARIMA model error is greater than 1000.
errors_df.query('f_rmse_ar > 90')
Out[75]:
county state fips f_rmse_es f_rmse_fb f_rmse_ar v_rmse_es v_rmse_fb SHAPE
1316 Lincoln OK 40081 15.028396 8.407123 95.493218 4.607633 0.1152 {"rings": [[[-10813687.5577, 4227929.468500003...

Summary

To summarize, in this notebook we:

  • built ARIMA models.
    • Created an ARIMA(p,d,q) model [1] for a single county.
    • Defined helper functions to build ARIMA models for all counties.
    • Combined the ARIMA results with the Exponential Smoothing [4], the Forest-based Forecast [5] models and the actual data.
  • created plots to visualize the fitted values, as well as the forecasted values and their confidence intervals for the Exponential Smoothing Forecast, Forest-based Forecast and ARIMA models .
  • plotted the forecast errors generated by the models.

Conclusion

Because the future may not repeat the past for this pandemic, no prediction model is certain. However, any prediction tool with acceptable prediction performance can still be useful for public-health planning, policy decision-making, and facilitating the transition back to normalcy.

Our analysis utilizes advanced data analysis, visualization, and machine learning techniques on a publicly available dataset [6] to show the variation in forecasts generated by some of the more popular time series forecasting algorithms [2][4][5] over different geographical areas.

The analysis can be easily adapted to use other features in the dataset, such as resident and staff cases, Covid-19 testing features, and availability/shortage of personal protective equipment (PPE). The techniques defined in this analysis can also support further data integrations, such as vaccination rates, nursing home health inspections, and quality of care ratings data.

References