- 👟 Ready To Run!
- 🚀 Data Exploration and Cleaning
- 🤖 Machine Learning
- ⏱️ Time Series
- ⚕️ Healthcare
Requirements:
- 🗄️ ArcGIS Notebooks Advanced Runtime
The Coronavirus disease 2019 (Covid-19), which would spark a pandemic, was first reported in China on December 31, 2019. On January 19, 2021, the first case of Covid-19 was reported in the United States near Seattle, WA. In the following months, the Seattle area became the epicenter of an early U.S. outbreak. 39 residents of a nursing home in Kirkland, WA died from complications from the virus in one four-week span [8].
The Covid-19 pandemic has swept the long-term care facilities in the United States, resulting in nearly 4% of all cases (1.4M) and 31% of all deaths (182K) as of June 2021 [9]. The impact on nursing homes is even worse where nearly 1 in 10 residents have died. According to a study [10], 94% of nursing homes reviewed experienced more than one Covid-19 outbreak and 85% had an outbreak lasting 5 or more weeks. As of August 2021, the case count has ticked up again and with the vaccinated elderly susceptible to breakthrough infections, nursing homes now confront new Covid-19 outbreaks [11].
People living in nursing homes are generally 60 years and older with underlying health conditions such as cardiovascular disease, chronic respiratory disease, cerebrovascular disease, malignancy, and dementia. They represent a vulnerable population that should benefit from additional protective measures against contamination. The communal nature of nursing homes along with comorbidities put these people at increased risk of Covid-19 progression, severe outcomes, and death.
The ability to predict the progress of this pandemic has been crucial for decision making aimed at fighting this pandemic and controlling its spread. This presents an opportunity for us to use advanced data analysis, visualization and modeling techniques (both open-source and ArcGIS based) to help find trends to provide an early warning of rising numbers of deaths and to prevent future increases.
In this 2-part study, we will:
In this notebook, we will:
In the subsequent notebook, we will:
The Centers for Medicare and Medicaid Services (CMS) provides a Nursing Home Covid-19 Public File. This file includes data reported by nursing homes to the CDC’s National Healthcare Safety Network (NHSN) system’s Covid-19 Long Term Care Facility Module. The data includes information about Resident Impact, Facility Capacity, Staff & Personnel, Personal Protective Equipment, Ventilator Capacity, and other Supplies information.
Dataset: https://data.cms.gov/covid-19/covid-19-nursing-home-data
Each record in the data represents information about an individual nursing home in the United States for each week. For our analysis, we will use the data ranging from June 07, 2020 - June 6, 2021
. This data has been cleaned, feature engineered and made available as published feature services. To learn more about the raw data and the data cleaning performed to create these services, review the following notebook.
# Import Libraries
# Display
from IPython.display import display
from ipywidgets import *
# ArcGIS and Arcpy
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
import arcpy
# Data Exploration
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)
# Plotting Libraries
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import dates
import seaborn as sns
plt.style.use('ggplot')
# Datetime and os
from datetime import datetime
import time
import os
import itertools
# Stop warnings
import warnings
warnings.filterwarnings("ignore")
# Create a GIS connection
gis = GIS("home")
The dataset we have is the space-time data where each record represents information about an individual nursing home in the United States for each week, ranging from June 07, 2020 - June 6, 2021
. For exploration purposes, we will only consider the following columns:
- weekly resident cases per 1000 residents
- weekly resident deaths per 1000 residents
There are many ways to explore this data, but in this notebook, we will explore it in two particular ways. We will aggregate the data by state to:
In this section, we will:
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_item = gis.content.search(f'NH_TS_US_Cleaned_0621 {group_query}',
item_type = "Feature Service",
outside_org=True)
ts_item
# Get data layer
ts_layer = ts_item[1].layers[0]
ts_layer
# View first few field names in the layers
for f in ts_layer.properties.fields:
print(f['name'], "{0:>35}".format(f['alias']))
# Check records in data
ts_layer.query(return_count_only=True)
From the results above, we can see that our dataset has ~770k records and 16 columns. Each record includes information about the nursing home staff, resident impacts, personal protective equipment and other supplies, occupied beds, etc.
Here, we will query and group the data.
%%time
# Query Data
subdf = ts_layer.query(as_df=True,
out_fields=['county','provider_state',
'week_ending','weekly_resident_confirmed_covid',
'weekly_resident_covid19_deaths_'])
subdf.shape
We will group the data by county and state to get the average weekly resident cases and deaths per 1000 residents for each state and county, for each time period.
# Group data by state for each time period
exp_df1 = subdf.groupby(['week_ending','provider_state','county']).aggregate('mean').copy()
exp_df1.reset_index(inplace=True)
exp_df1.drop('objectid', inplace=True, axis=1)
exp_df1.rename(columns={'weekly_resident_confirmed_covid':'resident_cases',
'weekly_resident_covid19_deaths_':'resident_deaths'}, inplace=True)
exp_df1.shape
# Check head
exp_df1.head()
Now that we have aggregated the data by county and state, then subsetted it to include only specific columns, the dataset has 149267 records and 5 columns. Each record represents the average weekly resident cases and deaths per 1000 residents for a county, state for each one week time period.
We can further aggregate this data by state to get average weekly resident cases and deaths per 1000 residents by state. This data is combined for all time periods.
# Group data by state
exp_df2 = exp_df1.groupby(['provider_state']).aggregate('mean').copy()
exp_df2.reset_index(inplace=True)
exp_df2.shape
# Check head
exp_df2.head()
This subset represents average weekly resident cases and deaths per 1000 residents by state over the entire time period. For example, we can say that across all nursing homes in AZ, for the entire time period, on average 3 nursing home residents out of a 1000 died per week due to Covid-19, and 13 out of a 1000 residents tested positive for Covid-19.
Let's plot the data using the two subsets of data created above.
# Plot
fig, (ax1, ax2) = plt.subplots(2, figsize=(20,15))
# Plot cases
sns.barplot(x='provider_state', y='resident_cases',
data=exp_df2.sort_values(by='resident_cases', ascending=False),
ax=ax1, color='lightcoral')
# Plot deaths
sns.barplot(x='provider_state', y='resident_deaths',
data=exp_df2.sort_values(by='resident_deaths', ascending=False),
ax=ax2, color='lightcoral')
# Set label and title
ax1.set(ylabel='Resident case rate', xlabel= "States",
title="Average weekly nursing home resident case rates by state")
ax2.set(ylabel='Resident death rate', xlabel= "States",
title="Average weekly nursing home resident death rates by state")
plt.rc('axes', labelsize=20)
plt.rc('axes', titlesize=15)
plt.rc('xtick', labelsize=10)
plt.show();
The plots show:
- average weekly nursing home resident case rates by state.
- average weekly nursing home resident death rates by state.
Observations:
- average weekly resident case rate seem to be the highest for Arkansas, followed by Arizona and Tennessee.
- average weekly resident death rate is the highest for Indiana, followed by Arizona and North Dakota.
Let's plot the distribution of average weekly resident death rates for Indiana, Arizona, and North Dakota.
Here, we plot the time series of average weekly resident deaths per 1000 residents for Indiana, Arizona, and North Dakota across different time periods.
fig, ax = plt.subplots(1,1, figsize=(20,10))
states = ['AZ','IN','ND']
for state in states:
sns.lineplot(x='week_ending', y='resident_deaths',
data=exp_df1.query('provider_state==@state'),
marker='o', linewidth=1.8, ci=None, label=state)
ax.autoscale(axis='x')
plt.legend(loc='upper left', prop={'size': 15})
ax.set(ylabel='Resident death rate', xlabel= "")
ax.xaxis.set_major_locator(dates.MonthLocator(interval=1))
plt.title("Average weekly nursing home resident death rates", fontsize=20)
plt.gcf().autofmt_xdate()
plt.show();
The plot shows average weekly resident death rates for Indiana, Arizona, and North Dakota.
Observations:
- Arizona saw a higher resident mortality rate early on, compared to other states. The death rate increased steadily until August'20, followed by a decreasing trend until mid Sept.'20 and then no deaths until mid Oct.'20. The period from mid Oct.'20 - mid Dec.'20 saw a consistent increase in the death rate, followed by a decreasing trend until June'21. The state saw its highest death rates in early Aug.'20 and mid Dec.'20.
- Indiana reported a consistently low death rate until Oct'20, with a minor spike for a month around Aug.'20. From Oct.'20 - mid Dec.'20, there was an increase in the death rate, followed by a decreasing trend until June'21. From mid Dec.'20 onwards, Indiana reported a higher death rate than Arizona and North Dakota.
- North Dakota started very well early on, with almost no deaths until Sept.'20. There was a steep rise in the death rate in mid Oct.'20, with an average of 25 deaths per 1000 residents during the 2nd week of Oct.'20. While this trend seemed to settle down the next week, there was another increasing trend in the death rate from early Nov.'20 - early Dec.'20. From mid Dec.'20 onwards, there was a decreasing trend through June'21.
- All 3 states saw a decrease in resident death rates starting from mid Dec.'20 and almost no deaths starting from March'21. This trend seems consistent with the overall Covid-19 death rates across the United States. Most states saw an increase in case rates and death rates during Dec.'20 - Jan.'21. With the introduction of vaccine programs early Jan.'21, the number of deaths has reduced overall to almost no deaths being reported, starting in April '21.
We will begin our analysis by creating a Space Time Cube using Create Space Time Cube From Defined Locations [1] tool. The data structure this tool creates can be thought of as a three-dimensional cube made up of space-time bins with the x and y dimensions representing space and the t dimension representing time. Every bin has a fixed position in space (either an x,y location for points or a fixed set of vertices for polygon locations) and in time (t).
This tool structures the data into a netCDF data format by creating space-time bins. The netCDF output is then used as input to other time-series algorithms in our toolbox.
In this section, we will:
We will use data from June 7, 2020 - March 14, 2021
to build the space time cube. Data from March 21, 2021 - June 6, 2021
(12 time steps) has been excluded and will be used for forecasting.
Before we create the space time cube, we will:
Here, we will create the file geodatabase and setup the data directory to output our results to.
# Set data directory to output results
home_dir = os.path.join(os.getcwd(), 'home')
print(f"home dir: {home_dir}")
# Create empty FGDB in data directory
if not os.path.isdir(os.path.join(home_dir, 'Results.gdb')):
arcpy.CreateFileGDB_management(home_dir, 'Results.gdb')
print('File Geodatabase exists')
# Setup Workspace
arcpy.env.workspace = os.path.join(home_dir,'Results.gdb')
arcpy.env.workspace
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_sub_item = gis.content.search(f'Subset_NH_TS_US_Cleaned_0621 {group_query}',
item_type = "Feature Service",
outside_org=True)
ts_sub_item
# Get data layer
ts_sub_layer = ts_sub_item[0].layers[0]
ts_sub_layer
%%time
# Query as dataframe
ts_sub_df = ts_sub_layer.query(as_df=True)
# Check details
display(ts_sub_df.shape)
display(ts_sub_df.columns)
display(ts_sub_df.head())
From the results above, we can see that our dataset has ~596k records and 17 columns. The columns include information about resident impact, staff & personnel, personal protective equipment and other supplies.
%%time
# Store the data locally
point_lyr = ts_sub_df.spatial.to_featureclass(os.path.join(arcpy.env.workspace,'Subset_NH_TS_US_Cleaned_0621'))
point_lyr
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
polygon_item = gis.content.search(f'NH_TS_US_FIPS_layer {group_query}',
item_type = "Feature Service",
outside_org=True)
polygon_item
# Get published FIPS layer as polygon layer
polygon_layer = polygon_item[0].layers[0]
polygon_layer
# Check layer details
fips_df = polygon_layer.query(as_df=True)
display(fips_df.shape)
display(fips_df.head())
Here, we will build a space time cube. We will aggregate weekly resident deaths (point_lyr) at the US county geographic level for each time period by using aggregation_shape_type = "DEFINED_LOCATIONS"
and passing the fips layer (polygon_layer) as defined_polygon_locations
.
# Run GWR and output results to the FGDB
from arcpy.stpm import CreateSpaceTimeCube
# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True
try:
# Define model variables
in_features = point_lyr
sptm_cube_sub = "usCubeNBLatest.nc" #"fcCubeNB.nc"
time_field = 'week_ending'
template_cube = None
time_step_interval = "1 Weeks"
time_step_alignment = "END_TIME"
reference_time = None
distance_interval = None
summary_fields = "weekly_resident_covid19_deaths_ MEAN SPACE_TIME_NEIGHBORS;"
aggregation_shape_type = "DEFINED_LOCATIONS"
defined_polygon_locations = polygon_layer.url
location_id = "fips"
CreateSpaceTimeCube(in_features, sptm_cube_sub, time_field, template_cube, time_step_interval,
time_step_alignment, reference_time, distance_interval, summary_fields,
aggregation_shape_type, defined_polygon_locations, location_id)
print(arcpy.GetMessages())
except arcpy.ExecuteError:
# If any error occurred when running the tool, print the messages
print(arcpy.GetMessages(2))
From the messages generated, we can see that:
- 3895 of 595650 input features could not be aggregated into the Defined Polygon locations. This represents 0.6 % of the total data, and we will ignore these data points for the analysis.
- 189 out of 3142 counties (Defined Locations) were excluded from the Space Time Cube due to the presence of bins that could not be estimated. These 189 counties didn't have enough data for estimation and will not be included in the analysis.
- Overall across the United States, there is a statistically significant increasing trend for average weekly resident deaths per 1000 residents.
Space Time Cube structures the data into a netCDF data format (.nc) by creating space-time bins. We can now use the netCDF output as input to other time-series algorithms in our toolbox.
Now that we have a Space Time Cube, let's try to understand some patterns in our space-time data of resident deaths per 1000 residents. Clustering allows us to understand the data better by identifying which features are more similar or dissimilar to each other. The Time Series Clustering [2] tool identifies the locations in a space-time cube that are most similar and partitions them into distinct clusters in which members of each cluster have similar time series characteristics.
The tool partitions a collection of time series, stored in a space-time cube, based on the similarity of time series characteristics. Time series can be clustered based on three criteria: having similar values across time, tending to increase and decrease at the same time, and having similar repeating patterns.
In this section, we will:
We will use the data from June 7, 2020 - March 14, 2021
for clustering. Data from March 21, 2021 - June 6, 2021
(12 time steps) has been excluded from clustering and will be used for forecasting.
Here, we use the Profile
(Correlation) option for characteristic_of_interest
parameter, which is used to cluster time series such that they tend to stay in proportion to each other and increase and decrease in value at the same time. This option measures the similarity in time series based on their statistical correlation across time.
from arcpy.stpm import TimeSeriesClustering
# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True
# Run
try:
in_cube = os.path.join(home_dir, sptm_cube_sub)
analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
output_features = "usNB_resDeaths_TSClustering"
characteristic_of_interest = "PROFILE"
cluster_count = None
output_table_for_charts = "usNB_resDeaths_TSClustChart"
shape_characteristic_to_ignore = None
enable_time_series_popups = "NO_POPUP"
TimeSeriesClustering(in_cube, analysis_variable, output_features, characteristic_of_interest,
cluster_count, output_table_for_charts, shape_characteristic_to_ignore,
enable_time_series_popups)
print(arcpy.GetMessages())
except arcpy.ExecuteError:
# If any error occurred when running the tool, print the messages
print(arcpy.GetMessages(2))
The messages generated show:
- optimal number of clusters generated by the model is 4.
- trend statistic for average time-series in each cluster shows statistically significant trend in 3 out of 4 clusters.
- number of locations (counties) assigned to each cluster.
# Access clustering results into a SeDF
tsc_resDeaths_df = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
tsc_resDeaths_df.head()
# Check shape
tsc_resDeaths_df.shape
# Create a feature service on portal
tsc_layer = tsc_resDeaths_df.spatial.to_featurelayer('NH_TS_Clustering')
tsc_layer
As time series clustering does not return a FIPS code, we will use a spatial join to join the results from clustering with the fips layer.
# Join Fips and Clustering output layers
from arcgis.features.analysis import join_features
tsc_resDeaths_item = join_features(target_layer=polygon_layer.url,
join_layer=tsc_layer,
spatial_relationship='identicalto',
join_operation='JoinOneToOne',
output_name='NH_TS_Join')
#Check item
tsc_resDeaths_item = gis.content.search('title: NH_TS_Join', 'Feature Layer')
tsc_resDeaths_item
# Get join results in a data frame
tsc_resDeaths_df = tsc_resDeaths_item[0].layers[0].query(as_df=True)
tsc_resDeaths_df.head()
Now that we have cluster information along with their location, we will:
Here, we will visualize the number of counties in each cluster.
# Create plot
sns.set(font_scale = 1)
sns.countplot(x='cluster_id', data=tsc_resDeaths_df, color='lightcoral')
plt.title('No. of counties in each cluster');
The plot shows the count of counties within each cluster.
Here, we will plot the clusters on a map to visualize which counties are associated to which cluster.
# Create Map
tsc_resDeaths_map = gis.map('USA', 4)
tsc_resDeaths_map
The plot shows counties colored by cluster id.
# Add layer
tsc_resDeaths_map.remove_layers()
tsc_resDeaths_df.spatial.plot(map_widget=tsc_resDeaths_map,
renderer_type='u', # for unique renderer
col='cluster_id',
line_width=.5)
# Add legend and take screenshot
tsc_resDeaths_map.legend = True
tsc_resDeaths_map.take_screenshot(output_in_cell=False)
The Average Time Series per Cluster chart displays the average of the Analysis Variable at each time step for each cluster.
# Access clustering data from local
tsc_resDeaths_df2 = pd.DataFrame.spatial.from_table(os.path.join(arcpy.env.workspace,output_table_for_charts))
tsc_resDeaths_df2.head()
# Plot time series per cluster
plt.figure(figsize=(20,10))
sns.lineplot(x='END_DATE', y='CLUST_MEAN', data=tsc_resDeaths_df2,
hue='CLUSTER_ID', marker='o', palette='Set2', linewidth=2.5)
plt.title('Average Time Series per Cluster', fontdict={'fontsize':20})
plt.legend(loc='upper left', title='CLUSTER_ID', prop={'size': 13})
plt.xticks(rotation=90);
The plot shows the average time series for counties in each cluster.
Observations:
- The plot shows different trends for three clusters but no trend for one cluster.
- The average time series for cluster 2 shows an increase in average weekly residents deaths per 1000 residents early on starting mid June'20 - mid Aug.'20. The increase is followed by a decline until early Oct'20. From Oct.'20 - Mar.'21, there is a slight increase in trend, however the death rate seems considerably lower.
- The average time series for cluster 3 shows an increase in average weekly residents deaths per 1000 residents starting mid Sept.'20, with a steep rise starting early Oct.'20 - early Dec.'20. The increase is followed by a steep decline until early Jan.'21, which is followed by a consistent decline until Mar.'21.
- The average time series for cluster 4 shows an increase in average weekly residents deaths per 1000 residents starting Nov.'20, with a steep rise starting late Nov.'20 - early Jan.'21. The increase is followed by a steep decline until late Feb.'21, which is followed by a consistent decline until Mar.'21.
- The final cluster, cluster 1, is shown as a straight line with no trend. As seen in the messages generated by the model, this cluster does not have any significant trend. There are 248 counties that fall within this cluster.
Here, we use simple group and query operations to identify counties and states that belong to a particular cluster. We will start by looking at the composition of cluster #3.
# Top 5 states with highest no. of counties in a particular cluster
tsc_resDeaths_df.query('(cluster_id==3)')[['county','state']] \
.groupby('state').count() \
.sort_values(by ='county',ascending=False).head()
# Identify counties for a cluster id and state
tsc_resDeaths_df.query('(cluster_id==3) and (state == "CA")')[['cluster_id','county','state','fips']]
# For each cluster, identify states with highest number of counties
g_df = tsc_resDeaths_df.loc[:,['cluster_id','state','county']] \
.groupby(['cluster_id','state']).count() \
.sort_values(by='county', ascending=False) \
.groupby(level=0).head(1)
g_df
Now that we have explored patterns in our time series data, let's get started with building some models to forecast average weekly resident deaths per 1000 residents.
Exponential smoothing methods [3] are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, more recent observations are weighted higher than past observations. There are three types of exponential smoothing models:
where $L_{t}$ is the level at time t, $\alpha$ is the weight (or smoothing constant) for the level. $T_{t}$ is the trend at time t, $\beta$ is the weight (or smoothing constant) for the trend. Also, $S_{t}$ is the seasonal component at time t, $\delta$ is the weight (or smoothing constant) for the seasonal component, and p is the seasonal period.
The Exponential Smoothing Forecast [5] tool forecasts the values of each location of a space-time cube using the Holt-Winters exponential smoothing method by decomposing the time series at each location cube into seasonal and trend components. The exponential smoothing model assumes that all components are additive and linear. Damped trend is always used, and additive seasonality is supported but not required. Confidence intervals are constructed under the assumption that the residuals are additive and normally distributed.
In this section, we will:
We will use data from June 7, 2020 - March 14, 2021
to train and validate the model. Data from March 21, 2021 - June 6, 2021
(12 time steps) will be used for forecasting.
# Run Algorithm
from arcpy.stpm import ExponentialSmoothingForecast
# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True
# Run
try:
in_cube = os.path.join(home_dir, sptm_cube_sub)
analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
output_features = "usNB_resDeaths_ESFsub"
out_cube = os.path.join(home_dir, 'usNB_resDeaths_ESFsub_cube.nc')
# out_cube = None
number_of_time_steps_to_forecast = 12
season_length = None
number_for_validation = 6
outlier_option = None
level_of_confidence = "95%"
maximum_number_of_outliers = None
ExponentialSmoothingForecast(in_cube, analysis_variable, output_features,out_cube,
number_of_time_steps_to_forecast,season_length,number_for_validation,
outlier_option,level_of_confidence,maximum_number_of_outliers)
print(arcpy.GetMessages())
except arcpy.ExecuteError:
# If any error occurred when running the tool, print the messages
print(arcpy.GetMessages(2))
# Access model data from local
esf_resDeaths_subdf = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace, output_features))
esf_resDeaths_subdf.head()
When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods i.e. from 2021-03-21 - 2021-06-06
.
To create forecast plots, we need raw, fitted, and forecasted values. The model generates a feature class with forecasted values only. The data in ppe_ts_layer
is also not aggregated. Let's create a space time cube in 3D using Visualize Space Time Cube in 3D [6] tool to get the fitted values.
from arcpy.stpm import VisualizeSpaceTimeCube3D
# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True
# Run
in_cube = out_cube
cube_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
display_theme = "FORECAST_RESULTS"
output_features = "usNB_resDeaths_ESFsub_3D"
VisualizeSpaceTimeCube3D(in_cube, cube_variable, display_theme, output_features)
# Access cube data from local
cube3D_ESsub_df_resDeaths = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
cube3D_ESsub_df_resDeaths.head()
# Check shape
cube3D_ESsub_df_resDeaths.shape
# Merge with FIPS data
fips_df.columns = fips_df.columns.str.upper()
cube3D_ESsub_df_resDeaths = pd.merge(cube3D_ESsub_df_resDeaths, fips_df, on='FIPS')
cube3D_ESsub_df_resDeaths.shape
Here, we clean the data and keep only necessary columns.
# Rename Columns
cols_to_rename = {'OBJECTID_y':'OBJECTID',
'SHAPE_y':'SHAPE'}
cube3D_ESsub_df_resDeaths.rename(columns=cols_to_rename, inplace=True)
# Drop columns
cols_to_drop = ['OBJECTID_x','TIME_STEP','TIME_EXAG','ELEMENT','LOCATION','SHAPE_x']
cube3D_ESsub_df_resDeaths.drop(columns = cols_to_drop,
axis=1, inplace=True)
# Check data
display(cube3D_ESsub_df_resDeaths.shape)
display(cube3D_ESsub_df_resDeaths.columns)
Here, we will combine the errors generated using Exponential Smoothing model with the fitted values from the space-time cube cube3D_ESsub_df_resDeaths
.
# Get valid values from ES results df
esf_small = esf_resDeaths_subdf[['FIPS','F_RMSE', 'V_RMSE']].copy()
esf_small.shape
# Merge with 3D cube data
cube3D_ESsub_df_resDeaths = pd.merge(cube3D_ESsub_df_resDeaths, esf_small, on='FIPS')
cube3D_ESsub_df_resDeaths.shape
Next, we will publish the combined model results to a feature layer for ease of access and use.
%%time
# Publish layer
cube3D_ESsub_df_resDeaths.spatial.to_featurelayer('ES_Results')
Here, we will create plots that show the raw, fitted, and forecasted average weekly nursing home resident deaths per 1000 residents due to Covid-19 for various counties. The fitted and raw values are for time period June 7, 2020 - March 14, 2021
, whereas forecasted values are for the time period March 21, 2021 - June 6, 2021
.
Next, we will subset the data to keep only the necessary columns.
# Get column names
cube3D_ESsub_df_resDeaths.columns
# Keep necessary columns
resDeaths_cube_ESsub_df = cube3D_ESsub_df_resDeaths[['end_date','fips','county','state','value',
'fitted','high','low','level','trend','SHAPE']].copy()
resDeaths_cube_ESsub_df.set_index('end_date', inplace=True)
resDeaths_cube_ESsub_df.shape
# Check head of data
resDeaths_cube_ESsub_df.head()
Here, we will create plots that show the raw, fitted, and forecasted values of average weekly resident deaths per 1000 residents for various counties.
Let's define a function ts_plot
that will:
We can then use this function to generate plots for one or more counties.
# Define function to plot data for one or multiple counties
def ts_plot(data, fips: list, ylabel: str, title: str):
if len(fips) == 1:
fig = plt.figure(figsize=(18, 12))
else:
fig = plt.figure(figsize=(20, 20))
# Create grid
for sp in range(0,len(fips)):
if len(fips) == 1:
ax = fig
elif len(fips)%2 == 0:
ax = fig.add_subplot(len(fips)/2, 2, sp + 1)
else:
ax = fig.add_subplot(len(fips)//2+1, 2, sp + 1)
# Subset data for county
plot_df = data[data['fips']==fips[sp]]
# Plot raw, fitted and forecast values
ax = plot_df.loc[:'2021-03-14','value'].plot(legend=True,
label='Raw Values', marker='o', markersize=10, linewidth=2)
ax = plot_df['fitted'].plot(legend=True, label='Fitted Values',marker='o',
markersize=10,linewidth=2)
# Plot confidence interval
plt.fill_between(plot_df['fitted'][-12:].index,
plot_df.loc['2021-03-21':,'low'],
plot_df.loc['2021-03-21':,'high'],
color='r', alpha=0.1, label='Confidence Interval')
# Plot forecast start line
plt.vlines(plot_df.index[-12],
ymin=plot_df['low'].min(),
ymax=plot_df['high'].max(),
color='grey', linewidth=1,
linestyle="--", label='Forecast results start here')
# Label, title and legend
ax.set(ylabel=ylabel, xlabel='')
ax.set_title(plot_df['county'][0]+', '+plot_df['state'][0], fontsize=20)
plt.legend(loc='upper left', prop={'size': 12})
# Remove spines
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
fig.suptitle(title, fontsize=25, y=1.0)
plt.rc('axes', labelsize=15)
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.tight_layout() # automatically adjusts layout to fit long labels
plt.show();
Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.
County, State | FIPS | Cluster Id |
---|---|---|
Maricopa, AZ | 4013 | 2 |
Delaware, PA | 42045 | 3 |
Porter, IN | 18127 | 4 |
# Create Plot
fips = [4013,42045,18127]
ts_plot(resDeaths_cube_ESsub_df,fips,
ylabel='Resident death rate',
title='Average weekly resident death rates')
The plots show:
- the fitted values as generated by the Exponential Smoothing model for June 7, 2020 - March 14, 2021.
- the forecasted values generated by the model for March 21, 2021 - June 6, 2021.
- the actual resident death rates for June 7, 2020 - March 14, 2021.
- the confidence interval of the forecast for March 21, 2021 - June 6, 2021.
Observations:
- The fitted values generated by the exponential smoothing model seem to follow the actual death rates fairly closely for Maricopa County, AZ and Delaware County, PA. For Porter County, IN, the fitted? values vary quite a bit from the raw values. This could be a result of the model trying to generalize the training data, rather than overfitting it.
- The forecast values vary for different counties, which is expected, as a different model is generated for each county. The increasing width of the confidence intervals show reduced confidence in the forecasted values as the forecasting time period increases.
- Carefully observing the plot for Porter County, IN we see that although the confidence interval seems narrower than the others, the resident death rate for Porter County is slightly higher when compared to the other counties. As such, the range of values for the confidence intervals in Porter County could be similar to, or even larger than, other counties.
- The increasing width of the confidence intervals show higher uncertainty associated with each forecast as the forecasting time period increases.
The forecast RMSE measures how much the fitted values from the model differ from the raw time series values. The forecast RMSE only measures how well the exponential smoothing model fits the raw time series values. It does not measure how well the forecast model actually forecasts future values. This problem is addressed by the validation model. The model uses data from June 7, 2020 - March 14, 2021
for training and data from March 21, 2021 - June 6, 2021
(12-time steps) is used for forecasting to calculate the error.
The validation model is used to determine how well the forecast model can forecast the future values of each time series. It is constructed by excluding some of the final time steps of each time series and fitting the exponential smoothing model to the data that was not excluded. This model is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the raw values that were hidden. Since we are using 6-time steps for validation, the model uses data from June 7, 2020 - Jan 31, 2021
for training and data from Jan 31, 2021 - March 14, 2021
for validation to calculate this error.
You can read more about the two errors here. Let's map the forecast and validation errors.
# Search for layer
es_item = gis.content.search('ES_Results', 'Feature Layer')
es_lyr = es_item[0].layers[0]
es_lyr
# Create map
m1= gis.map('US', zoomlevel=4)
m1
The plot shows variation in validation errors as generated by the Exponential Smoothing model for each county. The variation seems to be really high for some counties with values ranging from 9.8 - 389.
# Plot data
m1.add_layer(es_lyr,
{"renderer":"ClassedColorRenderer",
"field_name":"V_RMSE",
"opacity":0.7
})
# Add legend and screenshot
m1.legend = True
m1.take_screenshot(output_in_cell=False)
# Create map
m2 = gis.map('US', zoomlevel=4)
m2
The plot shows variation in forecast errors as generated by the Exponential Smoothing model for each county. The variation seems to be high for some counties with values ranging from 9.8 - 125.
# Plot data
m2.add_layer(es_lyr,
{"renderer":"ClassedColorRenderer",
"field_name":"F_RMSE",
"opacity":0.7
})
# Add legend and screenshot
m2.legend = True
m2.take_screenshot(output_in_cell=False)
Let's identify the counties with high variance in forecast and validation errors.
high_var = cube3D_ESsub_df_resDeaths.query(
'f_rmse > 9.8 and v_rmse > 9.8').groupby(
['state', 'county']).aggregate('count')[['f_rmse', 'v_rmse']]
high_var
From the results, we can see that there are 292 counties for which the variation in error is in the highest band for both validation and forecast errors.
Now that we have built the Exponential Smoothing forecast model and generated forecasts for our time series data, let's look into the Forest-based Forecast model to forecast average weekly resident deaths per 1000 residents.
The Forest-based Forecast tool [7] forecasts the values of each location of a space-time cube using an adaptation of Leo Breiman's random forest algorithm. This tool uses the same underlying algorithm as the Forest-based Classification and Regression tool when used for regression. The training data used to build the forest regression model is constructed using time windows on each location of the space-time cube.
Compared to other forecasting tools in the Time Series Forecasting toolset, this tool is the most complex but includes the fewest assumptions about the data.
In this section, we will:
We will use data from June 7, 2020 - March 14, 2021
to train and validate the model. Data from March 21, 2021 - June 6, 2021
(12 time steps) will be used for forecasting.
# Run Algorithm
from arcpy.stpm import ForestBasedForecast
# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True
# Run
try:
in_cube = os.path.join(home_dir, sptm_cube_sub)
analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
output_features = "usNB_resDeaths_FBFsub"
out_cube = os.path.join(home_dir, 'usNB_resDeaths_FBFsub_cube.nc')
number_of_time_steps_to_forecast = 12
time_window = None
# season_length = None
number_for_validation = 6
number_of_trees = 1000
minimum_leaf_size = None
maximum_depth = None
sample_size = 100
forecast_approach = "VALUE_DETREND"
outlier_option = None
level_of_confidence = "95%"
maximum_number_of_outliers = None
ForestBasedForecast(in_cube, analysis_variable, output_features,out_cube,
number_of_time_steps_to_forecast,time_window,number_for_validation,
number_of_trees,minimum_leaf_size,maximum_depth,sample_size,forecast_approach,
outlier_option,level_of_confidence,maximum_number_of_outliers)
print(arcpy.GetMessages())
except arcpy.ExecuteError:
# If any error occurred when running the tool, print the messages
print(arcpy.GetMessages(2))
# Access FBF data from local
fbf_resDeaths_subdf = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
fbf_resDeaths_subdf.head()
When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods, from 2021-03-21 - 2021-06-06
.
To create forecast plots, we need raw, fitted, and the forecasted values. The model generates a feature class with forecasted values only. The data in ppe_ts_layer
is also not aggregated. Let's create a space time cube in 3D using Visualize Space Time Cube in 3D [6] tool to get the fitted values.
from arcpy.stpm import VisualizeSpaceTimeCube3D
# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True
# Run
in_cube = out_cube
cube_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
display_theme = "FORECAST_RESULTS"
output_features = "usNB_resDeaths_FBFsub_3D"
VisualizeSpaceTimeCube3D(in_cube, cube_variable, display_theme, output_features)
# Access cube data from local
cube3D_FBFsub_df_resDeaths = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
cube3D_FBFsub_df_resDeaths.head()
# Check shape
cube3D_FBFsub_df_resDeaths.shape
# Merge with FIPS data
cube3D_FBFsub_df_resDeaths = pd.merge(cube3D_FBFsub_df_resDeaths, fips_df, on='FIPS')
cube3D_FBFsub_df_resDeaths.shape
Next, we will clean the data and keep only the necessary columns.
# Rename Columns
cols_to_rename = {'OBJECTID_y':'OBJECTID',
'SHAPE_y':'SHAPE'}
cube3D_FBFsub_df_resDeaths.rename(columns=cols_to_rename, inplace=True)
# Drop columns
cols_to_drop = ['OBJECTID_x','TIME_STEP','TIME_EXAG','ELEMENT','LOCATION','SHAPE_x']
cube3D_FBFsub_df_resDeaths.drop(columns = cols_to_drop,
axis=1, inplace=True)
# Check data
display(cube3D_FBFsub_df_resDeaths.shape)
display(cube3D_FBFsub_df_resDeaths.columns)
Here, we will combine the errors generated using the Forest-based Forecast with the fitted values from the space-time cube cube3D_FBFsub_df_resDeaths
.
# Get valid values from ES results df
fbf_small = fbf_resDeaths_subdf[['FIPS','F_RMSE', 'V_RMSE']].copy()
fbf_small.shape
# Merge with RMSE data
cube3D_FBFsub_df_resDeaths = pd.merge(cube3D_FBFsub_df_resDeaths, fbf_small, on='FIPS')
cube3D_FBFsub_df_resDeaths.shape
We will publish the combined model results to a feature layer for ease of access and use.
%%time
fbf_item = cube3D_FBFsub_df_resDeaths.spatial.to_featurelayer('FBF_BigLayer')
fbf_item
Here, we will create plots that show the raw, fitted, and forecasted average weekly nursing home resident deaths per 1000 residents due to Covid-19 for various counties. The fitted and raw values are for time period June 7, 2020 - March 14, 2021
, whereas the forecasted values are from March 21, 2021 - June 6, 2021
.
Next, we will subset the data to keep only the necessary columns.
# Check Column names
cube3D_FBFsub_df_resDeaths.columns
# Subset specific columns
resDeaths_cube_FBFsub_df = cube3D_FBFsub_df_resDeaths[['end_date','fips','county','state','value',
'fitted','residual','high','low']].copy()
resDeaths_cube_FBFsub_df.set_index('end_date', inplace=True)
resDeaths_cube_FBFsub_df.shape
Here, we will create plots that show the raw, fitted, and forecasted values of the average weekly resident deaths per 1000 residents for various counties. Let's use the ts_plot()
function, defined earlier, to create the plots. Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.
County, State | FIPS | Cluster Id |
---|---|---|
Maricopa, AZ | 4013 | 2 |
Delaware, PA | 42045 | 3 |
Porter, IN | 18127 | 4 |
# Plot data using the ts_plot function
fips = [4013,42045,18127]
ts_plot(resDeaths_cube_FBFsub_df,fips,
ylabel='Resident death rate',
title='Average weekly resident death rates')
The plots show:
- the fitted values as generated by the Forest-based Forecast model for June 7, 2020 - March 14, 2021.
- the forecasted values generated by the model for March 21, 2021 - June 6, 2021.
- the actual resident death rates for June 7, 2020 - March 14, 2021.
- the confidence interval of the forecast for March 21, 2021 - June 6, 2021.
Observations:
- The fitted values generated by the model seem to follow the actual death rates closely, with slight variation from the actual data. This could be a result of the model trying to generalize the training data, rather than overfitting it.
- The forecast values vary for different counties, which is expected, as a different model is generated for each county.
- The confidence intervals generated by the model are much wider when compared to the exponential smoothing models, making it difficult to visualize the distribution of fitted vs actual data.
- The increasing width of the confidence intervals show reduced confidence in forecasted values as the forecasting time period increases.
The forecast RMSE measures how much the fitted values from the model differ from the raw time series values. The forecast RMSE only measures how well the forest-based forecast model fits the raw time series values. It does not measure how well the forecast model actually forecasts future values. This problem is addressed by the validation model. The model uses data from June 7, 2020 - March 14, 2021
for training and data from March 21, 2021 - June 6, 2021
(12-time steps) is used for forecasting to calculate the error.
The validation model is used to determine how well the forecast model can forecast the future values of each time series. It is constructed by excluding some of the final time steps of each time series and fitting the forest-based forecast model to the data that was not excluded. This model is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the raw values that were hidden. Since we are using 6-time steps for validation, the model uses data from June 7, 2020 - Jan 31, 2021
for training and data from Jan 31, 2021 - March 14, 2021
for validation to calculate this error.
You can read more about the two errors here. Let's map the forecast and validation errors.
# Search for layer
fbf_item = gis.content.search('FBF_BigLayer', 'Feature Layer')
fbf_lyr = fbf_item[0].layers[0]
fbf_lyr
# Create map
fbf_m1 = gis.map('US', zoomlevel=4)
fbf_m1
The plot shows variation in validation errors as generated by the Forest-based Forecast model for each county. The variation seems to be really high for some counties with values ranging from 7.8 - 137.
# Plot Data
fbf_m1.remove_layers()
fbf_m1.add_layer(fbf_lyr,
{"renderer":"ClassedColorRenderer",
"field_name":"V_RMSE",
"opacity":0.7
})
# Take screenshot
fbf_m1.legend=True
fbf_m1.take_screenshot(output_in_cell=False)
# Create map
fbf_m2 = gis.map('US', zoomlevel=4)
fbf_m2
The plot shows variation in forecast errors as generated by the Forest-based Forecast model for each county. The variation seems to be high for some counties with values ranging from 5.5 - 66.
# Plot Data
fbf_m2.remove_layers()
fbf_m2.add_layer(fbf_lyr,
{"renderer":"ClassedColorRenderer",
"field_name":"F_RMSE",
"opacity":0.7
})
# Add layer and take screenshot
fbf_m2.legend=True
fbf_m2.take_screenshot(output_in_cell=False)
Let's identify the counties with high variance in forecast and validation errors.
high_var = cube3D_FBFsub_df_resDeaths.query(
'f_rmse > 5.5 and v_rmse > 7.8').groupby(
['state', 'county']).count()[['f_rmse', 'v_rmse']]
high_var
From the results, we can see that there are quite a few counties for which the variation in error is in the highest band for both validation and forecast errors.
Now that we have built the Forest-based Forecast model and generated forecasts for our time series data, let's combine the results from the models built above.
In this section, we will combine the results generated from the Exponential Smoothing and Forest-based forecast models. The combined results will be used later to generate plots overlaying various models.
# Check ES df
display(cube3D_ESsub_df_resDeaths.columns)
display(cube3D_ESsub_df_resDeaths.shape)
# Check FB df
display(cube3D_FBFsub_df_resDeaths.columns)
display(cube3D_FBFsub_df_resDeaths.shape)
Here, we will merge the Exponential Smoothing and Forest-based model results.
# Merge on fips and dates
es_fb_df = pd.merge(cube3D_ESsub_df_resDeaths, cube3D_FBFsub_df_resDeaths,
on=['fips', 'start_date', 'end_date'])
# Check merge results
display(es_fb_df.columns)
display(es_fb_df.shape)
display(es_fb_df.head())
Next, we will clean the data and keep the necessary columns.
# Rename Columns
cols_to_rename = {'objectid_x':'OBJECTID',
'SHAPE_x':'SHAPE',
'county_x':'county',
'state_x':'state'}
es_fb_df.rename(columns=cols_to_rename, inplace=True)
# Drop columns
cols_to_drop = ['objectid_y','SHAPE_y','state_y','county_y','value_x','value_y']
es_fb_df.drop(columns = cols_to_drop,
axis=1, inplace=True)
# Check names
display(es_fb_df.shape)
display(es_fb_df.columns)
# Replace x and y with ES, FB
es_fb_df.columns = es_fb_df.columns.str.replace('_x', '_ES')
es_fb_df.columns = es_fb_df.columns.str.replace('_y', '_FB')
# Rename some ES cols
cols_to_rename = {'level':'level_ES', 'trend':'trend_ES',
'leveltrend':'leveltrend_ES', 'season':'season_ES'}
es_fb_df = es_fb_df.rename(columns = cols_to_rename)
display(es_fb_df.shape)
display(es_fb_df.columns)
We will publish the combined model results to a feature layer for ease of access and use.
%%time
es_fb_item = es_fb_df.spatial.to_featurelayer('ES_FB_BigLayer')
es_fb_item
In this notebook we:
In the next notebook, we will generate forecasts using the ARIMA (Autoregressive Integrated Moving Average) model and create plots overlaying fitted and forecasted values from all of the models we have tested.
[1] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/createcubefromdefinedlocations.htm
[2] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/time-series-clustering.htm
[3] Sokol, J. (2018). Module 7: Time Series Models [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6501-intro-analytics-modeling
[4] https://online.stat.psu.edu/stat501/node/1001
[5] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/exponentialsmoothingforecast.htm
[6] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/visualizecube3d.htm
[7] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/forestbasedforecast.htm
[8] https://www.history.com/this-day-in-history/first-confirmed-case-of-coronavirus-found-in-us-washington-state
[9] https://www.nytimes.com/interactive/2020/us/coronavirus-nursing-homes.html
[10] https://www.gao.gov/products/gao-21-367
[11] https://www.nytimes.com/2021/08/04/health/nursing-homes-vaccine-delta-covid.html