Time Series Forecasting: COVID-19 trends in US nursing homes - Part 1/2¶

👟 Ready To Run!

🚀 Data Exploration and Cleaning

🤖 Machine Learning

⏱️ Time Series

⚕️ Healthcare

Requirements:

🗄️ ArcGIS Notebooks Advanced Runtime

The Coronavirus disease 2019 (Covid-19), which would spark a pandemic, was first reported in China on December 31, 2019. On January 19, 2021, the first case of Covid-19 was reported in the United States near Seattle, WA. In the following months, the Seattle area became the epicenter of an early U.S. outbreak. 39 residents of a nursing home in Kirkland, WA died from complications from the virus in one four-week span [8].

Image Source

The Covid-19 pandemic has swept the long-term care facilities in the United States, resulting in nearly 4% of all cases (1.4M) and 31% of all deaths (182K) as of June 2021 [9]. The impact on nursing homes is even worse where nearly 1 in 10 residents have died. According to a study [10], 94% of nursing homes reviewed experienced more than one Covid-19 outbreak and 85% had an outbreak lasting 5 or more weeks. As of August 2021, the case count has ticked up again and with the vaccinated elderly susceptible to breakthrough infections, nursing homes now confront new Covid-19 outbreaks [11].

People living in nursing homes are generally 60 years and older with underlying health conditions such as cardiovascular disease, chronic respiratory disease, cerebrovascular disease, malignancy, and dementia. They represent a vulnerable population that should benefit from additional protective measures against contamination. The communal nature of nursing homes along with comorbidities put these people at increased risk of Covid-19 progression, severe outcomes, and death.

The ability to predict the progress of this pandemic has been crucial for decision making aimed at fighting this pandemic and controlling its spread. This presents an opportunity for us to use advanced data analysis, visualization and modeling techniques (both open-source and ArcGIS based) to help find trends to provide an early warning of rising numbers of deaths and to prevent future increases.

In this 2-part study, we will:

Explore the data to understand the distribution of average weekly resident deaths per 1000 residents.
Cluster the data using Time Series Clustering to better understand spatiotemporal patterns in our data.
Generate forecasts for average weekly resident deaths per 1000 residents using:
- Exponential Smoothing Forecast model
- Forest-based Forecast model
- ARIMA model
Create plots to visualize the fitted values, forecasted values and their confidence intervals, and actual data values for all models.
Plot forecast errors generated by all models.

1 Introduction
- 1.1 About the Data
- 1.2 General Imports
2 Data Exploration
- 2.1 Get Resident Data
  - 2.1.1 Subset and Group Data
- 2.2 Plot Data
  - 2.2.1 Average weekly resident cases and death rates by state
  - 2.2.2 Average weekly resident death rates
3 Create Space Time Cube
- 3.1 Setup Environment and Get Data
- 3.2 Run Algorithm
4 Time Series Clustering
- 4.1 Run Algorithm
  - 4.1.1 Merge with FIPS data
- 4.2 Plot Clusters
5 Exponential Smoothing Forecast Model
6 Forest-based Forecast Model
7 Summary
8 References

Introduction¶

In this notebook, we will:

Start by exploring the data to understand the distribution of average weekly resident cases and deaths per 1000 residents.
Create a Space Time Cube [1] to structure the data into a netCDF data format and aggregate resident death rates at the US county geographic level.
Cluster the data using Time Series Clustering [2] to better understand spatiotemporal patterns in our data.
Generate forecasts for average weekly resident deaths per 1000 residents using Exponential Smoothing Forecast [5] and Forest-based Forecast [7] tools.
Plot forecast and validation errors as generated by the two models.

In the subsequent notebook, we will:

Build ARIMA models to generate forecasts for average weekly resident deaths per 1000 residents in each county.
Combine ARIMA results with Exponential Smoothing, Forest-based Forecast models and actual data.
Create plots to visualize the fitted values, forecasted values and their confidence intervals, and actual data values for all models.
Plot forecast errors generated by all models.

About the Data¶

The Centers for Medicare and Medicaid Services (CMS) provides a Nursing Home Covid-19 Public File. This file includes data reported by nursing homes to the CDC’s National Healthcare Safety Network (NHSN) system’s Covid-19 Long Term Care Facility Module. The data includes information about Resident Impact, Facility Capacity, Staff & Personnel, Personal Protective Equipment, Ventilator Capacity, and other Supplies information.

Dataset: https://data.cms.gov/covid-19/covid-19-nursing-home-data

Each record in the data represents information about an individual nursing home in the United States for each week. For our analysis, we will use the data ranging from June 07, 2020 - June 6, 2021. This data has been cleaned, feature engineered and made available as published feature services. To learn more about the raw data and the data cleaning performed to create these services, review the following notebook.

General Imports¶

# Import Libraries

# Display 
from IPython.display import display
from ipywidgets import *

# ArcGIS and Arcpy
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
import arcpy

# Data Exploration
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)

# Plotting Libraries
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import dates
import seaborn as sns
plt.style.use('ggplot')

# Datetime and os
from datetime import datetime
import time
import os
import itertools

# Stop warnings
import warnings
warnings.filterwarnings("ignore")

# Create a GIS connection
gis = GIS("home")

Data Exploration¶

The dataset we have is the space-time data where each record represents information about an individual nursing home in the United States for each week, ranging from June 07, 2020 - June 6, 2021. For exploration purposes, we will only consider the following columns:

- weekly resident cases per 1000 residents
- weekly resident deaths per 1000 residents

There are many ways to explore this data, but in this notebook, we will explore it in two particular ways. We will aggregate the data by state to:

get an overview of average weekly resident deaths per 1000 residents for all states for the entire time period.
understand the distribution of average weekly resident deaths per 1000 residents for states with the highest mortality rates.

In this section, we will:

get actual resident death data.
subset and aggregate data.
plot the data.

Get Resident Data¶

Use the NH_TS_US_Cleaned_0621 feature layer to get nursing home resident data. To learn more about how this service was created, review the following notebook.

# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_item = gis.content.search(f'NH_TS_US_Cleaned_0621 {group_query}',
                              item_type = "Feature Service",
                              outside_org=True)
ts_item

[<Item title:"Subset_NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>,
 <Item title:"NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>]

# Get data layer
ts_layer = ts_item[1].layers[0]
ts_layer

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/NH_TS_US_Cleaned_0621/FeatureServer/0">

# View first few field names in the layers
for f in ts_layer.properties.fields:
    print(f['name'], "{0:>35}".format(f['alias']))

objectid                            OBJECTID
county                              county
provider_state                      provider_state
initial_covid_case                  initial_covid_case
threeormore_cases_this_week         threeormore_cases_this_week
oneweek_ppe_shortage                oneweek_ppe_shortage
total_staff_shortage                total_staff_shortage
shortage_of_nursing_staff           shortage_of_nursing_staff
oneweek_shortage_of_n95_masks       oneweek_shortage_of_n95_masks
staff_total_confirmed_covid19       staff_total_confirmed_covid19
staff_total_covid19_deaths          staff_total_covid19_deaths
residents_total_admissions_covi  residents_total_admissions_covid19
weekly_resident_confirmed_covid weekly_resident_confirmed_covid19_cases_per_1000_residents
weekly_resident_covid19_deaths_ weekly_resident_covid19_deaths_per_1000_residents
total_number_of_occupied_beds       total_number_of_occupied_beds
week_ending                         week_ending

# Check records in data
ts_layer.query(return_count_only=True)

769541

From the results above, we can see that our dataset has ~770k records and 16 columns. Each record includes information about the nursing home staff, resident impacts, personal protective equipment and other supplies, occupied beds, etc.

Subset and Group Data¶

Here, we will query and group the data.

%%time
# Query Data
subdf = ts_layer.query(as_df=True, 
                       out_fields=['county','provider_state',
                                    'week_ending','weekly_resident_confirmed_covid',
                                    'weekly_resident_covid19_deaths_'])
subdf.shape

Wall time: 1min 49s

(769541, 7)

We will group the data by county and state to get the average weekly resident cases and deaths per 1000 residents for each state and county, for each time period.

aggregation by mean allows us to maintain the ratio of cases/deaths per 1000 residents, such that the numbers represent average weekly resident cases and deaths per 1000 residents.
rename columns to resident cases and resident deaths.

# Group data by state for each time period
exp_df1 = subdf.groupby(['week_ending','provider_state','county']).aggregate('mean').copy()
exp_df1.reset_index(inplace=True)
exp_df1.drop('objectid', inplace=True, axis=1)
exp_df1.rename(columns={'weekly_resident_confirmed_covid':'resident_cases',
              'weekly_resident_covid19_deaths_':'resident_deaths'}, inplace=True)
exp_df1.shape

(149267, 5)

# Check head
exp_df1.head()

Now that we have aggregated the data by county and state, then subsetted it to include only specific columns, the dataset has 149267 records and 5 columns. Each record represents the average weekly resident cases and deaths per 1000 residents for a county, state for each one week time period.

We can further aggregate this data by state to get average weekly resident cases and deaths per 1000 residents by state. This data is combined for all time periods.

# Group data by state 
exp_df2 = exp_df1.groupby(['provider_state']).aggregate('mean').copy()
exp_df2.reset_index(inplace=True)
exp_df2.shape

(51, 3)

# Check head
exp_df2.head()

This subset represents average weekly resident cases and deaths per 1000 residents by state over the entire time period. For example, we can say that across all nursing homes in AZ, for the entire time period, on average 3 nursing home residents out of a 1000 died per week due to Covid-19, and 13 out of a 1000 residents tested positive for Covid-19.

Plot Data¶

Let's plot the data using the two subsets of data created above.

Average weekly resident cases and death rates by state¶

# Plot
fig, (ax1, ax2) = plt.subplots(2, figsize=(20,15))

# Plot cases
sns.barplot(x='provider_state', y='resident_cases', 
            data=exp_df2.sort_values(by='resident_cases', ascending=False),
            ax=ax1, color='lightcoral')
# Plot deaths
sns.barplot(x='provider_state', y='resident_deaths', 
            data=exp_df2.sort_values(by='resident_deaths', ascending=False),
            ax=ax2, color='lightcoral')

# Set label and title
ax1.set(ylabel='Resident case rate', xlabel= "States", 
        title="Average weekly nursing home resident case rates by state")
ax2.set(ylabel='Resident death rate', xlabel= "States",
        title="Average weekly nursing home resident death rates by state")
plt.rc('axes', labelsize=20)
plt.rc('axes', titlesize=15)
plt.rc('xtick', labelsize=10)
plt.show();

The plots show:

average weekly nursing home resident case rates by state.

average weekly nursing home resident death rates by state.

Observations:

average weekly resident case rate seem to be the highest for Arkansas, followed by Arizona and Tennessee.

average weekly resident death rate is the highest for Indiana, followed by Arizona and North Dakota.

Let's plot the distribution of average weekly resident death rates for Indiana, Arizona, and North Dakota.

Average weekly resident death rates¶

Here, we plot the time series of average weekly resident deaths per 1000 residents for Indiana, Arizona, and North Dakota across different time periods.

fig, ax = plt.subplots(1,1, figsize=(20,10))

states = ['AZ','IN','ND']
for state in states:
    sns.lineplot(x='week_ending', y='resident_deaths', 
                 data=exp_df1.query('provider_state==@state'),
                 marker='o', linewidth=1.8, ci=None, label=state)

ax.autoscale(axis='x')
plt.legend(loc='upper left', prop={'size': 15})
ax.set(ylabel='Resident death rate', xlabel= "")
ax.xaxis.set_major_locator(dates.MonthLocator(interval=1))
plt.title("Average weekly nursing home resident death rates", fontsize=20)
plt.gcf().autofmt_xdate()
plt.show();

The plot shows average weekly resident death rates for Indiana, Arizona, and North Dakota.

Observations:

Arizona saw a higher resident mortality rate early on, compared to other states. The death rate increased steadily until August'20, followed by a decreasing trend until mid Sept.'20 and then no deaths until mid Oct.'20. The period from mid Oct.'20 - mid Dec.'20 saw a consistent increase in the death rate, followed by a decreasing trend until June'21. The state saw its highest death rates in early Aug.'20 and mid Dec.'20.

Indiana reported a consistently low death rate until Oct'20, with a minor spike for a month around Aug.'20. From Oct.'20 - mid Dec.'20, there was an increase in the death rate, followed by a decreasing trend until June'21. From mid Dec.'20 onwards, Indiana reported a higher death rate than Arizona and North Dakota.

North Dakota started very well early on, with almost no deaths until Sept.'20. There was a steep rise in the death rate in mid Oct.'20, with an average of 25 deaths per 1000 residents during the 2nd week of Oct.'20. While this trend seemed to settle down the next week, there was another increasing trend in the death rate from early Nov.'20 - early Dec.'20. From mid Dec.'20 onwards, there was a decreasing trend through June'21.

All 3 states saw a decrease in resident death rates starting from mid Dec.'20 and almost no deaths starting from March'21. This trend seems consistent with the overall Covid-19 death rates across the United States. Most states saw an increase in case rates and death rates during Dec.'20 - Jan.'21. With the introduction of vaccine programs early Jan.'21, the number of deaths has reduced overall to almost no deaths being reported, starting in April '21.

Create Space Time Cube¶

We will begin our analysis by creating a Space Time Cube using Create Space Time Cube From Defined Locations [1] tool. The data structure this tool creates can be thought of as a three-dimensional cube made up of space-time bins with the x and y dimensions representing space and the t dimension representing time. Every bin has a fixed position in space (either an x,y location for points or a fixed set of vertices for polygon locations) and in time (t).

Source

This tool structures the data into a netCDF data format by creating space-time bins. The netCDF output is then used as input to other time-series algorithms in our toolbox.

In this section, we will:

setup environment and get data.
- get nursing home resident deaths data.
- get data for US county boundaries.
build a space time cube to aggregate resident deaths at the US county geographic level for each time period.

We will use data from June 7, 2020 - March 14, 2021 to build the space time cube. Data from March 21, 2021 - June 6, 2021 (12 time steps) has been excluded and will be used for forecasting.

Setup Environment and Get Data¶

Before we create the space time cube, we will:

setup an environment to create the file geodatabase and setup a data directory to output results.
get data for US county boundaries to get fips, county, and state name mapping.

Setup Environment¶

Here, we will create the file geodatabase and setup the data directory to output our results to.

# Set data directory to output results
home_dir = os.path.join(os.getcwd(), 'home')
print(f"home dir: {home_dir}")

home dir: C:\Users\mohi9282\Documents\ArcGIS\home

# Create empty FGDB in data directory
if not os.path.isdir(os.path.join(home_dir, 'Results.gdb')):
    arcpy.CreateFileGDB_management(home_dir, 'Results.gdb')
print('File Geodatabase exists')

File Geodatabase exists

# Setup Workspace
arcpy.env.workspace = os.path.join(home_dir,'Results.gdb')
arcpy.env.workspace

'C:\\Users\\mohi9282\\Documents\\ArcGIS\\home\\Results.gdb'

Get Resident Data¶

We will use the Subset_NH_TS_US_Cleaned_0621 layer with data from June 7, 2020 - March 14, 2021. To learn more about how this service was created, review the following notebook.

# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_sub_item = gis.content.search(f'Subset_NH_TS_US_Cleaned_0621 {group_query}',
                                  item_type = "Feature Service",
                                  outside_org=True)
ts_sub_item

[<Item title:"Subset_NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>]

# Get data layer
ts_sub_layer = ts_sub_item[0].layers[0]
ts_sub_layer

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/Subset_NH_TS_US_Cleaned_0621/FeatureServer/0">

%%time
# Query as dataframe
ts_sub_df = ts_sub_layer.query(as_df=True)

Wall time: 1min 33s

# Check details
display(ts_sub_df.shape)
display(ts_sub_df.columns)
display(ts_sub_df.head())

(595650, 17)

Index(['SHAPE', 'county', 'initial_covid_case', 'objectid',
       'oneweek_ppe_shortage', 'oneweek_shortage_of_n95_masks',
       'provider_state', 'residents_total_admissions_covi',
       'shortage_of_nursing_staff', 'staff_total_confirmed_covid19',
       'staff_total_covid19_deaths', 'threeormore_cases_this_week',
       'total_number_of_occupied_beds', 'total_staff_shortage', 'week_ending',
       'weekly_resident_confirmed_covid', 'weekly_resident_covid19_deaths_'],
      dtype='object')

From the results above, we can see that our dataset has ~596k records and 17 columns. The columns include information about resident impact, staff & personnel, personal protective equipment and other supplies.

%%time
# Store the data locally
point_lyr = ts_sub_df.spatial.to_featureclass(os.path.join(arcpy.env.workspace,'Subset_NH_TS_US_Cleaned_0621'))
point_lyr

Wall time: 1min 43s

'C:\\Users\\mohi9282\\Documents\\ArcGIS\\home\\Results.gdb\\Subset_NH_TS_US_Cleaned_0621'

Get US County boundaries¶

Here, we will get the county boundaries layer NH_TS_US_FIPS_layer created for the analysis. To learn more about how this service was created, review the following notebook.

# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
polygon_item = gis.content.search(f'NH_TS_US_FIPS_layer {group_query}',
                                  item_type = "Feature Service",
                                  outside_org=True)
polygon_item

[<Item title:"NH_TS_US_FIPS_layer" type:Feature Layer Collection owner:portaladmin>]

# Get published FIPS layer as polygon layer
polygon_layer = polygon_item[0].layers[0]
polygon_layer

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/NH_TS_US_FIPS_layer/FeatureServer/0">

# Check layer details
fips_df = polygon_layer.query(as_df=True)
display(fips_df.shape)
display(fips_df.head())

(3142, 7)

Run Algorithm¶

Here, we will build a space time cube. We will aggregate weekly resident deaths (point_lyr) at the US county geographic level for each time period by using aggregation_shape_type = "DEFINED_LOCATIONS" and passing the fips layer (polygon_layer) as defined_polygon_locations.

# Run GWR and output results to the FGDB
from arcpy.stpm import CreateSpaceTimeCube

# Workspace
arcpy.env.workspace

arcpy.env.overwriteOutput = True

try:

    # Define model variables
    in_features = point_lyr
    sptm_cube_sub = "usCubeNBLatest.nc"   #"fcCubeNB.nc"
    time_field = 'week_ending'
    template_cube = None
    time_step_interval = "1 Weeks"
    time_step_alignment = "END_TIME"
    reference_time = None
    distance_interval = None
    summary_fields = "weekly_resident_covid19_deaths_ MEAN SPACE_TIME_NEIGHBORS;"
    aggregation_shape_type = "DEFINED_LOCATIONS"
    defined_polygon_locations = polygon_layer.url
    location_id = "fips"

    CreateSpaceTimeCube(in_features, sptm_cube_sub, time_field, template_cube, time_step_interval,
                        time_step_alignment, reference_time, distance_interval, summary_fields,
                        aggregation_shape_type, defined_polygon_locations, location_id)
    print(arcpy.GetMessages())
    
except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))

Start Time: Wednesday, September 8, 2021 12:05:22 PM
WARNING 110109: 3895 of 595650 Input Features could not be aggregated into the Defined Polygon Locations.
WARNING 110110: Features that could not be aggregated (only includes first 30): OBJECTID = 7713, 8930, 8942, 8949, 8972, 8974, 8985, 9004, 9035, 9046, 9054, 9056, 9066, 9086, 9120, 9148, 9157, 9161, 9187, 9240, 9249, 9250, 9262, 9267, 9275, 9278, 9279, 9290, 9314, 9332.
WARNING 110089: 189 out of 3142 Defined Locations were excluded from the Space Time Cube due to the presence of bins that could not be estimated.
WARNING 110090: Defined Locations excluded from the Space Time Cube based on the inability to predict (only includes first 30): FIPS = 53019, 53051, 30101, 30051, 30015, 53017, 30099, 30055, 30033, 53043, 38083, 38025, 30069, 30045, 38033, 38007, 38043, 38065, 30079, 30059, 30039, 30007, 30037, 30065, 30107, 38037, 53023, 30025, 16049, 38087.
WARNING 110067: Your spatial reference is not compatible with CF Conventions.  You may experience difficulties using the resulting space-time cube with other NetCDF tools and software.

---------- Space Time Cube Characteristics -----------
Input feature time extent          2020-06-07 00:00:00
                                to 2021-03-14 00:00:00
                                                      
Number of time steps                                41
Time step interval                              1 week
Time step alignment                                End
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Cube extent across space       (coordinates in meters)
Min X                                   -17838103.0610
Min Y                                     2145729.6799
Max X                                    -7454985.1465
Max Y                                     9729151.6921


Locations                                           2953
% of locations with estimated observations          5.45
- Total number                                       161
Total observations                                121073
% of all observations that were estimated           5.38
- Total number                                      6512


 Overall Data Trend - WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS 
Trend direction                             Increasing
Trend statistic                                 2.5946
Trend p-value                                   0.0095


------------- Overall Data Trend - COUNT -------------
Trend direction                             Decreasing
Trend statistic                                -8.7346
Trend p-value                                   0.0000

Succeeded at Wednesday, September 8, 2021 12:05:59 PM (Elapsed Time: 37.80 seconds)

From the messages generated, we can see that:

3895 of 595650 input features could not be aggregated into the Defined Polygon locations. This represents 0.6 % of the total data, and we will ignore these data points for the analysis.

189 out of 3142 counties (Defined Locations) were excluded from the Space Time Cube due to the presence of bins that could not be estimated. These 189 counties didn't have enough data for estimation and will not be included in the analysis.

Overall across the United States, there is a statistically significant increasing trend for average weekly resident deaths per 1000 residents.

Space Time Cube structures the data into a netCDF data format (.nc) by creating space-time bins. We can now use the netCDF output as input to other time-series algorithms in our toolbox.

Time Series Clustering¶

Now that we have a Space Time Cube, let's try to understand some patterns in our space-time data of resident deaths per 1000 residents. Clustering allows us to understand the data better by identifying which features are more similar or dissimilar to each other. The Time Series Clustering [2] tool identifies the locations in a space-time cube that are most similar and partitions them into distinct clusters in which members of each cluster have similar time series characteristics.

Source

The tool partitions a collection of time series, stored in a space-time cube, based on the similarity of time series characteristics. Time series can be clustered based on three criteria: having similar values across time, tending to increase and decrease at the same time, and having similar repeating patterns.

In this section, we will:

run the time series clustering tool using the space time cube.
visualize the number of counties in each cluster.
plot clusters on a map.
create an average time series for each cluster.
use cluster information to identify states and counties that belong to a particular cluster.

We will use the data from June 7, 2020 - March 14, 2021 for clustering. Data from March 21, 2021 - June 6, 2021 (12 time steps) has been excluded from clustering and will be used for forecasting.

Run Algorithm¶

Here, we use the Profile (Correlation) option for characteristic_of_interest parameter, which is used to cluster time series such that they tend to stay in proportion to each other and increase and decrease in value at the same time. This option measures the similarity in time series based on their statistical correlation across time.

from arcpy.stpm import TimeSeriesClustering

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
try:
    in_cube = os.path.join(home_dir, sptm_cube_sub)
    analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
    output_features = "usNB_resDeaths_TSClustering"
    characteristic_of_interest = "PROFILE"
    cluster_count = None
    output_table_for_charts = "usNB_resDeaths_TSClustChart"
    shape_characteristic_to_ignore = None
    enable_time_series_popups = "NO_POPUP"
    
    TimeSeriesClustering(in_cube, analysis_variable, output_features, characteristic_of_interest,
                        cluster_count, output_table_for_charts, shape_characteristic_to_ignore,
                        enable_time_series_popups)
    print(arcpy.GetMessages())

except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))

Start Time: Wednesday, September 8, 2021 12:07:23 PM

----------- Input Space Time Cube Details -----------
Time step interval                              1 week
                                                      
Shape Type                                     Polygon
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Number of time steps                                41
Number of locations analyzed                      2953
Number of space time bins analyzed              121073
-----------------------------------------------------

------ Pseudo F-Statistic Summary ------
Number of Clusters      Highest Pseudo F
2                                  0.000
3                                946.348
4                               1104.168
5                               1057.773
6                                989.514
7                                930.295
8                                922.507
9                                885.160
10                               858.593
----------------------------------------
Optimal number of clusters is 4 based on the highest pseudo F-statistic. Random Seed: 7165

------ Trend Statistics for Average Time-Series ------
Cluster ID     Direction           Statistic     p-value
1              Not Significant        0.0000      1.0000
2              Increasing             3.4707      0.0005
3              Decreasing            -3.0439      0.0023
4              Increasing             2.0105      0.0444
------------------------------------------------------

- Number of Locations per Cluster -
Cluster ID      Number of Locations
1                               248
2                              1379
3                               570
4                               756
-----------------------------------
Succeeded at Wednesday, September 8, 2021 12:08:23 PM (Elapsed Time: 1 minutes 0 seconds)

The messages generated show:

optimal number of clusters generated by the model is 4.

trend statistic for average time-series in each cluster shows statistically significant trend in 3 out of 4 clusters.

number of locations (counties) assigned to each cluster.

# Access clustering results into a SeDF
tsc_resDeaths_df = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
tsc_resDeaths_df.head()

# Check shape
tsc_resDeaths_df.shape

(2953, 5)

# Create a feature service on portal
tsc_layer = tsc_resDeaths_df.spatial.to_featurelayer('NH_TS_Clustering')
tsc_layer

Merge with FIPS data¶

As time series clustering does not return a FIPS code, we will use a spatial join to join the results from clustering with the fips layer.

# Join Fips and Clustering output layers
from arcgis.features.analysis import join_features

tsc_resDeaths_item = join_features(target_layer=polygon_layer.url,
                                    join_layer=tsc_layer,
                                    spatial_relationship='identicalto',
                                    join_operation='JoinOneToOne',
                                    output_name='NH_TS_Join')

#Check item
tsc_resDeaths_item = gis.content.search('title: NH_TS_Join', 'Feature Layer')
tsc_resDeaths_item

[<Item title:"NH_TS_Join" type:Feature Layer Collection owner:portaladmin>]

# Get join results in a data frame
tsc_resDeaths_df = tsc_resDeaths_item[0].layers[0].query(as_df=True)
tsc_resDeaths_df.head()

Plot Clusters¶

Now that we have cluster information along with their location, we will:

visualize the number of counties in each cluster.
plot the clusters on a map.
create an average time series for each cluster.
use cluster information to identify states and counties that belong to a particular cluster.

Cluster Allocation¶

Here, we will visualize the number of counties in each cluster.

# Create plot
sns.set(font_scale = 1)
sns.countplot(x='cluster_id', data=tsc_resDeaths_df, color='lightcoral')
plt.title('No. of counties in each cluster');

The plot shows the count of counties within each cluster.

Clusters on a Map¶

Here, we will plot the clusters on a map to visualize which counties are associated to which cluster.

# Create Map
tsc_resDeaths_map = gis.map('USA', 4)
tsc_resDeaths_map

The plot shows counties colored by cluster id.

# Add layer
tsc_resDeaths_map.remove_layers()
tsc_resDeaths_df.spatial.plot(map_widget=tsc_resDeaths_map,
                                renderer_type='u',  # for unique renderer
                                col='cluster_id',
                                line_width=.5)

True

# Add legend and take screenshot
tsc_resDeaths_map.legend = True
tsc_resDeaths_map.take_screenshot(output_in_cell=False)

Average Time Series per Cluster¶

The Average Time Series per Cluster chart displays the average of the Analysis Variable at each time step for each cluster.

# Access clustering data from local
tsc_resDeaths_df2 = pd.DataFrame.spatial.from_table(os.path.join(arcpy.env.workspace,output_table_for_charts))
tsc_resDeaths_df2.head()

# Plot time series per cluster

plt.figure(figsize=(20,10))

sns.lineplot(x='END_DATE', y='CLUST_MEAN', data=tsc_resDeaths_df2, 
             hue='CLUSTER_ID', marker='o', palette='Set2', linewidth=2.5)

plt.title('Average Time Series per Cluster', fontdict={'fontsize':20})
plt.legend(loc='upper left', title='CLUSTER_ID', prop={'size': 13})
plt.xticks(rotation=90);

The plot shows the average time series for counties in each cluster.

Observations:

The plot shows different trends for three clusters but no trend for one cluster.

The average time series for cluster 2 shows an increase in average weekly residents deaths per 1000 residents early on starting mid June'20 - mid Aug.'20. The increase is followed by a decline until early Oct'20. From Oct.'20 - Mar.'21, there is a slight increase in trend, however the death rate seems considerably lower.

The average time series for cluster 3 shows an increase in average weekly residents deaths per 1000 residents starting mid Sept.'20, with a steep rise starting early Oct.'20 - early Dec.'20. The increase is followed by a steep decline until early Jan.'21, which is followed by a consistent decline until Mar.'21.

The average time series for cluster 4 shows an increase in average weekly residents deaths per 1000 residents starting Nov.'20, with a steep rise starting late Nov.'20 - early Jan.'21. The increase is followed by a steep decline until late Feb.'21, which is followed by a consistent decline until Mar.'21.

The final cluster, cluster 1, is shown as a straight line with no trend. As seen in the messages generated by the model, this cluster does not have any significant trend. There are 248 counties that fall within this cluster.

Cluster Information: County,State by Cluster ID¶

Here, we use simple group and query operations to identify counties and states that belong to a particular cluster. We will start by looking at the composition of cluster #3.

# Top 5 states with highest no. of counties in a particular cluster
tsc_resDeaths_df.query('(cluster_id==3)')[['county','state']] \
                .groupby('state').count() \
                .sort_values(by ='county',ascending=False).head()

# Identify counties for a cluster id and state
tsc_resDeaths_df.query('(cluster_id==3) and (state == "CA")')[['cluster_id','county','state','fips']]

# For each cluster, identify states with highest number of counties
g_df = tsc_resDeaths_df.loc[:,['cluster_id','state','county']] \
                .groupby(['cluster_id','state']).count() \
                .sort_values(by='county', ascending=False) \
                .groupby(level=0).head(1)
g_df

Now that we have explored patterns in our time series data, let's get started with building some models to forecast average weekly resident deaths per 1000 residents.

Exponential Smoothing Forecast Model¶

Exponential smoothing methods [3] are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, more recent observations are weighted higher than past observations. There are three types of exponential smoothing models:

Simple Exponential Smoothing [3][4]: smoothes (levels) the data when no trend or seasonal components are present. In this method, the weight of each observation decreases exponentially as we move back in time. In other words, the smallest weights are associated with the oldest observations. This method adds a smoothing constant α (alpha) to assign weights to observations over time. If α is small, more weight is given to observations from the distant past, and if α is large (close to 1), more weight is given to recent observations. This method is suitable for forecasting data with no clear trend or seasonal pattern.

Double Exponential Smoothing [3][4]: also called Holt's method, extends the simple exponential smoothing method to allow the forecasting of data with a trend. This method adds a second smoothing constant β (beta) that addresses trends in the data. Like the α factor, values for the β factor fall between zero and one (0< β ≤ 1). The benefit here is that the model can anticipate future increases or decreases, whereas the level model would only work from recent calculations.

Triple Exponential Smoothing [3][4]: this method is most closely associated with Holt-Winters, adds support for both trends and seasonality in the data. This method is comprised of three smoothing equations — one for the level, one for the trend, and one for the seasonal component, with the corresponding smoothing parameters being α, β, and γ, respectively. There are two variations to this method that differ based on the seasonal component. The additive method is preferred when seasonal variations are roughly constant, while the multiplicative method is preferred when seasonal variations are changing proportional to the level of the series. The equations below are shown for additive method.

\begin{array}{l} L_{t}=\alpha\left(Y_{t}-S_{t-p}\right)+(1-\alpha)\left(L_{t-1}+T_{t-1}\right) \\ T_{t}=\beta\left(L_{t}-L_{t-1}\right)+(1-\beta) T_{t-1} \\ S_{t}=\delta\left(Y_{t}-L_{t}\right)+(1-\delta) S_{t-p} \\ \hat{Y}_{t}=L_{t-1}+T_{t-1}+S_{t-p} \end{array}

where $L_{t}$ is the level at time t, $\alpha$ is the weight (or smoothing constant) for the level. $T_{t}$ is the trend at time t, $\beta$ is the weight (or smoothing constant) for the trend. Also, $S_{t}$ is the seasonal component at time t, $\delta$ is the weight (or smoothing constant) for the seasonal component, and p is the seasonal period.

The Exponential Smoothing Forecast [5] tool forecasts the values of each location of a space-time cube using the Holt-Winters exponential smoothing method by decomposing the time series at each location cube into seasonal and trend components. The exponential smoothing model assumes that all components are additive and linear. Damped trend is always used, and additive seasonality is supported but not required. Confidence intervals are constructed under the assumption that the residuals are additive and normally distributed.

In this section, we will:

build an exponential smoothing forecast model using the space time cube.
generate forecasts.
plot errors.

We will use data from June 7, 2020 - March 14, 2021 to train and validate the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting.

Build a Model¶

# Run Algorithm
from arcpy.stpm import ExponentialSmoothingForecast

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
try:
    in_cube = os.path.join(home_dir, sptm_cube_sub)
    analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
    output_features = "usNB_resDeaths_ESFsub"
    out_cube = os.path.join(home_dir, 'usNB_resDeaths_ESFsub_cube.nc')
    # out_cube = None
    number_of_time_steps_to_forecast = 12
    season_length = None
    number_for_validation = 6
    outlier_option = None
    level_of_confidence = "95%"
    maximum_number_of_outliers = None

    ExponentialSmoothingForecast(in_cube, analysis_variable, output_features,out_cube,
                            number_of_time_steps_to_forecast,season_length,number_for_validation,
                            outlier_option,level_of_confidence,maximum_number_of_outliers)
    print(arcpy.GetMessages())
    
except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))

Start Time: Wednesday, September 8, 2021 12:16:47 PM

----------- Input Space Time Cube Details -----------
Time step interval                              1 week
                                                      
Shape Type                                     Polygon
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Number of time steps                                41
Number of locations analyzed                      2953
Number of space time bins analyzed              121073
-----------------------------------------------------


-------------------- Analysis Details -------------------
Input Space Time Cube                      Forecast Method
usCubeNBLatest.nc                    Exponential Smoothing
---------------------------------------------------------
Number of forecast time steps                           12
Time Steps excluded for validation                       6
% locations modeled with seasonality                 35.83
Minimum of season length                              1.00
Maximum of season length                             13.00
Average season length                                 2.82
Median season length                                  1.00
Std. Dev. of season length                            3.42
---------------------------------------------------------
First forecast time step interval                    after
                                       2021-03-14 00:00:01
                                           to on or before
                                       2021-03-21 00:00:00
                                                          
Last forecast time step interval                     after
                                       2021-05-30 00:00:00
                                           to on or before
                                       2021-06-06 00:00:00
---------------------------------------------------------

------ Summary of Accuracy across Locations ------
Category          Min     Max  Mean  Median  Std. Dev.
Forecast RMSE    0.00  125.71  6.71    4.68       7.49
Validation RMSE  0.00  389.27  6.98    3.28      14.81
--------------------------------------------------

Succeeded at Wednesday, September 8, 2021 12:17:08 PM (Elapsed Time: 20.88 seconds)

# Access model data from local
esf_resDeaths_subdf = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace, output_features))
esf_resDeaths_subdf.head()

Generate Forecasts¶

When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods i.e. from 2021-03-21 - 2021-06-06.

Create Space Time Cube in 3D¶

To create forecast plots, we need raw, fitted, and forecasted values. The model generates a feature class with forecasted values only. The data in ppe_ts_layer is also not aggregated. Let's create a space time cube in 3D using Visualize Space Time Cube in 3D [6] tool to get the fitted values.

from arcpy.stpm import VisualizeSpaceTimeCube3D

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
in_cube = out_cube
cube_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
display_theme = "FORECAST_RESULTS"
output_features = "usNB_resDeaths_ESFsub_3D"

VisualizeSpaceTimeCube3D(in_cube, cube_variable, display_theme, output_features)

# Access cube data from local
cube3D_ESsub_df_resDeaths = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
cube3D_ESsub_df_resDeaths.head()

# Check shape
cube3D_ESsub_df_resDeaths.shape

(156509, 18)

# Merge with FIPS data
fips_df.columns = fips_df.columns.str.upper()
cube3D_ESsub_df_resDeaths = pd.merge(cube3D_ESsub_df_resDeaths, fips_df, on='FIPS')
cube3D_ESsub_df_resDeaths.shape

(156509, 24)

Clean data¶

Here, we clean the data and keep only necessary columns.

# Rename Columns
cols_to_rename = {'OBJECTID_y':'OBJECTID',
                 'SHAPE_y':'SHAPE'}
cube3D_ESsub_df_resDeaths.rename(columns=cols_to_rename, inplace=True)

# Drop columns
cols_to_drop = ['OBJECTID_x','TIME_STEP','TIME_EXAG','ELEMENT','LOCATION','SHAPE_x']

cube3D_ESsub_df_resDeaths.drop(columns = cols_to_drop,
                         axis=1, inplace=True)

# Check data
display(cube3D_ESsub_df_resDeaths.shape)
display(cube3D_ESsub_df_resDeaths.columns)

(156509, 18)

Index(['START_DATE', 'END_DATE', 'FIPS', 'VALUE', 'FITTED', 'RESIDUAL', 'HIGH',
       'LOW', 'LEVEL', 'TREND', 'LEVELTREND', 'SEASON', 'SHAPE', 'SHAPE__AREA',
       'SHAPE__LENGTH', 'COUNTY', 'OBJECTID', 'STATE'],
      dtype='object')

Combine Results¶

Here, we will combine the errors generated using Exponential Smoothing model with the fitted values from the space-time cube cube3D_ESsub_df_resDeaths.

# Get valid values from ES results df
esf_small = esf_resDeaths_subdf[['FIPS','F_RMSE', 'V_RMSE']].copy()
esf_small.shape

(2953, 3)

# Merge with 3D cube data
cube3D_ESsub_df_resDeaths = pd.merge(cube3D_ESsub_df_resDeaths, esf_small, on='FIPS')
cube3D_ESsub_df_resDeaths.shape

(156509, 20)

Publish Layer¶

Next, we will publish the combined model results to a feature layer for ease of access and use.

%%time
# Publish layer
cube3D_ESsub_df_resDeaths.spatial.to_featurelayer('ES_Results')

Wall time: 3min 31s

Plot Forecasts¶

Here, we will create plots that show the raw, fitted, and forecasted average weekly nursing home resident deaths per 1000 residents due to Covid-19 for various counties. The fitted and raw values are for time period June 7, 2020 - March 14, 2021, whereas forecasted values are for the time period March 21, 2021 - June 6, 2021.

Subset Data¶

Next, we will subset the data to keep only the necessary columns.

# Get column names
cube3D_ESsub_df_resDeaths.columns

Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'level', 'trend', 'leveltrend', 'season', 'SHAPE', 'shape_area',
       'shape_length', 'county', 'objectid', 'state', 'f_rmse', 'v_rmse'],
      dtype='object')

# Keep necessary columns
resDeaths_cube_ESsub_df = cube3D_ESsub_df_resDeaths[['end_date','fips','county','state','value',
                                                     'fitted','high','low','level','trend','SHAPE']].copy()
resDeaths_cube_ESsub_df.set_index('end_date', inplace=True)
resDeaths_cube_ESsub_df.shape

(156509, 10)

# Check head of data
resDeaths_cube_ESsub_df.head()

Generate Plot¶

Here, we will create plots that show the raw, fitted, and forecasted values of average weekly resident deaths per 1000 residents for various counties.

Let's define a function ts_plot that will:

Create a grid
Subset data for a county
Plot raw, fitted, and forecast values
Plot confidence interval

We can then use this function to generate plots for one or more counties.

# Define function to plot data for one or multiple counties
def ts_plot(data, fips: list, ylabel: str, title: str):
    
    if len(fips) == 1:
        fig = plt.figure(figsize=(18, 12))
    else:
        fig = plt.figure(figsize=(20, 20))
    
    # Create grid
    for sp in range(0,len(fips)):
        if len(fips) == 1:
            ax = fig
        elif len(fips)%2 == 0:
            ax = fig.add_subplot(len(fips)/2, 2, sp + 1)
        else:
            ax = fig.add_subplot(len(fips)//2+1, 2, sp + 1)

        # Subset data for county
        plot_df = data[data['fips']==fips[sp]]

        # Plot raw, fitted and forecast values
        ax = plot_df.loc[:'2021-03-14','value'].plot(legend=True, 
                                  label='Raw Values', marker='o', markersize=10, linewidth=2)
        
        ax = plot_df['fitted'].plot(legend=True, label='Fitted Values',marker='o', 
                                       markersize=10,linewidth=2)

        # Plot confidence interval 
        plt.fill_between(plot_df['fitted'][-12:].index,
                         plot_df.loc['2021-03-21':,'low'],
                         plot_df.loc['2021-03-21':,'high'],
                         color='r', alpha=0.1, label='Confidence Interval')
        
        # Plot forecast start line
        plt.vlines(plot_df.index[-12],
                   ymin=plot_df['low'].min(),
                   ymax=plot_df['high'].max(),
                    color='grey', linewidth=1,
                   linestyle="--", label='Forecast results start here')

        # Label, title and legend
        ax.set(ylabel=ylabel, xlabel='')
        ax.set_title(plot_df['county'][0]+', '+plot_df['state'][0], fontsize=20)
        plt.legend(loc='upper left', prop={'size': 12})

        # Remove spines
        for key, spine in ax.spines.items():
            spine.set_visible(False)
        ax.margins(0.01, 0)

    fig.suptitle(title, fontsize=25, y=1.0)
    plt.rc('axes', labelsize=15)
    plt.rc('xtick', labelsize=15)
    plt.rc('ytick', labelsize=15)
    plt.tight_layout() # automatically adjusts layout to fit long labels
    plt.show();

Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.

County, State	FIPS	Cluster Id
Maricopa, AZ	4013	2
Delaware, PA	42045	3
Porter, IN	18127	4

# Create Plot
fips = [4013,42045,18127]
ts_plot(resDeaths_cube_ESsub_df,fips,
        ylabel='Resident death rate',
        title='Average weekly resident death rates')

The plots show:

the fitted values as generated by the Exponential Smoothing model for June 7, 2020 - March 14, 2021.

the forecasted values generated by the model for March 21, 2021 - June 6, 2021.

the actual resident death rates for June 7, 2020 - March 14, 2021.

the confidence interval of the forecast for March 21, 2021 - June 6, 2021.

Observations:

The fitted values generated by the exponential smoothing model seem to follow the actual death rates fairly closely for Maricopa County, AZ and Delaware County, PA. For Porter County, IN, the fitted? values vary quite a bit from the raw values. This could be a result of the model trying to generalize the training data, rather than overfitting it.

The forecast values vary for different counties, which is expected, as a different model is generated for each county. The increasing width of the confidence intervals show reduced confidence in the forecasted values as the forecasting time period increases.

Carefully observing the plot for Porter County, IN we see that although the confidence interval seems narrower than the others, the resident death rate for Porter County is slightly higher when compared to the other counties. As such, the range of values for the confidence intervals in Porter County could be similar to, or even larger than, other counties.

The increasing width of the confidence intervals show higher uncertainty associated with each forecast as the forecasting time period increases.

Plot Errors¶

The forecast RMSE measures how much the fitted values from the model differ from the raw time series values. The forecast RMSE only measures how well the exponential smoothing model fits the raw time series values. It does not measure how well the forecast model actually forecasts future values. This problem is addressed by the validation model. The model uses data from June 7, 2020 - March 14, 2021 for training and data from March 21, 2021 - June 6, 2021 (12-time steps) is used for forecasting to calculate the error.

The validation model is used to determine how well the forecast model can forecast the future values of each time series. It is constructed by excluding some of the final time steps of each time series and fitting the exponential smoothing model to the data that was not excluded. This model is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the raw values that were hidden. Since we are using 6-time steps for validation, the model uses data from June 7, 2020 - Jan 31, 2021 for training and data from Jan 31, 2021 - March 14, 2021 for validation to calculate this error.

You can read more about the two errors here. Let's map the forecast and validation errors.

# Search for layer
es_item = gis.content.search('ES_Results', 'Feature Layer')
es_lyr = es_item[0].layers[0]
es_lyr

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/a00c82/FeatureServer/0">

Validation Errors on a Map¶

# Create map
m1= gis.map('US', zoomlevel=4)
m1

The plot shows variation in validation errors as generated by the Exponential Smoothing model for each county. The variation seems to be really high for some counties with values ranging from 9.8 - 389.

# Plot data
m1.add_layer(es_lyr, 
            {"renderer":"ClassedColorRenderer",
           "field_name":"V_RMSE",
           "opacity":0.7
            })

# Add legend and screenshot
m1.legend = True
m1.take_screenshot(output_in_cell=False)

Forecast Errors on a Map¶

# Create map
m2 = gis.map('US', zoomlevel=4)
m2

The plot shows variation in forecast errors as generated by the Exponential Smoothing model for each county. The variation seems to be high for some counties with values ranging from 9.8 - 125.

# Plot data
m2.add_layer(es_lyr, 
            {"renderer":"ClassedColorRenderer",
           "field_name":"F_RMSE",
           "opacity":0.7
            })

# Add legend and screenshot
m2.legend = True
m2.take_screenshot(output_in_cell=False)

Let's identify the counties with high variance in forecast and validation errors.

high_var = cube3D_ESsub_df_resDeaths.query(
                    'f_rmse > 9.8 and v_rmse > 9.8').groupby(
                    ['state', 'county']).aggregate('count')[['f_rmse', 'v_rmse']]
high_var

From the results, we can see that there are 292 counties for which the variation in error is in the highest band for both validation and forecast errors.

Now that we have built the Exponential Smoothing forecast model and generated forecasts for our time series data, let's look into the Forest-based Forecast model to forecast average weekly resident deaths per 1000 residents.

Forest-based Forecast Model¶

The Forest-based Forecast tool [7] forecasts the values of each location of a space-time cube using an adaptation of Leo Breiman's random forest algorithm. This tool uses the same underlying algorithm as the Forest-based Classification and Regression tool when used for regression. The training data used to build the forest regression model is constructed using time windows on each location of the space-time cube.

Source

Compared to other forecasting tools in the Time Series Forecasting toolset, this tool is the most complex but includes the fewest assumptions about the data.

In this section, we will:

build a forest-based forecast model using the space time cube.
generate forecasts.
plot errors.

We will use data from June 7, 2020 - March 14, 2021 to train and validate the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting.

Build a Model¶

# Run Algorithm
from arcpy.stpm import ForestBasedForecast

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
try:
    in_cube = os.path.join(home_dir, sptm_cube_sub)
    analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
    output_features = "usNB_resDeaths_FBFsub"
    out_cube = os.path.join(home_dir, 'usNB_resDeaths_FBFsub_cube.nc')
    number_of_time_steps_to_forecast = 12
    time_window = None
    # season_length = None
    number_for_validation = 6
    number_of_trees = 1000
    minimum_leaf_size = None
    maximum_depth = None
    sample_size = 100
    forecast_approach = "VALUE_DETREND"
    outlier_option = None
    level_of_confidence = "95%"
    maximum_number_of_outliers = None

    ForestBasedForecast(in_cube, analysis_variable, output_features,out_cube,
                        number_of_time_steps_to_forecast,time_window,number_for_validation,
                        number_of_trees,minimum_leaf_size,maximum_depth,sample_size,forecast_approach,
                        outlier_option,level_of_confidence,maximum_number_of_outliers)
    
    print(arcpy.GetMessages())
    
except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))

Start Time: Wednesday, September 8, 2021 12:29:35 PM

----------- Input Space Time Cube Details -----------
Time step interval                              1 week
                                                      
Shape Type                                     Polygon
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Number of time steps                                41
Number of locations analyzed                      2953
Number of space time bins analyzed              121073
-----------------------------------------------------


------------------- Analysis Details ------------------
Input Space Time Cube                    Forecast Method
usCubeNBLatest.nc                           Forest-based
-------------------------------------------------------
Number of forecast time steps                         12
Time Steps excluded for validation                     6
% locations modeled with seasonality               35.83
Minimum of time window                              2.00
Maximum of time window                             13.00
Average time window                                 8.59
Median time window                                 10.00
Std. Dev. of time window                            3.06
-------------------------------------------------------
First forecast time step interval                  after
                                     2021-03-14 00:00:01
                                         to on or before
                                     2021-03-21 00:00:00
                                                        
Last forecast time step interval                   after
                                     2021-05-30 00:00:00
                                         to on or before
                                     2021-06-06 00:00:00
-------------------------------------------------------

------ Summary of Accuracy across Locations ------
Category          Min     Max  Mean  Median  Std. Dev.
Forecast RMSE    0.00   64.19  3.73    2.48       4.54
Validation RMSE  0.00  140.69  5.39    3.14       8.00
--------------------------------------------------

Succeeded at Wednesday, September 8, 2021 12:30:56 PM (Elapsed Time: 1 minutes 20 seconds)

# Access FBF data from local
fbf_resDeaths_subdf = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
fbf_resDeaths_subdf.head()

Generate Forecasts¶

When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods, from 2021-03-21 - 2021-06-06.

Create Space Time Cube in 3D¶

To create forecast plots, we need raw, fitted, and the forecasted values. The model generates a feature class with forecasted values only. The data in ppe_ts_layer is also not aggregated. Let's create a space time cube in 3D using Visualize Space Time Cube in 3D [6] tool to get the fitted values.

from arcpy.stpm import VisualizeSpaceTimeCube3D

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
in_cube = out_cube
cube_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
display_theme = "FORECAST_RESULTS"
output_features = "usNB_resDeaths_FBFsub_3D"

VisualizeSpaceTimeCube3D(in_cube, cube_variable, display_theme, output_features)

# Access cube data from local
cube3D_FBFsub_df_resDeaths = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
cube3D_FBFsub_df_resDeaths.head()

# Check shape
cube3D_FBFsub_df_resDeaths.shape

(156509, 14)

# Merge with FIPS data
cube3D_FBFsub_df_resDeaths = pd.merge(cube3D_FBFsub_df_resDeaths, fips_df, on='FIPS')
cube3D_FBFsub_df_resDeaths.shape

(156509, 20)

Clean data¶

Next, we will clean the data and keep only the necessary columns.

# Rename Columns
cols_to_rename = {'OBJECTID_y':'OBJECTID',
                 'SHAPE_y':'SHAPE'}
cube3D_FBFsub_df_resDeaths.rename(columns=cols_to_rename, inplace=True)

# Drop columns

cols_to_drop = ['OBJECTID_x','TIME_STEP','TIME_EXAG','ELEMENT','LOCATION','SHAPE_x']

cube3D_FBFsub_df_resDeaths.drop(columns = cols_to_drop,
                         axis=1, inplace=True)

# Check data
display(cube3D_FBFsub_df_resDeaths.shape)
display(cube3D_FBFsub_df_resDeaths.columns)

(156509, 14)

Index(['START_DATE', 'END_DATE', 'FIPS', 'VALUE', 'FITTED', 'RESIDUAL', 'HIGH',
       'LOW', 'SHAPE', 'SHAPE__AREA', 'SHAPE__LENGTH', 'COUNTY', 'OBJECTID',
       'STATE'],
      dtype='object')

Combine Results¶

Here, we will combine the errors generated using the Forest-based Forecast with the fitted values from the space-time cube cube3D_FBFsub_df_resDeaths.

# Get valid values from ES results df
fbf_small = fbf_resDeaths_subdf[['FIPS','F_RMSE', 'V_RMSE']].copy()
fbf_small.shape

(2953, 3)

# Merge with RMSE data
cube3D_FBFsub_df_resDeaths = pd.merge(cube3D_FBFsub_df_resDeaths, fbf_small, on='FIPS')
cube3D_FBFsub_df_resDeaths.shape

(156509, 16)

Publish Layer¶

We will publish the combined model results to a feature layer for ease of access and use.

%%time
fbf_item = cube3D_FBFsub_df_resDeaths.spatial.to_featurelayer('FBF_BigLayer')
fbf_item

Wall time: 4min 30s

Plot Forecasts¶

Here, we will create plots that show the raw, fitted, and forecasted average weekly nursing home resident deaths per 1000 residents due to Covid-19 for various counties. The fitted and raw values are for time period June 7, 2020 - March 14, 2021, whereas the forecasted values are from March 21, 2021 - June 6, 2021.

Subset Data¶

Next, we will subset the data to keep only the necessary columns.

# Check Column names
cube3D_FBFsub_df_resDeaths.columns

Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'SHAPE', 'shape_area', 'shape_length', 'county', 'objectid',
       'state', 'f_rmse', 'v_rmse'],
      dtype='object')

# Subset specific columns
resDeaths_cube_FBFsub_df = cube3D_FBFsub_df_resDeaths[['end_date','fips','county','state','value',
                                                 'fitted','residual','high','low']].copy()
resDeaths_cube_FBFsub_df.set_index('end_date', inplace=True)
resDeaths_cube_FBFsub_df.shape

(156509, 8)

Generate Plot¶

Here, we will create plots that show the raw, fitted, and forecasted values of the average weekly resident deaths per 1000 residents for various counties. Let's use the ts_plot() function, defined earlier, to create the plots. Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.

County, State	FIPS	Cluster Id
Maricopa, AZ	4013	2
Delaware, PA	42045	3
Porter, IN	18127	4

# Plot data using the ts_plot function
fips = [4013,42045,18127]
ts_plot(resDeaths_cube_FBFsub_df,fips,
        ylabel='Resident death rate',
        title='Average weekly resident death rates')

The plots show:

the fitted values as generated by the Forest-based Forecast model for June 7, 2020 - March 14, 2021.

the forecasted values generated by the model for March 21, 2021 - June 6, 2021.

the actual resident death rates for June 7, 2020 - March 14, 2021.

the confidence interval of the forecast for March 21, 2021 - June 6, 2021.

Observations:

The fitted values generated by the model seem to follow the actual death rates closely, with slight variation from the actual data. This could be a result of the model trying to generalize the training data, rather than overfitting it.

The forecast values vary for different counties, which is expected, as a different model is generated for each county.

The confidence intervals generated by the model are much wider when compared to the exponential smoothing models, making it difficult to visualize the distribution of fitted vs actual data.

The increasing width of the confidence intervals show reduced confidence in forecasted values as the forecasting time period increases.

Plot Errors¶

The forecast RMSE measures how much the fitted values from the model differ from the raw time series values. The forecast RMSE only measures how well the forest-based forecast model fits the raw time series values. It does not measure how well the forecast model actually forecasts future values. This problem is addressed by the validation model. The model uses data from June 7, 2020 - March 14, 2021 for training and data from March 21, 2021 - June 6, 2021 (12-time steps) is used for forecasting to calculate the error.

The validation model is used to determine how well the forecast model can forecast the future values of each time series. It is constructed by excluding some of the final time steps of each time series and fitting the forest-based forecast model to the data that was not excluded. This model is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the raw values that were hidden. Since we are using 6-time steps for validation, the model uses data from June 7, 2020 - Jan 31, 2021 for training and data from Jan 31, 2021 - March 14, 2021 for validation to calculate this error.

You can read more about the two errors here. Let's map the forecast and validation errors.

# Search for layer
fbf_item = gis.content.search('FBF_BigLayer', 'Feature Layer')
fbf_lyr = fbf_item[0].layers[0]
fbf_lyr

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/aad982/FeatureServer/0">

Validation Errors on a Map¶

# Create map
fbf_m1 = gis.map('US', zoomlevel=4)
fbf_m1

The plot shows variation in validation errors as generated by the Forest-based Forecast model for each county. The variation seems to be really high for some counties with values ranging from 7.8 - 137.

# Plot Data
fbf_m1.remove_layers()
fbf_m1.add_layer(fbf_lyr, 
                    {"renderer":"ClassedColorRenderer",
                   "field_name":"V_RMSE",
                   "opacity":0.7
                    })

# Take screenshot
fbf_m1.legend=True
fbf_m1.take_screenshot(output_in_cell=False)

Forecast Errors on a Map¶

# Create map
fbf_m2 = gis.map('US', zoomlevel=4)
fbf_m2

The plot shows variation in forecast errors as generated by the Forest-based Forecast model for each county. The variation seems to be high for some counties with values ranging from 5.5 - 66.

# Plot Data
fbf_m2.remove_layers()
fbf_m2.add_layer(fbf_lyr, 
                    {"renderer":"ClassedColorRenderer",
                   "field_name":"F_RMSE",
                   "opacity":0.7
                    })

# Add layer and take screenshot
fbf_m2.legend=True
fbf_m2.take_screenshot(output_in_cell=False)

Let's identify the counties with high variance in forecast and validation errors.

high_var = cube3D_FBFsub_df_resDeaths.query(
                    'f_rmse > 5.5 and v_rmse > 7.8').groupby(
                    ['state', 'county']).count()[['f_rmse', 'v_rmse']]
high_var

From the results, we can see that there are quite a few counties for which the variation in error is in the highest band for both validation and forecast errors.

Now that we have built the Forest-based Forecast model and generated forecasts for our time series data, let's combine the results from the models built above.

Combine Exponential Smoothing and Forest-based Results¶

In this section, we will combine the results generated from the Exponential Smoothing and Forest-based forecast models. The combined results will be used later to generate plots overlaying various models.

# Check ES df
display(cube3D_ESsub_df_resDeaths.columns)
display(cube3D_ESsub_df_resDeaths.shape)

# Check FB df
display(cube3D_FBFsub_df_resDeaths.columns)
display(cube3D_FBFsub_df_resDeaths.shape)

Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'level', 'trend', 'leveltrend', 'season', 'SHAPE', 'shape_area',
       'shape_length', 'county', 'objectid', 'state', 'f_rmse', 'v_rmse'],
      dtype='object')

(156509, 20)

Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'SHAPE', 'shape_area', 'shape_length', 'county', 'objectid',
       'state', 'f_rmse', 'v_rmse'],
      dtype='object')

(156509, 16)

Merge datasets¶

Here, we will merge the Exponential Smoothing and Forest-based model results.

# Merge on fips and dates
es_fb_df = pd.merge(cube3D_ESsub_df_resDeaths, cube3D_FBFsub_df_resDeaths, 
                    on=['fips', 'start_date', 'end_date'])

# Check merge results
display(es_fb_df.columns)
display(es_fb_df.shape)
display(es_fb_df.head())

Index(['start_date', 'end_date', 'fips', 'value_x', 'fitted_x', 'residual_x',
       'high_x', 'low_x', 'level', 'trend', 'leveltrend', 'season', 'SHAPE_x',
       'shape_area_x', 'shape_length_x', 'county_x', 'objectid_x', 'state_x',
       'f_rmse_x', 'v_rmse_x', 'value_y', 'fitted_y', 'residual_y', 'high_y',
       'low_y', 'SHAPE_y', 'shape_area_y', 'shape_length_y', 'county_y',
       'objectid_y', 'state_y', 'f_rmse_y', 'v_rmse_y'],
      dtype='object')

(156509, 33)

Clean data¶

Next, we will clean the data and keep the necessary columns.

# Rename Columns
cols_to_rename = {'objectid_x':'OBJECTID',
                 'SHAPE_x':'SHAPE',
                 'county_x':'county',
                 'state_x':'state'}
es_fb_df.rename(columns=cols_to_rename, inplace=True)

# Drop columns

cols_to_drop = ['objectid_y','SHAPE_y','state_y','county_y','value_x','value_y']

es_fb_df.drop(columns = cols_to_drop,
            axis=1, inplace=True)

# Check names
display(es_fb_df.shape)
display(es_fb_df.columns)

(156509, 27)

Index(['start_date', 'end_date', 'fips', 'fitted_x', 'residual_x', 'high_x',
       'low_x', 'level', 'trend', 'leveltrend', 'season', 'SHAPE',
       'shape_area_x', 'shape_length_x', 'county', 'OBJECTID', 'state',
       'f_rmse_x', 'v_rmse_x', 'fitted_y', 'residual_y', 'high_y', 'low_y',
       'shape_area_y', 'shape_length_y', 'f_rmse_y', 'v_rmse_y'],
      dtype='object')

# Replace x and y with ES, FB
es_fb_df.columns = es_fb_df.columns.str.replace('_x', '_ES')
es_fb_df.columns = es_fb_df.columns.str.replace('_y', '_FB')

# Rename some ES cols
cols_to_rename = {'level':'level_ES', 'trend':'trend_ES',
                  'leveltrend':'leveltrend_ES', 'season':'season_ES'}

es_fb_df = es_fb_df.rename(columns = cols_to_rename)

display(es_fb_df.shape)
display(es_fb_df.columns)

(156509, 27)

Index(['start_date', 'end_date', 'fips', 'fitted_ES', 'residual_ES', 'high_ES',
       'low_ES', 'level_ES', 'trend_ES', 'leveltrend_ES', 'season_ES', 'SHAPE',
       'shape_area_ES', 'shape_length_ES', 'county', 'OBJECTID', 'state',
       'f_rmse_ES', 'v_rmse_ES', 'fitted_FB', 'residual_FB', 'high_FB',
       'low_FB', 'shape_area_FB', 'shape_length_FB', 'f_rmse_FB', 'v_rmse_FB'],
      dtype='object')

Publish to Feature Layer¶

We will publish the combined model results to a feature layer for ease of access and use.

%%time
es_fb_item = es_fb_df.spatial.to_featurelayer('ES_FB_BigLayer')
es_fb_item

Wall time: 4min 24s

Summary¶

In this notebook we:

Started with data exploration to get an overview of the distribution of average weekly resident deaths per 1000 residents.
Created a Space Time Cube to structure the data into a netCDF data format.
Clustered the data using Time Series Clustering to better understand some patterns in our space-time data.
- created average time series per cluster.
Generated forecasts for average weekly resident deaths per 1000 residents using Exponential Smoothing Forecast and Forest-based Forecast tools.
Plotted forecast and validation errors as generated by the two models.

In the next notebook, we will generate forecasts using the ARIMA (Autoregressive Integrated Moving Average) model and create plots overlaying fitted and forecasted values from all of the models we have tested.

References¶

[1] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/createcubefromdefinedlocations.htm
[2] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/time-series-clustering.htm
[3] Sokol, J. (2018). Module 7: Time Series Models [Lecture Recording]. Georgia Institute of Technology. https://omscs.gatech.edu/isye-6501-intro-analytics-modeling
[4] https://online.stat.psu.edu/stat501/node/1001
[5] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/exponentialsmoothingforecast.htm
[6] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/visualizecube3d.htm
[7] https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/forestbasedforecast.htm
[8] https://www.history.com/this-day-in-history/first-confirmed-case-of-coronavirus-found-in-us-washington-state
[9] https://www.nytimes.com/interactive/2020/us/coronavirus-nursing-homes.html
[10] https://www.gao.gov/products/gao-21-367
[11] https://www.nytimes.com/2021/08/04/health/nursing-homes-vaccine-delta-covid.html

	provider_state	resident_cases	resident_deaths
0	AK	1.822731	0.293808
1	AL	12.955047	2.127964
2	AR	13.550926	2.735555
3	AZ	13.460469	3.306978
4	CA	10.686582	1.860854

	SHAPE	SHAPE__Area	SHAPE__Length	county	fips	objectid	state
0	{"rings": [[[-13230502.797899999, 6098468.1313...	1.342148e+10	547763.657179	Ferry	53019	1	WA
1	{"rings": [[[-13073230.935899999, 6114197.2246...	1.484728e+10	684811.980009	Stevens	53065	2	WA
2	{"rings": [[[-13243912.912999999, 6096737.8555...	3.124596e+10	878281.939690	Okanogan	53047	3	WA
3	{"rings": [[[-13073202.9669, 6274847.561999999...	8.528383e+09	453421.765692	Pend Oreille	53051	4	WA
4	{"rings": [[[-13027621.124200001, 6247430.6903...	7.613832e+09	385397.356080	Boundary	16021	5	ID

	OBJECTID	LOCATION	CLUSTER_ID	CENTER_REP	SHAPE
0	1	0	2	0	{"rings": [[[-13073230.935899999, 6114197.2246...
1	2	1	2	0	{"rings": [[[-13243912.912999999, 6096737.8555...
2	3	2	1	1	{"rings": [[[-13027621.124200001, 6247430.6903...
3	4	3	4	0	{"rings": [[[-12919238.883299999, 6141609.4641...
4	5	4	2	0	{"rings": [[[-12631826.7216, 6040165.092500001...

	SHAPE	center_rep	cluster_id	county	fips	join_count	location	objectid	state	target_fid
0	{"rings": [[[-13073230.935899999, 6114197.2246...	0	2	Stevens	53065	1	0	1	WA	2
1	{"rings": [[[-13243912.912999999, 6096737.8555...	0	2	Okanogan	53047	1	1	2	WA	3
2	{"rings": [[[-13027621.124200001, 6247430.6903...	1	1	Boundary	16021	1	2	3	ID	5
3	{"rings": [[[-12919238.883299999, 6141609.4641...	0	4	Lincoln	30053	1	3	4	MT	6
4	{"rings": [[[-12631826.7216, 6040165.092500001...	0	2	Flathead	30029	1	4	5	MT	7

	OBJECTID	TIME_STEP	START_DATE	END_DATE	TIME_EXAG	ELEMENT	LOCATION	FIPS	CLUSTER_ID
0	1	0	2020-05-31 00:00:01	2020-06-07	0.000000	2	2	16021	1
1	2	1	2020-06-07 00:00:01	2020-06-14	20766.235829	2955	2	16021	1
2	3	2	2020-06-14 00:00:01	2020-06-21	41532.471658	5908	2	16021	1
3	4	3	2020-06-21 00:00:01	2020-06-28	62298.707487	8861	2	16021	1
4	5	4	2020-06-28 00:00:01	2020-07-05	83064.943316	11814	2	16021	1

	week_ending	provider_state	county	resident_cases
0	2020-06-07	AK	anchorage	81.08
1	2020-06-07	AK	cordova-mccarthy	0.00
2	2020-06-07	AK	fairbanks	0.00
3	2020-06-07	AK	juneau	0.00
4	2020-06-07	AK	kenai-cook inlet	0.00

	SHAPE	county	objectid	provider_state	residents_total_admissions_covi	staff_total_confirmed_covid19	total_number_of_occupied_beds	week_ending
0	{"x": -9141988.7262, "y": 3396571.8598999977, ...	marion	1	FL	0.0	0.0	159.0	2020-07-12
1	{"x": -9093028.0764, "y": 3501992.7968000025, ...	clay	2	FL	0.0	0.0	0.0	2020-06-07
2	{"x": -8921663.4089, "y": 3000273.0898, "spati...	broward	3	FL	1.0	12.0	28.0	2021-01-03
3	{"x": -9794605.9202, "y": 5185103.478500001, "...	lake	4	IL	22.0	59.0	0.0	2020-10-04
4	{"x": -10041672.7395, "y": 4858313.148800001, ...	cass	5	IL	0.0	0.0	0.0	2020-06-28

	cluster_id	county	state	fips
440	3	Colusa	CA	6011
468	3	Napa	CA	6055
469	3	Sonoma	CA	6097
472	3	Amador	CA	6005
488	3	Calaveras	CA	6009
493	3	Marin	CA	6041
523	3	Madera	CA	6039
542	3	Inyo	CA	6027
568	3	San Benito	CA	6069
582	3	Kings	CA	6031
603	3	San Bernardino	CA	6071
693	3	Imperial	CA	6025

	OBJECTID	LOCATION	FIPS	FCAST_1	FCAST_2	FCAST_3	FCAST_4	FCAST_5	FCAST_6	FCAST_7	FCAST_8	FCAST_9	FCAST_10	FCAST_11	FCAST_12	HIGH_1	LOW_1	HIGH_2	LOW_2	HIGH_3	LOW_3	HIGH_4	LOW_4	HIGH_5	LOW_5	HIGH_6	LOW_6	HIGH_7	LOW_7	HIGH_8	LOW_8	HIGH_9	LOW_9	HIGH_10	LOW_10	HIGH_11	LOW_11	HIGH_12	LOW_12	F_RMSE	V_RMSE	SEASON	METHOD	HTML_CHART	SHAPE
0	1	0	53065	1.058795	1.058795	1.058795	1.058795	1.058795	1.058795	1.058795	1.058795	1.058795	1.058795	1.058795	1.058795	13.067414	-10.949824	13.072486	-10.954896	13.077556	-10.959967	13.082624	-10.965035	13.087690	-10.970100	13.092754	-10.975164	13.097816	-10.980226	13.102875	-10.985285	13.107932	-10.990343	13.112988	-10.995398	13.118041	-11.000451	13.123092	-11.005502	6.126846	2.037500	1	exponential smoothing; seasonality:auto_detect	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-13073230.935899999, 6114197.2246...
1	2	1	53047	0.016359	0.028585	0.039589	0.049492	0.058405	0.066427	0.073647	0.080144	0.085993	0.091256	0.095993	0.100256	28.811219	-28.778502	37.773283	-37.716113	44.986007	-44.906830	51.193395	-51.094411	56.726007	-56.609197	61.765174	-61.632320	66.423127	-66.275834	70.775066	-70.614777	74.874295	-74.702310	78.760236	-78.577724	82.463017	-82.271031	86.006280	-85.805768	14.691255	12.794694	1	exponential smoothing; seasonality:auto_detect	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-13243912.912999999, 6096737.8555...
2	3	2	16021	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1	exponential smoothing; seasonality:auto_detect	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-13027621.124200001, 6247430.6903...
3	4	3	30053	0.100010	0.100010	0.100010	0.100010	0.100010	0.100010	0.100010	0.100010	0.100010	0.100010	0.100010	0.100010	5.614756	-5.414737	5.737636	-5.537616	5.857893	-5.657874	5.975690	-5.775671	6.091172	-5.891152	6.204469	-6.004449	6.315701	-6.115682	6.424977	-6.224958	6.532398	-6.332378	6.638053	-6.438034	6.742028	-6.542009	6.844401	-6.644381	2.813646	0.555628	1	exponential smoothing; seasonality:auto_detect	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-12919238.883299999, 6141609.4641...
4	5	4	30029	0.144388	0.144388	0.144388	0.144388	0.144388	0.144388	0.144388	0.144388	0.144388	0.144388	0.144388	0.144388	7.412144	-7.123368	7.847356	-7.558580	8.259261	-7.970485	8.651244	-8.362468	9.025945	-8.737168	9.385464	-9.096688	9.731511	-9.442735	10.065496	-9.776720	10.388597	-10.099821	10.701815	-10.413039	11.006004	-10.717228	11.301904	-11.013127	3.708039	2.414484	1	exponential smoothing; seasonality:auto_detect	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-12631826.7216, 6040165.092500001...

		f_rmse	v_rmse
state	county
AL	Cleburne	53	53
	Crenshaw	53	53
	Pike	53	53
	Walker	53	53
AR	Chicot	53	53
...	...	...	...
WV	Mercer	53	53
	Monroe	53	53
	Upshur	53	53
	Wyoming	53	53
WY	Fremont	53	53

	OBJECTID	LOCATION	FIPS	FCAST_1	FCAST_2	FCAST_3	FCAST_4	FCAST_5	FCAST_6	FCAST_7	FCAST_8	FCAST_9	FCAST_10	FCAST_11	FCAST_12	HIGH_1	LOW_1	HIGH_2	LOW_2	HIGH_3	LOW_3	HIGH_4	LOW_4	HIGH_5	LOW_5	HIGH_6	LOW_6	HIGH_7	LOW_7	HIGH_8	LOW_8	HIGH_9	LOW_9	HIGH_10	LOW_10	HIGH_11	LOW_11	HIGH_12	LOW_12	F_RMSE	V_RMSE	TIMEWINDOW	METHOD	HTML_CHART	SHAPE
0	1	0	53065	3.280066	4.187650	4.130642	4.408497	4.307801	4.159690	4.795857	4.925029	4.361156	3.757626	2.769107	2.835004	17.674663	0.195682	32.292750	-2.807511	46.073265	-6.656661	60.010600	-10.350181	73.717314	-14.242134	87.583994	-17.918033	101.781844	-21.213006	115.410313	-25.054906	128.851084	-28.936353	127.378642	-31.930005	126.564565	-33.309451	126.808967	-33.636652	3.576085	1.800630	10	forest-based; seed:32009; number_of_trees:1000...	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-13073230.935899999, 6114197.2246...
1	2	1	53047	9.595976	13.047301	16.396436	15.240901	14.375925	14.717977	12.942141	11.612516	9.839731	7.786485	6.390948	6.452929	30.129470	0.465302	71.244499	-8.403254	109.197685	-20.521899	143.860987	-35.648077	179.692873	-49.383438	216.492842	-62.123885	252.973609	-74.622399	271.006249	-85.058460	282.017671	-94.031924	293.778844	-100.487063	293.127178	-104.761302	294.031588	-107.712477	9.198349	3.824102	10	forest-based; seed:32009; number_of_trees:1000...	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-13243912.912999999, 6096737.8555...
2	3	2	16021	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	-0.000003	0.000007	-0.000009	0.000017	-0.000016	0.000027	-0.000023	0.000036	-0.000030	0.000046	-0.000036	0.000056	-0.000043	0.000066	-0.000050	0.000075	-0.000056	0.000085	-0.000063	0.000095	-0.000070	0.000105	-0.000077	0.000114	-0.000083	0.000002	0.000003	10	forest-based; seed:32009; number_of_trees:1000...	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-13027621.124200001, 6247430.6903...
3	4	3	30053	0.886052	1.719032	1.902360	2.636298	2.352632	2.121013	1.936059	1.808111	1.618762	1.147915	0.486185	0.497271	7.998071	0.021748	15.530076	-0.827406	28.434965	-2.494942	41.214422	-4.330191	53.263423	-6.880947	65.595626	-9.126077	78.158723	-11.113424	90.906601	-12.881377	97.564938	-14.483893	104.049252	-15.820655	103.471270	-16.532877	103.567337	-16.571082	1.730725	0.311888	10	forest-based; seed:32009; number_of_trees:1000...	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-12919238.883299999, 6141609.4641...
4	5	4	30029	3.605844	5.109240	6.789801	8.178999	7.754497	8.279101	6.984334	6.744166	5.875686	5.429073	5.431167	4.187095	13.484862	0.129425	23.861338	-3.241755	38.323356	-7.996128	51.223241	-13.872881	62.811414	-21.059821	74.933416	-27.762729	86.586476	-34.921460	99.596797	-40.774308	112.901204	-46.392580	127.152738	-51.106451	141.936570	-55.441406	151.201586	-59.731379	2.300957	2.232414	10	forest-based; seed:32009; number_of_trees:1000...	<html>\n<head>\n <meta charset = "utf-8">\n ...	{"rings": [[[-12631826.7216, 6040165.092500001...

Time Series Forecasting: COVID-19 trends in US nursing homes - Part 1/2¶

Table of Contents

Introduction¶

About the Data¶

General Imports¶

Data Exploration¶

Get Resident Data¶

Subset and Group Data¶

Plot Data¶

Average weekly resident cases and death rates by state¶

Average weekly resident death rates¶

Create Space Time Cube¶

Setup Environment and Get Data¶

Setup Environment¶

Get Resident Data¶

Get US County boundaries¶

Run Algorithm¶

Time Series Clustering¶

Run Algorithm¶

Merge with FIPS data¶

Plot Clusters¶

Cluster Allocation¶

Clusters on a Map¶

Average Time Series per Cluster¶

Cluster Information: County,State by Cluster ID¶

Exponential Smoothing Forecast Model¶

Build a Model¶

Generate Forecasts¶

Create Space Time Cube in 3D¶

Output

Messages

Clean data¶

Combine Results¶

Publish Layer¶

Plot Forecasts¶

Subset Data¶

Generate Plot¶

Plot Errors¶

Validation Errors on a Map¶

Forecast Errors on a Map¶

Forest-based Forecast Model¶

Build a Model¶

Generate Forecasts¶

Create Space Time Cube in 3D¶

Output

Messages

Clean data¶

Combine Results¶

Publish Layer¶

Plot Forecasts¶

Subset Data¶

Generate Plot¶

Plot Errors¶

Validation Errors on a Map¶

Forecast Errors on a Map¶

Combine Exponential Smoothing and Forest-based Results¶

Merge datasets¶

Clean data¶

Publish to Feature Layer¶

Summary¶

References¶