Time Series Forecasting: COVID-19 trends in US nursing homes - Part 1/2

  • 👟 Ready To Run!
  • 🚀 Data Exploration and Cleaning
  • 🤖 Machine Learning
  • ⏱️ Time Series
  • ⚕️ Healthcare

Requirements:

  • 🗄️ ArcGIS Notebooks Advanced Runtime

The Coronavirus disease 2019 (Covid-19), which would spark a pandemic, was first reported in China on December 31, 2019. On January 19, 2021, the first case of Covid-19 was reported in the United States near Seattle, WA. In the following months, the Seattle area became the epicenter of an early U.S. outbreak. 39 residents of a nursing home in Kirkland, WA died from complications from the virus in one four-week span [8].

image.png

Image Source

The Covid-19 pandemic has swept the long-term care facilities in the United States, resulting in nearly 4% of all cases (1.4M) and 31% of all deaths (182K) as of June 2021 [9]. The impact on nursing homes is even worse where nearly 1 in 10 residents have died. According to a study [10], 94% of nursing homes reviewed experienced more than one Covid-19 outbreak and 85% had an outbreak lasting 5 or more weeks. As of August 2021, the case count has ticked up again and with the vaccinated elderly susceptible to breakthrough infections, nursing homes now confront new Covid-19 outbreaks [11].

People living in nursing homes are generally 60 years and older with underlying health conditions such as cardiovascular disease, chronic respiratory disease, cerebrovascular disease, malignancy, and dementia. They represent a vulnerable population that should benefit from additional protective measures against contamination. The communal nature of nursing homes along with comorbidities put these people at increased risk of Covid-19 progression, severe outcomes, and death.

The ability to predict the progress of this pandemic has been crucial for decision making aimed at fighting this pandemic and controlling its spread. This presents an opportunity for us to use advanced data analysis, visualization and modeling techniques (both open-source and ArcGIS based) to help find trends to provide an early warning of rising numbers of deaths and to prevent future increases.

In this 2-part study, we will:

  • Explore the data to understand the distribution of average weekly resident deaths per 1000 residents.
  • Cluster the data using Time Series Clustering to better understand spatiotemporal patterns in our data.
  • Generate forecasts for average weekly resident deaths per 1000 residents using:
    • Exponential Smoothing Forecast model
    • Forest-based Forecast model
    • ARIMA model
  • Create plots to visualize the fitted values, forecasted values and their confidence intervals, and actual data values for all models.
  • Plot forecast errors generated by all models.

Table of Contents

Introduction

In this notebook, we will:

  • Start by exploring the data to understand the distribution of average weekly resident cases and deaths per 1000 residents.
  • Create a Space Time Cube [1] to structure the data into a netCDF data format and aggregate resident death rates at the US county geographic level.
  • Cluster the data using Time Series Clustering [2] to better understand spatiotemporal patterns in our data.
  • Generate forecasts for average weekly resident deaths per 1000 residents using Exponential Smoothing Forecast [5] and Forest-based Forecast [7] tools.
  • Plot forecast and validation errors as generated by the two models.

In the subsequent notebook, we will:

  • Build ARIMA models to generate forecasts for average weekly resident deaths per 1000 residents in each county.
  • Combine ARIMA results with Exponential Smoothing, Forest-based Forecast models and actual data.
  • Create plots to visualize the fitted values, forecasted values and their confidence intervals, and actual data values for all models.
  • Plot forecast errors generated by all models.

About the Data

The Centers for Medicare and Medicaid Services (CMS) provides a Nursing Home Covid-19 Public File. This file includes data reported by nursing homes to the CDC’s National Healthcare Safety Network (NHSN) system’s Covid-19 Long Term Care Facility Module. The data includes information about Resident Impact, Facility Capacity, Staff & Personnel, Personal Protective Equipment, Ventilator Capacity, and other Supplies information.

Dataset: https://data.cms.gov/covid-19/covid-19-nursing-home-data

Each record in the data represents information about an individual nursing home in the United States for each week. For our analysis, we will use the data ranging from June 07, 2020 - June 6, 2021. This data has been cleaned, feature engineered and made available as published feature services. To learn more about the raw data and the data cleaning performed to create these services, review the following notebook.

General Imports

In [1]:
# Import Libraries

# Display 
from IPython.display import display
from ipywidgets import *

# ArcGIS and Arcpy
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
import arcpy

# Data Exploration
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)

# Plotting Libraries
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import dates
import seaborn as sns
plt.style.use('ggplot')

# Datetime and os
from datetime import datetime
import time
import os
import itertools

# Stop warnings
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Create a GIS connection
gis = GIS("home")

Data Exploration

The dataset we have is the space-time data where each record represents information about an individual nursing home in the United States for each week, ranging from June 07, 2020 - June 6, 2021. For exploration purposes, we will only consider the following columns:

- weekly resident cases per 1000 residents
- weekly resident deaths per 1000 residents

There are many ways to explore this data, but in this notebook, we will explore it in two particular ways. We will aggregate the data by state to:

  • get an overview of average weekly resident deaths per 1000 residents for all states for the entire time period.
  • understand the distribution of average weekly resident deaths per 1000 residents for states with the highest mortality rates.

In this section, we will:

  • get actual resident death data.
  • subset and aggregate data.
  • plot the data.

Get Resident Data

Use the NH_TS_US_Cleaned_0621 feature layer to get nursing home resident data. To learn more about how this service was created, review the following notebook.

In [3]:
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_item = gis.content.search(f'NH_TS_US_Cleaned_0621 {group_query}',
                              item_type = "Feature Service",
                              outside_org=True)
ts_item
Out[3]:
[<Item title:"Subset_NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>,
 <Item title:"NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>]
In [4]:
# Get data layer
ts_layer = ts_item[1].layers[0]
ts_layer
Out[4]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/NH_TS_US_Cleaned_0621/FeatureServer/0">
In [5]:
# View first few field names in the layers
for f in ts_layer.properties.fields:
    print(f['name'], "{0:>35}".format(f['alias']))
objectid                            OBJECTID
county                              county
provider_state                      provider_state
initial_covid_case                  initial_covid_case
threeormore_cases_this_week         threeormore_cases_this_week
oneweek_ppe_shortage                oneweek_ppe_shortage
total_staff_shortage                total_staff_shortage
shortage_of_nursing_staff           shortage_of_nursing_staff
oneweek_shortage_of_n95_masks       oneweek_shortage_of_n95_masks
staff_total_confirmed_covid19       staff_total_confirmed_covid19
staff_total_covid19_deaths          staff_total_covid19_deaths
residents_total_admissions_covi  residents_total_admissions_covid19
weekly_resident_confirmed_covid weekly_resident_confirmed_covid19_cases_per_1000_residents
weekly_resident_covid19_deaths_ weekly_resident_covid19_deaths_per_1000_residents
total_number_of_occupied_beds       total_number_of_occupied_beds
week_ending                         week_ending
In [6]:
# Check records in data
ts_layer.query(return_count_only=True)
Out[6]:
769541

From the results above, we can see that our dataset has ~770k records and 16 columns. Each record includes information about the nursing home staff, resident impacts, personal protective equipment and other supplies, occupied beds, etc.

Subset and Group Data

Here, we will query and group the data.

In [7]:
%%time
# Query Data
subdf = ts_layer.query(as_df=True, 
                       out_fields=['county','provider_state',
                                    'week_ending','weekly_resident_confirmed_covid',
                                    'weekly_resident_covid19_deaths_'])
subdf.shape
Wall time: 1min 49s
Out[7]:
(769541, 7)

We will group the data by county and state to get the average weekly resident cases and deaths per 1000 residents for each state and county, for each time period.

  • aggregation by mean allows us to maintain the ratio of cases/deaths per 1000 residents, such that the numbers represent average weekly resident cases and deaths per 1000 residents.
  • rename columns to resident cases and resident deaths.
In [8]:
# Group data by state for each time period
exp_df1 = subdf.groupby(['week_ending','provider_state','county']).aggregate('mean').copy()
exp_df1.reset_index(inplace=True)
exp_df1.drop('objectid', inplace=True, axis=1)
exp_df1.rename(columns={'weekly_resident_confirmed_covid':'resident_cases',
              'weekly_resident_covid19_deaths_':'resident_deaths'}, inplace=True)
exp_df1.shape
Out[8]:
(149267, 5)
In [9]:
# Check head
exp_df1.head()
Out[9]:
week_ending provider_state county resident_cases resident_deaths
0 2020-06-07 AK anchorage 81.08 0.0
1 2020-06-07 AK cordova-mccarthy 0.00 0.0
2 2020-06-07 AK fairbanks 0.00 0.0
3 2020-06-07 AK juneau 0.00 0.0
4 2020-06-07 AK kenai-cook inlet 0.00 0.0

Now that we have aggregated the data by county and state, then subsetted it to include only specific columns, the dataset has 149267 records and 5 columns. Each record represents the average weekly resident cases and deaths per 1000 residents for a county, state for each one week time period.

We can further aggregate this data by state to get average weekly resident cases and deaths per 1000 residents by state. This data is combined for all time periods.

In [10]:
# Group data by state 
exp_df2 = exp_df1.groupby(['provider_state']).aggregate('mean').copy()
exp_df2.reset_index(inplace=True)
exp_df2.shape
Out[10]:
(51, 3)
In [11]:
# Check head
exp_df2.head()
Out[11]:
provider_state resident_cases resident_deaths
0 AK 1.822731 0.293808
1 AL 12.955047 2.127964
2 AR 13.550926 2.735555
3 AZ 13.460469 3.306978
4 CA 10.686582 1.860854

This subset represents average weekly resident cases and deaths per 1000 residents by state over the entire time period. For example, we can say that across all nursing homes in AZ, for the entire time period, on average 3 nursing home residents out of a 1000 died per week due to Covid-19, and 13 out of a 1000 residents tested positive for Covid-19.

Plot Data

Let's plot the data using the two subsets of data created above.

Average weekly resident cases and death rates by state

In [12]:
# Plot
fig, (ax1, ax2) = plt.subplots(2, figsize=(20,15))

# Plot cases
sns.barplot(x='provider_state', y='resident_cases', 
            data=exp_df2.sort_values(by='resident_cases', ascending=False),
            ax=ax1, color='lightcoral')
# Plot deaths
sns.barplot(x='provider_state', y='resident_deaths', 
            data=exp_df2.sort_values(by='resident_deaths', ascending=False),
            ax=ax2, color='lightcoral')

# Set label and title
ax1.set(ylabel='Resident case rate', xlabel= "States", 
        title="Average weekly nursing home resident case rates by state")
ax2.set(ylabel='Resident death rate', xlabel= "States",
        title="Average weekly nursing home resident death rates by state")
plt.rc('axes', labelsize=20)
plt.rc('axes', titlesize=15)
plt.rc('xtick', labelsize=10)
plt.show();

The plots show:

  • average weekly nursing home resident case rates by state.
  • average weekly nursing home resident death rates by state.

Observations:

  • average weekly resident case rate seem to be the highest for Arkansas, followed by Arizona and Tennessee.
  • average weekly resident death rate is the highest for Indiana, followed by Arizona and North Dakota.

Let's plot the distribution of average weekly resident death rates for Indiana, Arizona, and North Dakota.

Average weekly resident death rates

Here, we plot the time series of average weekly resident deaths per 1000 residents for Indiana, Arizona, and North Dakota across different time periods.

In [13]:
fig, ax = plt.subplots(1,1, figsize=(20,10))

states = ['AZ','IN','ND']
for state in states:
    sns.lineplot(x='week_ending', y='resident_deaths', 
                 data=exp_df1.query('provider_state==@state'),
                 marker='o', linewidth=1.8, ci=None, label=state)

ax.autoscale(axis='x')
plt.legend(loc='upper left', prop={'size': 15})
ax.set(ylabel='Resident death rate', xlabel= "")
ax.xaxis.set_major_locator(dates.MonthLocator(interval=1))
plt.title("Average weekly nursing home resident death rates", fontsize=20)
plt.gcf().autofmt_xdate()
plt.show();

The plot shows average weekly resident death rates for Indiana, Arizona, and North Dakota.

Observations:

  • Arizona saw a higher resident mortality rate early on, compared to other states. The death rate increased steadily until August'20, followed by a decreasing trend until mid Sept.'20 and then no deaths until mid Oct.'20. The period from mid Oct.'20 - mid Dec.'20 saw a consistent increase in the death rate, followed by a decreasing trend until June'21. The state saw its highest death rates in early Aug.'20 and mid Dec.'20.
  • Indiana reported a consistently low death rate until Oct'20, with a minor spike for a month around Aug.'20. From Oct.'20 - mid Dec.'20, there was an increase in the death rate, followed by a decreasing trend until June'21. From mid Dec.'20 onwards, Indiana reported a higher death rate than Arizona and North Dakota.
  • North Dakota started very well early on, with almost no deaths until Sept.'20. There was a steep rise in the death rate in mid Oct.'20, with an average of 25 deaths per 1000 residents during the 2nd week of Oct.'20. While this trend seemed to settle down the next week, there was another increasing trend in the death rate from early Nov.'20 - early Dec.'20. From mid Dec.'20 onwards, there was a decreasing trend through June'21.
  • All 3 states saw a decrease in resident death rates starting from mid Dec.'20 and almost no deaths starting from March'21. This trend seems consistent with the overall Covid-19 death rates across the United States. Most states saw an increase in case rates and death rates during Dec.'20 - Jan.'21. With the introduction of vaccine programs early Jan.'21, the number of deaths has reduced overall to almost no deaths being reported, starting in April '21.

Create Space Time Cube

We will begin our analysis by creating a Space Time Cube using Create Space Time Cube From Defined Locations [1] tool. The data structure this tool creates can be thought of as a three-dimensional cube made up of space-time bins with the x and y dimensions representing space and the t dimension representing time. Every bin has a fixed position in space (either an x,y location for points or a fixed set of vertices for polygon locations) and in time (t).

image.png

Source

This tool structures the data into a netCDF data format by creating space-time bins. The netCDF output is then used as input to other time-series algorithms in our toolbox.

In this section, we will:

  • setup environment and get data.
    • get nursing home resident deaths data.
    • get data for US county boundaries.
  • build a space time cube to aggregate resident deaths at the US county geographic level for each time period.

We will use data from June 7, 2020 - March 14, 2021 to build the space time cube. Data from March 21, 2021 - June 6, 2021 (12 time steps) has been excluded and will be used for forecasting.

Setup Environment and Get Data

Before we create the space time cube, we will:

  • setup an environment to create the file geodatabase and setup a data directory to output results.
  • get data for US county boundaries to get fips, county, and state name mapping.

Setup Environment

Here, we will create the file geodatabase and setup the data directory to output our results to.

In [14]:
# Set data directory to output results
home_dir = os.path.join(os.getcwd(), 'home')
print(f"home dir: {home_dir}")
home dir: C:\Users\mohi9282\Documents\ArcGIS\home
In [15]:
# Create empty FGDB in data directory
if not os.path.isdir(os.path.join(home_dir, 'Results.gdb')):
    arcpy.CreateFileGDB_management(home_dir, 'Results.gdb')
print('File Geodatabase exists')
File Geodatabase exists
In [16]:
# Setup Workspace
arcpy.env.workspace = os.path.join(home_dir,'Results.gdb')
arcpy.env.workspace
Out[16]:
'C:\\Users\\mohi9282\\Documents\\ArcGIS\\home\\Results.gdb'

Get Resident Data

We will use the Subset_NH_TS_US_Cleaned_0621 layer with data from June 7, 2020 - March 14, 2021. To learn more about how this service was created, review the following notebook.

In [19]:
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
ts_sub_item = gis.content.search(f'Subset_NH_TS_US_Cleaned_0621 {group_query}',
                                  item_type = "Feature Service",
                                  outside_org=True)
ts_sub_item
Out[19]:
[<Item title:"Subset_NH_TS_US_Cleaned_0621" type:Feature Layer Collection owner:portaladmin>]
In [20]:
# Get data layer
ts_sub_layer = ts_sub_item[0].layers[0]
ts_sub_layer
Out[20]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/Subset_NH_TS_US_Cleaned_0621/FeatureServer/0">
In [21]:
%%time
# Query as dataframe
ts_sub_df = ts_sub_layer.query(as_df=True)
Wall time: 1min 33s
In [22]:
# Check details
display(ts_sub_df.shape)
display(ts_sub_df.columns)
display(ts_sub_df.head())
(595650, 17)
Index(['SHAPE', 'county', 'initial_covid_case', 'objectid',
       'oneweek_ppe_shortage', 'oneweek_shortage_of_n95_masks',
       'provider_state', 'residents_total_admissions_covi',
       'shortage_of_nursing_staff', 'staff_total_confirmed_covid19',
       'staff_total_covid19_deaths', 'threeormore_cases_this_week',
       'total_number_of_occupied_beds', 'total_staff_shortage', 'week_ending',
       'weekly_resident_confirmed_covid', 'weekly_resident_covid19_deaths_'],
      dtype='object')
SHAPE county initial_covid_case objectid oneweek_ppe_shortage oneweek_shortage_of_n95_masks provider_state residents_total_admissions_covi shortage_of_nursing_staff staff_total_confirmed_covid19 staff_total_covid19_deaths threeormore_cases_this_week total_number_of_occupied_beds total_staff_shortage week_ending weekly_resident_confirmed_covid weekly_resident_covid19_deaths_
0 {"x": -9141988.7262, "y": 3396571.8598999977, ... marion 0.0 1 0.0 0.0 FL 0.0 0.0 0.0 0.0 0.0 159.0 0.0 2020-07-12 0.0 0.0
1 {"x": -9093028.0764, "y": 3501992.7968000025, ... clay 0.0 2 0.0 0.0 FL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2020-06-07 0.0 0.0
2 {"x": -8921663.4089, "y": 3000273.0898, "spati... broward 0.0 3 0.0 0.0 FL 1.0 0.0 12.0 0.0 0.0 28.0 0.0 2021-01-03 0.0 0.0
3 {"x": -9794605.9202, "y": 5185103.478500001, "... lake 0.0 4 0.0 0.0 IL 22.0 0.0 59.0 0.0 0.0 0.0 0.0 2020-10-04 0.0 0.0
4 {"x": -10041672.7395, "y": 4858313.148800001, ... cass 0.0 5 0.0 0.0 IL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2020-06-28 0.0 0.0

From the results above, we can see that our dataset has ~596k records and 17 columns. The columns include information about resident impact, staff & personnel, personal protective equipment and other supplies.

In [23]:
%%time
# Store the data locally
point_lyr = ts_sub_df.spatial.to_featureclass(os.path.join(arcpy.env.workspace,'Subset_NH_TS_US_Cleaned_0621'))
point_lyr
Wall time: 1min 43s
Out[23]:
'C:\\Users\\mohi9282\\Documents\\ArcGIS\\home\\Results.gdb\\Subset_NH_TS_US_Cleaned_0621'

Get US County boundaries

Here, we will get the county boundaries layer NH_TS_US_FIPS_layer created for the analysis. To learn more about how this service was created, review the following notebook.

In [24]:
# Query specific item from the org
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
polygon_item = gis.content.search(f'NH_TS_US_FIPS_layer {group_query}',
                                  item_type = "Feature Service",
                                  outside_org=True)
polygon_item
Out[24]:
[<Item title:"NH_TS_US_FIPS_layer" type:Feature Layer Collection owner:portaladmin>]
In [25]:
# Get published FIPS layer as polygon layer
polygon_layer = polygon_item[0].layers[0]
polygon_layer
Out[25]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/NH_TS_US_FIPS_layer/FeatureServer/0">
In [27]:
# Check layer details
fips_df = polygon_layer.query(as_df=True)
display(fips_df.shape)
display(fips_df.head())
(3142, 7)
SHAPE SHAPE__Area SHAPE__Length county fips objectid state
0 {"rings": [[[-13230502.797899999, 6098468.1313... 1.342148e+10 547763.657179 Ferry 53019 1 WA
1 {"rings": [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 53065 2 WA
2 {"rings": [[[-13243912.912999999, 6096737.8555... 3.124596e+10 878281.939690 Okanogan 53047 3 WA
3 {"rings": [[[-13073202.9669, 6274847.561999999... 8.528383e+09 453421.765692 Pend Oreille 53051 4 WA
4 {"rings": [[[-13027621.124200001, 6247430.6903... 7.613832e+09 385397.356080 Boundary 16021 5 ID

Run Algorithm

Here, we will build a space time cube. We will aggregate weekly resident deaths (point_lyr) at the US county geographic level for each time period by using aggregation_shape_type = "DEFINED_LOCATIONS" and passing the fips layer (polygon_layer) as defined_polygon_locations.

In [29]:
# Run GWR and output results to the FGDB
from arcpy.stpm import CreateSpaceTimeCube

# Workspace
arcpy.env.workspace

arcpy.env.overwriteOutput = True

try:

    # Define model variables
    in_features = point_lyr
    sptm_cube_sub = "usCubeNBLatest.nc"   #"fcCubeNB.nc"
    time_field = 'week_ending'
    template_cube = None
    time_step_interval = "1 Weeks"
    time_step_alignment = "END_TIME"
    reference_time = None
    distance_interval = None
    summary_fields = "weekly_resident_covid19_deaths_ MEAN SPACE_TIME_NEIGHBORS;"
    aggregation_shape_type = "DEFINED_LOCATIONS"
    defined_polygon_locations = polygon_layer.url
    location_id = "fips"

    CreateSpaceTimeCube(in_features, sptm_cube_sub, time_field, template_cube, time_step_interval,
                        time_step_alignment, reference_time, distance_interval, summary_fields,
                        aggregation_shape_type, defined_polygon_locations, location_id)
    print(arcpy.GetMessages())
    
except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))
Start Time: Wednesday, September 8, 2021 12:05:22 PM
WARNING 110109: 3895 of 595650 Input Features could not be aggregated into the Defined Polygon Locations.
WARNING 110110: Features that could not be aggregated (only includes first 30): OBJECTID = 7713, 8930, 8942, 8949, 8972, 8974, 8985, 9004, 9035, 9046, 9054, 9056, 9066, 9086, 9120, 9148, 9157, 9161, 9187, 9240, 9249, 9250, 9262, 9267, 9275, 9278, 9279, 9290, 9314, 9332.
WARNING 110089: 189 out of 3142 Defined Locations were excluded from the Space Time Cube due to the presence of bins that could not be estimated.
WARNING 110090: Defined Locations excluded from the Space Time Cube based on the inability to predict (only includes first 30): FIPS = 53019, 53051, 30101, 30051, 30015, 53017, 30099, 30055, 30033, 53043, 38083, 38025, 30069, 30045, 38033, 38007, 38043, 38065, 30079, 30059, 30039, 30007, 30037, 30065, 30107, 38037, 53023, 30025, 16049, 38087.
WARNING 110067: Your spatial reference is not compatible with CF Conventions.  You may experience difficulties using the resulting space-time cube with other NetCDF tools and software.

---------- Space Time Cube Characteristics -----------
Input feature time extent          2020-06-07 00:00:00
                                to 2021-03-14 00:00:00
                                                      
Number of time steps                                41
Time step interval                              1 week
Time step alignment                                End
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Cube extent across space       (coordinates in meters)
Min X                                   -17838103.0610
Min Y                                     2145729.6799
Max X                                    -7454985.1465
Max Y                                     9729151.6921


Locations                                           2953
% of locations with estimated observations          5.45
- Total number                                       161
Total observations                                121073
% of all observations that were estimated           5.38
- Total number                                      6512


 Overall Data Trend - WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS 
Trend direction                             Increasing
Trend statistic                                 2.5946
Trend p-value                                   0.0095


------------- Overall Data Trend - COUNT -------------
Trend direction                             Decreasing
Trend statistic                                -8.7346
Trend p-value                                   0.0000

Succeeded at Wednesday, September 8, 2021 12:05:59 PM (Elapsed Time: 37.80 seconds)

From the messages generated, we can see that:

  • 3895 of 595650 input features could not be aggregated into the Defined Polygon locations. This represents 0.6 % of the total data, and we will ignore these data points for the analysis.
  • 189 out of 3142 counties (Defined Locations) were excluded from the Space Time Cube due to the presence of bins that could not be estimated. These 189 counties didn't have enough data for estimation and will not be included in the analysis.
  • Overall across the United States, there is a statistically significant increasing trend for average weekly resident deaths per 1000 residents.

Space Time Cube structures the data into a netCDF data format (.nc) by creating space-time bins. We can now use the netCDF output as input to other time-series algorithms in our toolbox.

Time Series Clustering

Now that we have a Space Time Cube, let's try to understand some patterns in our space-time data of resident deaths per 1000 residents. Clustering allows us to understand the data better by identifying which features are more similar or dissimilar to each other. The Time Series Clustering [2] tool identifies the locations in a space-time cube that are most similar and partitions them into distinct clusters in which members of each cluster have similar time series characteristics.

image.png Source

The tool partitions a collection of time series, stored in a space-time cube, based on the similarity of time series characteristics. Time series can be clustered based on three criteria: having similar values across time, tending to increase and decrease at the same time, and having similar repeating patterns.

In this section, we will:

  • run the time series clustering tool using the space time cube.
  • visualize the number of counties in each cluster.
  • plot clusters on a map.
  • create an average time series for each cluster.
  • use cluster information to identify states and counties that belong to a particular cluster.

We will use the data from June 7, 2020 - March 14, 2021 for clustering. Data from March 21, 2021 - June 6, 2021 (12 time steps) has been excluded from clustering and will be used for forecasting.

Run Algorithm

Here, we use the Profile (Correlation) option for characteristic_of_interest parameter, which is used to cluster time series such that they tend to stay in proportion to each other and increase and decrease in value at the same time. This option measures the similarity in time series based on their statistical correlation across time.

In [30]:
from arcpy.stpm import TimeSeriesClustering

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
try:
    in_cube = os.path.join(home_dir, sptm_cube_sub)
    analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
    output_features = "usNB_resDeaths_TSClustering"
    characteristic_of_interest = "PROFILE"
    cluster_count = None
    output_table_for_charts = "usNB_resDeaths_TSClustChart"
    shape_characteristic_to_ignore = None
    enable_time_series_popups = "NO_POPUP"
    
    TimeSeriesClustering(in_cube, analysis_variable, output_features, characteristic_of_interest,
                        cluster_count, output_table_for_charts, shape_characteristic_to_ignore,
                        enable_time_series_popups)
    print(arcpy.GetMessages())

except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))
Start Time: Wednesday, September 8, 2021 12:07:23 PM

----------- Input Space Time Cube Details -----------
Time step interval                              1 week
                                                      
Shape Type                                     Polygon
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Number of time steps                                41
Number of locations analyzed                      2953
Number of space time bins analyzed              121073
-----------------------------------------------------

------ Pseudo F-Statistic Summary ------
Number of Clusters      Highest Pseudo F
2                                  0.000
3                                946.348
4                               1104.168
5                               1057.773
6                                989.514
7                                930.295
8                                922.507
9                                885.160
10                               858.593
----------------------------------------
Optimal number of clusters is 4 based on the highest pseudo F-statistic. Random Seed: 7165

------ Trend Statistics for Average Time-Series ------
Cluster ID     Direction           Statistic     p-value
1              Not Significant        0.0000      1.0000
2              Increasing             3.4707      0.0005
3              Decreasing            -3.0439      0.0023
4              Increasing             2.0105      0.0444
------------------------------------------------------

- Number of Locations per Cluster -
Cluster ID      Number of Locations
1                               248
2                              1379
3                               570
4                               756
-----------------------------------
Succeeded at Wednesday, September 8, 2021 12:08:23 PM (Elapsed Time: 1 minutes 0 seconds)

The messages generated show:

  • optimal number of clusters generated by the model is 4.
  • trend statistic for average time-series in each cluster shows statistically significant trend in 3 out of 4 clusters.
  • number of locations (counties) assigned to each cluster.
In [31]:
# Access clustering results into a SeDF
tsc_resDeaths_df = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
tsc_resDeaths_df.head()
Out[31]:
OBJECTID LOCATION CLUSTER_ID CENTER_REP SHAPE
0 1 0 2 0 {"rings": [[[-13073230.935899999, 6114197.2246...
1 2 1 2 0 {"rings": [[[-13243912.912999999, 6096737.8555...
2 3 2 1 1 {"rings": [[[-13027621.124200001, 6247430.6903...
3 4 3 4 0 {"rings": [[[-12919238.883299999, 6141609.4641...
4 5 4 2 0 {"rings": [[[-12631826.7216, 6040165.092500001...
In [32]:
# Check shape
tsc_resDeaths_df.shape
Out[32]:
(2953, 5)
In [33]:
# Create a feature service on portal
tsc_layer = tsc_resDeaths_df.spatial.to_featurelayer('NH_TS_Clustering')
tsc_layer
Out[33]:
NH_TS_Clustering
Feature Layer Collection by portaladmin
Last Modified: September 08, 2021
0 comments, 0 views

Merge with FIPS data

As time series clustering does not return a FIPS code, we will use a spatial join to join the results from clustering with the fips layer.

In [34]:
# Join Fips and Clustering output layers
from arcgis.features.analysis import join_features

tsc_resDeaths_item = join_features(target_layer=polygon_layer.url,
                                    join_layer=tsc_layer,
                                    spatial_relationship='identicalto',
                                    join_operation='JoinOneToOne',
                                    output_name='NH_TS_Join')
In [35]:
#Check item
tsc_resDeaths_item = gis.content.search('title: NH_TS_Join', 'Feature Layer')
tsc_resDeaths_item
Out[35]:
[<Item title:"NH_TS_Join" type:Feature Layer Collection owner:portaladmin>]
In [36]:
# Get join results in a data frame
tsc_resDeaths_df = tsc_resDeaths_item[0].layers[0].query(as_df=True)
tsc_resDeaths_df.head()
Out[36]:
SHAPE center_rep cluster_id county fips join_count location objectid state target_fid
0 {"rings": [[[-13073230.935899999, 6114197.2246... 0 2 Stevens 53065 1 0 1 WA 2
1 {"rings": [[[-13243912.912999999, 6096737.8555... 0 2 Okanogan 53047 1 1 2 WA 3
2 {"rings": [[[-13027621.124200001, 6247430.6903... 1 1 Boundary 16021 1 2 3 ID 5
3 {"rings": [[[-12919238.883299999, 6141609.4641... 0 4 Lincoln 30053 1 3 4 MT 6
4 {"rings": [[[-12631826.7216, 6040165.092500001... 0 2 Flathead 30029 1 4 5 MT 7

Plot Clusters

Now that we have cluster information along with their location, we will:

  • visualize the number of counties in each cluster.
  • plot the clusters on a map.
  • create an average time series for each cluster.
  • use cluster information to identify states and counties that belong to a particular cluster.

Cluster Allocation

Here, we will visualize the number of counties in each cluster.

In [37]:
# Create plot
sns.set(font_scale = 1)
sns.countplot(x='cluster_id', data=tsc_resDeaths_df, color='lightcoral')
plt.title('No. of counties in each cluster');

The plot shows the count of counties within each cluster.

Clusters on a Map

Here, we will plot the clusters on a map to visualize which counties are associated to which cluster.

In [41]:
# Create Map
tsc_resDeaths_map = gis.map('USA', 4)
tsc_resDeaths_map
Out[41]:

The plot shows counties colored by cluster id.

In [39]:
# Add layer
tsc_resDeaths_map.remove_layers()
tsc_resDeaths_df.spatial.plot(map_widget=tsc_resDeaths_map,
                                renderer_type='u',  # for unique renderer
                                col='cluster_id',
                                line_width=.5)
Out[39]:
True
In [40]:
# Add legend and take screenshot
tsc_resDeaths_map.legend = True
tsc_resDeaths_map.take_screenshot(output_in_cell=False)

Average Time Series per Cluster

The Average Time Series per Cluster chart displays the average of the Analysis Variable at each time step for each cluster.

In [42]:
# Access clustering data from local
tsc_resDeaths_df2 = pd.DataFrame.spatial.from_table(os.path.join(arcpy.env.workspace,output_table_for_charts))
tsc_resDeaths_df2.head()
Out[42]:
OBJECTID TIME_STEP START_DATE END_DATE TIME_EXAG ELEMENT LOCATION FIPS CLUST_MED CLUST_MEAN CLUSTER_ID
0 1 0 2020-05-31 00:00:01 2020-06-07 0.000000 2 2 16021 0.0 0.0 1
1 2 1 2020-06-07 00:00:01 2020-06-14 20766.235829 2955 2 16021 0.0 0.0 1
2 3 2 2020-06-14 00:00:01 2020-06-21 41532.471658 5908 2 16021 0.0 0.0 1
3 4 3 2020-06-21 00:00:01 2020-06-28 62298.707487 8861 2 16021 0.0 0.0 1
4 5 4 2020-06-28 00:00:01 2020-07-05 83064.943316 11814 2 16021 0.0 0.0 1
In [43]:
# Plot time series per cluster

plt.figure(figsize=(20,10))

sns.lineplot(x='END_DATE', y='CLUST_MEAN', data=tsc_resDeaths_df2, 
             hue='CLUSTER_ID', marker='o', palette='Set2', linewidth=2.5)

plt.title('Average Time Series per Cluster', fontdict={'fontsize':20})
plt.legend(loc='upper left', title='CLUSTER_ID', prop={'size': 13})
plt.xticks(rotation=90);

The plot shows the average time series for counties in each cluster.

Observations:

  • The plot shows different trends for three clusters but no trend for one cluster.
  • The average time series for cluster 2 shows an increase in average weekly residents deaths per 1000 residents early on starting mid June'20 - mid Aug.'20. The increase is followed by a decline until early Oct'20. From Oct.'20 - Mar.'21, there is a slight increase in trend, however the death rate seems considerably lower.
  • The average time series for cluster 3 shows an increase in average weekly residents deaths per 1000 residents starting mid Sept.'20, with a steep rise starting early Oct.'20 - early Dec.'20. The increase is followed by a steep decline until early Jan.'21, which is followed by a consistent decline until Mar.'21.
  • The average time series for cluster 4 shows an increase in average weekly residents deaths per 1000 residents starting Nov.'20, with a steep rise starting late Nov.'20 - early Jan.'21. The increase is followed by a steep decline until late Feb.'21, which is followed by a consistent decline until Mar.'21.
  • The final cluster, cluster 1, is shown as a straight line with no trend. As seen in the messages generated by the model, this cluster does not have any significant trend. There are 248 counties that fall within this cluster.

Cluster Information: County,State by Cluster ID

Here, we use simple group and query operations to identify counties and states that belong to a particular cluster. We will start by looking at the composition of cluster #3.

In [44]:
# Top 5 states with highest no. of counties in a particular cluster
tsc_resDeaths_df.query('(cluster_id==3)')[['county','state']] \
                .groupby('state').count() \
                .sort_values(by ='county',ascending=False).head()
Out[44]:
county
state
GA 86
TX 77
FL 45
MS 40
LA 33
In [45]:
# Identify counties for a cluster id and state
tsc_resDeaths_df.query('(cluster_id==3) and (state == "CA")')[['cluster_id','county','state','fips']]
Out[45]:
cluster_id county state fips
440 3 Colusa CA 6011
468 3 Napa CA 6055
469 3 Sonoma CA 6097
472 3 Amador CA 6005
488 3 Calaveras CA 6009
493 3 Marin CA 6041
523 3 Madera CA 6039
542 3 Inyo CA 6027
568 3 San Benito CA 6069
582 3 Kings CA 6031
603 3 San Bernardino CA 6071
693 3 Imperial CA 6025
In [46]:
# For each cluster, identify states with highest number of counties
g_df = tsc_resDeaths_df.loc[:,['cluster_id','state','county']] \
                .groupby(['cluster_id','state']).count() \
                .sort_values(by='county', ascending=False) \
                .groupby(level=0).head(1)
g_df
Out[46]:
county
cluster_id state
2 TX 91
3 GA 86
4 MO 46
1 TX 15

Now that we have explored patterns in our time series data, let's get started with building some models to forecast average weekly resident deaths per 1000 residents.

Exponential Smoothing Forecast Model

Exponential smoothing methods [3] are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, more recent observations are weighted higher than past observations. There are three types of exponential smoothing models:

  • Simple Exponential Smoothing [3][4]: smoothes (levels) the data when no trend or seasonal components are present. In this method, the weight of each observation decreases exponentially as we move back in time. In other words, the smallest weights are associated with the oldest observations. This method adds a smoothing constant α (alpha) to assign weights to observations over time. If α is small, more weight is given to observations from the distant past, and if α is large (close to 1), more weight is given to recent observations. This method is suitable for forecasting data with no clear trend or seasonal pattern.
  • Double Exponential Smoothing [3][4]: also called Holt's method, extends the simple exponential smoothing method to allow the forecasting of data with a trend. This method adds a second smoothing constant β (beta) that addresses trends in the data. Like the α factor, values for the β factor fall between zero and one (0< β ≤ 1). The benefit here is that the model can anticipate future increases or decreases, whereas the level model would only work from recent calculations.
  • Triple Exponential Smoothing [3][4]: this method is most closely associated with Holt-Winters, adds support for both trends and seasonality in the data. This method is comprised of three smoothing equations — one for the level, one for the trend, and one for the seasonal component, with the corresponding smoothing parameters being α, β, and γ, respectively. There are two variations to this method that differ based on the seasonal component. The additive method is preferred when seasonal variations are roughly constant, while the multiplicative method is preferred when seasonal variations are changing proportional to the level of the series. The equations below are shown for additive method.
\begin{array}{l} L_{t}=\alpha\left(Y_{t}-S_{t-p}\right)+(1-\alpha)\left(L_{t-1}+T_{t-1}\right) \\ T_{t}=\beta\left(L_{t}-L_{t-1}\right)+(1-\beta) T_{t-1} \\ S_{t}=\delta\left(Y_{t}-L_{t}\right)+(1-\delta) S_{t-p} \\ \hat{Y}_{t}=L_{t-1}+T_{t-1}+S_{t-p} \end{array}

where $L_{t}$ is the level at time t, $\alpha$ is the weight (or smoothing constant) for the level. $T_{t}$ is the trend at time t, $\beta$ is the weight (or smoothing constant) for the trend. Also, $S_{t}$ is the seasonal component at time t, $\delta$ is the weight (or smoothing constant) for the seasonal component, and p is the seasonal period.

The Exponential Smoothing Forecast [5] tool forecasts the values of each location of a space-time cube using the Holt-Winters exponential smoothing method by decomposing the time series at each location cube into seasonal and trend components. The exponential smoothing model assumes that all components are additive and linear. Damped trend is always used, and additive seasonality is supported but not required. Confidence intervals are constructed under the assumption that the residuals are additive and normally distributed.

In this section, we will:

  • build an exponential smoothing forecast model using the space time cube.
  • generate forecasts.
  • plot errors.

We will use data from June 7, 2020 - March 14, 2021 to train and validate the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting.

Build a Model

In [47]:
# Run Algorithm
from arcpy.stpm import ExponentialSmoothingForecast

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
try:
    in_cube = os.path.join(home_dir, sptm_cube_sub)
    analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
    output_features = "usNB_resDeaths_ESFsub"
    out_cube = os.path.join(home_dir, 'usNB_resDeaths_ESFsub_cube.nc')
    # out_cube = None
    number_of_time_steps_to_forecast = 12
    season_length = None
    number_for_validation = 6
    outlier_option = None
    level_of_confidence = "95%"
    maximum_number_of_outliers = None

    ExponentialSmoothingForecast(in_cube, analysis_variable, output_features,out_cube,
                            number_of_time_steps_to_forecast,season_length,number_for_validation,
                            outlier_option,level_of_confidence,maximum_number_of_outliers)
    print(arcpy.GetMessages())
    
except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))
Start Time: Wednesday, September 8, 2021 12:16:47 PM

----------- Input Space Time Cube Details -----------
Time step interval                              1 week
                                                      
Shape Type                                     Polygon
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Number of time steps                                41
Number of locations analyzed                      2953
Number of space time bins analyzed              121073
-----------------------------------------------------


-------------------- Analysis Details -------------------
Input Space Time Cube                      Forecast Method
usCubeNBLatest.nc                    Exponential Smoothing
---------------------------------------------------------
Number of forecast time steps                           12
Time Steps excluded for validation                       6
% locations modeled with seasonality                 35.83
Minimum of season length                              1.00
Maximum of season length                             13.00
Average season length                                 2.82
Median season length                                  1.00
Std. Dev. of season length                            3.42
---------------------------------------------------------
First forecast time step interval                    after
                                       2021-03-14 00:00:01
                                           to on or before
                                       2021-03-21 00:00:00
                                                          
Last forecast time step interval                     after
                                       2021-05-30 00:00:00
                                           to on or before
                                       2021-06-06 00:00:00
---------------------------------------------------------

------ Summary of Accuracy across Locations ------
Category          Min     Max  Mean  Median  Std. Dev.
Forecast RMSE    0.00  125.71  6.71    4.68       7.49
Validation RMSE  0.00  389.27  6.98    3.28      14.81
--------------------------------------------------

Succeeded at Wednesday, September 8, 2021 12:17:08 PM (Elapsed Time: 20.88 seconds)
In [48]:
# Access model data from local
esf_resDeaths_subdf = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace, output_features))
esf_resDeaths_subdf.head()
Out[48]:
OBJECTID LOCATION FIPS FCAST_1 FCAST_2 FCAST_3 FCAST_4 FCAST_5 FCAST_6 FCAST_7 FCAST_8 FCAST_9 FCAST_10 FCAST_11 FCAST_12 HIGH_1 LOW_1 HIGH_2 LOW_2 HIGH_3 LOW_3 HIGH_4 LOW_4 HIGH_5 LOW_5 HIGH_6 LOW_6 HIGH_7 LOW_7 HIGH_8 LOW_8 HIGH_9 LOW_9 HIGH_10 LOW_10 HIGH_11 LOW_11 HIGH_12 LOW_12 F_RMSE V_RMSE SEASON METHOD HTML_CHART SHAPE
0 1 0 53065 1.058795 1.058795 1.058795 1.058795 1.058795 1.058795 1.058795 1.058795 1.058795 1.058795 1.058795 1.058795 13.067414 -10.949824 13.072486 -10.954896 13.077556 -10.959967 13.082624 -10.965035 13.087690 -10.970100 13.092754 -10.975164 13.097816 -10.980226 13.102875 -10.985285 13.107932 -10.990343 13.112988 -10.995398 13.118041 -11.000451 13.123092 -11.005502 6.126846 2.037500 1 exponential smoothing; seasonality:auto_detect <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-13073230.935899999, 6114197.2246...
1 2 1 53047 0.016359 0.028585 0.039589 0.049492 0.058405 0.066427 0.073647 0.080144 0.085993 0.091256 0.095993 0.100256 28.811219 -28.778502 37.773283 -37.716113 44.986007 -44.906830 51.193395 -51.094411 56.726007 -56.609197 61.765174 -61.632320 66.423127 -66.275834 70.775066 -70.614777 74.874295 -74.702310 78.760236 -78.577724 82.463017 -82.271031 86.006280 -85.805768 14.691255 12.794694 1 exponential smoothing; seasonality:auto_detect <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-13243912.912999999, 6096737.8555...
2 3 2 16021 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1 exponential smoothing; seasonality:auto_detect <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-13027621.124200001, 6247430.6903...
3 4 3 30053 0.100010 0.100010 0.100010 0.100010 0.100010 0.100010 0.100010 0.100010 0.100010 0.100010 0.100010 0.100010 5.614756 -5.414737 5.737636 -5.537616 5.857893 -5.657874 5.975690 -5.775671 6.091172 -5.891152 6.204469 -6.004449 6.315701 -6.115682 6.424977 -6.224958 6.532398 -6.332378 6.638053 -6.438034 6.742028 -6.542009 6.844401 -6.644381 2.813646 0.555628 1 exponential smoothing; seasonality:auto_detect <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-12919238.883299999, 6141609.4641...
4 5 4 30029 0.144388 0.144388 0.144388 0.144388 0.144388 0.144388 0.144388 0.144388 0.144388 0.144388 0.144388 0.144388 7.412144 -7.123368 7.847356 -7.558580 8.259261 -7.970485 8.651244 -8.362468 9.025945 -8.737168 9.385464 -9.096688 9.731511 -9.442735 10.065496 -9.776720 10.388597 -10.099821 10.701815 -10.413039 11.006004 -10.717228 11.301904 -11.013127 3.708039 2.414484 1 exponential smoothing; seasonality:auto_detect <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-12631826.7216, 6040165.092500001...

Generate Forecasts

When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods i.e. from 2021-03-21 - 2021-06-06.

Create Space Time Cube in 3D

To create forecast plots, we need raw, fitted, and forecasted values. The model generates a feature class with forecasted values only. The data in ppe_ts_layer is also not aggregated. Let's create a space time cube in 3D using Visualize Space Time Cube in 3D [6] tool to get the fitted values.

In [49]:
from arcpy.stpm import VisualizeSpaceTimeCube3D

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
in_cube = out_cube
cube_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
display_theme = "FORECAST_RESULTS"
output_features = "usNB_resDeaths_ESFsub_3D"

VisualizeSpaceTimeCube3D(in_cube, cube_variable, display_theme, output_features)
Out[49]:

Output

C:\Users\mohi9282\Documents\ArcGIS\home\Results.gdb\usNB_resDeaths_ESFsub_3D

Messages

Start Time: Wednesday, September 8, 2021 12:17:14 PM
WARNING 110044: The time it takes to render the cube in three dimensions may vary considerably based on the number of features and the graphics card associated with your CPU.
Succeeded at Wednesday, September 8, 2021 12:17:35 PM (Elapsed Time: 20.64 seconds)
In [50]:
# Access cube data from local
cube3D_ESsub_df_resDeaths = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
cube3D_ESsub_df_resDeaths.head()
Out[50]:
OBJECTID TIME_STEP START_DATE END_DATE TIME_EXAG ELEMENT LOCATION FIPS VALUE FITTED RESIDUAL HIGH LOW LEVEL TREND LEVELTREND SEASON SHAPE
0 1 0 2020-05-31 00:00:01 2020-06-07 0.000000 0 0 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
1 2 1 2020-06-07 00:00:01 2020-06-14 20766.235829 2953 0 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
2 3 2 2020-06-14 00:00:01 2020-06-21 41532.471658 5906 0 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
3 4 3 2020-06-21 00:00:01 2020-06-28 62298.707487 8859 0 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
4 5 4 2020-06-28 00:00:01 2020-07-05 83064.943316 11812 0 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
In [51]:
# Check shape
cube3D_ESsub_df_resDeaths.shape
Out[51]:
(156509, 18)
In [52]:
# Merge with FIPS data
fips_df.columns = fips_df.columns.str.upper()
cube3D_ESsub_df_resDeaths = pd.merge(cube3D_ESsub_df_resDeaths, fips_df, on='FIPS')
cube3D_ESsub_df_resDeaths.shape
Out[52]:
(156509, 24)
Clean data

Here, we clean the data and keep only necessary columns.

In [53]:
# Rename Columns
cols_to_rename = {'OBJECTID_y':'OBJECTID',
                 'SHAPE_y':'SHAPE'}
cube3D_ESsub_df_resDeaths.rename(columns=cols_to_rename, inplace=True)
In [54]:
# Drop columns
cols_to_drop = ['OBJECTID_x','TIME_STEP','TIME_EXAG','ELEMENT','LOCATION','SHAPE_x']

cube3D_ESsub_df_resDeaths.drop(columns = cols_to_drop,
                         axis=1, inplace=True)
In [55]:
# Check data
display(cube3D_ESsub_df_resDeaths.shape)
display(cube3D_ESsub_df_resDeaths.columns)
(156509, 18)
Index(['START_DATE', 'END_DATE', 'FIPS', 'VALUE', 'FITTED', 'RESIDUAL', 'HIGH',
       'LOW', 'LEVEL', 'TREND', 'LEVELTREND', 'SEASON', 'SHAPE', 'SHAPE__AREA',
       'SHAPE__LENGTH', 'COUNTY', 'OBJECTID', 'STATE'],
      dtype='object')

Combine Results

Here, we will combine the errors generated using Exponential Smoothing model with the fitted values from the space-time cube cube3D_ESsub_df_resDeaths.

In [56]:
# Get valid values from ES results df
esf_small = esf_resDeaths_subdf[['FIPS','F_RMSE', 'V_RMSE']].copy()
esf_small.shape
Out[56]:
(2953, 3)
In [57]:
# Merge with 3D cube data
cube3D_ESsub_df_resDeaths = pd.merge(cube3D_ESsub_df_resDeaths, esf_small, on='FIPS')
cube3D_ESsub_df_resDeaths.shape
Out[57]:
(156509, 20)
Publish Layer

Next, we will publish the combined model results to a feature layer for ease of access and use.

In [58]:
%%time
# Publish layer
cube3D_ESsub_df_resDeaths.spatial.to_featurelayer('ES_Results')
Wall time: 3min 31s
Out[58]:
ES_Results
Feature Layer Collection by portaladmin
Last Modified: September 08, 2021
0 comments, 0 views

Plot Forecasts

Here, we will create plots that show the raw, fitted, and forecasted average weekly nursing home resident deaths per 1000 residents due to Covid-19 for various counties. The fitted and raw values are for time period June 7, 2020 - March 14, 2021, whereas forecasted values are for the time period March 21, 2021 - June 6, 2021.

Subset Data

Next, we will subset the data to keep only the necessary columns.

In [59]:
# Get column names
cube3D_ESsub_df_resDeaths.columns
Out[59]:
Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'level', 'trend', 'leveltrend', 'season', 'SHAPE', 'shape_area',
       'shape_length', 'county', 'objectid', 'state', 'f_rmse', 'v_rmse'],
      dtype='object')
In [60]:
# Keep necessary columns
resDeaths_cube_ESsub_df = cube3D_ESsub_df_resDeaths[['end_date','fips','county','state','value',
                                                     'fitted','high','low','level','trend','SHAPE']].copy()
resDeaths_cube_ESsub_df.set_index('end_date', inplace=True)
resDeaths_cube_ESsub_df.shape
Out[60]:
(156509, 10)
In [61]:
# Check head of data
resDeaths_cube_ESsub_df.head()
Out[61]:
fips county state value fitted high low level trend SHAPE
end_date
2020-06-07 53065 Stevens WA 0.0 0.0 0.0 0.0 0.0 0.0 {"rings": [[[-13073230.935899999, 6114197.2246...
2020-06-14 53065 Stevens WA 0.0 0.0 0.0 0.0 0.0 0.0 {"rings": [[[-13073230.935899999, 6114197.2246...
2020-06-21 53065 Stevens WA 0.0 0.0 0.0 0.0 0.0 0.0 {"rings": [[[-13073230.935899999, 6114197.2246...
2020-06-28 53065 Stevens WA 0.0 0.0 0.0 0.0 0.0 0.0 {"rings": [[[-13073230.935899999, 6114197.2246...
2020-07-05 53065 Stevens WA 0.0 0.0 0.0 0.0 0.0 0.0 {"rings": [[[-13073230.935899999, 6114197.2246...
Generate Plot

Here, we will create plots that show the raw, fitted, and forecasted values of average weekly resident deaths per 1000 residents for various counties.

Let's define a function ts_plot that will:

  • Create a grid
  • Subset data for a county
  • Plot raw, fitted, and forecast values
  • Plot confidence interval

We can then use this function to generate plots for one or more counties.

In [62]:
# Define function to plot data for one or multiple counties
def ts_plot(data, fips: list, ylabel: str, title: str):
    
    if len(fips) == 1:
        fig = plt.figure(figsize=(18, 12))
    else:
        fig = plt.figure(figsize=(20, 20))
    
    # Create grid
    for sp in range(0,len(fips)):
        if len(fips) == 1:
            ax = fig
        elif len(fips)%2 == 0:
            ax = fig.add_subplot(len(fips)/2, 2, sp + 1)
        else:
            ax = fig.add_subplot(len(fips)//2+1, 2, sp + 1)

        # Subset data for county
        plot_df = data[data['fips']==fips[sp]]

        # Plot raw, fitted and forecast values
        ax = plot_df.loc[:'2021-03-14','value'].plot(legend=True, 
                                  label='Raw Values', marker='o', markersize=10, linewidth=2)
        
        ax = plot_df['fitted'].plot(legend=True, label='Fitted Values',marker='o', 
                                       markersize=10,linewidth=2)

        # Plot confidence interval 
        plt.fill_between(plot_df['fitted'][-12:].index,
                         plot_df.loc['2021-03-21':,'low'],
                         plot_df.loc['2021-03-21':,'high'],
                         color='r', alpha=0.1, label='Confidence Interval')
        
        # Plot forecast start line
        plt.vlines(plot_df.index[-12],
                   ymin=plot_df['low'].min(),
                   ymax=plot_df['high'].max(),
                    color='grey', linewidth=1,
                   linestyle="--", label='Forecast results start here')

        # Label, title and legend
        ax.set(ylabel=ylabel, xlabel='')
        ax.set_title(plot_df['county'][0]+', '+plot_df['state'][0], fontsize=20)
        plt.legend(loc='upper left', prop={'size': 12})

        # Remove spines
        for key, spine in ax.spines.items():
            spine.set_visible(False)
        ax.margins(0.01, 0)

    fig.suptitle(title, fontsize=25, y=1.0)
    plt.rc('axes', labelsize=15)
    plt.rc('xtick', labelsize=15)
    plt.rc('ytick', labelsize=15)
    plt.tight_layout() # automatically adjusts layout to fit long labels
    plt.show();

Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.

County, State FIPS Cluster Id
Maricopa, AZ 4013 2
Delaware, PA 42045 3
Porter, IN 18127 4
In [63]:
# Create Plot
fips = [4013,42045,18127]
ts_plot(resDeaths_cube_ESsub_df,fips,
        ylabel='Resident death rate',
        title='Average weekly resident death rates')

The plots show:

  • the fitted values as generated by the Exponential Smoothing model for June 7, 2020 - March 14, 2021.
  • the forecasted values generated by the model for March 21, 2021 - June 6, 2021.
  • the actual resident death rates for June 7, 2020 - March 14, 2021.
  • the confidence interval of the forecast for March 21, 2021 - June 6, 2021.

Observations:

  • The fitted values generated by the exponential smoothing model seem to follow the actual death rates fairly closely for Maricopa County, AZ and Delaware County, PA. For Porter County, IN, the fitted? values vary quite a bit from the raw values. This could be a result of the model trying to generalize the training data, rather than overfitting it.
  • The forecast values vary for different counties, which is expected, as a different model is generated for each county. The increasing width of the confidence intervals show reduced confidence in the forecasted values as the forecasting time period increases.
  • Carefully observing the plot for Porter County, IN we see that although the confidence interval seems narrower than the others, the resident death rate for Porter County is slightly higher when compared to the other counties. As such, the range of values for the confidence intervals in Porter County could be similar to, or even larger than, other counties.
  • The increasing width of the confidence intervals show higher uncertainty associated with each forecast as the forecasting time period increases.

Plot Errors

The forecast RMSE measures how much the fitted values from the model differ from the raw time series values. The forecast RMSE only measures how well the exponential smoothing model fits the raw time series values. It does not measure how well the forecast model actually forecasts future values. This problem is addressed by the validation model. The model uses data from June 7, 2020 - March 14, 2021 for training and data from March 21, 2021 - June 6, 2021 (12-time steps) is used for forecasting to calculate the error.

The validation model is used to determine how well the forecast model can forecast the future values of each time series. It is constructed by excluding some of the final time steps of each time series and fitting the exponential smoothing model to the data that was not excluded. This model is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the raw values that were hidden. Since we are using 6-time steps for validation, the model uses data from June 7, 2020 - Jan 31, 2021 for training and data from Jan 31, 2021 - March 14, 2021 for validation to calculate this error.

You can read more about the two errors here. Let's map the forecast and validation errors.

In [66]:
# Search for layer
es_item = gis.content.search('ES_Results', 'Feature Layer')
es_lyr = es_item[0].layers[0]
es_lyr
Out[66]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/a00c82/FeatureServer/0">

Validation Errors on a Map

In [69]:
# Create map
m1= gis.map('US', zoomlevel=4)
m1
Out[69]:

The plot shows variation in validation errors as generated by the Exponential Smoothing model for each county. The variation seems to be really high for some counties with values ranging from 9.8 - 389.

In [67]:
# Plot data
m1.add_layer(es_lyr, 
            {"renderer":"ClassedColorRenderer",
           "field_name":"V_RMSE",
           "opacity":0.7
            })
In [68]:
# Add legend and screenshot
m1.legend = True
m1.take_screenshot(output_in_cell=False)

Forecast Errors on a Map

In [73]:
# Create map
m2 = gis.map('US', zoomlevel=4)
m2
Out[73]:

The plot shows variation in forecast errors as generated by the Exponential Smoothing model for each county. The variation seems to be high for some counties with values ranging from 9.8 - 125.

In [71]:
# Plot data
m2.add_layer(es_lyr, 
            {"renderer":"ClassedColorRenderer",
           "field_name":"F_RMSE",
           "opacity":0.7
            })
In [72]:
# Add legend and screenshot
m2.legend = True
m2.take_screenshot(output_in_cell=False)

Let's identify the counties with high variance in forecast and validation errors.

In [74]:
high_var = cube3D_ESsub_df_resDeaths.query(
                    'f_rmse > 9.8 and v_rmse > 9.8').groupby(
                    ['state', 'county']).aggregate('count')[['f_rmse', 'v_rmse']]
high_var
Out[74]:
f_rmse v_rmse
state county
AL Cleburne 53 53
Crenshaw 53 53
Pike 53 53
Walker 53 53
AR Chicot 53 53
... ... ... ...
WV Mercer 53 53
Monroe 53 53
Upshur 53 53
Wyoming 53 53
WY Fremont 53 53

292 rows × 2 columns

From the results, we can see that there are 292 counties for which the variation in error is in the highest band for both validation and forecast errors.

Now that we have built the Exponential Smoothing forecast model and generated forecasts for our time series data, let's look into the Forest-based Forecast model to forecast average weekly resident deaths per 1000 residents.

Forest-based Forecast Model

The Forest-based Forecast tool [7] forecasts the values of each location of a space-time cube using an adaptation of Leo Breiman's random forest algorithm. This tool uses the same underlying algorithm as the Forest-based Classification and Regression tool when used for regression. The training data used to build the forest regression model is constructed using time windows on each location of the space-time cube.

image.png

Source

Compared to other forecasting tools in the Time Series Forecasting toolset, this tool is the most complex but includes the fewest assumptions about the data.

In this section, we will:

  • build a forest-based forecast model using the space time cube.
  • generate forecasts.
  • plot errors.

We will use data from June 7, 2020 - March 14, 2021 to train and validate the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting.

Build a Model

In [75]:
# Run Algorithm
from arcpy.stpm import ForestBasedForecast

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
try:
    in_cube = os.path.join(home_dir, sptm_cube_sub)
    analysis_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
    output_features = "usNB_resDeaths_FBFsub"
    out_cube = os.path.join(home_dir, 'usNB_resDeaths_FBFsub_cube.nc')
    number_of_time_steps_to_forecast = 12
    time_window = None
    # season_length = None
    number_for_validation = 6
    number_of_trees = 1000
    minimum_leaf_size = None
    maximum_depth = None
    sample_size = 100
    forecast_approach = "VALUE_DETREND"
    outlier_option = None
    level_of_confidence = "95%"
    maximum_number_of_outliers = None

    ForestBasedForecast(in_cube, analysis_variable, output_features,out_cube,
                        number_of_time_steps_to_forecast,time_window,number_for_validation,
                        number_of_trees,minimum_leaf_size,maximum_depth,sample_size,forecast_approach,
                        outlier_option,level_of_confidence,maximum_number_of_outliers)
    
    print(arcpy.GetMessages())
    
except arcpy.ExecuteError:
    # If any error occurred when running the tool, print the messages
    print(arcpy.GetMessages(2))
Start Time: Wednesday, September 8, 2021 12:29:35 PM

----------- Input Space Time Cube Details -----------
Time step interval                              1 week
                                                      
Shape Type                                     Polygon
                                                      
First time step temporal bias                  100.00%
First time step interval                         after
                                   2020-05-31 00:00:00
                                       to on or before
                                   2020-06-07 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2021-03-07 00:00:00
                                       to on or before
                                   2021-03-14 00:00:00
                                                      
Number of time steps                                41
Number of locations analyzed                      2953
Number of space time bins analyzed              121073
-----------------------------------------------------


------------------- Analysis Details ------------------
Input Space Time Cube                    Forecast Method
usCubeNBLatest.nc                           Forest-based
-------------------------------------------------------
Number of forecast time steps                         12
Time Steps excluded for validation                     6
% locations modeled with seasonality               35.83
Minimum of time window                              2.00
Maximum of time window                             13.00
Average time window                                 8.59
Median time window                                 10.00
Std. Dev. of time window                            3.06
-------------------------------------------------------
First forecast time step interval                  after
                                     2021-03-14 00:00:01
                                         to on or before
                                     2021-03-21 00:00:00
                                                        
Last forecast time step interval                   after
                                     2021-05-30 00:00:00
                                         to on or before
                                     2021-06-06 00:00:00
-------------------------------------------------------

------ Summary of Accuracy across Locations ------
Category          Min     Max  Mean  Median  Std. Dev.
Forecast RMSE    0.00   64.19  3.73    2.48       4.54
Validation RMSE  0.00  140.69  5.39    3.14       8.00
--------------------------------------------------

Succeeded at Wednesday, September 8, 2021 12:30:56 PM (Elapsed Time: 1 minutes 20 seconds)
In [76]:
# Access FBF data from local
fbf_resDeaths_subdf = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
fbf_resDeaths_subdf.head()
Out[76]:
OBJECTID LOCATION FIPS FCAST_1 FCAST_2 FCAST_3 FCAST_4 FCAST_5 FCAST_6 FCAST_7 FCAST_8 FCAST_9 FCAST_10 FCAST_11 FCAST_12 HIGH_1 LOW_1 HIGH_2 LOW_2 HIGH_3 LOW_3 HIGH_4 LOW_4 HIGH_5 LOW_5 HIGH_6 LOW_6 HIGH_7 LOW_7 HIGH_8 LOW_8 HIGH_9 LOW_9 HIGH_10 LOW_10 HIGH_11 LOW_11 HIGH_12 LOW_12 F_RMSE V_RMSE TIMEWINDOW IS_SEASON METHOD HTML_CHART SHAPE
0 1 0 53065 3.280066 4.187650 4.130642 4.408497 4.307801 4.159690 4.795857 4.925029 4.361156 3.757626 2.769107 2.835004 17.674663 0.195682 32.292750 -2.807511 46.073265 -6.656661 60.010600 -10.350181 73.717314 -14.242134 87.583994 -17.918033 101.781844 -21.213006 115.410313 -25.054906 128.851084 -28.936353 127.378642 -31.930005 126.564565 -33.309451 126.808967 -33.636652 3.576085 1.800630 10 0 forest-based; seed:32009; number_of_trees:1000... <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-13073230.935899999, 6114197.2246...
1 2 1 53047 9.595976 13.047301 16.396436 15.240901 14.375925 14.717977 12.942141 11.612516 9.839731 7.786485 6.390948 6.452929 30.129470 0.465302 71.244499 -8.403254 109.197685 -20.521899 143.860987 -35.648077 179.692873 -49.383438 216.492842 -62.123885 252.973609 -74.622399 271.006249 -85.058460 282.017671 -94.031924 293.778844 -100.487063 293.127178 -104.761302 294.031588 -107.712477 9.198349 3.824102 10 0 forest-based; seed:32009; number_of_trees:1000... <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-13243912.912999999, 6096737.8555...
2 3 2 16021 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 -0.000003 0.000007 -0.000009 0.000017 -0.000016 0.000027 -0.000023 0.000036 -0.000030 0.000046 -0.000036 0.000056 -0.000043 0.000066 -0.000050 0.000075 -0.000056 0.000085 -0.000063 0.000095 -0.000070 0.000105 -0.000077 0.000114 -0.000083 0.000002 0.000003 10 0 forest-based; seed:32009; number_of_trees:1000... <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-13027621.124200001, 6247430.6903...
3 4 3 30053 0.886052 1.719032 1.902360 2.636298 2.352632 2.121013 1.936059 1.808111 1.618762 1.147915 0.486185 0.497271 7.998071 0.021748 15.530076 -0.827406 28.434965 -2.494942 41.214422 -4.330191 53.263423 -6.880947 65.595626 -9.126077 78.158723 -11.113424 90.906601 -12.881377 97.564938 -14.483893 104.049252 -15.820655 103.471270 -16.532877 103.567337 -16.571082 1.730725 0.311888 10 0 forest-based; seed:32009; number_of_trees:1000... <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-12919238.883299999, 6141609.4641...
4 5 4 30029 3.605844 5.109240 6.789801 8.178999 7.754497 8.279101 6.984334 6.744166 5.875686 5.429073 5.431167 4.187095 13.484862 0.129425 23.861338 -3.241755 38.323356 -7.996128 51.223241 -13.872881 62.811414 -21.059821 74.933416 -27.762729 86.586476 -34.921460 99.596797 -40.774308 112.901204 -46.392580 127.152738 -51.106451 141.936570 -55.441406 151.201586 -59.731379 2.300957 2.232414 10 0 forest-based; seed:32009; number_of_trees:1000... <html>\n<head>\n <meta charset = "utf-8">\n ... {"rings": [[[-12631826.7216, 6040165.092500001...

Generate Forecasts

When forecasting time series data, the idea is to estimate how the sequence of observations will continue into the future. Using the model just created, we will generate a forecast for 12 time periods, from 2021-03-21 - 2021-06-06.

Create Space Time Cube in 3D

To create forecast plots, we need raw, fitted, and the forecasted values. The model generates a feature class with forecasted values only. The data in ppe_ts_layer is also not aggregated. Let's create a space time cube in 3D using Visualize Space Time Cube in 3D [6] tool to get the fitted values.

In [77]:
from arcpy.stpm import VisualizeSpaceTimeCube3D

# Workspace
arcpy.env.workspace
arcpy.env.overwriteOutput = True

# Run
in_cube = out_cube
cube_variable = "WEEKLY_RESIDENT_COVID19_DEATHS__MEAN_SPACE_TIME_NEIGHBORS"
display_theme = "FORECAST_RESULTS"
output_features = "usNB_resDeaths_FBFsub_3D"

VisualizeSpaceTimeCube3D(in_cube, cube_variable, display_theme, output_features)
Out[77]:

Output

C:\Users\mohi9282\Documents\ArcGIS\home\Results.gdb\usNB_resDeaths_FBFsub_3D

Messages

Start Time: Wednesday, September 8, 2021 12:31:31 PM
WARNING 110044: The time it takes to render the cube in three dimensions may vary considerably based on the number of features and the graphics card associated with your CPU.
Succeeded at Wednesday, September 8, 2021 12:31:51 PM (Elapsed Time: 19.97 seconds)
In [78]:
# Access cube data from local
cube3D_FBFsub_df_resDeaths = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
cube3D_FBFsub_df_resDeaths.head()
Out[78]:
OBJECTID TIME_STEP START_DATE END_DATE TIME_EXAG ELEMENT LOCATION FIPS VALUE FITTED RESIDUAL HIGH LOW SHAPE
0 1 0 2020-05-31 00:00:01 2020-06-07 0.000000 0 0 53065 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
1 2 1 2020-06-07 00:00:01 2020-06-14 20766.235829 2953 0 53065 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
2 3 2 2020-06-14 00:00:01 2020-06-21 41532.471658 5906 0 53065 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
3 4 3 2020-06-21 00:00:01 2020-06-28 62298.707487 8859 0 53065 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
4 5 4 2020-06-28 00:00:01 2020-07-05 83064.943316 11812 0 53065 0.0 0.0 0.0 0.0 0.0 {"x": -13119702.4761, "y": 6173542.7597, "z": ...
In [79]:
# Check shape
cube3D_FBFsub_df_resDeaths.shape
Out[79]:
(156509, 14)
In [80]:
# Merge with FIPS data
cube3D_FBFsub_df_resDeaths = pd.merge(cube3D_FBFsub_df_resDeaths, fips_df, on='FIPS')
cube3D_FBFsub_df_resDeaths.shape
Out[80]:
(156509, 20)
Clean data

Next, we will clean the data and keep only the necessary columns.

In [81]:
# Rename Columns
cols_to_rename = {'OBJECTID_y':'OBJECTID',
                 'SHAPE_y':'SHAPE'}
cube3D_FBFsub_df_resDeaths.rename(columns=cols_to_rename, inplace=True)
In [82]:
# Drop columns

cols_to_drop = ['OBJECTID_x','TIME_STEP','TIME_EXAG','ELEMENT','LOCATION','SHAPE_x']

cube3D_FBFsub_df_resDeaths.drop(columns = cols_to_drop,
                         axis=1, inplace=True)
In [83]:
# Check data
display(cube3D_FBFsub_df_resDeaths.shape)
display(cube3D_FBFsub_df_resDeaths.columns)
(156509, 14)
Index(['START_DATE', 'END_DATE', 'FIPS', 'VALUE', 'FITTED', 'RESIDUAL', 'HIGH',
       'LOW', 'SHAPE', 'SHAPE__AREA', 'SHAPE__LENGTH', 'COUNTY', 'OBJECTID',
       'STATE'],
      dtype='object')

Combine Results

Here, we will combine the errors generated using the Forest-based Forecast with the fitted values from the space-time cube cube3D_FBFsub_df_resDeaths.

In [84]:
# Get valid values from ES results df
fbf_small = fbf_resDeaths_subdf[['FIPS','F_RMSE', 'V_RMSE']].copy()
fbf_small.shape
Out[84]:
(2953, 3)
In [85]:
# Merge with RMSE data
cube3D_FBFsub_df_resDeaths = pd.merge(cube3D_FBFsub_df_resDeaths, fbf_small, on='FIPS')
cube3D_FBFsub_df_resDeaths.shape
Out[85]:
(156509, 16)
Publish Layer

We will publish the combined model results to a feature layer for ease of access and use.

In [86]:
%%time
fbf_item = cube3D_FBFsub_df_resDeaths.spatial.to_featurelayer('FBF_BigLayer')
fbf_item
Wall time: 4min 30s
Out[86]:
FBF_BigLayer
Feature Layer Collection by portaladmin
Last Modified: September 08, 2021
0 comments, 0 views

Plot Forecasts

Here, we will create plots that show the raw, fitted, and forecasted average weekly nursing home resident deaths per 1000 residents due to Covid-19 for various counties. The fitted and raw values are for time period June 7, 2020 - March 14, 2021, whereas the forecasted values are from March 21, 2021 - June 6, 2021.

Subset Data

Next, we will subset the data to keep only the necessary columns.

In [87]:
# Check Column names
cube3D_FBFsub_df_resDeaths.columns
Out[87]:
Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'SHAPE', 'shape_area', 'shape_length', 'county', 'objectid',
       'state', 'f_rmse', 'v_rmse'],
      dtype='object')
In [88]:
# Subset specific columns
resDeaths_cube_FBFsub_df = cube3D_FBFsub_df_resDeaths[['end_date','fips','county','state','value',
                                                 'fitted','residual','high','low']].copy()
resDeaths_cube_FBFsub_df.set_index('end_date', inplace=True)
resDeaths_cube_FBFsub_df.shape
Out[88]:
(156509, 8)
Generate Plot

Here, we will create plots that show the raw, fitted, and forecasted values of the average weekly resident deaths per 1000 residents for various counties. Let's use the ts_plot() function, defined earlier, to create the plots. Since Indiana, Arizona, and Pennsylvania were among the top 5 states with the highest average weekly resident deaths per 1000 residents, we will pick a county from each cluster to represent each state and plot them.

County, State FIPS Cluster Id
Maricopa, AZ 4013 2
Delaware, PA 42045 3
Porter, IN 18127 4
In [89]:
# Plot data using the ts_plot function
fips = [4013,42045,18127]
ts_plot(resDeaths_cube_FBFsub_df,fips,
        ylabel='Resident death rate',
        title='Average weekly resident death rates')

The plots show:

  • the fitted values as generated by the Forest-based Forecast model for June 7, 2020 - March 14, 2021.
  • the forecasted values generated by the model for March 21, 2021 - June 6, 2021.
  • the actual resident death rates for June 7, 2020 - March 14, 2021.
  • the confidence interval of the forecast for March 21, 2021 - June 6, 2021.

Observations:

  • The fitted values generated by the model seem to follow the actual death rates closely, with slight variation from the actual data. This could be a result of the model trying to generalize the training data, rather than overfitting it.
  • The forecast values vary for different counties, which is expected, as a different model is generated for each county.
  • The confidence intervals generated by the model are much wider when compared to the exponential smoothing models, making it difficult to visualize the distribution of fitted vs actual data.
  • The increasing width of the confidence intervals show reduced confidence in forecasted values as the forecasting time period increases.

Plot Errors

The forecast RMSE measures how much the fitted values from the model differ from the raw time series values. The forecast RMSE only measures how well the forest-based forecast model fits the raw time series values. It does not measure how well the forecast model actually forecasts future values. This problem is addressed by the validation model. The model uses data from June 7, 2020 - March 14, 2021 for training and data from March 21, 2021 - June 6, 2021 (12-time steps) is used for forecasting to calculate the error.

The validation model is used to determine how well the forecast model can forecast the future values of each time series. It is constructed by excluding some of the final time steps of each time series and fitting the forest-based forecast model to the data that was not excluded. This model is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the raw values that were hidden. Since we are using 6-time steps for validation, the model uses data from June 7, 2020 - Jan 31, 2021 for training and data from Jan 31, 2021 - March 14, 2021 for validation to calculate this error.

You can read more about the two errors here. Let's map the forecast and validation errors.

In [91]:
# Search for layer
fbf_item = gis.content.search('FBF_BigLayer', 'Feature Layer')
fbf_lyr = fbf_item[0].layers[0]
fbf_lyr
Out[91]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/aad982/FeatureServer/0">

Validation Errors on a Map

In [95]:
# Create map
fbf_m1 = gis.map('US', zoomlevel=4)
fbf_m1
Out[95]:

The plot shows variation in validation errors as generated by the Forest-based Forecast model for each county. The variation seems to be really high for some counties with values ranging from 7.8 - 137.

In [93]:
# Plot Data
fbf_m1.remove_layers()
fbf_m1.add_layer(fbf_lyr, 
                    {"renderer":"ClassedColorRenderer",
                   "field_name":"V_RMSE",
                   "opacity":0.7
                    })
In [94]:
# Take screenshot
fbf_m1.legend=True
fbf_m1.take_screenshot(output_in_cell=False)

Forecast Errors on a Map

In [99]:
# Create map
fbf_m2 = gis.map('US', zoomlevel=4)
fbf_m2
Out[99]:

The plot shows variation in forecast errors as generated by the Forest-based Forecast model for each county. The variation seems to be high for some counties with values ranging from 5.5 - 66.

In [97]:
# Plot Data
fbf_m2.remove_layers()
fbf_m2.add_layer(fbf_lyr, 
                    {"renderer":"ClassedColorRenderer",
                   "field_name":"F_RMSE",
                   "opacity":0.7
                    })
In [98]:
# Add layer and take screenshot
fbf_m2.legend=True
fbf_m2.take_screenshot(output_in_cell=False)

Let's identify the counties with high variance in forecast and validation errors.

In [100]:
high_var = cube3D_FBFsub_df_resDeaths.query(
                    'f_rmse > 5.5 and v_rmse > 7.8').groupby(
                    ['state', 'county']).count()[['f_rmse', 'v_rmse']]
high_var
Out[100]:
f_rmse v_rmse
state county
AL Cleburne 53 53
Pike 53 53
AR Bradley 53 53
Cleburne 53 53
Cross 53 53
... ... ... ...
WV Upshur 53 53
Wetzel 53 53
Wyoming 53 53
WY Crook 53 53
Fremont 53 53

316 rows × 2 columns

From the results, we can see that there are quite a few counties for which the variation in error is in the highest band for both validation and forecast errors.

Now that we have built the Forest-based Forecast model and generated forecasts for our time series data, let's combine the results from the models built above.

Combine Exponential Smoothing and Forest-based Results

In this section, we will combine the results generated from the Exponential Smoothing and Forest-based forecast models. The combined results will be used later to generate plots overlaying various models.

In [101]:
# Check ES df
display(cube3D_ESsub_df_resDeaths.columns)
display(cube3D_ESsub_df_resDeaths.shape)

# Check FB df
display(cube3D_FBFsub_df_resDeaths.columns)
display(cube3D_FBFsub_df_resDeaths.shape)
Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'level', 'trend', 'leveltrend', 'season', 'SHAPE', 'shape_area',
       'shape_length', 'county', 'objectid', 'state', 'f_rmse', 'v_rmse'],
      dtype='object')
(156509, 20)
Index(['start_date', 'end_date', 'fips', 'value', 'fitted', 'residual', 'high',
       'low', 'SHAPE', 'shape_area', 'shape_length', 'county', 'objectid',
       'state', 'f_rmse', 'v_rmse'],
      dtype='object')
(156509, 16)

Merge datasets

Here, we will merge the Exponential Smoothing and Forest-based model results.

In [102]:
# Merge on fips and dates
es_fb_df = pd.merge(cube3D_ESsub_df_resDeaths, cube3D_FBFsub_df_resDeaths, 
                    on=['fips', 'start_date', 'end_date'])
In [103]:
# Check merge results
display(es_fb_df.columns)
display(es_fb_df.shape)
display(es_fb_df.head())
Index(['start_date', 'end_date', 'fips', 'value_x', 'fitted_x', 'residual_x',
       'high_x', 'low_x', 'level', 'trend', 'leveltrend', 'season', 'SHAPE_x',
       'shape_area_x', 'shape_length_x', 'county_x', 'objectid_x', 'state_x',
       'f_rmse_x', 'v_rmse_x', 'value_y', 'fitted_y', 'residual_y', 'high_y',
       'low_y', 'SHAPE_y', 'shape_area_y', 'shape_length_y', 'county_y',
       'objectid_y', 'state_y', 'f_rmse_y', 'v_rmse_y'],
      dtype='object')
(156509, 33)
start_date end_date fips value_x fitted_x residual_x high_x low_x level trend leveltrend season SHAPE_x shape_area_x shape_length_x county_x objectid_x state_x f_rmse_x v_rmse_x value_y fitted_y residual_y high_y low_y SHAPE_y shape_area_y shape_length_y county_y objectid_y state_y f_rmse_y v_rmse_y
0 2020-05-31 00:00:01 2020-06-07 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 6.126846 2.0375 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 3.576085 1.80063
1 2020-06-07 00:00:01 2020-06-14 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 6.126846 2.0375 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 3.576085 1.80063
2 2020-06-14 00:00:01 2020-06-21 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 6.126846 2.0375 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 3.576085 1.80063
3 2020-06-21 00:00:01 2020-06-28 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 6.126846 2.0375 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 3.576085 1.80063
4 2020-06-28 00:00:01 2020-07-05 53065 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 6.126846 2.0375 0.0 0.0 0.0 0.0 0.0 {'rings': [[[-13073230.935899999, 6114197.2246... 1.484728e+10 684811.980009 Stevens 2 WA 3.576085 1.80063

Clean data

Next, we will clean the data and keep the necessary columns.

In [104]:
# Rename Columns
cols_to_rename = {'objectid_x':'OBJECTID',
                 'SHAPE_x':'SHAPE',
                 'county_x':'county',
                 'state_x':'state'}
es_fb_df.rename(columns=cols_to_rename, inplace=True)
In [105]:
# Drop columns

cols_to_drop = ['objectid_y','SHAPE_y','state_y','county_y','value_x','value_y']

es_fb_df.drop(columns = cols_to_drop,
            axis=1, inplace=True)
In [106]:
# Check names
display(es_fb_df.shape)
display(es_fb_df.columns)
(156509, 27)
Index(['start_date', 'end_date', 'fips', 'fitted_x', 'residual_x', 'high_x',
       'low_x', 'level', 'trend', 'leveltrend', 'season', 'SHAPE',
       'shape_area_x', 'shape_length_x', 'county', 'OBJECTID', 'state',
       'f_rmse_x', 'v_rmse_x', 'fitted_y', 'residual_y', 'high_y', 'low_y',
       'shape_area_y', 'shape_length_y', 'f_rmse_y', 'v_rmse_y'],
      dtype='object')
In [107]:
# Replace x and y with ES, FB
es_fb_df.columns = es_fb_df.columns.str.replace('_x', '_ES')
es_fb_df.columns = es_fb_df.columns.str.replace('_y', '_FB')
In [108]:
# Rename some ES cols
cols_to_rename = {'level':'level_ES', 'trend':'trend_ES',
                  'leveltrend':'leveltrend_ES', 'season':'season_ES'}

es_fb_df = es_fb_df.rename(columns = cols_to_rename)
In [109]:
display(es_fb_df.shape)
display(es_fb_df.columns)
(156509, 27)
Index(['start_date', 'end_date', 'fips', 'fitted_ES', 'residual_ES', 'high_ES',
       'low_ES', 'level_ES', 'trend_ES', 'leveltrend_ES', 'season_ES', 'SHAPE',
       'shape_area_ES', 'shape_length_ES', 'county', 'OBJECTID', 'state',
       'f_rmse_ES', 'v_rmse_ES', 'fitted_FB', 'residual_FB', 'high_FB',
       'low_FB', 'shape_area_FB', 'shape_length_FB', 'f_rmse_FB', 'v_rmse_FB'],
      dtype='object')

Publish to Feature Layer

We will publish the combined model results to a feature layer for ease of access and use.

In [110]:
%%time
es_fb_item = es_fb_df.spatial.to_featurelayer('ES_FB_BigLayer')
es_fb_item
Wall time: 4min 24s
Out[110]:
ES_FB_BigLayer
Feature Layer Collection by portaladmin
Last Modified: September 08, 2021
0 comments, 0 views

Summary

In this notebook we:

  • Started with data exploration to get an overview of the distribution of average weekly resident deaths per 1000 residents.
  • Created a Space Time Cube to structure the data into a netCDF data format.
  • Clustered the data using Time Series Clustering to better understand some patterns in our space-time data.
    • created average time series per cluster.
  • Generated forecasts for average weekly resident deaths per 1000 residents using Exponential Smoothing Forecast and Forest-based Forecast tools.
  • Plotted forecast and validation errors as generated by the two models.

In the next notebook, we will generate forecasts using the ARIMA (Autoregressive Integrated Moving Average) model and create plots overlaying fitted and forecasted values from all of the models we have tested.

References