The purpose of this notebook is to clean and engineer that data to be used for Covid-19 Nursing Home Resident Deaths study. The Centers for Medicare and Medicaid Services (CMS) provides a Nursing Home COVID-19 Public File. This file includes data reported by nursing homes to the CDC’s National Healthcare Safety Network (NHSN) system’s COVID-19 Long Term Care Facility Module. The data includes Resident Impact, Facility Capacity, Staff & Personnel, Personal Protective Equipment, Ventilator Capacity and other Supplies information.

Dataset: https://data.cms.gov/covid-19/covid-19-nursing-home-data

We downloaded the data as a .csv file where each record represents information about an individual nursing home in the United States for each week from May 25, 2020 - June 6, 2021. The csv file with raw data can be found here and the data dictionary can be found here.

General Imports

# Import Libraries
from IPython.display import display

# Import arcgis
import arcgis
from arcgis.gis import GIS

# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np

# Import library for time and os
import time
import os

# Create a GIS connection
gis = GIS("home")

Get the Data¶

Read as csv¶

# Read and check shape
data = pd.read_csv('COVID-19_Nursing_Home_Dataset_062121.csv', low_memory=False)
data.shape

(843467, 279)

# Identify null location values
pd.isnull(data['Geolocation']).sum()

44811

# Drop records with no location data
data.dropna(subset=['Geolocation'], inplace=True)
data.shape

(798656, 279)

Create Lat and Long cols¶

# Create lat long
data['LONGITUDE'] = data.Geolocation.str.extract('([-.0-9]+)', expand=True).astype(float)
data['LATITUDE'] = data.Geolocation.str.extract('([-.0-9]+)\)$', expand=True).astype(float)

Convert to Spatially Enabled DataFrame¶

# Create SeDF from lat/long
ppedf = pd.DataFrame.spatial.from_xy(data,'LONGITUDE','LATITUDE')

# Check geometry
ppedf['SHAPE']

1         {"spatialReference": {"wkid": 4326}, "x": -92....
2         {"spatialReference": {"wkid": 4326}, "x": -94....
3         {"spatialReference": {"wkid": 4326}, "x": -118...
4         {"spatialReference": {"wkid": 4326}, "x": -121...
6         {"spatialReference": {"wkid": 4326}, "x": -80....
                                ...                        
843459    {"spatialReference": {"wkid": 4326}, "x": -95....
843460    {"spatialReference": {"wkid": 4326}, "x": -95....
843463    {"spatialReference": {"wkid": 4326}, "x": -95....
843464    {"spatialReference": {"wkid": 4326}, "x": -98....
843466    {"spatialReference": {"wkid": 4326}, "x": -95....
Name: SHAPE, Length: 798656, dtype: geometry

# Check shape of data
ppedf.shape

(798656, 282)

Drop Geolocation cols¶

# Drop columns
drop_cols = ['Geolocation','LATITUDE','LONGITUDE']
ppedf.drop(drop_cols,axis=1,inplace=True)

# Sanitize Column Names
ppedf.spatial.sanitize_column_names(inplace=True)

True

Rename Columns¶

Some column names are really long. We will rename some columns.

Note - Long column names can lead to the following error when publishing as a feature layer. Make sure to rename all columns with long names so the publishing doesn't error out.

RuntimeError: The operation was attempted on an empty geometry.

Data Wrangling¶

In this section we will:

Create date and month columns
Convert some existing categorical features to binary (i.e. Y/N --> 0/1).
Add new features to aggregate shortage of ppe and staff.
Rename current and one week supply columns to "shortage" columns.
Change data types

Create Date columns¶

Create date and month columns.

# Convert to datetime col
import datetime
ppedf['week_ending'] = pd.to_datetime(ppedf['week_ending'])

# Create Date and Month cols
ppedf['date'] = ppedf['week_ending'].dt.strftime("%b-%d")
ppedf['month'] = ppedf['week_ending'].dt.month_name(locale = 'English')

# Check columns
ppedf.iloc[:, -3:].head()

# Check data type 
ppedf['week_ending'].dtype

dtype('<M8[ns]')

# Create a copy of dataframe
ppe_df = ppedf.copy()
ppe_df.shape

Convert categorical to binary¶

Convert some existing categorical features to binary (i.e. Y/N --> 0/1).

# Convert staff shortage cols
staff_cols = [col for col in ppe_df.columns if col.startswith('shortage')]
for col in staff_cols:
    ppe_df[col] = ppe_df[col].map({'Y': 1, 'N': 0})

# Convert ppe supply cols
current_ppe_cols = [col for col in ppe_df.columns if col.startswith('any')]
for col in current_ppe_cols:
    ppe_df[col] = ppe_df[col].map({'Y': 0, 'N': 1})

oneweek_ppe_cols = [col for col in ppe_df.columns if col.startswith('one')]
for col in oneweek_ppe_cols:
    ppe_df[col] = ppe_df[col].map({'Y': 0, 'N': 1})

# Convert col for 3 or more cases
ppe_df['threeormore_cases_this_week'] = ppe_df['three_or_more_confirmed_covid19_cases_this_week'].map({'Y': 1, 'N': 0})

# Convert col for initial covid case for the first time since Jan2020
ppe_df['initial_covid_case'] = ppe_df['initial_confirmed_covid19_case_this_week'].map({'Y': 1, 'N': 0})

# Convert col for Residents access to testing
ppe_df['resident_access_to_testing_in_f'] = ppe_df['resident_access_to_testing_in_facility'].map({'Y': 1, 'N': 0})

# Convert testing ability columns for resident and staff
ppe_df['able_to_test_residents'] = ppe_df['able_to_test_or_obtain_resources_to_test_all_current_residents_within_next_7_days'].map({'Y':1, 'N':0})
ppe_df['able_to_test_staff'] = ppe_df['able_to_test_or_obtain_resources_to_test_all_staff_andor_personnel_within_next_7_days'].map({'Y':1, 'N':0})

Add new features¶

Add new features to aggregate shortage of ppe and staff.

# Add staff shortage
ppe_df['total_staff_shortage'] = ppe_df[staff_cols].sum(axis=1)

# Add current ppe shortage
ppe_df['current_ppe_shortage'] = ppe_df[current_ppe_cols].sum(axis=1)

# Add one week ppe shortage
ppe_df['oneweek_ppe_shortage'] = ppe_df[oneweek_ppe_cols].sum(axis=1)

Rename columns¶

Rename current and one week supply columns to "shortage" columns.

# Replace supply with shortage in column names
ppe_df.columns = ppe_df.columns.str.replace('any_current_supply','current_shortage')
ppe_df.columns = ppe_df.columns.str.replace('one_week_supply','oneweek_shortage')

Change data types¶

The following column have an object data type. These columns should be float. Converting to correct datatype.

cols_to_convert = ["residents_total_all_deaths","total_number_of_occupied_beds",
                   "total_resident_confirmed_covid19_cases_per_1000_residents",
                   "total_residents_covid19_deaths_as_a_percentage_of_confirmed_covid19_cases",
                   "total_resident_covid19_deaths_per_1000_residents",
                   "weekly_resident_confirmed_covid19_cases_per_1000_residents",
                   "weekly_resident_covid19_deaths_per_1000_residents",
                  "residents_total_admissions_covid19"]
ppe_df[cols_to_convert] = ppe_df[cols_to_convert].apply(pd.to_numeric, errors='coerce')

ppe_df[cols_to_convert].dtypes

residents_total_all_deaths                                                   float64
total_number_of_occupied_beds                                                float64
total_resident_confirmed_covid19_cases_per_1000_residents                    float64
total_residents_covid19_deaths_as_a_percentage_of_confirmed_covid19_cases    float64
total_resident_covid19_deaths_per_1000_residents                             float64
weekly_resident_confirmed_covid19_cases_per_1000_residents                   float64
weekly_resident_covid19_deaths_per_1000_residents                            float64
residents_total_admissions_covid19                                           float64
dtype: object

# Check shape
ppe_df.shape

(798656, 289)

Data for Time Series¶

Data for all Time Periods¶

Subset Columns¶

Subset the data to keep only necessary columns.

cols_to_keep = ['SHAPE','county','provider_state',
               'initial_covid_case','threeormore_cases_this_week','oneweek_ppe_shortage',
                'total_staff_shortage','shortage_of_nursing_staff','oneweek_shortage_of_n95_masks',
               'staff_total_confirmed_covid19', 'staff_total_covid19_deaths',
                'residents_total_admissions_covid19',
                'weekly_resident_confirmed_covid19_cases_per_1000_residents',
                'weekly_resident_covid19_deaths_per_1000_residents',
                'total_number_of_occupied_beds',
                'week_ending']

ts_ppe_df = ppe_df[cols_to_keep].copy()
ts_ppe_df.shape

(798656, 16)

# Strip white spaces and lowercase
ts_ppe_df['county'] = ts_ppe_df['county'].str.lower().str.strip()

#Check Null Values
ts_ppe_df.isnull().sum()

SHAPE                                                              0
county                                                             0
provider_state                                                     0
initial_covid_case                                             27791
threeormore_cases_this_week                                    27791
oneweek_ppe_shortage                                               0
total_staff_shortage                                               0
shortage_of_nursing_staff                                      14705
oneweek_shortage_of_n95_masks                                 198687
staff_total_confirmed_covid19                                   8067
staff_total_covid19_deaths                                      8067
residents_total_admissions_covid19                              8085
weekly_resident_confirmed_covid19_cases_per_1000_residents     14371
weekly_resident_covid19_deaths_per_1000_residents              14152
total_number_of_occupied_beds                                  11133
week_ending                                                        0
dtype: int64

# Replace null with 0
ts_ppe_df.replace(np.nan, 0, inplace=True)

# Check value counts
ts_ppe_df['week_ending'].value_counts().sort_index()

2020-05-24    14558
2020-05-31    14557
2020-06-07    14557
2020-06-14    14558
2020-06-21    14556
2020-06-28    14556
2020-07-05    14550
2020-07-12    14549
2020-07-19    14549
2020-07-26    14550
2020-08-02    14545
2020-08-09    14545
2020-08-16    14546
2020-08-23    14545
2020-08-30    14543
2020-09-06    14540
2020-09-13    14538
2020-09-20    14539
2020-09-27    14535
2020-10-04    14527
2020-10-11    14528
2020-10-18    14527
2020-10-25    14526
2020-11-01    14522
2020-11-08    14521
2020-11-15    14523
2020-11-22    14519
2020-11-29    14516
2020-12-06    14514
2020-12-13    14513
2020-12-20    14514
2020-12-27    14515
2021-01-03    14514
2021-01-10    14513
2021-01-17    14511
2021-01-24    14510
2021-01-31    14509
2021-02-07    14507
2021-02-14    14506
2021-02-21    14504
2021-02-28    14505
2021-03-07    14504
2021-03-14    14501
2021-03-21    14500
2021-03-28    14501
2021-04-04    14497
2021-04-11    14496
2021-04-18    14494
2021-04-25    14492
2021-05-02    14488
2021-05-09    14488
2021-05-16    14484
2021-05-23    14484
2021-05-30    14484
2021-06-06    14483
Name: week_ending, dtype: int64

# Check information of dataframe
ts_ppe_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 798656 entries, 1 to 843466
Data columns (total 16 columns):
 #   Column                                                      Non-Null Count   Dtype         
---  ------                                                      --------------   -----         
 0   SHAPE                                                       798656 non-null  geometry      
 1   county                                                      798656 non-null  object        
 2   provider_state                                              798656 non-null  object        
 3   initial_covid_case                                          798656 non-null  float64       
 4   threeormore_cases_this_week                                 798656 non-null  float64       
 5   oneweek_ppe_shortage                                        798656 non-null  float64       
 6   total_staff_shortage                                        798656 non-null  float64       
 7   shortage_of_nursing_staff                                   798656 non-null  float64       
 8   oneweek_shortage_of_n95_masks                               798656 non-null  float64       
 9   staff_total_confirmed_covid19                               798656 non-null  float64       
 10  staff_total_covid19_deaths                                  798656 non-null  float64       
 11  residents_total_admissions_covid19                          798656 non-null  float64       
 12  weekly_resident_confirmed_covid19_cases_per_1000_residents  798656 non-null  float64       
 13  weekly_resident_covid19_deaths_per_1000_residents           798656 non-null  float64       
 14  total_number_of_occupied_beds                               798656 non-null  float64       
 15  week_ending                                                 798656 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(12), geometry(1), object(2)
memory usage: 103.6+ MB

Remove Faulty Data¶

Per CMS, data for the earlier periods may have been reported incorrectly. We will inspect the data to see if it makes sense and remove records as needed.

# Group and subset for CA
subdf = ts_ppe_df.groupby(
    ['week_ending','provider_state']).aggregate('sum')[['weekly_resident_confirmed_covid19_cases_per_1000_residents',
                                                        'weekly_resident_covid19_deaths_per_1000_residents']]
subdf.query('provider_state == "CA"')

We can see that the data for May 25, 2020 does not make sense. The data seems to be incorrectly reported. We will go ahead and remove all records for the first to available dates i.e. May 25, 2020 and May 31, 2020. This way the dataset used for our analysis will start from June 07, 2020.

# Remove records for May 25, 2020 and May 31, 2020
ts_df = ts_ppe_df.query('(week_ending > "2020-05-31")').copy()
ts_df.shape

(769541, 16)

# Check data
ts_df['week_ending'].value_counts().sort_index()

2020-06-07    14557
2020-06-14    14558
2020-06-21    14556
2020-06-28    14556
2020-07-05    14550
2020-07-12    14549
2020-07-19    14549
2020-07-26    14550
2020-08-02    14545
2020-08-09    14545
2020-08-16    14546
2020-08-23    14545
2020-08-30    14543
2020-09-06    14540
2020-09-13    14538
2020-09-20    14539
2020-09-27    14535
2020-10-04    14527
2020-10-11    14528
2020-10-18    14527
2020-10-25    14526
2020-11-01    14522
2020-11-08    14521
2020-11-15    14523
2020-11-22    14519
2020-11-29    14516
2020-12-06    14514
2020-12-13    14513
2020-12-20    14514
2020-12-27    14515
2021-01-03    14514
2021-01-10    14513
2021-01-17    14511
2021-01-24    14510
2021-01-31    14509
2021-02-07    14507
2021-02-14    14506
2021-02-21    14504
2021-02-28    14505
2021-03-07    14504
2021-03-14    14501
2021-03-21    14500
2021-03-28    14501
2021-04-04    14497
2021-04-11    14496
2021-04-18    14494
2021-04-25    14492
2021-05-02    14488
2021-05-09    14488
2021-05-16    14484
2021-05-23    14484
2021-05-30    14484
2021-06-06    14483
Name: week_ending, dtype: int64

Publish as Layer¶

We will publish the data to a feature layer for ease of access and use in the analysis.

ts_layer = ts_df.spatial.to_featurelayer('NH_TS_US_Cleaned_0621')

ts_layer

NOTE: This layer contains subset of columns for all of US for all time periods.

Data for Specific Time Period¶

We will use data from June 07, 2020 - March 14, 2021 to train and validate the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting and is not included in this subset.

Subset data¶

We will subset the data from June 07, 2020 - March 14, 2021.

# Create subset
ts_df_sub = ts_df.query('(week_ending < "2021-03-21")').copy()
ts_df_sub.shape

(595650, 16)

# Check data
ts_df_sub['week_ending'].value_counts().sort_index()

2020-06-07    14557
2020-06-14    14558
2020-06-21    14556
2020-06-28    14556
2020-07-05    14550
2020-07-12    14549
2020-07-19    14549
2020-07-26    14550
2020-08-02    14545
2020-08-09    14545
2020-08-16    14546
2020-08-23    14545
2020-08-30    14543
2020-09-06    14540
2020-09-13    14538
2020-09-20    14539
2020-09-27    14535
2020-10-04    14527
2020-10-11    14528
2020-10-18    14527
2020-10-25    14526
2020-11-01    14522
2020-11-08    14521
2020-11-15    14523
2020-11-22    14519
2020-11-29    14516
2020-12-06    14514
2020-12-13    14513
2020-12-20    14514
2020-12-27    14515
2021-01-03    14514
2021-01-10    14513
2021-01-17    14511
2021-01-24    14510
2021-01-31    14509
2021-02-07    14507
2021-02-14    14506
2021-02-21    14504
2021-02-28    14505
2021-03-07    14504
2021-03-14    14501
Name: week_ending, dtype: int64

Publish as Layer¶

We will publish the data to a feature layer for ease of access and use in the analysis.

ts_sub_layer = ts_df_sub.spatial.to_featurelayer('Subset_NH_TS_US_Cleaned_0621')

ts_sub_layer

NOTE: This layer contains subset of columns for all of US for June 07, 2020 - March 14, 2021 to train and validate the model. These points will be aggregated when creating "Space Time cube by Aggregation"; we will supply US county boundaries to aggregate on those boundaries. Data from March 21, 2021 - June 6, 2021 (12 time steps) is not included in this layer as it will be used for forecasting.

Data for County boundaries¶

Here, we will query county boundaries layer to get data for fips code, county and state name mapping. The data is then cleaned and published for use as polygon layer in the analysis notebooks.

# Get County Boundaries
boundary_item = agol_gis.content.get('f7ea0c30856b4ef9a00d07978426f4d4')
boundary_layer = boundary_item.layers[0]
boundary_layer

<FeatureLayer url:"https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_Counties_Generalized_analysis_trim/FeatureServer/0">

# Look at the first 5 fields and their data types
for f in boundary_layer.properties.fields[:5]:
    print(f['name'], "{0:>35}".format(f['alias']))

OBJECTID                            OBJECTID
FIPS                                FIPS
COUNTY                              COUNTY
STATE                               STATE
POPULATION                             POP2014

# Query fips layer for relevant columns
# Change data type of FIPS field to int
fields = ['FIPS','COUNTY','STATE']
fips_df = boundary_layer.query(out_fields=fields, as_df=True, gis = agol_gis)
fips_df['FIPS'] = fips_df['FIPS'].astype('int64')
fips_df.shape

(3142, 5)

# Check head
fips_df.head()

# Confirm data types
fips_df.dtypes

COUNTY        object
FIPS           int64
OBJECTID       int64
SHAPE       geometry
STATE         object
dtype: object

# Create layer for FIPS data
fips_lyr = fips_df.spatial.to_featurelayer('NH_TS_US_FIPS_layer', gis=gis)
fips_lyr

Summary¶

To summarize, this notebook cleans and prepares the data to be used for Covid-19 Nursing Home Resident Deaths study.

	COUNTY	FIPS	OBJECTID	SHAPE	STATE
0	Ferry	53019	1	{"rings": [[[-13230502.7979313, 6098468.131349...	WA
1	Stevens	53065	2	{"rings": [[[-13073230.935907, 6114197.22455],...	WA
2	Okanogan	53047	3	{"rings": [[[-13243912.9130111, 6096737.855632...	WA
3	Pend Oreille	53051	4	{"rings": [[[-13073202.9669304, 6274847.562042...	WA
4	Boundary	16021	5	{"rings": [[[-13027621.1242142, 6247430.690397...	ID

Table of Contents

Overview¶

Get the Data¶

Read as csv¶

Create Lat and Long cols¶

Convert to Spatially Enabled DataFrame¶

Drop Geolocation cols¶

Rename Columns¶

Data Wrangling¶

Create Date columns¶

Convert categorical to binary¶

Add new features¶

Rename columns¶

Change data types¶

Data for Time Series¶

Data for all Time Periods¶

Subset Columns¶

Remove Faulty Data¶

Publish as Layer¶

Data for Specific Time Period¶

Subset data¶

Publish as Layer¶

Data for County boundaries¶

Summary¶

	SHAPE	date	month
1	{"spatialReference": {"wkid": 4326}, "x": -92....	May-24	May
2	{"spatialReference": {"wkid": 4326}, "x": -94....	May-31	May
3	{"spatialReference": {"wkid": 4326}, "x": -118...	May-24	May
4	{"spatialReference": {"wkid": 4326}, "x": -121...	May-24	May
6	{"spatialReference": {"wkid": 4326}, "x": -80....	May-24	May

		weekly_resident_confirmed_covid19_cases_per_1000_residents	weekly_resident_covid19_deaths_per_1000_residents
week_ending	provider_state
2020-05-24	CA	64822.33	18843.20
2020-05-31	CA	10526.18	3740.68
2020-06-07	CA	13044.15	2737.44
2020-06-14	CA	10552.47	2544.03
2020-06-21	CA	12615.01	2612.71
2020-06-28	CA	13936.62	2083.85
2020-07-05	CA	15366.19	1902.56
2020-07-12	CA	14716.18	2677.40
2020-07-19	CA	15928.93	2906.59
2020-07-26	CA	21096.04	2940.45
2020-08-02	CA	21675.30	2670.54
2020-08-09	CA	17065.89	2694.64
2020-08-16	CA	20012.63	2950.37
2020-08-23	CA	15236.37	3707.93
2020-08-30	CA	10288.45	2581.94
2020-09-06	CA	9144.72	1997.02
2020-09-13	CA	6368.02	1938.68
2020-09-20	CA	5987.87	1323.03
2020-09-27	CA	6757.67	975.27
2020-10-04	CA	4361.12	1088.37
2020-10-11	CA	3772.21	886.30
2020-10-18	CA	5462.04	901.14
2020-10-25	CA	4142.42	914.06
2020-11-01	CA	3354.34	696.85
2020-11-08	CA	8657.22	698.16
2020-11-15	CA	12883.34	885.49
2020-11-22	CA	17824.22	1409.90
2020-11-29	CA	22483.93	1844.39
2020-12-06	CA	34345.27	2771.98
2020-12-13	CA	48001.86	4498.46
2020-12-20	CA	66761.84	6675.14
2020-12-27	CA	56359.45	7736.73
2021-01-03	CA	58834.34	8728.24
2021-01-10	CA	45147.31	8921.63
2021-01-17	CA	36879.14	7823.49
2021-01-24	CA	22382.82	5235.65
2021-01-31	CA	12626.54	3136.84
2021-02-07	CA	6303.13	2249.72
2021-02-14	CA	3781.25	1667.42
2021-02-21	CA	1756.66	629.46
2021-02-28	CA	1507.45	500.78
2021-03-07	CA	1287.57	459.36
2021-03-14	CA	981.89	93.59
2021-03-21	CA	988.75	395.16
2021-03-28	CA	812.66	123.88
2021-04-04	CA	744.33	68.34
2021-04-11	CA	745.01	48.44
2021-04-18	CA	956.08	64.39
2021-04-25	CA	475.46	50.87
2021-05-02	CA	597.15	299.33
2021-05-09	CA	363.46	44.17
2021-05-16	CA	441.84	55.00
2021-05-23	CA	319.83	21.52
2021-05-30	CA	369.50	62.27
2021-06-06	CA	325.39	12.35