Overview

The purpose of this notebook is to clean and engineer that data to be used for Covid-19 Nursing Home Resident Deaths study. The Centers for Medicare and Medicaid Services (CMS) provides a Nursing Home COVID-19 Public File. This file includes data reported by nursing homes to the CDC’s National Healthcare Safety Network (NHSN) system’s COVID-19 Long Term Care Facility Module. The data includes Resident Impact, Facility Capacity, Staff & Personnel, Personal Protective Equipment, Ventilator Capacity and other Supplies information.

Dataset: https://data.cms.gov/covid-19/covid-19-nursing-home-data

We downloaded the data as a .csv file where each record represents information about an individual nursing home in the United States for each week from May 25, 2020 - June 6, 2021. The csv file with raw data can be found here and the data dictionary can be found here.

General Imports

In [1]:
# Import Libraries
from IPython.display import display

# Import arcgis
import arcgis
from arcgis.gis import GIS

# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np

# Import library for time and os
import time
import os
In [2]:
# Create a GIS connection
gis = GIS("home")

Get the Data

Read as csv

In [8]:
# Read and check shape
data = pd.read_csv('COVID-19_Nursing_Home_Dataset_062121.csv', low_memory=False)
data.shape
Out[8]:
(843467, 279)
In [9]:
# Identify null location values
pd.isnull(data['Geolocation']).sum()
Out[9]:
44811
In [11]:
# Drop records with no location data
data.dropna(subset=['Geolocation'], inplace=True)
data.shape
Out[11]:
(798656, 279)

Create Lat and Long cols

In [12]:
# Create lat long
data['LONGITUDE'] = data.Geolocation.str.extract('([-.0-9]+)', expand=True).astype(float)
data['LATITUDE'] = data.Geolocation.str.extract('([-.0-9]+)\)$', expand=True).astype(float)

Convert to Spatially Enabled DataFrame

In [13]:
# Create SeDF from lat/long
ppedf = pd.DataFrame.spatial.from_xy(data,'LONGITUDE','LATITUDE')
In [14]:
# Check geometry
ppedf['SHAPE']
Out[14]:
1         {"spatialReference": {"wkid": 4326}, "x": -92....
2         {"spatialReference": {"wkid": 4326}, "x": -94....
3         {"spatialReference": {"wkid": 4326}, "x": -118...
4         {"spatialReference": {"wkid": 4326}, "x": -121...
6         {"spatialReference": {"wkid": 4326}, "x": -80....
                                ...                        
843459    {"spatialReference": {"wkid": 4326}, "x": -95....
843460    {"spatialReference": {"wkid": 4326}, "x": -95....
843463    {"spatialReference": {"wkid": 4326}, "x": -95....
843464    {"spatialReference": {"wkid": 4326}, "x": -98....
843466    {"spatialReference": {"wkid": 4326}, "x": -95....
Name: SHAPE, Length: 798656, dtype: geometry
In [15]:
# Check shape of data
ppedf.shape
Out[15]:
(798656, 282)

Drop Geolocation cols

In [16]:
# Drop columns
drop_cols = ['Geolocation','LATITUDE','LONGITUDE']
ppedf.drop(drop_cols,axis=1,inplace=True)
In [17]:
# Sanitize Column Names
ppedf.spatial.sanitize_column_names(inplace=True)
Out[17]:
True

Rename Columns

Some column names are really long. We will rename some columns.

Note - Long column names can lead to the following error when publishing as a feature layer. Make sure to rename all columns with long names so the publishing doesn't error out.

RuntimeError: The operation was attempted on an empty geometry.

Data Wrangling

In this section we will:

  1. Create date and month columns
  2. Convert some existing categorical features to binary (i.e. Y/N --> 0/1).
  3. Add new features to aggregate shortage of ppe and staff.
  4. Rename current and one week supply columns to "shortage" columns.
  5. Change data types

Create Date columns

Create date and month columns.

In [18]:
# Convert to datetime col
import datetime
ppedf['week_ending'] = pd.to_datetime(ppedf['week_ending'])
In [19]:
# Create Date and Month cols
ppedf['date'] = ppedf['week_ending'].dt.strftime("%b-%d")
ppedf['month'] = ppedf['week_ending'].dt.month_name(locale = 'English')
In [20]:
# Check columns
ppedf.iloc[:, -3:].head()
Out[20]:
SHAPE date month
1 {"spatialReference": {"wkid": 4326}, "x": -92.... May-24 May
2 {"spatialReference": {"wkid": 4326}, "x": -94.... May-31 May
3 {"spatialReference": {"wkid": 4326}, "x": -118... May-24 May
4 {"spatialReference": {"wkid": 4326}, "x": -121... May-24 May
6 {"spatialReference": {"wkid": 4326}, "x": -80.... May-24 May
In [21]:
# Check data type 
ppedf['week_ending'].dtype
Out[21]:
dtype('<M8[ns]')
In [ ]:
# Create a copy of dataframe
ppe_df = ppedf.copy()
ppe_df.shape

Convert categorical to binary

Convert some existing categorical features to binary (i.e. Y/N --> 0/1).

In [23]:
# Convert staff shortage cols
staff_cols = [col for col in ppe_df.columns if col.startswith('shortage')]
for col in staff_cols:
    ppe_df[col] = ppe_df[col].map({'Y': 1, 'N': 0})    
In [24]:
# Convert ppe supply cols
current_ppe_cols = [col for col in ppe_df.columns if col.startswith('any')]
for col in current_ppe_cols:
    ppe_df[col] = ppe_df[col].map({'Y': 0, 'N': 1})

oneweek_ppe_cols = [col for col in ppe_df.columns if col.startswith('one')]
for col in oneweek_ppe_cols:
    ppe_df[col] = ppe_df[col].map({'Y': 0, 'N': 1})  
In [25]:
# Convert col for 3 or more cases
ppe_df['threeormore_cases_this_week'] = ppe_df['three_or_more_confirmed_covid19_cases_this_week'].map({'Y': 1, 'N': 0})
In [26]:
# Convert col for initial covid case for the first time since Jan2020
ppe_df['initial_covid_case'] = ppe_df['initial_confirmed_covid19_case_this_week'].map({'Y': 1, 'N': 0})
In [27]:
# Convert col for Residents access to testing
ppe_df['resident_access_to_testing_in_f'] = ppe_df['resident_access_to_testing_in_facility'].map({'Y': 1, 'N': 0})
In [28]:
# Convert testing ability columns for resident and staff
ppe_df['able_to_test_residents'] = ppe_df['able_to_test_or_obtain_resources_to_test_all_current_residents_within_next_7_days'].map({'Y':1, 'N':0})
ppe_df['able_to_test_staff'] = ppe_df['able_to_test_or_obtain_resources_to_test_all_staff_andor_personnel_within_next_7_days'].map({'Y':1, 'N':0})

Add new features

Add new features to aggregate shortage of ppe and staff.

In [29]:
# Add staff shortage
ppe_df['total_staff_shortage'] = ppe_df[staff_cols].sum(axis=1)

# Add current ppe shortage
ppe_df['current_ppe_shortage'] = ppe_df[current_ppe_cols].sum(axis=1)

# Add one week ppe shortage
ppe_df['oneweek_ppe_shortage'] = ppe_df[oneweek_ppe_cols].sum(axis=1)

Rename columns

Rename current and one week supply columns to "shortage" columns.

In [30]:
# Replace supply with shortage in column names
ppe_df.columns = ppe_df.columns.str.replace('any_current_supply','current_shortage')
ppe_df.columns = ppe_df.columns.str.replace('one_week_supply','oneweek_shortage')

Change data types

The following column have an object data type. These columns should be float. Converting to correct datatype.

In [31]:
cols_to_convert = ["residents_total_all_deaths","total_number_of_occupied_beds",
                   "total_resident_confirmed_covid19_cases_per_1000_residents",
                   "total_residents_covid19_deaths_as_a_percentage_of_confirmed_covid19_cases",
                   "total_resident_covid19_deaths_per_1000_residents",
                   "weekly_resident_confirmed_covid19_cases_per_1000_residents",
                   "weekly_resident_covid19_deaths_per_1000_residents",
                  "residents_total_admissions_covid19"]
ppe_df[cols_to_convert] = ppe_df[cols_to_convert].apply(pd.to_numeric, errors='coerce')
In [32]:
ppe_df[cols_to_convert].dtypes
Out[32]:
residents_total_all_deaths                                                   float64
total_number_of_occupied_beds                                                float64
total_resident_confirmed_covid19_cases_per_1000_residents                    float64
total_residents_covid19_deaths_as_a_percentage_of_confirmed_covid19_cases    float64
total_resident_covid19_deaths_per_1000_residents                             float64
weekly_resident_confirmed_covid19_cases_per_1000_residents                   float64
weekly_resident_covid19_deaths_per_1000_residents                            float64
residents_total_admissions_covid19                                           float64
dtype: object
In [33]:
# Check shape
ppe_df.shape
Out[33]:
(798656, 289)

Data for Time Series

Data for all Time Periods

Subset Columns

Subset the data to keep only necessary columns.

In [34]:
cols_to_keep = ['SHAPE','county','provider_state',
               'initial_covid_case','threeormore_cases_this_week','oneweek_ppe_shortage',
                'total_staff_shortage','shortage_of_nursing_staff','oneweek_shortage_of_n95_masks',
               'staff_total_confirmed_covid19', 'staff_total_covid19_deaths',
                'residents_total_admissions_covid19',
                'weekly_resident_confirmed_covid19_cases_per_1000_residents',
                'weekly_resident_covid19_deaths_per_1000_residents',
                'total_number_of_occupied_beds',
                'week_ending']
In [35]:
ts_ppe_df = ppe_df[cols_to_keep].copy()
ts_ppe_df.shape
Out[35]:
(798656, 16)
In [36]:
# Strip white spaces and lowercase
ts_ppe_df['county'] = ts_ppe_df['county'].str.lower().str.strip()
In [37]:
#Check Null Values
ts_ppe_df.isnull().sum()
Out[37]:
SHAPE                                                              0
county                                                             0
provider_state                                                     0
initial_covid_case                                             27791
threeormore_cases_this_week                                    27791
oneweek_ppe_shortage                                               0
total_staff_shortage                                               0
shortage_of_nursing_staff                                      14705
oneweek_shortage_of_n95_masks                                 198687
staff_total_confirmed_covid19                                   8067
staff_total_covid19_deaths                                      8067
residents_total_admissions_covid19                              8085
weekly_resident_confirmed_covid19_cases_per_1000_residents     14371
weekly_resident_covid19_deaths_per_1000_residents              14152
total_number_of_occupied_beds                                  11133
week_ending                                                        0
dtype: int64
In [38]:
# Replace null with 0
ts_ppe_df.replace(np.nan, 0, inplace=True)
In [39]:
# Check value counts
ts_ppe_df['week_ending'].value_counts().sort_index()
Out[39]:
2020-05-24    14558
2020-05-31    14557
2020-06-07    14557
2020-06-14    14558
2020-06-21    14556
2020-06-28    14556
2020-07-05    14550
2020-07-12    14549
2020-07-19    14549
2020-07-26    14550
2020-08-02    14545
2020-08-09    14545
2020-08-16    14546
2020-08-23    14545
2020-08-30    14543
2020-09-06    14540
2020-09-13    14538
2020-09-20    14539
2020-09-27    14535
2020-10-04    14527
2020-10-11    14528
2020-10-18    14527
2020-10-25    14526
2020-11-01    14522
2020-11-08    14521
2020-11-15    14523
2020-11-22    14519
2020-11-29    14516
2020-12-06    14514
2020-12-13    14513
2020-12-20    14514
2020-12-27    14515
2021-01-03    14514
2021-01-10    14513
2021-01-17    14511
2021-01-24    14510
2021-01-31    14509
2021-02-07    14507
2021-02-14    14506
2021-02-21    14504
2021-02-28    14505
2021-03-07    14504
2021-03-14    14501
2021-03-21    14500
2021-03-28    14501
2021-04-04    14497
2021-04-11    14496
2021-04-18    14494
2021-04-25    14492
2021-05-02    14488
2021-05-09    14488
2021-05-16    14484
2021-05-23    14484
2021-05-30    14484
2021-06-06    14483
Name: week_ending, dtype: int64
In [40]:
# Check information of dataframe
ts_ppe_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 798656 entries, 1 to 843466
Data columns (total 16 columns):
 #   Column                                                      Non-Null Count   Dtype         
---  ------                                                      --------------   -----         
 0   SHAPE                                                       798656 non-null  geometry      
 1   county                                                      798656 non-null  object        
 2   provider_state                                              798656 non-null  object        
 3   initial_covid_case                                          798656 non-null  float64       
 4   threeormore_cases_this_week                                 798656 non-null  float64       
 5   oneweek_ppe_shortage                                        798656 non-null  float64       
 6   total_staff_shortage                                        798656 non-null  float64       
 7   shortage_of_nursing_staff                                   798656 non-null  float64       
 8   oneweek_shortage_of_n95_masks                               798656 non-null  float64       
 9   staff_total_confirmed_covid19                               798656 non-null  float64       
 10  staff_total_covid19_deaths                                  798656 non-null  float64       
 11  residents_total_admissions_covid19                          798656 non-null  float64       
 12  weekly_resident_confirmed_covid19_cases_per_1000_residents  798656 non-null  float64       
 13  weekly_resident_covid19_deaths_per_1000_residents           798656 non-null  float64       
 14  total_number_of_occupied_beds                               798656 non-null  float64       
 15  week_ending                                                 798656 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(12), geometry(1), object(2)
memory usage: 103.6+ MB

Remove Faulty Data

Per CMS, data for the earlier periods may have been reported incorrectly. We will inspect the data to see if it makes sense and remove records as needed.

In [50]:
# Group and subset for CA
subdf = ts_ppe_df.groupby(
    ['week_ending','provider_state']).aggregate('sum')[['weekly_resident_confirmed_covid19_cases_per_1000_residents',
                                                        'weekly_resident_covid19_deaths_per_1000_residents']]
subdf.query('provider_state == "CA"')
Out[50]:
weekly_resident_confirmed_covid19_cases_per_1000_residents weekly_resident_covid19_deaths_per_1000_residents
week_ending provider_state
2020-05-24 CA 64822.33 18843.20
2020-05-31 CA 10526.18 3740.68
2020-06-07 CA 13044.15 2737.44
2020-06-14 CA 10552.47 2544.03
2020-06-21 CA 12615.01 2612.71
2020-06-28 CA 13936.62 2083.85
2020-07-05 CA 15366.19 1902.56
2020-07-12 CA 14716.18 2677.40
2020-07-19 CA 15928.93 2906.59
2020-07-26 CA 21096.04 2940.45
2020-08-02 CA 21675.30 2670.54
2020-08-09 CA 17065.89 2694.64
2020-08-16 CA 20012.63 2950.37
2020-08-23 CA 15236.37 3707.93
2020-08-30 CA 10288.45 2581.94
2020-09-06 CA 9144.72 1997.02
2020-09-13 CA 6368.02 1938.68
2020-09-20 CA 5987.87 1323.03
2020-09-27 CA 6757.67 975.27
2020-10-04 CA 4361.12 1088.37
2020-10-11 CA 3772.21 886.30
2020-10-18 CA 5462.04 901.14
2020-10-25 CA 4142.42 914.06
2020-11-01 CA 3354.34 696.85
2020-11-08 CA 8657.22 698.16
2020-11-15 CA 12883.34 885.49
2020-11-22 CA 17824.22 1409.90
2020-11-29 CA 22483.93 1844.39
2020-12-06 CA 34345.27 2771.98
2020-12-13 CA 48001.86 4498.46
2020-12-20 CA 66761.84 6675.14
2020-12-27 CA 56359.45 7736.73
2021-01-03 CA 58834.34 8728.24
2021-01-10 CA 45147.31 8921.63
2021-01-17 CA 36879.14 7823.49
2021-01-24 CA 22382.82 5235.65
2021-01-31 CA 12626.54 3136.84
2021-02-07 CA 6303.13 2249.72
2021-02-14 CA 3781.25 1667.42
2021-02-21 CA 1756.66 629.46
2021-02-28 CA 1507.45 500.78
2021-03-07 CA 1287.57 459.36
2021-03-14 CA 981.89 93.59
2021-03-21 CA 988.75 395.16
2021-03-28 CA 812.66 123.88
2021-04-04 CA 744.33 68.34
2021-04-11 CA 745.01 48.44
2021-04-18 CA 956.08 64.39
2021-04-25 CA 475.46 50.87
2021-05-02 CA 597.15 299.33
2021-05-09 CA 363.46 44.17
2021-05-16 CA 441.84 55.00
2021-05-23 CA 319.83 21.52
2021-05-30 CA 369.50 62.27
2021-06-06 CA 325.39 12.35

We can see that the data for May 25, 2020 does not make sense. The data seems to be incorrectly reported. We will go ahead and remove all records for the first to available dates i.e. May 25, 2020 and May 31, 2020. This way the dataset used for our analysis will start from June 07, 2020.

In [51]:
# Remove records for May 25, 2020 and May 31, 2020
ts_df = ts_ppe_df.query('(week_ending > "2020-05-31")').copy()
ts_df.shape
Out[51]:
(769541, 16)
In [52]:
# Check data
ts_df['week_ending'].value_counts().sort_index()
Out[52]:
2020-06-07    14557
2020-06-14    14558
2020-06-21    14556
2020-06-28    14556
2020-07-05    14550
2020-07-12    14549
2020-07-19    14549
2020-07-26    14550
2020-08-02    14545
2020-08-09    14545
2020-08-16    14546
2020-08-23    14545
2020-08-30    14543
2020-09-06    14540
2020-09-13    14538
2020-09-20    14539
2020-09-27    14535
2020-10-04    14527
2020-10-11    14528
2020-10-18    14527
2020-10-25    14526
2020-11-01    14522
2020-11-08    14521
2020-11-15    14523
2020-11-22    14519
2020-11-29    14516
2020-12-06    14514
2020-12-13    14513
2020-12-20    14514
2020-12-27    14515
2021-01-03    14514
2021-01-10    14513
2021-01-17    14511
2021-01-24    14510
2021-01-31    14509
2021-02-07    14507
2021-02-14    14506
2021-02-21    14504
2021-02-28    14505
2021-03-07    14504
2021-03-14    14501
2021-03-21    14500
2021-03-28    14501
2021-04-04    14497
2021-04-11    14496
2021-04-18    14494
2021-04-25    14492
2021-05-02    14488
2021-05-09    14488
2021-05-16    14484
2021-05-23    14484
2021-05-30    14484
2021-06-06    14483
Name: week_ending, dtype: int64

Publish as Layer

We will publish the data to a feature layer for ease of access and use in the analysis.

In [53]:
ts_layer = ts_df.spatial.to_featurelayer('NH_TS_US_Cleaned_0621')
In [54]:
ts_layer
Out[54]:
NH_TS_US_Cleaned_0621
Feature Layer Collection by portaladmin
Last Modified: July 30, 2021
0 comments, 0 views

NOTE: This layer contains subset of columns for all of US for all time periods.

Data for Specific Time Period

We will use data from June 07, 2020 - March 14, 2021 to train and validate the model. Data from March 21, 2021 - June 6, 2021 (12 time steps) will be used for forecasting and is not included in this subset.

Subset data

We will subset the data from June 07, 2020 - March 14, 2021.

In [55]:
# Create subset
ts_df_sub = ts_df.query('(week_ending < "2021-03-21")').copy()
ts_df_sub.shape
Out[55]:
(595650, 16)
In [56]:
# Check data
ts_df_sub['week_ending'].value_counts().sort_index()
Out[56]:
2020-06-07    14557
2020-06-14    14558
2020-06-21    14556
2020-06-28    14556
2020-07-05    14550
2020-07-12    14549
2020-07-19    14549
2020-07-26    14550
2020-08-02    14545
2020-08-09    14545
2020-08-16    14546
2020-08-23    14545
2020-08-30    14543
2020-09-06    14540
2020-09-13    14538
2020-09-20    14539
2020-09-27    14535
2020-10-04    14527
2020-10-11    14528
2020-10-18    14527
2020-10-25    14526
2020-11-01    14522
2020-11-08    14521
2020-11-15    14523
2020-11-22    14519
2020-11-29    14516
2020-12-06    14514
2020-12-13    14513
2020-12-20    14514
2020-12-27    14515
2021-01-03    14514
2021-01-10    14513
2021-01-17    14511
2021-01-24    14510
2021-01-31    14509
2021-02-07    14507
2021-02-14    14506
2021-02-21    14504
2021-02-28    14505
2021-03-07    14504
2021-03-14    14501
Name: week_ending, dtype: int64

Publish as Layer

We will publish the data to a feature layer for ease of access and use in the analysis.

In [57]:
ts_sub_layer = ts_df_sub.spatial.to_featurelayer('Subset_NH_TS_US_Cleaned_0621')
In [58]:
ts_sub_layer
Out[58]:
Subset_NH_TS_US_Cleaned_0621
Feature Layer Collection by portaladmin
Last Modified: July 30, 2021
0 comments, 0 views

NOTE: This layer contains subset of columns for all of US for June 07, 2020 - March 14, 2021 to train and validate the model. These points will be aggregated when creating "Space Time cube by Aggregation"; we will supply US county boundaries to aggregate on those boundaries. Data from March 21, 2021 - June 6, 2021 (12 time steps) is not included in this layer as it will be used for forecasting.

Data for County boundaries

Here, we will query county boundaries layer to get data for fips code, county and state name mapping. The data is then cleaned and published for use as polygon layer in the analysis notebooks.

In [3]:
# Get County Boundaries
boundary_item = agol_gis.content.get('f7ea0c30856b4ef9a00d07978426f4d4')
boundary_layer = boundary_item.layers[0]
boundary_layer
Out[3]:
<FeatureLayer url:"https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_Counties_Generalized_analysis_trim/FeatureServer/0">
In [4]:
# Look at the first 5 fields and their data types
for f in boundary_layer.properties.fields[:5]:
    print(f['name'], "{0:>35}".format(f['alias']))
OBJECTID                            OBJECTID
FIPS                                FIPS
COUNTY                              COUNTY
STATE                               STATE
POPULATION                             POP2014
In [5]:
# Query fips layer for relevant columns
# Change data type of FIPS field to int
fields = ['FIPS','COUNTY','STATE']
fips_df = boundary_layer.query(out_fields=fields, as_df=True, gis = agol_gis)
fips_df['FIPS'] = fips_df['FIPS'].astype('int64')
fips_df.shape
Out[5]:
(3142, 5)
In [6]:
# Check head
fips_df.head()
Out[6]:
COUNTY FIPS OBJECTID SHAPE STATE
0 Ferry 53019 1 {"rings": [[[-13230502.7979313, 6098468.131349... WA
1 Stevens 53065 2 {"rings": [[[-13073230.935907, 6114197.22455],... WA
2 Okanogan 53047 3 {"rings": [[[-13243912.9130111, 6096737.855632... WA
3 Pend Oreille 53051 4 {"rings": [[[-13073202.9669304, 6274847.562042... WA
4 Boundary 16021 5 {"rings": [[[-13027621.1242142, 6247430.690397... ID
In [7]:
# Confirm data types
fips_df.dtypes
Out[7]:
COUNTY        object
FIPS           int64
OBJECTID       int64
SHAPE       geometry
STATE         object
dtype: object
In [8]:
# Create layer for FIPS data
fips_lyr = fips_df.spatial.to_featurelayer('NH_TS_US_FIPS_layer', gis=gis)
fips_lyr
Out[8]:
NH_TS_US_FIPS_layer
Feature Layer Collection by portaladmin
Last Modified: August 05, 2021
0 comments, 0 views

Summary

To summarize, this notebook cleans and prepares the data to be used for Covid-19 Nursing Home Resident Deaths study.