The purpose of this notebook is to clean and engineer that data to be used for Covid-19 Nursing Home Resident Deaths study. The Centers for Medicare and Medicaid Services (CMS) provides a Nursing Home COVID-19 Public File. This file includes data reported by nursing homes to the CDC’s National Healthcare Safety Network (NHSN) system’s COVID-19 Long Term Care Facility Module. The data includes Resident Impact, Facility Capacity, Staff & Personnel, Personal Protective Equipment, Ventilator Capacity and other Supplies information.
Dataset: https://data.cms.gov/covid-19/covid-19-nursing-home-data
We downloaded the data as a .csv file where each record represents information about an individual nursing home in the United States for each week from May 25, 2020 - June 6, 2021
. The csv file with raw data can be found here and the data dictionary can be found here.
General Imports
# Import Libraries
from IPython.display import display
# Import arcgis
import arcgis
from arcgis.gis import GIS
# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
# Import library for time and os
import time
import os
# Create a GIS connection
gis = GIS("home")
# Read and check shape
data = pd.read_csv('COVID-19_Nursing_Home_Dataset_062121.csv', low_memory=False)
data.shape
# Identify null location values
pd.isnull(data['Geolocation']).sum()
# Drop records with no location data
data.dropna(subset=['Geolocation'], inplace=True)
data.shape
# Create lat long
data['LONGITUDE'] = data.Geolocation.str.extract('([-.0-9]+)', expand=True).astype(float)
data['LATITUDE'] = data.Geolocation.str.extract('([-.0-9]+)\)$', expand=True).astype(float)
# Create SeDF from lat/long
ppedf = pd.DataFrame.spatial.from_xy(data,'LONGITUDE','LATITUDE')
# Check geometry
ppedf['SHAPE']
# Check shape of data
ppedf.shape
# Drop columns
drop_cols = ['Geolocation','LATITUDE','LONGITUDE']
ppedf.drop(drop_cols,axis=1,inplace=True)
# Sanitize Column Names
ppedf.spatial.sanitize_column_names(inplace=True)
Some column names are really long. We will rename some columns.
Note - Long column names can lead to the following error when publishing as a feature layer. Make sure to rename all columns with long names so the publishing doesn't error out.
RuntimeError: The operation was attempted on an empty geometry.
In this section we will:
Create date and month columns.
# Convert to datetime col
import datetime
ppedf['week_ending'] = pd.to_datetime(ppedf['week_ending'])
# Create Date and Month cols
ppedf['date'] = ppedf['week_ending'].dt.strftime("%b-%d")
ppedf['month'] = ppedf['week_ending'].dt.month_name(locale = 'English')
# Check columns
ppedf.iloc[:, -3:].head()
# Check data type
ppedf['week_ending'].dtype
# Create a copy of dataframe
ppe_df = ppedf.copy()
ppe_df.shape
Convert some existing categorical features to binary (i.e. Y/N --> 0/1).
# Convert staff shortage cols
staff_cols = [col for col in ppe_df.columns if col.startswith('shortage')]
for col in staff_cols:
ppe_df[col] = ppe_df[col].map({'Y': 1, 'N': 0})
# Convert ppe supply cols
current_ppe_cols = [col for col in ppe_df.columns if col.startswith('any')]
for col in current_ppe_cols:
ppe_df[col] = ppe_df[col].map({'Y': 0, 'N': 1})
oneweek_ppe_cols = [col for col in ppe_df.columns if col.startswith('one')]
for col in oneweek_ppe_cols:
ppe_df[col] = ppe_df[col].map({'Y': 0, 'N': 1})
# Convert col for 3 or more cases
ppe_df['threeormore_cases_this_week'] = ppe_df['three_or_more_confirmed_covid19_cases_this_week'].map({'Y': 1, 'N': 0})
# Convert col for initial covid case for the first time since Jan2020
ppe_df['initial_covid_case'] = ppe_df['initial_confirmed_covid19_case_this_week'].map({'Y': 1, 'N': 0})
# Convert col for Residents access to testing
ppe_df['resident_access_to_testing_in_f'] = ppe_df['resident_access_to_testing_in_facility'].map({'Y': 1, 'N': 0})
# Convert testing ability columns for resident and staff
ppe_df['able_to_test_residents'] = ppe_df['able_to_test_or_obtain_resources_to_test_all_current_residents_within_next_7_days'].map({'Y':1, 'N':0})
ppe_df['able_to_test_staff'] = ppe_df['able_to_test_or_obtain_resources_to_test_all_staff_andor_personnel_within_next_7_days'].map({'Y':1, 'N':0})
Add new features to aggregate shortage of ppe and staff.
# Add staff shortage
ppe_df['total_staff_shortage'] = ppe_df[staff_cols].sum(axis=1)
# Add current ppe shortage
ppe_df['current_ppe_shortage'] = ppe_df[current_ppe_cols].sum(axis=1)
# Add one week ppe shortage
ppe_df['oneweek_ppe_shortage'] = ppe_df[oneweek_ppe_cols].sum(axis=1)
Rename current and one week supply columns to "shortage" columns.
# Replace supply with shortage in column names
ppe_df.columns = ppe_df.columns.str.replace('any_current_supply','current_shortage')
ppe_df.columns = ppe_df.columns.str.replace('one_week_supply','oneweek_shortage')
The following column have an object
data type. These columns should be float. Converting to correct datatype.
cols_to_convert = ["residents_total_all_deaths","total_number_of_occupied_beds",
"total_resident_confirmed_covid19_cases_per_1000_residents",
"total_residents_covid19_deaths_as_a_percentage_of_confirmed_covid19_cases",
"total_resident_covid19_deaths_per_1000_residents",
"weekly_resident_confirmed_covid19_cases_per_1000_residents",
"weekly_resident_covid19_deaths_per_1000_residents",
"residents_total_admissions_covid19"]
ppe_df[cols_to_convert] = ppe_df[cols_to_convert].apply(pd.to_numeric, errors='coerce')
ppe_df[cols_to_convert].dtypes
# Check shape
ppe_df.shape
Subset the data to keep only necessary columns.
cols_to_keep = ['SHAPE','county','provider_state',
'initial_covid_case','threeormore_cases_this_week','oneweek_ppe_shortage',
'total_staff_shortage','shortage_of_nursing_staff','oneweek_shortage_of_n95_masks',
'staff_total_confirmed_covid19', 'staff_total_covid19_deaths',
'residents_total_admissions_covid19',
'weekly_resident_confirmed_covid19_cases_per_1000_residents',
'weekly_resident_covid19_deaths_per_1000_residents',
'total_number_of_occupied_beds',
'week_ending']
ts_ppe_df = ppe_df[cols_to_keep].copy()
ts_ppe_df.shape
# Strip white spaces and lowercase
ts_ppe_df['county'] = ts_ppe_df['county'].str.lower().str.strip()
#Check Null Values
ts_ppe_df.isnull().sum()
# Replace null with 0
ts_ppe_df.replace(np.nan, 0, inplace=True)
# Check value counts
ts_ppe_df['week_ending'].value_counts().sort_index()
# Check information of dataframe
ts_ppe_df.info()
Per CMS, data for the earlier periods may have been reported incorrectly. We will inspect the data to see if it makes sense and remove records as needed.
# Group and subset for CA
subdf = ts_ppe_df.groupby(
['week_ending','provider_state']).aggregate('sum')[['weekly_resident_confirmed_covid19_cases_per_1000_residents',
'weekly_resident_covid19_deaths_per_1000_residents']]
subdf.query('provider_state == "CA"')
We can see that the data for May 25, 2020 does not make sense. The data seems to be incorrectly reported. We will go ahead and remove all records for the first to available dates i.e. May 25, 2020 and May 31, 2020. This way the dataset used for our analysis will start from June 07, 2020.
# Remove records for May 25, 2020 and May 31, 2020
ts_df = ts_ppe_df.query('(week_ending > "2020-05-31")').copy()
ts_df.shape
# Check data
ts_df['week_ending'].value_counts().sort_index()
We will publish the data to a feature layer for ease of access and use in the analysis.
ts_layer = ts_df.spatial.to_featurelayer('NH_TS_US_Cleaned_0621')
ts_layer
NOTE: This layer contains subset of columns for all of US for all time periods.
We will use data from June 07, 2020 - March 14, 2021
to train and validate the model. Data from March 21, 2021 - June 6, 2021
(12 time steps) will be used for forecasting and is not included in this subset.
We will subset the data from June 07, 2020 - March 14, 2021
.
# Create subset
ts_df_sub = ts_df.query('(week_ending < "2021-03-21")').copy()
ts_df_sub.shape
# Check data
ts_df_sub['week_ending'].value_counts().sort_index()
We will publish the data to a feature layer for ease of access and use in the analysis.
ts_sub_layer = ts_df_sub.spatial.to_featurelayer('Subset_NH_TS_US_Cleaned_0621')
ts_sub_layer
NOTE: This layer contains subset of columns for all of US for June 07, 2020 - March 14, 2021
to train and validate the model. These points will be aggregated when creating "Space Time cube by Aggregation"; we will supply US county boundaries to aggregate on those boundaries. Data from March 21, 2021 - June 6, 2021
(12 time steps) is not included in this layer as it will be used for forecasting.
Here, we will query county boundaries layer to get data for fips code, county and state name mapping. The data is then cleaned and published for use as polygon layer in the analysis notebooks.
# Get County Boundaries
boundary_item = agol_gis.content.get('f7ea0c30856b4ef9a00d07978426f4d4')
boundary_layer = boundary_item.layers[0]
boundary_layer
# Look at the first 5 fields and their data types
for f in boundary_layer.properties.fields[:5]:
print(f['name'], "{0:>35}".format(f['alias']))
# Query fips layer for relevant columns
# Change data type of FIPS field to int
fields = ['FIPS','COUNTY','STATE']
fips_df = boundary_layer.query(out_fields=fields, as_df=True, gis = agol_gis)
fips_df['FIPS'] = fips_df['FIPS'].astype('int64')
fips_df.shape
# Check head
fips_df.head()
# Confirm data types
fips_df.dtypes
# Create layer for FIPS data
fips_lyr = fips_df.spatial.to_featurelayer('NH_TS_US_FIPS_layer', gis=gis)
fips_lyr
To summarize, this notebook cleans and prepares the data to be used for Covid-19 Nursing Home Resident Deaths study.