- 👟 Ready To Run!
- 🔬 Data Science
- 🚀 Data Exploration and Cleaning
- ⚕️ Healthcare
America has a severe shortage of healthcare providers. According to a recent article [1] published in The Hill, the country faces a severe physician workforce shortage.
According to a 2019 study [2] conducted by the Association of American Medical Colleges (AAMC), by the year 2032, the United States will see a shortage of up to nearly 122,000 physicians. A shortfall of roughly 25,000 to 66,000 specialists and 21,000 to 55,000 primary care physicians.
Image source: https://blog.medicaresolutions.com/the-u-s-doctor-shortage/
In this study, we will use National Plan and Provider Enumeration System (NPPES) healthcare provider data [3] to:
In this notebook, we will:
The latest Healthcare Provider data file can be downloaded from NPPES. For our analysis, we willl use the file from May' 2019. The dataset contains ~5.8 million records with 329 attributes; 6.7GB on disk. Data includes healthcare provider information such as:
# Import Libraries
# Import Dask for reading and processing big data
import dask.dataframe as dd
# Import Pandas for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
# Import plotting libraries
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
# Import libraries for time and output
import time
import os
import shutil
import zipfile
# Import arcgis
import arcgis
from arcgis.gis import GIS
This notebook will use npidata.zip
csv file located in /arcgis/samplesdata/
. Run the below cell to copy csv file to your working directory of /arcgis/home/
so you can modify it locally.
arcgis_dir = './arcgis'
home_dir = os.path.join(arcgis_dir, 'home')
samples_dir = os.path.join(arcgis_dir, 'samplesdata')
print(f"root dir: {arcgis_dir}")
print(f"home dir: {home_dir}")
print(f"samples dir: {samples_dir}")
def copy_sample_csv_to_home(zip_csv_name):
"""Given the full filename (with extensions) of a csv zip file in
/arcgis/samplesdata/, will copy and unzip the file to /arcgis/home/
Will return the full path to the unzipped csv in home"""
# Get the full paths of all the source and destination files to copy
csv_dir_name = zip_csv_name.split(".zip")[0]
csv_zip_path_src = os.path.join(samples_dir, zip_csv_name)
csv_dir_path_src = os.path.join(samples_dir, csv_dir_name)
csv_zip_path_dst = os.path.join(home_dir, zip_csv_name)
csv_dir_path_dst = os.path.join(home_dir, csv_dir_name)
# If the gdb has been copied/unzipped to home dir before, delete it
if os.path.exists(csv_zip_path_dst):
os.remove(csv_zip_path_dst)
if os.path.exists(csv_dir_path_dst):
shutil.rmtree(csv_dir_path_dst)
# Copy the zip file to home, unzip it
shutil.copy(csv_zip_path_src, csv_zip_path_dst)
zip_ref = zipfile.ZipFile(csv_zip_path_dst, 'r')
zip_ref.extractall(home_dir)
zip_ref.close()
# Return the output full path to /arcgis/home/unzipped_gdb
return csv_dir_path_dst
# call the function to copy data needed for this analysis
csv_path = copy_sample_csv_to_home('npidata.zip')
print(f"CSV succesfully copied {csv_path}")
# Get Unzipped file
data_file = os.path.join(home_dir, 'npidata.csv')
print(data_file)
Sheer size of this data does not allow for exploration using basic tools like Excel or Pandas. Traditional packages such as Pandas cannot be used as they expect data to fit fully in memory and are not designed to scale beyond a single machine.
Thus, in this part of the study, we will use Dask, a distributed data analysis library. Dask enables efficient parallel computations on single machines by leveraging their multi-core CPUs and streaming data efficiently from disk. Dask provides a DataFrame object that mimics traditional Pandas DataFrame which can be used to perform operations to slice, dice the data and do explorations. Operations on a DataFrame get queued and are operated only when necessary. When executed, Dask will read data in chunks, distribute it (be it cores on a single machine or multiple machines in a cluster set up) and compute the data for you. Thus, Dask allows us to work with any larger than memory dataset as it performs operations on chunks in a distributed manner.
We will read the data as a Dask dataframe for our analysis.
# Read the data as Dask dataframe
%time npi_data = dd.read_csv(data_file, blocksize=1000000, dtype=str)
# Confirm type of dataset
type(npi_data)
npi_data.head()
A Dask dataframe doesn't know how many records are in data without reading through all of it. Reading through all the data will trigger computation and take some time which is not necessary at this time. We will just check the number of columns our data has.
# Check no. of columns
len(npi_data.columns)
We can see that the dataset is pretty wide with 329 attributes. We can classify the attributes of this dataset as:
We will clean the data by performing following tasks:
full_address
column by combining various address fieldsentity_type
For our analysis, we will keep only 12 attributes that are of value to us, removing all others.
# Identify columns to keep
cols_to_keep = ['NPI','Entity Type Code', 'Healthcare Provider Taxonomy Code_1','Healthcare Provider Taxonomy Group_1',\
'Provider Gender Code','Provider Organization Name (Legal Business Name)',\
'Provider First Line Business Practice Location Address',\
'Provider Second Line Business Practice Location Address','Provider Business Practice Location Address City Name',\
'Provider Business Practice Location Address State Name','Provider Business Practice Location Address Country Code (If outside U.S.)',\
'Provider Business Practice Location Address Postal Code']
# Get columns to drop
all_columns = set(npi_data.columns)
cols_to_keep_set = set(cols_to_keep)
cols_to_drop = all_columns - cols_to_keep_set
# Check length
len(all_columns), len(cols_to_keep_set), len(cols_to_drop)
# Drop columns
npi_updated = npi_data.drop(list(cols_to_drop), axis=1)
npi_updated.head()
Here we will rename columns to ensure they are easy to read and understand.
# Rename columns
npi_updated = npi_updated.rename(columns={"Entity Type Code":"entity_type",'Provider First Line Business Practice Location Address':'addressline1',
'Healthcare Provider Taxonomy Code_1':'taxonomy_code_1','Healthcare Provider Taxonomy Group_1':'taxonomy_group_1',
'Provider Gender Code':'provider_gender','Provider Organization Name (Legal Business Name)':'organization_name',
'Provider Second Line Business Practice Location Address':'addressline2',
'Provider Business Practice Location Address City Name':'city',
'Provider Business Practice Location Address State Name':'state',
'Provider Business Practice Location Address Country Code (If outside U.S.)':'country',
'Provider Business Practice Location Address Postal Code':'postal_code','NPI':'npi'})
npi_updated.head()
Here we will filter the data for US and add a new full_address
column
# Select data for US only
npi_updated = npi_updated[npi_updated['country']=='US']
# Create new address column
npi_updated['full_address'] = npi_updated['addressline1'].astype(str)+', '+npi_updated['addressline2'].fillna('').astype(str)+', '\
+npi_updated['city'].astype(str)+', '+npi_updated['state'].astype(str)+' '+npi_updated['postal_code'].astype(str)
npi_updated.head()
Entity types are categorized as Individual or Organization. Here, we will change the categories from 0 and 1 to reflect the correct categories.
# Change category for Entity Type
npi_updated['entity_type'] = npi_updated['entity_type'].mask(npi_updated['entity_type'] == '1', 'Individual').mask(npi_updated['entity_type'] == '2', 'Organization')
npi_updated.head()
The compute operation allows Dask to read through all the data and trigger computation that has been queued up.
# Compute
%time npi_clean = npi_updated.compute()
# Check the shape of cleaned dataframe
npi_clean.shape
We can see that the cleaned dataset has ~ 5.8M records with 13 attributes. Let's look at the first few rows of this dataset.
npi_clean.head()
# Check Null Values
npi_clean.isnull().sum()
We have a lot of missing values and dataset seems to be sparse. Let's explain some of these missing attributes:
taxonomy_code_1
- these codes identify the speciality of a provider. A healthcare provider has a primary speciality (taxonomy_code_1) and can have multiple secondary specialities (taxonomy_code_2 through 15). This dataset shows only 11 missing values for the primary speciality. For our analysis, we will only focus on the primary taxonomy of these providers.taxonomy_group_1
- these codes identify a business group of one or more individual practitioners, who practice with different areas of specialization. We will not use it for our analysis.addressline2
- an address in U.S. may or may not include a second address line and does not impact out analysis.provider_gender
- providers are not required to report gender and hence the missing values. We will not use this attribute for our analysis.organization_name
- we will not use an organization's name for our analysis.Let's look at records where taxonomy_code_1
is missing.
# Identify records with `taxonomy_code_1` as null
npi_clean[npi_clean['taxonomy_code_1'].isnull()]
We will remove these records in future notebooks after geocoding the data.
Let's explore healthcare provider counts by each state
# UNIQUE STATES
state_list = npi_clean['state'].unique()
state_list
# Record Count for each state
unique_counts = npi_clean['state'].value_counts()
unique_counts[:5]
# Drop records that are not in US
new_unique = unique_counts.drop(unique_counts.index[-21:])
# Plot Record Count by State
plt.figure(figsize=(25,12))
sns.barplot(new_unique.index, new_unique.values,color='lightcoral')
plt.title('Records by State', fontsize=22)
plt.xlabel('States', fontsize=18)
plt.ylabel('No. of Records', fontsize=18);
This chart shows healthcare provider counts by each state. Here we can see the disparity in availability of providers for different states. States like CA, NY, FL have more than 350,000 providers whereas states like ND, VT, WY have fewer than 20,000 providers.
Taxonomy codes can be mapped to area of specialization using this reference. Let's create a mapping of the names of top specialities with taxonomy codes and then plot.
# Create a copy of the npi_clean dataframe
npi_renamed = npi_clean.copy()
# Look at top 15 Specialities
npi_renamed['taxonomy_code_1'].value_counts()[:15]
# Remap taxonomy code values
npi_renamed['taxonomy_code_1'] = npi_renamed['taxonomy_code_1'].map({'225100000X':'Physical Therapist',
'183500000X':'Pharmacist','207R00000X':'Internam Medicine','101YM0800X':'Counselor-Mental Health',
'1041C0700X':'Social Worker-Clinical','390200000X':'Student-Healthcare','207Q00000X':'Family Medicine',
'1223G0001X':'Dentist-General','111N00000X':'Chiropractor','235Z00000X':'Speech Pathologist',
'363LF0000X':'Nurse-Family','122300000X':'Dentist','163W00000X':'Registered Nurse',
'174400000X':'Spacialist-Non Doctors','363A00000X':'Physician Assistant','106H00000X':'Marriage-Family Therapist',
'106S00000X':'Behavior Technician','101YA0400X':'Counselor-Addiction','101Y00000X':'Counselor',
'171M00000X':'Case Manager','164W00000X':'Licensed Practical Nurse','104100000X':'Social Worker',
'103TC0700X':'Clinical Psychologist','225X00000X':'Occupational Therapist','172V00000X':'Community Health Worker',
'101YP2500X':'Professional Counselor'})
# Create a plot of top 15 Provider Specialities
fig, ax = plt.subplots(figsize=(20,8))
# Add data
provider = npi_renamed['taxonomy_code_1'].value_counts().reset_index()
ax.bar(provider.iloc[:15,0], provider.iloc[:15,1],color='lightcoral')
# Remove Spines
for key, spine in ax.spines.items():
spine.set_visible(False)
# Adjust margin, ticks etc and add title
ax.margins(0.01, 0)
ax.set_ylim(0,210000)
ax.set_title('Count of Providers by Speciality (Top 15)', fontsize=22)
ax.set_xlabel('Spaciality(Taxonomy Codes)', fontsize=18)
ax.set_ylabel('Count', fontsize=18)
plt.xticks(rotation=90)
plt.show();
We can see that some of top specializations of healthcare providers in US include Physical Therapists, Pharmacists, Internal Medicine, Counselors, Social Workers, Family Medicine practitioners and so on.
Let's look at the top 15 specialities for states that have the highest number of providers vs states with lowest number of providers.
states = ['CA','NY','FL','VT','ND','WY']
fig = plt.figure(figsize=(15, 15))
for sp in range(0,6):
ax = fig.add_subplot(2,3,sp+1)
provType = npi_renamed[npi_renamed['state']==states[sp]]['taxonomy_code_1'].value_counts().reset_index()
ax.bar(provType.iloc[:15,0], provType.iloc[:15,1],color='lightcoral')
ax.set_title(states[sp])
ax.set_ylim(0,provType.iloc[0,1])
plt.xticks(rotation=90)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
plt.tight_layout() # automatically adjusts layout to fit long labels
plt.show();
CA, NY, FL have the most providers whereas states like ND, VT, WY have the least. This chart clearly shows the disparity in availability of providers in the states with most vs states with least providers. Taxonomy codes can be mapped to area of specialization using this reference.
We can see that most providers are individuals with fewer organizations. This looks right as an organization would have its own individual NPI but would have multiple individual providers working for them.
fig, ax = plt.subplots(figsize=(5,5))
entity = npi_clean['entity_type'].value_counts().reset_index()
ax.bar(entity.iloc[:,0], entity.iloc[:,1],color='lightcoral')
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.set_ylim(0,5000000)
ax.set_title('Entity Types in US', pad=20)
ax.tick_params(bottom="off")
plt.show();
We will export our cleaned data as different csv files for each state and store them in a single directory
# Create a directory to store csv output
if "state_data" not in os.listdir(home_dir):
os.mkdir(os.path.join(home_dir,"state_data"))
state_data_dir = os.path.join(home_dir,"state_data")
os.listdir(home_dir)
# Create list of state names in U.S.
clean_statelist = ['NE', 'FL', 'NC', 'TX', 'CA', 'OH', 'NY', 'AZ', 'OK', 'MO', 'IL',
'TN', 'MN', 'NV', 'GA', 'AL', 'IA', 'NJ', 'CT', 'MD', 'MI', 'WI',
'NM', 'AR', 'PA', 'UT', 'KY', 'VA', 'IN', 'MS', 'ME', 'NH', 'MA',
'MT', 'LA', 'CO', 'WV', 'WA', 'RI', 'OR', 'DC', 'KS', 'ID', 'SC',
'HI', 'SD', 'DE', 'PR', 'AK', 'WY', 'VT', 'ND']
%%time
# Output csv files for each state and store in state_data_dir
for stateName in clean_statelist:
name = stateName
i = npi_clean[npi_clean['state']==stateName]
i.to_csv(os.path.join(state_data_dir,'npi_'+name+'.csv'), index = None, header=True)
So far, we have successfully cleaned and exported provider data for each state. Now, to make this data useful and perform any spatial analysis on it (i.e plot it on a map, aggregate it at a county/state level or to add additional social demographic features), we first need to geocode the cleaned data.
Considering the amount of data we have (~5.8M records), Geocoding can be an expensive and time consuming operation. Geocoding this data will consume ~2,32,000 credits and can take ~36 hours when performed as a sequential operation. This is where the power of GeoAnalytics server really shines. The GeoAnalytics server provides extreme performance by processing on massive datasets in a scalable and distributed fashion.
The Geocode Locations from Table
tool is a convenient way to geocode large tables of addresses. Learn more about how this tool can be used through ArcGIS API for Python here.
We have geocoded this data for you and have included it as a part of the analysis. The geocoded data can be accessed as provider_data_geocoded_7_30
feature layer and we will see it being utilized in other parts of this study.
To summarize, in this notebook we:
[1] https://thehill.com/opinion/healthcare/447923-there-is-a-severe-physician-shortage-and-it-will-only-worsen
[2] https://news.aamc.org/press-releases/article/2019-workforce-projections-update/
[3] https://www.cms.gov/Regulations-and-Guidance/Administrative-Simplification/NationalProvIdentStand/DataDissemination.html