Analyzing Healthcare Provider Shortage - Part 1/4

  • 👟 Ready To Run!
  • 🔬 Data Science
  • 🚀 Data Exploration and Cleaning
  • ⚕️ Healthcare

America has a severe shortage of healthcare providers. According to a recent article [1] published in The Hill, the country faces a severe physician workforce shortage.

  • Demand for services is increasing
  • Shortages of physicians and mental health providers are limiting patients’ access to required treatment.

According to a 2019 study [2] conducted by the Association of American Medical Colleges (AAMC), by the year 2032, the United States will see a shortage of up to nearly 122,000 physicians. A shortfall of roughly 25,000 to 66,000 specialists and 21,000 to 55,000 primary care physicians.

Image source: https://blog.medicaresolutions.com/the-u-s-doctor-shortage/

In this study, we will use National Plan and Provider Enumeration System (NPPES) healthcare provider data [3] to:

  1. Read, clean and geocode ~ 5.8 million records of healthcare provider data.
  2. Identify shortage areas for all healthcare, mental health and OB-GYN healthcare providers in the U.S.
  3. Build global and local models to:
    • Identify key sociodemographic and economic factors that influence access to providers.
    • Identify the direction of influence i.e. whether the factors influence access to providers positively or negatively.
    • Identify how the influence of these factors varies across different counties.

Part 1: Provider Shortage - Data Preparation

In this notebook, we will:

  • Read large data file (~6GB) using Dask
  • Clean the data by:
    • dropping unused columns
    • adding new columns
    • renaming columns
    • change categories of a column
    • change data types
  • Compute data using Dask
  • Basic exploration and missing value determination
  • Export data into multiple csv files
  • Review how GeoAnalytics server can be used to geocode massive datasets

Get the Data

The latest Healthcare Provider data file can be downloaded from NPPES. For our analysis, we willl use the file from May' 2019. The dataset contains ~5.8 million records with 329 attributes; 6.7GB on disk. Data includes healthcare provider information such as:

  • Organization Name, Address, National Provider Identifier, Gender, Entity Type
  • Taxonomy codes that define provider type, classification and area of specialization​
In [19]:
# Import Libraries

# Import Dask for reading and processing big data
import dask.dataframe as dd

# Import Pandas for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)

# Import plotting libraries
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

# Import libraries for time and output
import time
import os
import shutil
import zipfile

# Import arcgis
import arcgis
from arcgis.gis import GIS

Extract data from zipped file

This notebook will use npidata.zip csv file located in /arcgis/samplesdata/. Run the below cell to copy csv file to your working directory of /arcgis/home/ so you can modify it locally.

In [23]:
arcgis_dir = './arcgis'
home_dir = os.path.join(arcgis_dir, 'home')
samples_dir = os.path.join(arcgis_dir, 'samplesdata')

print(f"root dir: {arcgis_dir}")
print(f"home dir: {home_dir}")
print(f"samples dir: {samples_dir}")
root dir: ./arcgis
home dir: ./arcgis\home
samples dir: ./arcgis\samplesdata
In [25]:
def copy_sample_csv_to_home(zip_csv_name):
    """Given the full filename (with extensions) of a csv zip file in 
    /arcgis/samplesdata/, will copy and unzip the file to /arcgis/home/
    Will return the full path to the unzipped csv in home"""

    # Get the full paths of all the source and destination files to copy
    csv_dir_name = zip_csv_name.split(".zip")[0]
    csv_zip_path_src = os.path.join(samples_dir, zip_csv_name)
    csv_dir_path_src = os.path.join(samples_dir, csv_dir_name)
    
    csv_zip_path_dst = os.path.join(home_dir, zip_csv_name)
    csv_dir_path_dst = os.path.join(home_dir, csv_dir_name)

    # If the gdb has been copied/unzipped to home dir before, delete it
    if os.path.exists(csv_zip_path_dst):
        os.remove(csv_zip_path_dst)
    if os.path.exists(csv_dir_path_dst):
        shutil.rmtree(csv_dir_path_dst)

    # Copy the zip file to home, unzip it
    shutil.copy(csv_zip_path_src, csv_zip_path_dst)
    zip_ref = zipfile.ZipFile(csv_zip_path_dst, 'r')
    zip_ref.extractall(home_dir)
    zip_ref.close()
    
    # Return the output full path to /arcgis/home/unzipped_gdb
    return csv_dir_path_dst

# call the function to copy data needed for this analysis
csv_path = copy_sample_csv_to_home('npidata.zip')

print(f"CSV succesfully copied {csv_path}")
GDB succesfully copied ./arcgis\home\npidata
In [28]:
# Get Unzipped file
data_file = os.path.join(home_dir, 'npidata.csv')
print(data_file)
./arcgis\home\npidata.csv

Read data using Dask

Sheer size of this data does not allow for exploration using basic tools like Excel or Pandas. Traditional packages such as Pandas cannot be used as they expect data to fit fully in memory and are not designed to scale beyond a single machine.

Thus, in this part of the study, we will use Dask, a distributed data analysis library. Dask enables efficient parallel computations on single machines by leveraging their multi-core CPUs and streaming data efficiently from disk. Dask provides a DataFrame object that mimics traditional Pandas DataFrame which can be used to perform operations to slice, dice the data and do explorations. Operations on a DataFrame get queued and are operated only when necessary. When executed, Dask will read data in chunks, distribute it (be it cores on a single machine or multiple machines in a cluster set up) and compute the data for you. Thus, Dask allows us to work with any larger than memory dataset as it performs operations on chunks in a distributed manner.

We will read the data as a Dask dataframe for our analysis.

In [29]:
# Read the data as Dask dataframe

%time npi_data = dd.read_csv(data_file, blocksize=1000000, dtype=str)
Wall time: 1min 4s
In [30]:
# Confirm type of dataset
type(npi_data)
Out[30]:
dask.dataframe.core.DataFrame
In [31]:
npi_data.head()
Out[31]:
NPI Entity Type Code Replacement NPI Employer Identification Number (EIN) Provider Organization Name (Legal Business Name) Provider Last Name (Legal Name) Provider First Name Provider Middle Name Provider Name Prefix Text Provider Name Suffix Text Provider Credential Text Provider Other Organization Name Provider Other Organization Name Type Code Provider Other Last Name Provider Other First Name Provider Other Middle Name Provider Other Name Prefix Text Provider Other Name Suffix Text Provider Other Credential Text Provider Other Last Name Type Code Provider First Line Business Mailing Address Provider Second Line Business Mailing Address Provider Business Mailing Address City Name Provider Business Mailing Address State Name Provider Business Mailing Address Postal Code Provider Business Mailing Address Country Code (If outside U.S.) Provider Business Mailing Address Telephone Number Provider Business Mailing Address Fax Number Provider First Line Business Practice Location Address Provider Second Line Business Practice Location Address Provider Business Practice Location Address City Name Provider Business Practice Location Address State Name Provider Business Practice Location Address Postal Code Provider Business Practice Location Address Country Code (If outside U.S.) Provider Business Practice Location Address Telephone Number Provider Business Practice Location Address Fax Number Provider Enumeration Date Last Update Date NPI Deactivation Reason Code NPI Deactivation Date NPI Reactivation Date Provider Gender Code Authorized Official Last Name Authorized Official First Name Authorized Official Middle Name Authorized Official Title or Position Authorized Official Telephone Number Healthcare Provider Taxonomy Code_1 Provider License Number_1 Provider License Number State Code_1 Healthcare Provider Primary Taxonomy Switch_1 Healthcare Provider Taxonomy Code_2 Provider License Number_2 Provider License Number State Code_2 Healthcare Provider Primary Taxonomy Switch_2 Healthcare Provider Taxonomy Code_3 Provider License Number_3 Provider License Number State Code_3 Healthcare Provider Primary Taxonomy Switch_3 Healthcare Provider Taxonomy Code_4 Provider License Number_4 Provider License Number State Code_4 Healthcare Provider Primary Taxonomy Switch_4 Healthcare Provider Taxonomy Code_5 Provider License Number_5 Provider License Number State Code_5 Healthcare Provider Primary Taxonomy Switch_5 Healthcare Provider Taxonomy Code_6 Provider License Number_6 Provider License Number State Code_6 Healthcare Provider Primary Taxonomy Switch_6 Healthcare Provider Taxonomy Code_7 Provider License Number_7 Provider License Number State Code_7 Healthcare Provider Primary Taxonomy Switch_7 Healthcare Provider Taxonomy Code_8 Provider License Number_8 Provider License Number State Code_8 Healthcare Provider Primary Taxonomy Switch_8 Healthcare Provider Taxonomy Code_9 Provider License Number_9 Provider License Number State Code_9 Healthcare Provider Primary Taxonomy Switch_9 Healthcare Provider Taxonomy Code_10 Provider License Number_10 Provider License Number State Code_10 Healthcare Provider Primary Taxonomy Switch_10 Healthcare Provider Taxonomy Code_11 Provider License Number_11 Provider License Number State Code_11 Healthcare Provider Primary Taxonomy Switch_11 Healthcare Provider Taxonomy Code_12 Provider License Number_12 Provider License Number State Code_12 Healthcare Provider Primary Taxonomy Switch_12 Healthcare Provider Taxonomy Code_13 Provider License Number_13 Provider License Number State Code_13 Healthcare Provider Primary Taxonomy Switch_13 Healthcare Provider Taxonomy Code_14 Provider License Number_14 Provider License Number State Code_14 Healthcare Provider Primary Taxonomy Switch_14 Healthcare Provider Taxonomy Code_15 Provider License Number_15 Provider License Number State Code_15 Healthcare Provider Primary Taxonomy Switch_15 Other Provider Identifier_1 Other Provider Identifier Type Code_1 Other Provider Identifier State_1 Other Provider Identifier Issuer_1 Other Provider Identifier_2 Other Provider Identifier Type Code_2 Other Provider Identifier State_2 Other Provider Identifier Issuer_2 Other Provider Identifier_3 Other Provider Identifier Type Code_3 Other Provider Identifier State_3 Other Provider Identifier Issuer_3 Other Provider Identifier_4 Other Provider Identifier Type Code_4 Other Provider Identifier State_4 Other Provider Identifier Issuer_4 Other Provider Identifier_5 Other Provider Identifier Type Code_5 Other Provider Identifier State_5 Other Provider Identifier Issuer_5 Other Provider Identifier_6 Other Provider Identifier Type Code_6 Other Provider Identifier State_6 Other Provider Identifier Issuer_6 Other Provider Identifier_7 Other Provider Identifier Type Code_7 Other Provider Identifier State_7 Other Provider Identifier Issuer_7 Other Provider Identifier_8 Other Provider Identifier Type Code_8 Other Provider Identifier State_8 Other Provider Identifier Issuer_8 Other Provider Identifier_9 Other Provider Identifier Type Code_9 Other Provider Identifier State_9 Other Provider Identifier Issuer_9 Other Provider Identifier_10 Other Provider Identifier Type Code_10 Other Provider Identifier State_10 Other Provider Identifier Issuer_10 Other Provider Identifier_11 Other Provider Identifier Type Code_11 Other Provider Identifier State_11 Other Provider Identifier Issuer_11 Other Provider Identifier_12 Other Provider Identifier Type Code_12 Other Provider Identifier State_12 Other Provider Identifier Issuer_12 Other Provider Identifier_13 Other Provider Identifier Type Code_13 Other Provider Identifier State_13 Other Provider Identifier Issuer_13 Other Provider Identifier_14 Other Provider Identifier Type Code_14 Other Provider Identifier State_14 Other Provider Identifier Issuer_14 Other Provider Identifier_15 Other Provider Identifier Type Code_15 Other Provider Identifier State_15 Other Provider Identifier Issuer_15 Other Provider Identifier_16 Other Provider Identifier Type Code_16 Other Provider Identifier State_16 Other Provider Identifier Issuer_16 Other Provider Identifier_17 Other Provider Identifier Type Code_17 Other Provider Identifier State_17 Other Provider Identifier Issuer_17 Other Provider Identifier_18 Other Provider Identifier Type Code_18 Other Provider Identifier State_18 Other Provider Identifier Issuer_18 Other Provider Identifier_19 Other Provider Identifier Type Code_19 Other Provider Identifier State_19 Other Provider Identifier Issuer_19 Other Provider Identifier_20 Other Provider Identifier Type Code_20 Other Provider Identifier State_20 Other Provider Identifier Issuer_20 Other Provider Identifier_21 Other Provider Identifier Type Code_21 Other Provider Identifier State_21 Other Provider Identifier Issuer_21 Other Provider Identifier_22 Other Provider Identifier Type Code_22 Other Provider Identifier State_22 Other Provider Identifier Issuer_22 Other Provider Identifier_23 Other Provider Identifier Type Code_23 Other Provider Identifier State_23 Other Provider Identifier Issuer_23 Other Provider Identifier_24 Other Provider Identifier Type Code_24 Other Provider Identifier State_24 Other Provider Identifier Issuer_24 Other Provider Identifier_25 Other Provider Identifier Type Code_25 Other Provider Identifier State_25 Other Provider Identifier Issuer_25 Other Provider Identifier_26 Other Provider Identifier Type Code_26 Other Provider Identifier State_26 Other Provider Identifier Issuer_26 Other Provider Identifier_27 Other Provider Identifier Type Code_27 Other Provider Identifier State_27 Other Provider Identifier Issuer_27 Other Provider Identifier_28 Other Provider Identifier Type Code_28 Other Provider Identifier State_28 Other Provider Identifier Issuer_28 Other Provider Identifier_29 Other Provider Identifier Type Code_29 Other Provider Identifier State_29 Other Provider Identifier Issuer_29 Other Provider Identifier_30 Other Provider Identifier Type Code_30 Other Provider Identifier State_30 Other Provider Identifier Issuer_30 Other Provider Identifier_31 Other Provider Identifier Type Code_31 Other Provider Identifier State_31 Other Provider Identifier Issuer_31 Other Provider Identifier_32 Other Provider Identifier Type Code_32 Other Provider Identifier State_32 Other Provider Identifier Issuer_32 Other Provider Identifier_33 Other Provider Identifier Type Code_33 Other Provider Identifier State_33 Other Provider Identifier Issuer_33 Other Provider Identifier_34 Other Provider Identifier Type Code_34 Other Provider Identifier State_34 Other Provider Identifier Issuer_34 Other Provider Identifier_35 Other Provider Identifier Type Code_35 Other Provider Identifier State_35 Other Provider Identifier Issuer_35 Other Provider Identifier_36 Other Provider Identifier Type Code_36 Other Provider Identifier State_36 Other Provider Identifier Issuer_36 Other Provider Identifier_37 Other Provider Identifier Type Code_37 Other Provider Identifier State_37 Other Provider Identifier Issuer_37 Other Provider Identifier_38 Other Provider Identifier Type Code_38 Other Provider Identifier State_38 Other Provider Identifier Issuer_38 Other Provider Identifier_39 Other Provider Identifier Type Code_39 Other Provider Identifier State_39 Other Provider Identifier Issuer_39 Other Provider Identifier_40 Other Provider Identifier Type Code_40 Other Provider Identifier State_40 Other Provider Identifier Issuer_40 Other Provider Identifier_41 Other Provider Identifier Type Code_41 Other Provider Identifier State_41 Other Provider Identifier Issuer_41 Other Provider Identifier_42 Other Provider Identifier Type Code_42 Other Provider Identifier State_42 Other Provider Identifier Issuer_42 Other Provider Identifier_43 Other Provider Identifier Type Code_43 Other Provider Identifier State_43 Other Provider Identifier Issuer_43 Other Provider Identifier_44 Other Provider Identifier Type Code_44 Other Provider Identifier State_44 Other Provider Identifier Issuer_44 Other Provider Identifier_45 Other Provider Identifier Type Code_45 Other Provider Identifier State_45 Other Provider Identifier Issuer_45 Other Provider Identifier_46 Other Provider Identifier Type Code_46 Other Provider Identifier State_46 Other Provider Identifier Issuer_46 Other Provider Identifier_47 Other Provider Identifier Type Code_47 Other Provider Identifier State_47 Other Provider Identifier Issuer_47 Other Provider Identifier_48 Other Provider Identifier Type Code_48 Other Provider Identifier State_48 Other Provider Identifier Issuer_48 Other Provider Identifier_49 Other Provider Identifier Type Code_49 Other Provider Identifier State_49 Other Provider Identifier Issuer_49 Other Provider Identifier_50 Other Provider Identifier Type Code_50 Other Provider Identifier State_50 Other Provider Identifier Issuer_50 Is Sole Proprietor Is Organization Subpart Parent Organization LBN Parent Organization TIN Authorized Official Name Prefix Text Authorized Official Name Suffix Text Authorized Official Credential Text Healthcare Provider Taxonomy Group_1 Healthcare Provider Taxonomy Group_2 Healthcare Provider Taxonomy Group_3 Healthcare Provider Taxonomy Group_4 Healthcare Provider Taxonomy Group_5 Healthcare Provider Taxonomy Group_6 Healthcare Provider Taxonomy Group_7 Healthcare Provider Taxonomy Group_8 Healthcare Provider Taxonomy Group_9 Healthcare Provider Taxonomy Group_10 Healthcare Provider Taxonomy Group_11 Healthcare Provider Taxonomy Group_12 Healthcare Provider Taxonomy Group_13 Healthcare Provider Taxonomy Group_14 Healthcare Provider Taxonomy Group_15
0 1679576722 1 NaN NaN NaN WIEBE DAVID A NaN NaN M.D. NaN NaN NaN NaN NaN NaN NaN NaN NaN PO BOX 2168 NaN KEARNEY NE 688482168 US 3088652512 3088652506 3500 CENTRAL AVE NaN KEARNEY NE 688472944 US 3088652512 3088652506 05/23/2005 07/08/2007 NaN NaN NaN M NaN NaN NaN NaN NaN 207X00000X 12637 NE Y NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 645540 01 KS FIRSTGUARD 46969 01 KS BCBS 1553 01 NE BCBS NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN X NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1588667638 1 NaN NaN NaN PILCHER WILLIAM C DR. NaN MD NaN NaN NaN NaN NaN NaN NaN NaN NaN 1824 KING STREET SUITE 300 JACKSONVILLE FL 322044736 US 9043881820 9043881827 1824 KING STREET SUITE 300 JACKSONVILLE FL 322044736 US 9043881820 9043881827 05/23/2005 05/29/2014 NaN NaN NaN M NaN NaN NaN NaN NaN 207RC0000X ME68414 FL Y 207RC0000X 032024 GA N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 00532485C 05 GA NaN 510265 01 GA BCBS 251286600 05 FL NaN 27888 01 FL BCBS 208143 01 FL AVMED 00706626A 05 GA NaN 0897705 01 FL AETNA NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1497758544 2 NaN <UNAVAIL> CUMBERLAND COUNTY HOSPITAL SYSTEM, INC NaN NaN NaN NaN NaN NaN CAPE FEAR VALLEY HOME HEALTH AND HOSPICE 3 NaN NaN NaN NaN NaN NaN NaN 3418 VILLAGE DR NaN FAYETTEVILLE NC 283044552 US 9106096740 NaN 3418 VILLAGE DR NaN FAYETTEVILLE NC 283044552 US 9106096740 NaN 05/23/2005 09/26/2011 NaN NaN NaN NaN NAGOWSKI MICHAEL NaN CEO 9106096700 251G00000X HC0283 NC Y NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3401562 05 NC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN MR. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1306849450 1 NaN NaN NaN SMITSON HAROLD LEROY DR. II M.D. NaN NaN NaN NaN NaN NaN NaN NaN NaN 810 LUCAS DR NaN ATHENS TX 757513446 US 9036756778 9036752333 810 LUCAS DR NaN ATHENS TX 757513446 US 9036756778 9036752333 05/23/2005 01/03/2008 NaN NaN NaN M NaN NaN NaN NaN NaN 2085R0202X E5444 TX Y NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 132476603 05 TX NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1215930367 1 NaN NaN NaN GRESSOT LAURENT NaN DR. NaN M.D. NaN NaN NaN NaN NaN NaN NaN NaN NaN 17323 RED OAK DR NaN HOUSTON TX 770901243 US 2814405006 2814406149 17323 RED OAK DR NaN HOUSTON TX 770901243 US 2814405006 2814406149 05/23/2005 11/25/2014 NaN NaN NaN M NaN NaN NaN NaN NaN 174400000X H6257 TX N 207RH0003X H6257 TX Y NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

A Dask dataframe doesn't know how many records are in data without reading through all of it. Reading through all the data will trigger computation and take some time which is not necessary at this time. We will just check the number of columns our data has.

In [32]:
# Check no. of columns
len(npi_data.columns)
Out[32]:
329

We can see that the dataset is pretty wide with 329 attributes. We can classify the attributes of this dataset as:

  • Provider Details - this includes items like:
    • National Provider Identifier (NPI) which is a unique identifier
    • Name, Gender, Credential, Provider License No., Employer Identification Number (EIN), Entity Type (Individual/Org)
    • Address and Phone No. details for both mailing and practice location addresses
  • Provider License and Speciality Details - this includes items like:
    • Taxonomy Code, Taxonomy Group, Taxonomy Switch - these identify the speciality of a provider
    • Provider Identifier/type/state/issuer columns
    • License No., License No. State Codes for each state for which provider holds a license

Clean the Data

We will clean the data by performing following tasks:

  • Remove unnecessary columns
  • Rename columns
  • Filter data for United States
  • Create a new full_address column by combining various address fields
  • Change category names for entity_type

Drop Columns

For our analysis, we will keep only 12 attributes that are of value to us, removing all others.

In [33]:
# Identify columns to keep
cols_to_keep = ['NPI','Entity Type Code', 'Healthcare Provider Taxonomy Code_1','Healthcare Provider Taxonomy Group_1',\
                    'Provider Gender Code','Provider Organization Name (Legal Business Name)',\
                    'Provider First Line Business Practice Location Address',\
                   'Provider Second Line Business Practice Location Address','Provider Business Practice Location Address City Name',\
                    'Provider Business Practice Location Address State Name','Provider Business Practice Location Address Country Code (If outside U.S.)',\
                   'Provider Business Practice Location Address Postal Code']
In [34]:
# Get columns to drop
all_columns = set(npi_data.columns)
cols_to_keep_set = set(cols_to_keep)
cols_to_drop = all_columns - cols_to_keep_set
In [35]:
# Check length 
len(all_columns), len(cols_to_keep_set), len(cols_to_drop)
Out[35]:
(329, 12, 317)
In [36]:
# Drop columns
npi_updated = npi_data.drop(list(cols_to_drop), axis=1)
In [37]:
npi_updated.head()
Out[37]:
NPI Entity Type Code Provider Organization Name (Legal Business Name) Provider First Line Business Practice Location Address Provider Second Line Business Practice Location Address Provider Business Practice Location Address City Name Provider Business Practice Location Address State Name Provider Business Practice Location Address Postal Code Provider Business Practice Location Address Country Code (If outside U.S.) Provider Gender Code Healthcare Provider Taxonomy Code_1 Healthcare Provider Taxonomy Group_1
0 1679576722 1 NaN 3500 CENTRAL AVE NaN KEARNEY NE 688472944 US M 207X00000X NaN
1 1588667638 1 NaN 1824 KING STREET SUITE 300 JACKSONVILLE FL 322044736 US M 207RC0000X NaN
2 1497758544 2 CUMBERLAND COUNTY HOSPITAL SYSTEM, INC 3418 VILLAGE DR NaN FAYETTEVILLE NC 283044552 US NaN 251G00000X NaN
3 1306849450 1 NaN 810 LUCAS DR NaN ATHENS TX 757513446 US M 2085R0202X NaN
4 1215930367 1 NaN 17323 RED OAK DR NaN HOUSTON TX 770901243 US M 174400000X NaN

Rename Columns

Here we will rename columns to ensure they are easy to read and understand.

In [38]:
# Rename columns
npi_updated = npi_updated.rename(columns={"Entity Type Code":"entity_type",'Provider First Line Business Practice Location Address':'addressline1',
                'Healthcare Provider Taxonomy Code_1':'taxonomy_code_1','Healthcare Provider Taxonomy Group_1':'taxonomy_group_1',
                'Provider Gender Code':'provider_gender','Provider Organization Name (Legal Business Name)':'organization_name',
                'Provider Second Line Business Practice Location Address':'addressline2',
                'Provider Business Practice Location Address City Name':'city',
                'Provider Business Practice Location Address State Name':'state',
                'Provider Business Practice Location Address Country Code (If outside U.S.)':'country',
                'Provider Business Practice Location Address Postal Code':'postal_code','NPI':'npi'})
In [39]:
npi_updated.head()
Out[39]:
npi entity_type organization_name addressline1 addressline2 city state postal_code country provider_gender taxonomy_code_1 taxonomy_group_1
0 1679576722 1 NaN 3500 CENTRAL AVE NaN KEARNEY NE 688472944 US M 207X00000X NaN
1 1588667638 1 NaN 1824 KING STREET SUITE 300 JACKSONVILLE FL 322044736 US M 207RC0000X NaN
2 1497758544 2 CUMBERLAND COUNTY HOSPITAL SYSTEM, INC 3418 VILLAGE DR NaN FAYETTEVILLE NC 283044552 US NaN 251G00000X NaN
3 1306849450 1 NaN 810 LUCAS DR NaN ATHENS TX 757513446 US M 2085R0202X NaN
4 1215930367 1 NaN 17323 RED OAK DR NaN HOUSTON TX 770901243 US M 174400000X NaN

Filter data and Add new column

Here we will filter the data for US and add a new full_address column

In [40]:
# Select data for US only
npi_updated = npi_updated[npi_updated['country']=='US']
In [41]:
# Create new address column
npi_updated['full_address'] = npi_updated['addressline1'].astype(str)+', '+npi_updated['addressline2'].fillna('').astype(str)+', '\
    +npi_updated['city'].astype(str)+', '+npi_updated['state'].astype(str)+' '+npi_updated['postal_code'].astype(str)
In [42]:
npi_updated.head()
Out[42]:
npi entity_type organization_name addressline1 addressline2 city state postal_code country provider_gender taxonomy_code_1 taxonomy_group_1 full_address
0 1679576722 1 NaN 3500 CENTRAL AVE NaN KEARNEY NE 688472944 US M 207X00000X NaN 3500 CENTRAL AVE, , KEARNEY, NE 688472944
1 1588667638 1 NaN 1824 KING STREET SUITE 300 JACKSONVILLE FL 322044736 US M 207RC0000X NaN 1824 KING STREET, SUITE 300, JACKSONVILLE, FL ...
2 1497758544 2 CUMBERLAND COUNTY HOSPITAL SYSTEM, INC 3418 VILLAGE DR NaN FAYETTEVILLE NC 283044552 US NaN 251G00000X NaN 3418 VILLAGE DR, , FAYETTEVILLE, NC 283044552
3 1306849450 1 NaN 810 LUCAS DR NaN ATHENS TX 757513446 US M 2085R0202X NaN 810 LUCAS DR, , ATHENS, TX 757513446
4 1215930367 1 NaN 17323 RED OAK DR NaN HOUSTON TX 770901243 US M 174400000X NaN 17323 RED OAK DR, , HOUSTON, TX 770901243

Change Categories of Entity Types

Entity types are categorized as Individual or Organization. Here, we will change the categories from 0 and 1 to reflect the correct categories.

In [43]:
# Change category for Entity Type
npi_updated['entity_type'] = npi_updated['entity_type'].mask(npi_updated['entity_type'] == '1', 'Individual').mask(npi_updated['entity_type'] == '2', 'Organization')
In [44]:
npi_updated.head()
Out[44]:
npi entity_type organization_name addressline1 addressline2 city state postal_code country provider_gender taxonomy_code_1 taxonomy_group_1 full_address
0 1679576722 Individual NaN 3500 CENTRAL AVE NaN KEARNEY NE 688472944 US M 207X00000X NaN 3500 CENTRAL AVE, , KEARNEY, NE 688472944
1 1588667638 Individual NaN 1824 KING STREET SUITE 300 JACKSONVILLE FL 322044736 US M 207RC0000X NaN 1824 KING STREET, SUITE 300, JACKSONVILLE, FL ...
2 1497758544 Organization CUMBERLAND COUNTY HOSPITAL SYSTEM, INC 3418 VILLAGE DR NaN FAYETTEVILLE NC 283044552 US NaN 251G00000X NaN 3418 VILLAGE DR, , FAYETTEVILLE, NC 283044552
3 1306849450 Individual NaN 810 LUCAS DR NaN ATHENS TX 757513446 US M 2085R0202X NaN 810 LUCAS DR, , ATHENS, TX 757513446
4 1215930367 Individual NaN 17323 RED OAK DR NaN HOUSTON TX 770901243 US M 174400000X NaN 17323 RED OAK DR, , HOUSTON, TX 770901243

Compute the Data

The compute operation allows Dask to read through all the data and trigger computation that has been queued up.

In [45]:
# Compute
%time npi_clean = npi_updated.compute()
Wall time: 20min 12s
In [46]:
# Check the shape of cleaned dataframe
npi_clean.shape
Out[46]:
(5796083, 13)

We can see that the cleaned dataset has ~ 5.8M records with 13 attributes. Let's look at the first few rows of this dataset.

In [47]:
npi_clean.head()
Out[47]:
npi entity_type organization_name addressline1 addressline2 city state postal_code country provider_gender taxonomy_code_1 taxonomy_group_1 full_address
0 1679576722 Individual NaN 3500 CENTRAL AVE NaN KEARNEY NE 688472944 US M 207X00000X NaN 3500 CENTRAL AVE, , KEARNEY, NE 688472944
1 1588667638 Individual NaN 1824 KING STREET SUITE 300 JACKSONVILLE FL 322044736 US M 207RC0000X NaN 1824 KING STREET, SUITE 300, JACKSONVILLE, FL ...
2 1497758544 Organization CUMBERLAND COUNTY HOSPITAL SYSTEM, INC 3418 VILLAGE DR NaN FAYETTEVILLE NC 283044552 US NaN 251G00000X NaN 3418 VILLAGE DR, , FAYETTEVILLE, NC 283044552
3 1306849450 Individual NaN 810 LUCAS DR NaN ATHENS TX 757513446 US M 2085R0202X NaN 810 LUCAS DR, , ATHENS, TX 757513446
4 1215930367 Individual NaN 17323 RED OAK DR NaN HOUSTON TX 770901243 US M 174400000X NaN 17323 RED OAK DR, , HOUSTON, TX 770901243

Check for Null Values

In [48]:
# Check Null Values
npi_clean.isnull().sum()
Out[48]:
npi                        0
entity_type                0
organization_name    4469338
addressline1               0
addressline2         4230506
city                       1
state                      0
postal_code                5
country                    0
provider_gender      1326764
taxonomy_code_1           11
taxonomy_group_1     5116464
full_address               0
dtype: int64

We have a lot of missing values and dataset seems to be sparse. Let's explain some of these missing attributes:

  1. taxonomy_code_1 - these codes identify the speciality of a provider. A healthcare provider has a primary speciality (taxonomy_code_1) and can have multiple secondary specialities (taxonomy_code_2 through 15). This dataset shows only 11 missing values for the primary speciality. For our analysis, we will only focus on the primary taxonomy of these providers.
  2. taxonomy_group_1 - these codes identify a business group of one or more individual practitioners, who practice with different areas of specialization. We will not use it for our analysis.
  3. addressline2 - an address in U.S. may or may not include a second address line and does not impact out analysis.
  4. provider_gender - providers are not required to report gender and hence the missing values. We will not use this attribute for our analysis.
  5. organization_name - we will not use an organization's name for our analysis.

Let's look at records where taxonomy_code_1 is missing.

In [49]:
# Identify records with `taxonomy_code_1` as null
npi_clean[npi_clean['taxonomy_code_1'].isnull()]
Out[49]:
npi entity_type organization_name addressline1 addressline2 city state postal_code country provider_gender taxonomy_code_1 taxonomy_group_1 full_address
544 1255524401 Individual NaN 1011 HAWKINS WAY NaN GLEN BURNIE MD 210615179 US M NaN NaN 1011 HAWKINS WAY, , GLEN BURNIE, MD 210615179
12 1568648855 Individual NaN 1815 PLEASANT GROVE RD NaN JONESBORO AR 724017870 US F NaN NaN 1815 PLEASANT GROVE RD, , JONESBORO, AR 724017870
571 1447556469 Organization LEWIS FAMILY PHARMACY INC 29930 W 12 MILE RD STE 2 NaN FARMINGTON HILLS MI 483343983 US NaN NaN NaN 29930 W 12 MILE RD STE 2, , FARMINGTON HILLS, ...
309 1528352010 Organization ROSEWOOD FAMILY PHARMACY PLLC 1010 S MAGNOLIA BLVD STE D NaN MAGNOLIA TX 773558550 US NaN NaN NaN 1010 S MAGNOLIA BLVD STE D, , MAGNOLIA, TX 773...
126 1023305125 Organization SHIV PHARMACY INC. 1948 PERFECTION AVE NaN BELMONT NC 280122173 US NaN NaN NaN 1948 PERFECTION AVE, , BELMONT, NC 280122173
66 1215302476 Individual NaN 2725 HIGHWAY 51 S. NaN HERNANDO MS 38632 US F NaN NaN 2725 HIGHWAY 51 S., , HERNANDO, MS 38632
578 1023566205 Organization ALLYSCRIPTS LLC 201 LONNIE E CRAWFORD BLVD STE B NaN SCOTTSBORO AL 357697408 US NaN NaN NaN 201 LONNIE E CRAWFORD BLVD STE B, , SCOTTSBORO...
119 1679003917 Individual NaN 232 E GISH RD NaN SAN JOSE CA 951124706 US F NaN NaN 232 E GISH RD, , SAN JOSE, CA 951124706
453 1194231928 Organization JAYS PHARMACY OF MADISON LLC 683 E BASE ST NaN MADISON FL 323402705 US NaN NaN NaN 683 E BASE ST, , MADISON, FL 323402705
379 1548767650 Organization KING TUT PHARMACY LLC 90 E 3RD ST SUITE B DOUGLAS AZ 856073154 US NaN NaN NaN 90 E 3RD ST, SUITE B, DOUGLAS, AZ 856073154
758 1649768383 Individual NaN 215 PARK AVE S NaN NEW YORK NY 100031603 US M NaN NaN 215 PARK AVE S, , NEW YORK, NY 100031603

We will remove these records in future notebooks after geocoding the data.

Initial Data Exploration

Provider Count for Each State

Let's explore healthcare provider counts by each state

In [50]:
# UNIQUE STATES
state_list = npi_clean['state'].unique()
state_list
Out[50]:
array(['NE', 'FL', 'NC', 'TX', 'CA', 'OH', 'NY', 'AZ', 'OK', 'MO', 'IL',
       'TN', 'MN', 'NV', 'GA', 'AL', 'IA', 'NJ', 'CT', 'MD', 'MI', 'WI',
       'NM', 'AR', 'PA', 'UT', 'KY', 'VA', 'IN', 'MS', 'ME', 'NH', 'MA',
       'MT', 'LA', 'CO', 'WV', 'WA', 'RI', 'OR', 'DC', 'KS', 'ID', 'SC',
       'HI', 'SD', 'DE', 'PR', 'AK', 'WY', 'VT', 'ND', 'AE', 'FM', 'VI',
       'GU', 'AP', 'PW', 'AA', 'AS', 'FPO', 'MP', 'ITALY', 'FP', 'MH',
       'MILITARY', 'UNITED STATES', 'US', 'USA', 'US NAVY', 'GUAM',
       'GERMANY', 'KOREA'], dtype=object)
In [51]:
# Record Count for each state
unique_counts = npi_clean['state'].value_counts()
unique_counts[:5]
Out[51]:
CA    678482
NY    467416
FL    375603
TX    365783
PA    233208
Name: state, dtype: int64
In [52]:
# Drop records that are not in US
new_unique = unique_counts.drop(unique_counts.index[-21:])
In [53]:
# Plot Record Count by State

plt.figure(figsize=(25,12))
sns.barplot(new_unique.index, new_unique.values,color='lightcoral')
plt.title('Records by State', fontsize=22)
plt.xlabel('States', fontsize=18)
plt.ylabel('No. of Records', fontsize=18);
Out[53]:
Text(0, 0.5, 'No. of Records')

This chart shows healthcare provider counts by each state. Here we can see the disparity in availability of providers for different states. States like CA, NY, FL have more than 350,000 providers whereas states like ND, VT, WY have fewer than 20,000 providers.

Top 15 Provider Specialities (Taxonomy Codes)

Taxonomy codes can be mapped to area of specialization using this reference. Let's create a mapping of the names of top specialities with taxonomy codes and then plot.

In [54]:
# Create a copy of the npi_clean dataframe
npi_renamed = npi_clean.copy()

# Look at top 15 Specialities
npi_renamed['taxonomy_code_1'].value_counts()[:15]
Out[54]:
225100000X    200961
183500000X    197467
207R00000X    193138
101YM0800X    192386
1041C0700X    187014
390200000X    181283
207Q00000X    175167
1223G0001X    147548
111N00000X    132915
235Z00000X    131542
363LF0000X    118961
122300000X    106991
163W00000X    105641
174400000X     98625
363A00000X     93967
Name: taxonomy_code_1, dtype: int64
In [55]:
# Remap taxonomy code values
npi_renamed['taxonomy_code_1'] = npi_renamed['taxonomy_code_1'].map({'225100000X':'Physical Therapist',
            '183500000X':'Pharmacist','207R00000X':'Internam Medicine','101YM0800X':'Counselor-Mental Health',
            '1041C0700X':'Social Worker-Clinical','390200000X':'Student-Healthcare','207Q00000X':'Family Medicine',
            '1223G0001X':'Dentist-General','111N00000X':'Chiropractor','235Z00000X':'Speech Pathologist',
            '363LF0000X':'Nurse-Family','122300000X':'Dentist','163W00000X':'Registered Nurse',
            '174400000X':'Spacialist-Non Doctors','363A00000X':'Physician Assistant','106H00000X':'Marriage-Family Therapist',
            '106S00000X':'Behavior Technician','101YA0400X':'Counselor-Addiction','101Y00000X':'Counselor',
            '171M00000X':'Case Manager','164W00000X':'Licensed Practical Nurse','104100000X':'Social Worker',
            '103TC0700X':'Clinical Psychologist','225X00000X':'Occupational Therapist','172V00000X':'Community Health Worker',
            '101YP2500X':'Professional Counselor'})
In [56]:
# Create a plot of top 15 Provider Specialities
fig, ax = plt.subplots(figsize=(20,8))

# Add data
provider = npi_renamed['taxonomy_code_1'].value_counts().reset_index()
ax.bar(provider.iloc[:15,0], provider.iloc[:15,1],color='lightcoral')

# Remove Spines
for key, spine in ax.spines.items():
    spine.set_visible(False)
    
# Adjust margin, ticks etc and add title    
ax.margins(0.01, 0)
ax.set_ylim(0,210000)
ax.set_title('Count of Providers by Speciality (Top 15)', fontsize=22)
ax.set_xlabel('Spaciality(Taxonomy Codes)', fontsize=18)
ax.set_ylabel('Count', fontsize=18)
plt.xticks(rotation=90)
plt.show();

We can see that some of top specializations of healthcare providers in US include Physical Therapists, Pharmacists, Internal Medicine, Counselors, Social Workers, Family Medicine practitioners and so on.

Top 15 provider specialities for states with most and least providers

Let's look at the top 15 specialities for states that have the highest number of providers vs states with lowest number of providers.

In [57]:
states = ['CA','NY','FL','VT','ND','WY']
fig = plt.figure(figsize=(15, 15))
for sp in range(0,6):
    ax = fig.add_subplot(2,3,sp+1)
    provType = npi_renamed[npi_renamed['state']==states[sp]]['taxonomy_code_1'].value_counts().reset_index()
    ax.bar(provType.iloc[:15,0], provType.iloc[:15,1],color='lightcoral')
    ax.set_title(states[sp])
    ax.set_ylim(0,provType.iloc[0,1])
    plt.xticks(rotation=90)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.margins(0.01, 0)
plt.tight_layout() # automatically adjusts layout to fit long labels
plt.show();

CA, NY, FL have the most providers whereas states like ND, VT, WY have the least. This chart clearly shows the disparity in availability of providers in the states with most vs states with least providers. Taxonomy codes can be mapped to area of specialization using this reference.

Count of Entity Types

We can see that most providers are individuals with fewer organizations. This looks right as an organization would have its own individual NPI but would have multiple individual providers working for them.

In [58]:
fig, ax = plt.subplots(figsize=(5,5))
entity = npi_clean['entity_type'].value_counts().reset_index()
ax.bar(entity.iloc[:,0], entity.iloc[:,1],color='lightcoral')
for key, spine in ax.spines.items():
    spine.set_visible(False)
ax.set_ylim(0,5000000)
ax.set_title('Entity Types in US', pad=20)
ax.tick_params(bottom="off")
plt.show();

Export Data

We will export our cleaned data as different csv files for each state and store them in a single directory

In [60]:
# Create a directory to store csv output
if "state_data" not in os.listdir(home_dir):
    os.mkdir(os.path.join(home_dir,"state_data"))
    
state_data_dir = os.path.join(home_dir,"state_data")
os.listdir(home_dir)
Out[60]:
['npidata.csv', 'npidata.zip', 'Results.gdb', 'state_data']
In [61]:
# Create list of state names in U.S.
clean_statelist = ['NE', 'FL', 'NC', 'TX', 'CA', 'OH', 'NY', 'AZ', 'OK', 'MO', 'IL',
       'TN', 'MN', 'NV', 'GA', 'AL', 'IA', 'NJ', 'CT', 'MD', 'MI', 'WI',
       'NM', 'AR', 'PA', 'UT', 'KY', 'VA', 'IN', 'MS', 'ME', 'NH', 'MA',
       'MT', 'LA', 'CO', 'WV', 'WA', 'RI', 'OR', 'DC', 'KS', 'ID', 'SC',
       'HI', 'SD', 'DE', 'PR', 'AK', 'WY', 'VT', 'ND']
In [62]:
%%time

# Output csv files for each state and store in state_data_dir
for stateName in clean_statelist:
    name = stateName
    i = npi_clean[npi_clean['state']==stateName]
    i.to_csv(os.path.join(state_data_dir,'npi_'+name+'.csv'), index = None, header=True)
Wall time: 55.2 s

Geocode using GeoAnalytics Server

So far, we have successfully cleaned and exported provider data for each state. Now, to make this data useful and perform any spatial analysis on it (i.e plot it on a map, aggregate it at a county/state level or to add additional social demographic features), we first need to geocode the cleaned data.

Considering the amount of data we have (~5.8M records), Geocoding can be an expensive and time consuming operation. Geocoding this data will consume ~2,32,000 credits and can take ~36 hours when performed as a sequential operation. This is where the power of GeoAnalytics server really shines. The GeoAnalytics server provides extreme performance by processing on massive datasets in a scalable and distributed fashion.

The Geocode Locations from Table tool is a convenient way to geocode large tables of addresses. Learn more about how this tool can be used through ArcGIS API for Python here.

We have geocoded this data for you and have included it as a part of the analysis. The geocoded data can be accessed as provider_data_geocoded_7_30 feature layer and we will see it being utilized in other parts of this study.

Summary

To summarize, in this notebook we:

  1. Read large data (~5.8 million records with 329 attirbutes) using Dask.
  2. Cleaned the data by removing redundant columns, renaming columns, changing categories etc.
  3. Performed initial data exploration.
  4. Filtered and exported cleaned data for each state to be geocoded.
  5. Reviewed how GeoAnalytics server and tools can be used to geocode massive datasets.

References