Analyzing Healthcare Provider Shortage - Part 2/4¶

👟 Ready To Run!

🔬 Data Science

🚀 Data Exploration and Cleaning

📈 Statistics & Graphing

⚕️ Healthcare

Requirements

🗺️ Living Atlas configured

In this part of the study, we will explore geocoded provider data with respect to socioeconomic and demographic factors such as population density, median income, median age, population with health insurance etc. We will study the distribution of:

Overall providers in the U.S.
Explore Texas to find out how healthcare providers are distributed at the county level

In this section, we will:

Get geocoded data and check geocoding results
Gather and Process Demographic and Health Expenditure Data
Aggregate Provider Data at the County Level
Explore distribution of Healthcare Providers in U.S.
Explore Texas and study how:
- Provider Count varies with Population Density, Median Income and Median Age

1 Get Geocoded Provider Data
2 Check Geocoding
- 2.1 Geocoding Check: Wyoming
3 Heat Map - All Providers
- 3.1 Heatmap of Providers in U.S.
4 Gather and Process Data
5 Plot People to Provider Ratio - US
- 5.1 Plot People per Provider
  - 5.1.1 Define Renderer
6 The Lone Star State - Exploring Texas
7 Summary
8 References

Part 2: Provider Shortage - Data Exploration¶

# Import Libraries
from IPython.display import display

# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap

# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np

# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Import library for time and os
import time
import os

# Create a GIS connection
gis = GIS("home")

Get Geocoded Provider Data¶

Provider data was geocoded using the GeoAnalytics server. Let's get the geocoded provider data feature layer for exploration.

# Search the feature layer
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
provider_data_item = gis.content.search(f'provider_data_geocoded_7_30 {group_query}',
                                      item_type = "Feature Service",
                                      outside_org=True)[0]
provider_data_item

# Get the layer needed for analysis
provider_data_layer = provider_data_item.layers[0]
provider_data_layer

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/provider_data_geocoded_7_30/FeatureServer/0">

# Look at the first 5 fields and their data types
for f in provider_data_layer.properties.fields[:5]:
    print(f['name'],"{0:>35}".format(f['type']))

objectid                    esriFieldTypeOID
user_npi                 esriFieldTypeDouble
user_entity_type                 esriFieldTypeString
user_address                 esriFieldTypeString
user_address2                 esriFieldTypeString

# Check number of records in `provider_data_layer`
provider_data_layer.query(return_count_only=True)

5790417

Check Geocoding¶

Let's do a random test to check accuracy of geocoded data. For a given state, we will look at and plot the data points based on address provided vs address geocoded.

Geocoding Check: Wyoming¶

# Create a spatially enabled dataframe for WY
%time wy_df = provider_data_layer.query(where="user_Region='WY'", as_df=True)

wy_df.shape

Wall time: 17.6 s

(12770, 32)

# Look at data points that are geocoded outside of Wyoming
len(wy_df[wy_df['region']!='Wyoming'])

3

# Check the accuracy of geocoding
print(f'{round(100-((wy_df.shape[1]/wy_df.shape[0])*100),2)} %')

99.75 %

# Create a map
map1 = gis.map('USA',4)
map1

# Plot data points for WY
wy_df.spatial.plot(map1)

True

We have an accuracy of >99% with geocoding. From this map, we can see that out of 12770 points that have been geocoded only 3 are outside of WY. Similar analysis can be performed for other states to check for geocoding errors.

Heat Map - All Providers¶

Heatmap of Providers in U.S.¶

To visualize how healthcare providers are distributed, let's do a heatmap of providers in the United States.

# Create a map
map_usa = gis.map('USA', zoomlevel=4)
map_usa

The map shows distribution of providers throughout the US. We can see large gaps in Nevada, Idaho, Utah, Wyoming and Texas. These and other states can be explored further for analysis.

# Add provider data and create a heatmap
renderer = {"renderer": "autocast", #This tells python to use JS autocasting
            "type": "heatmap",
            "blurRadius":1,  # changes the size of the clusters
            "maxPixelIntensity":2,
            "minPixelIntensity":0,
            "field":None}
renderer["colorStops"] = [{"ratio":0,"color":[63, 40, 102, 0]},
                          {"ratio":0.25,"color":[167,97,170,179]},
                          {"ratio":0.50,"color":"#7b3ce9"},
                          {"ratio":0.75,"color":[222,102,0,179]},
                          {"ratio":1,"color":[244,204,0,179]}]

map_usa.add_layer(provider_data_layer,
               { "type": "FeatureLayer",
                 "renderer": renderer,
                })

# Add Legend
map_usa.legend = True

Gather and Process Data¶

Let's gather and process Demographic and Health Expenditure Data to create a data layer that includes aggregated provider count.

Create County Dataframe from Demographics Data¶

Here, we will search for demographic data at the county level. To do this, we will:

Search for Population density feature layer collection from Esri Living Atlas
Select the layer for county level data
Create a dataframe of county level demographic data by specifying the columns we need for our analysis

# Search for and get Population data layer
popdensity = gis.content.get('b9095ebdf5e8442588ab3f269dc7ee5e')
popdensity

# Check first 5 layers in population Density
popdensity.layers[:5]

[<FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/0">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/1">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/2">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/3">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/4">]

# Look at first few field names for county layer
county_layer = popdensity.layers[46]
print('FIELD NAME', "{0:>50}".format('FIELD ALIAS'))
for field in county_layer.properties.fields[:10]:
    print(field['name'], "{0:>50}".format(field['alias']))

FIELD NAME                                        FIELD ALIAS
OBJECTID                                           OBJECTID
Shape                                              Shape
ID                                                 ID
NAME                                               NAME
STATE_NAME                                         STATE_NAME
ST_ABBREV                                          ST_ABBREV
TOTPOP_CY                       2020 Total Population (Esri)
HHPOP_CY                   2020 Household Population (Esri)
FAMPOP_CY                      2020 Family Population (Esri)
GQPOP_CY              2020 Group Quarters Population (Esri)

%%time

# Create dataframe of specific attributes for Counties
county_layer = popdensity.layers[46]
county_df = pd.DataFrame()
out_fields=['Shape','ST_ABBREV','NAME','ASIAN_CY','AMERIND_CY','AVGHHSZ_CY','AVGHINC_CY','BLACK_CY',
            'EDUCBASECY','HISPPOP_CY','MEDAGE_CY','MINORITYCY','OTHRACE_CY','PCI_CY','POPDENS_CY',
            'UNEMPRT_CY','WHITE_CY','SMCOLL_CY','ASSCDEG_CY','BACHDEG_CY','GRADDEG_CY','TOTPOP_CY']

offset = 0
while offset <= 3000:
    chunk_df = county_layer.query(out_fields=out_fields,return_all_records=False,
                                  result_offset=offset,result_record_count=750,as_df=True)
    county_df = pd.concat([chunk_df, county_df], ignore_index=True)
    
    offset += 750

Wall time: 27.5 s

county_df.shape

(3142, 23)

Create County Dataframe from Expenditure Data¶

Here, we will search for health expenditure data at the county level. To do this, we will:

Search for layers related to healthcare spendings from Esri Living Atlas
Select the Health Insurance Spending layer for county level data
Create a dataframe of county level health expenses data by specifying the columns we need for our analysis

# Search for and get Healthcare expenditure data layer
health_exp = gis.content.get('1fb3254792694ee285a835dd5fffc20b')
health_exp

# Check first few layers in population Density
health_exp.layers[:5]

[<FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/0">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/1">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/2">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/3">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/4">]

health_exp_county_layer = health_exp.layers[46]
print('FIELD NAME', "{0:>50}".format('FIELD ALIAS'))
for f in health_exp_county_layer.properties.fields[:10]:
    print(f['name'], "{0:>50}".format(f['alias']))

FIELD NAME                                        FIELD ALIAS
OBJECTID                                           OBJECTID
Shape                                              Shape
ID                                                 ID
NAME                                               NAME
STATE_NAME                                         STATE_NAME
ST_ABBREV                                          ST_ABBREV
X8001_X                                   2020 Health Care
X8001_A                          2020 Health Care: Average
X8001_I                            2020 Health Care: Index
X8002_X                              2020 Health Insurance

%%time

# Create a dataframe using specific attributes for Counties
health_exp_county_layer = health_exp.layers[46]
health_exp_county_df = pd.DataFrame()
out_fields=['ST_ABBREV','NAME','X8001_A','X8002_A','X8013_A','X8018_A','X8019_A','X8024_A','X8032_A',
            'X13002_A','X13004_A']
    
offset = 0
while offset <= 3000:
    chunk_df = health_exp_county_layer.query(out_fields=out_fields, return_all_records=False,
                                             result_offset=offset,result_record_count=500,as_df=True)
    health_exp_county_df = pd.concat([chunk_df, health_exp_county_df], ignore_index=True)
    offset += 500

Wall time: 26.2 s

health_exp_county_df.shape

(3142, 13)

Spatially Join Demographic and Health Expenditure Data¶

We will spatially join both the demographic and expenditure layer so as to combine data into a single polygon layer.

# Merge demographic and health expenditure data at county level
%time county_healthexp_df = county_df.spatial.join(health_exp_county_df,how='left', op='within')

Wall time: 1min 19s

# Check geometry type
county_healthexp_df.spatial.geometry_type

['polygon']

# Check shape of data
county_healthexp_df.shape

(3142, 36)

We will now write this spatially enabled dataframe as a feature service to the Portal.

# Create a feature service on portal
demo_healthexp = county_healthexp_df.spatial.to_featurelayer('demographic_healthexp_layer')

demo_healthexp

# Access the feature service
demo_healthexp_layer = demo_healthexp.layers[0]
demo_healthexp_layer

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/a7cc14/FeatureServer/0">

Aggregate Data - Get Provider Count¶

Now, we need to get the provider count for each county. Count of providers will serve as a key variable and we can get this data in multiple ways:

- Method 1: We can query the provider data layer, store results in a spatially enabled dataframe and generate counts.
- Method 2: Use `AggregatePoints` tool from the features module.

Method 1 fails due to the size and memory usage required for our data (~5.8 million points). This is where tools for data summarization come into play.

AggregatePoints tool helps us aggregate point data (provider count) on polygon data (counties). The output generated is a polygon layer with aggregated statistics.

We will run AggregatePoints with provider data as our point layer, demo_healthexp_layer as the polygon layer and write out the results to a feature service demo_healthexp_allproviders. In simple words, this tool will grab all the providers in a particular county and give us an aggregate count of providers for that county along with other demographic and health expenditure data that was in the polygon layer for counties.

Learn more about Aggregate Points.

# Import Aggregate Points
from arcgis.features.summarize_data import aggregate_points

Let's create a function that:

Check if the layers exists on Portal
If layer exists, get layer from the portal
If layer does not exist, run Aggregate Points tool to create layer

# Run Aggregate Points and write to a feature service

def Aggregation():

    # Check if the layers exists on Portal
    search_result = gis.content.search("demos_healthexp_allproviders", item_type='Feature Layer')

    # If exists, get layer from the item
    if (len(search_result) > 0):
        layer = search_result[0].layers[0]
        print('Layer exists... continue')

    # If layer does not exist, run Aggregate Points tool to create layer
    else:
        print('Running Aggregate Points')
        start_time = time.time()
        
        item = aggregate_points(point_layer = provider_data_layer,
                                 polygon_layer = demo_healthexp_layer,
                                 output_name = 'demos_healthexp_allproviders')

        layer = item.layers[0]
        duration = time.time()-start_time
        print(f'Time elapsed: {duration/60:.2f} mins') 
        print('Layer created... continue')
    
    return layer

allprovider = Aggregation()

Running Aggregate Points
Time elapsed: 11.62 mins
Layer created... continue

Get Aggregated Data Layer¶

Now that we have aggregated our data, we will get the aggrageted data layer that includes provider count and other demographics data.

# Get the feature layer with aggregated data
allprovider

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/demos_healthexp_allproviders/FeatureServer/0">

# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider.query(as_df=True)
allprovider_df.shape

(3142, 39)

# Look at the first few rows of dataframe
allprovider_df.head()

We can see that the dataframe allprovider_df has 3142 rows and 39 columns. This is the same number of rows as the number of counties. Thus aggregation has given us the number of providers in each county along with other variables from the demographic and health expenditure layer.

Clean Data¶

Here we will:

Check for Null values
Remove any duplicate columns
Rename columns

# Check for null values
allprovider_df.isnull().sum()

SHAPE              0
amerind_cy         0
analysisarea       0
asian_cy           0
asscdeg_cy         0
avghhsz_cy         0
avghinc_cy         0
bachdeg_cy         0
black_cy           0
educbasecy         0
graddeg_cy         0
hisppop_cy         0
index_right        0
medage_cy          0
minoritycy         0
name_left          0
name_right         0
objectid           0
objectid_left      0
objectid_right     0
othrace_cy         0
pci_cy             0
point_count        0
popdens_cy         0
smcoll_cy          0
st_abbrev_left     0
st_abbrev_right    0
totpop_cy          0
unemprt_cy         0
white_cy           0
x13002_a           0
x13004_a           0
x8001_a            0
x8002_a            0
x8013_a            0
x8018_a            0
x8019_a            0
x8024_a            0
x8032_a            0
dtype: int64

# Check for duplicate columns
allprovider_df.columns

Index(['SHAPE', 'amerind_cy', 'analysisarea', 'asian_cy', 'asscdeg_cy',
       'avghhsz_cy', 'avghinc_cy', 'bachdeg_cy', 'black_cy', 'educbasecy',
       'graddeg_cy', 'hisppop_cy', 'index_right', 'medage_cy', 'minoritycy',
       'name_left', 'name_right', 'objectid', 'objectid_left',
       'objectid_right', 'othrace_cy', 'pci_cy', 'point_count', 'popdens_cy',
       'smcoll_cy', 'st_abbrev_left', 'st_abbrev_right', 'totpop_cy',
       'unemprt_cy', 'white_cy', 'x13002_a', 'x13004_a', 'x8001_a', 'x8002_a',
       'x8013_a', 'x8018_a', 'x8019_a', 'x8024_a', 'x8032_a'],
      dtype='object')

We can see that name, objectid and st_abbrev columns have duplicates. We will drop these duplicate columns from our data.

# Drop duplicate columns
allprovider_df.drop(['objectid','objectid_right','st_abbrev_right','name_right','analysisarea','index_right'],
                    axis=1, inplace=True)
allprovider_df.columns

Index(['SHAPE', 'amerind_cy', 'asian_cy', 'asscdeg_cy', 'avghhsz_cy',
       'avghinc_cy', 'bachdeg_cy', 'black_cy', 'educbasecy', 'graddeg_cy',
       'hisppop_cy', 'medage_cy', 'minoritycy', 'name_left', 'objectid_left',
       'othrace_cy', 'pci_cy', 'point_count', 'popdens_cy', 'smcoll_cy',
       'st_abbrev_left', 'totpop_cy', 'unemprt_cy', 'white_cy', 'x13002_a',
       'x13004_a', 'x8001_a', 'x8002_a', 'x8013_a', 'x8018_a', 'x8019_a',
       'x8024_a', 'x8032_a'],
      dtype='object')

# Rename Columns
allprovider_df.rename(columns={'x8001_a':'avg_healthcare','x8002_a':'avg_healthinsurance','x8013_a':'avg_medicare',
                                  'x8018_a':'avg_medicalcare','x8019_a':'avg_medicalsrvc','x8024_a':'avg_labtest',
                                  'x8032_a':'avg_presdrug','x13002_a':'avg_personalinsurance','x13004_a':'avg_socsecurity',
                                  'asian_cy':'asian_pop','amerind_cy':'amerind_pop','avghhsz_cy':'avg_hhsz',
                                  'avghinc_cy':'avg_hhinc','black_cy':'black_pop','educbasecy':'edubase',
                                  'hisppop_cy':'hisp_pop','medage_cy':'median_age','minoritycy':'minority_pop',
                                  'othrace_cy':'otherace_pop','pci_cy':'percap_income','popdens_cy':'pop_density',
                                  'unemprt_cy':'unemp_rate','white_cy':'white_pop','smcoll_cy':'some_college',
                                  'asscdeg_cy':'asso_deg','bachdeg_cy':'bach_deg','graddeg_cy':'grad_deg','totpop_cy':'total_population',
                                  'st_abbrev_left':'state','name_left':'county','point_count':'provider_count','objectid_left':'objectid'}, inplace=True)
allprovider_df.columns

Index(['SHAPE', 'amerind_pop', 'asian_pop', 'asso_deg', 'avg_hhsz',
       'avg_hhinc', 'bach_deg', 'black_pop', 'edubase', 'grad_deg', 'hisp_pop',
       'median_age', 'minority_pop', 'county', 'objectid', 'otherace_pop',
       'percap_income', 'provider_count', 'pop_density', 'some_college',
       'state', 'total_population', 'unemp_rate', 'white_pop',
       'avg_personalinsurance', 'avg_socsecurity', 'avg_healthcare',
       'avg_healthinsurance', 'avg_medicare', 'avg_medicalcare',
       'avg_medicalsrvc', 'avg_labtest', 'avg_presdrug'],
      dtype='object')

# Check size of dataframe after dropping columns
allprovider_df.shape

(3142, 33)

Write Cleaned Data as a Feature Service¶

Now that our aggregated data is clean, we will write this data as a feature service to portal. This feature service demographic_healthexp_clean_allproviders will be used later for creating models.

# Create a feature service on portal
allprovider_clean = allprovider_df.spatial.to_featurelayer('demographic_healthexp_clean_allproviders')

Plot People to Provider Ratio - US¶

To plot Population to Provider Ratio, we will create a new column for the no. of people per Healthcare Provider. To achieve this, we will divide total population by provider count for each county.

# Create a new column - People per Provider
allprovider_df['people_per_prov'] = allprovider_df['total_population']/allprovider_df['provider_count']

Plot People per Provider¶

allprovider_map = gis.map('USA', 4)
allprovider_map

Darker shades on the map show large no. of people per provider which is bad. From this map, we can see that:

Counties in Westen and North Eastern US seem to have the best people to provider ratio with 0-100 people per healthcare provider.

Areas in North Central US seem to have a mix of both low and high number of people per provider. Nebraska stands out with Thomas, Logan, Loup, Hayes and Wheeler counties where number of people per healthcare provider is > 500.

Counties in Southern states seem to suffer the most where a majority of counties have 100-200 people per provider or higher.

Texas seems to stand out where counties like Borden, Glasscock, Irion, San Jacinto, Newton have > 500 people per healthcare provider. We can also see more counties where number of people per provider is high.

Define Renderer¶

# Define Renderer
allProvTest = {"renderer": {
                 "type": "classBreaks",  
                 "field":"people_per_prov",
                 "transparency":.5,
                 "minValue":1}}

# Define Manual Class breaks
allProvTest['renderer']["classBreakInfos"] = [{
  "classMaxValue": 100.00,
  "label": "0 - 100.00",
  "description": "0 - 100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [255,247,236,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 200.00,
  "label": "100.001 - 200.00",
  "description": "100.001 - 200.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [253,220,174,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 350.00,
  "label": "200.001 - 350.00",
  "description": "200.001 - 350.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [252,177,123,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 500.00,
  "label": "350.001 - 500.00",
  "description": "350.001 - 500.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [241,109,75,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 1100.00,
  "label": "500.001 - 1100.00",
  "description": "500.001 - 1100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [200,28,18,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}]

# Plot Map using defined Renderer
allprovider_map.remove_layers()
allprovider_df.spatial.plot(map_widget=allprovider_map, renderer=allProvTest['renderer'])

True

# Add legend
allprovider_map.legend=True

An alternate way to create the map without define a renderer is to directly plot the spatial dataframe by using Esri's classification algorithm. Here we show the use of esriClassifyNaturalBreaks which breaks the column values naturally into different classes.

# allprovider_df.spatial.plot(map_widget=allprovider_map,
#         renderer_type='c',  # for class breaks renderer
#         method='esriClassifyNaturalBreaks',  # classification algorithm
#         class_count=6,  # choose the number of classes
#         col='people_per_prov',  # numeric column to classify
#         cmap='OrRd',  # color map to pick colors from for each class
#         alpha=0.7 ,  # specify opacity
#         linewidth = 0.05)

The Lone Star State - Exploring Texas¶

The second largest state [1] in U.S. both by area and population, Texas baosts of 261,231.71 sq mi land area [2] with a population of ~28.7 million [3]. The population density is low with just 105.2 people per square mile [4] and varies drastically between heavely populated urban areas to sparsely populated rural.

Image source: https://www.npr.org/sections/thetwo-way/2018/05/24/614195884/new-census-data-shows-texas-cities-are-growing-faster-than-all-other-states

As seen in the previous notebook, Texas has the fourth largest number of healthcare providers in U.S. However, the state stands second highest on Health Resources and Services Administration's (HRSA) list of shortage areas [5]. Texas also tops HRSA's list of medically underserved areas [5].

Let's start our journey with exploring Texas!

Get the Data

We will filter our allprovider_df dataframe for TX and subset to get specific columns.

# Filter data for Texas
q = allprovider_df['state']=='TX'
txprovider_df = allprovider_df[q]

# Subset to keep specific columns
cols_to_keep = ['SHAPE','objectid','state','county','avg_hhinc','median_age',
                'total_population','provider_count','people_per_prov']
txprovider_df = txprovider_df[cols_to_keep]
txprovider_df.shape

(254, 9)

Provider and Population Density¶

Let's explore distribution of providers with respect to population density.

# Create a map 
tx_pop_map = gis.map('Texas, USA', zoomlevel=6)
tx_pop_map.layout.height="650px"
tx_pop_map

Darker shades on the map show large no. of people per provider which is bad. From this map, we can see that:

Irion County (in red) seems to be the worst with 799 people per healthcare provider.

Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.

There are very few counties (in white) in Texas with less than 100 people per healthcare provider.

Define Renderer¶

# Define Renderer
txpopTest = {"renderer": { #This tells python to use JS autocasting
                 "type": "classBreaks",  
                 "field":"people_per_prov",
                 "transparency":.5,
                 "minValue":1}}

# Define Manual Class breaks
txpopTest['renderer']["classBreakInfos"] = [{
  "classMaxValue": 100.00,
  "label": "0 - 100.00",
  "description": "0 - 100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [255,247,236,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 300.00,
  "label": "100.001 - 300.00",
  "description": "100.001 - 300.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [253,220,174,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 500.00,
  "label": "300.001 - 500.00",
  "description": "300.001 - 500.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [252,177,123,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 700.00,
  "label": "500.001 - 700.00",
  "description": "500.001 - 700.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [241,109,75,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 800.00,
  "label": "700.001 - 800.00",
  "description": "700.001 - 800.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [200,28,18,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}]

Plot Map¶

# Plot Map using defined Renderer
txprovider_df.spatial.plot(map_widget=tx_pop_map, renderer=txpopTest['renderer'])

True

# Add legend
tx_pop_map.legend=True

Provider and Avg. Household Income¶

We will create a Kernel Density Plot of People per Provider w.r.t. Avg. Household Income. Before we plot, let's check for any Nan or Infinite values in our data and replace or drop as necessary.

txprovider_df.describe().transpose()

Here we see that one or more records for provider_count is 0 causing the max for people_per_provider to be infinite. Let's find the record(s) with 0 min.

# Check for rows with zero provider count
txprovider_df[txprovider_df['provider_count']==0]

Let's ignore this single record and work with remaining data.

# Create dataframe where provider count is greater than zero
txprovider_df = txprovider_df[txprovider_df['provider_count']>0]

Let's create a density plot to explore providers with respect to avg. household income.

g = sns.jointplot(x='people_per_prov', y='avg_hhinc', data=txprovider_df,xlim=[0,400], ylim=[35000,100000] ,kind='kde', n_levels=8)
g.fig.suptitle('Kernel Density Plot of People per Provider vs Avg. Household Income', y=1.05);

From this plot we can see that for majority of the counties in Texas the average number of people per healthcare provider vary from 100-200 and the average household income for these counties vary from 60,000 - 70,000 dollars.

Provider and Median Age¶

Let's create a density plot to explore providers with respect to median age.

g = sns.jointplot(y='people_per_prov', x='median_age', data=txprovider_df, ylim=[0,400], xlim=[20,60] ,kind='kde')
g.fig.suptitle('Kernel Density Plot of People per Provider vs Median Age', y=1.05);

From this plot we can see that for majority of the counties in Texas the median age is between 35-45. The average number of people per healthcare provider vary from 60 - 210 for these counties.

Summary¶

To summarize, in this notebook we saw that:

Counties in Westen and North Eastern US seem to have the best people to provider ratio with 0-100 people per healthcare provider.

Counties in Southern states seem to suffer the most where a majority of counties have 100-200 people per provider or higher.

Texas seems to stand out in terms of overall healthcare provider shortage:
- Irion County (in red) seems to be the worst with 799 people per healthcare provider.
- Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.
- There are very few counties (in white) in Texas with less than 100 people per healthcare provider.

References¶

[1] https://en.wikipedia.org/wiki/Texas
[2] https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area
[3] https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population
[4] http://worldpopulationreview.com/states/texas-population/
[5] https://data.hrsa.gov/topics/health-workforce/shortage-areas

	SHAPE	amerind_cy	analysisarea	asian_cy	asscdeg_cy	avghhsz_cy	avghinc_cy	bachdeg_cy	black_cy	educbasecy	graddeg_cy	hisppop_cy	index_right	medage_cy	minoritycy	name_left	name_right	objectid	objectid_left	objectid_right	othrace_cy	pci_cy	point_count	popdens_cy	smcoll_cy	st_abbrev_left	st_abbrev_right	totpop_cy	unemprt_cy	white_cy	x13002_a	x13004_a	x8001_a	x8002_a	x8013_a	x8018_a	x8019_a	x8024_a	x8032_a
0	{"rings": [[[-8966175.063999997, 4783327.06530...	33	320.588718	14	399	2.37	55987	797	166	6048	251	71	0	43.5	412	Doddridge County	Doddridge County	1	3001	3001	5	21059	33	25.8	1032	WV	WV	8259	9.5	7908	402.36	4631.13	4372.78	2797.49	481.64	1575.29	760.50	44.89	114.15
1	{"rings": [[[-9003763.871099995, 4594051.14379...	97	668.400698	126	2829	2.33	53306	4004	2012	34080	1723	577	1	45.0	3574	Fayette County	Fayette County	2	3002	3002	112	22007	461	69.6	6078	WV	WV	46044	9.5	42851	369.20	4396.48	4060.98	2607.20	446.05	1453.78	704.75	41.96	105.12
2	{"rings": [[[-8651697.2798, 4767802.112999998]...	196	211.494165	1016	3074	2.60	99084	8210	3591	41040	5425	3790	10	40.7	10091	Jefferson County	Jefferson County	3	3011	3011	1396	37431	517	280.7	8227	WV	WV	58849	10.3	50714	567.43	9197.22	6445.56	4162.27	471.27	2283.29	1266.35	77.03	127.93
3	{"rings": [[[-8986636.1429, 4735386.049599998]...	60	340.215576	54	448	2.31	49788	619	1037	6210	371	514	2	39.4	1755	Gilmer County	Gilmer County	4	3003	3003	213	17155	54	24.7	1302	WV	WV	8358	14.2	6823	337.54	4119.71	3730.03	2395.73	403.85	1334.30	649.71	38.73	95.56
4	{"rings": [[[-8792002.028199999, 4755766.55449...	21	480.087132	28	677	2.37	51798	905	125	9343	557	183	3	47.4	483	Grant County	Grant County	5	3004	3004	88	21642	158	26.1	1406	WV	WV	12436	7.4	12011	360.26	4416.26	3940.01	2517.00	403.61	1423.02	707.25	42.51	99.50

	count	mean	std	min	25%	50%	75%	max
objectid	254.0	2.650500e+03	73.467680	2524.000000	2587.250000	2650.500000	2713.750000	2777.0
avg_hhinc	254.0	7.119343e+04	15027.773448	41857.000000	61321.000000	68941.000000	77239.500000	127417.0
median_age	254.0	4.061102e+01	5.877521	26.400000	36.300000	40.000000	44.575000	58.2
total_population	254.0	1.173478e+05	416409.337617	88.000000	7079.000000	19534.500000	54647.250000	4778365.0
provider_count	254.0	1.440287e+03	6174.628398	0.000000	36.250000	105.500000	387.750000	70973.0
people_per_prov	254.0	inf	NaN	42.368833	109.109202	155.083449	209.904211	inf

Analyzing Healthcare Provider Shortage - Part 2/4¶

Table of Contents

Part 2: Provider Shortage - Data Exploration¶

Get Geocoded Provider Data¶

Check Geocoding¶

Geocoding Check: Wyoming¶

Heat Map - All Providers¶

Heatmap of Providers in U.S.¶

Gather and Process Data¶

Create County Dataframe from Demographics Data¶

Create County Dataframe from Expenditure Data¶

Spatially Join Demographic and Health Expenditure Data¶

Aggregate Data - Get Provider Count¶

Get Aggregated Data Layer¶

Clean Data¶

Write Cleaned Data as a Feature Service¶

Plot People to Provider Ratio - US¶

Plot People per Provider¶

Define Renderer¶

The Lone Star State - Exploring Texas¶

Provider and Population Density¶

Define Renderer¶

Plot Map¶

Provider and Avg. Household Income¶

Provider and Median Age¶

Summary¶

References¶