Analyzing Healthcare Provider Shortage - Part 2/4

  • 👟 Ready To Run!
  • 🔬 Data Science
  • 🚀 Data Exploration and Cleaning
  • 📈 Statistics & Graphing
  • ⚕️ Healthcare

Requirements

  • 🗺️ Living Atlas configured

In this part of the study, we will explore geocoded provider data with respect to socioeconomic and demographic factors such as population density, median income, median age, population with health insurance etc. We will study the distribution of:

  • Overall providers in the U.S.
  • Explore Texas to find out how healthcare providers are distributed at the county level

In this section, we will:

  • Get geocoded data and check geocoding results
  • Gather and Process Demographic and Health Expenditure Data
  • Aggregate Provider Data at the County Level
  • Explore distribution of Healthcare Providers in U.S.
  • Explore Texas and study how:
    • Provider Count varies with Population Density, Median Income and Median Age

Part 2: Provider Shortage - Data Exploration

In [2]:
# Import Libraries
from IPython.display import display

# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap

# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np

# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Import library for time and os
import time
import os
In [3]:
# Create a GIS connection
gis = GIS("home")

Get Geocoded Provider Data

Provider data was geocoded using the GeoAnalytics server. Let's get the geocoded provider data feature layer for exploration.

In [4]:
# Search the feature layer
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
provider_data_item = gis.content.search(f'provider_data_geocoded_7_30 {group_query}',
                                      item_type = "Feature Service",
                                      outside_org=True)[0]
provider_data_item
Out[4]:
provider_data_geocoded_7_30
Geocoded healthcare provider point data of 5.8M providers in US.Feature Layer Collection by portaladmin
Last Modified: August 17, 2020
0 comments, 0 views
In [5]:
# Get the layer needed for analysis
provider_data_layer = provider_data_item.layers[0]
provider_data_layer
Out[5]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/provider_data_geocoded_7_30/FeatureServer/0">
In [5]:
# Look at the first 5 fields and their data types
for f in provider_data_layer.properties.fields[:5]:
    print(f['name'],"{0:>35}".format(f['type']))
objectid                    esriFieldTypeOID
user_npi                 esriFieldTypeDouble
user_entity_type                 esriFieldTypeString
user_address                 esriFieldTypeString
user_address2                 esriFieldTypeString
In [6]:
# Check number of records in `provider_data_layer`
provider_data_layer.query(return_count_only=True)
Out[6]:
5790417

Check Geocoding

Let's do a random test to check accuracy of geocoded data. For a given state, we will look at and plot the data points based on address provided vs address geocoded.

Geocoding Check: Wyoming

In [8]:
# Create a spatially enabled dataframe for WY
%time wy_df = provider_data_layer.query(where="user_Region='WY'", as_df=True)

wy_df.shape
Wall time: 17.6 s
Out[8]:
(12770, 32)
In [9]:
# Look at data points that are geocoded outside of Wyoming
len(wy_df[wy_df['region']!='Wyoming'])
Out[9]:
3
In [10]:
# Check the accuracy of geocoding
print(f'{round(100-((wy_df.shape[1]/wy_df.shape[0])*100),2)} %')
99.75 %
In [15]:
# Create a map
map1 = gis.map('USA',4)
map1
In [13]:
# Plot data points for WY
wy_df.spatial.plot(map1)
Out[13]:
True

We have an accuracy of >99% with geocoding. From this map, we can see that out of 12770 points that have been geocoded only 3 are outside of WY. Similar analysis can be performed for other states to check for geocoding errors.

Heat Map - All Providers

Heatmap of Providers in U.S.

To visualize how healthcare providers are distributed, let's do a heatmap of providers in the United States.

In [3]:
# Create a map
map_usa = gis.map('USA', zoomlevel=4)
map_usa
Out[3]:

The map shows distribution of providers throughout the US. We can see large gaps in Nevada, Idaho, Utah, Wyoming and Texas. These and other states can be explored further for analysis.

In [12]:
# Add provider data and create a heatmap
renderer = {"renderer": "autocast", #This tells python to use JS autocasting
            "type": "heatmap",
            "blurRadius":1,  # changes the size of the clusters
            "maxPixelIntensity":2,
            "minPixelIntensity":0,
            "field":None}
renderer["colorStops"] = [{"ratio":0,"color":[63, 40, 102, 0]},
                          {"ratio":0.25,"color":[167,97,170,179]},
                          {"ratio":0.50,"color":"#7b3ce9"},
                          {"ratio":0.75,"color":[222,102,0,179]},
                          {"ratio":1,"color":[244,204,0,179]}]

map_usa.add_layer(provider_data_layer,
               { "type": "FeatureLayer",
                 "renderer": renderer,
                })
In [13]:
# Add Legend
map_usa.legend = True

Gather and Process Data

Let's gather and process Demographic and Health Expenditure Data to create a data layer that includes aggregated provider count.

Create County Dataframe from Demographics Data

Here, we will search for demographic data at the county level. To do this, we will:

  • Search for Population density feature layer collection from Esri Living Atlas
  • Select the layer for county level data
  • Create a dataframe of county level demographic data by specifying the columns we need for our analysis
In [7]:
# Search for and get Population data layer
popdensity = gis.content.get('b9095ebdf5e8442588ab3f269dc7ee5e')
popdensity
Out[7]:
2020 USA Population Density
This layer shows the population density in the United States in 2020 in persons per square mile in a multiscale map by country, state, county, ZIP Code, tract, and block group. ArcGIS Online subscription required.Map Image Layer by esri
Last Modified: June 29, 2020
0 comments, 10,318 views
In [8]:
# Check first 5 layers in population Density
popdensity.layers[:5]
Out[8]:
[<FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/0">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/1">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/2">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/3">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Demographics_and_Boundaries_2020/MapServer/4">]
In [9]:
# Look at first few field names for county layer
county_layer = popdensity.layers[46]
print('FIELD NAME', "{0:>50}".format('FIELD ALIAS'))
for field in county_layer.properties.fields[:10]:
    print(field['name'], "{0:>50}".format(field['alias']))
FIELD NAME                                        FIELD ALIAS
OBJECTID                                           OBJECTID
Shape                                              Shape
ID                                                 ID
NAME                                               NAME
STATE_NAME                                         STATE_NAME
ST_ABBREV                                          ST_ABBREV
TOTPOP_CY                       2020 Total Population (Esri)
HHPOP_CY                   2020 Household Population (Esri)
FAMPOP_CY                      2020 Family Population (Esri)
GQPOP_CY              2020 Group Quarters Population (Esri)
In [10]:
%%time

# Create dataframe of specific attributes for Counties
county_layer = popdensity.layers[46]
county_df = pd.DataFrame()
out_fields=['Shape','ST_ABBREV','NAME','ASIAN_CY','AMERIND_CY','AVGHHSZ_CY','AVGHINC_CY','BLACK_CY',
            'EDUCBASECY','HISPPOP_CY','MEDAGE_CY','MINORITYCY','OTHRACE_CY','PCI_CY','POPDENS_CY',
            'UNEMPRT_CY','WHITE_CY','SMCOLL_CY','ASSCDEG_CY','BACHDEG_CY','GRADDEG_CY','TOTPOP_CY']

offset = 0
while offset <= 3000:
    chunk_df = county_layer.query(out_fields=out_fields,return_all_records=False,
                                  result_offset=offset,result_record_count=750,as_df=True)
    county_df = pd.concat([chunk_df, county_df], ignore_index=True)
    
    offset += 750
Wall time: 27.5 s
In [11]:
county_df.shape
Out[11]:
(3142, 23)

Create County Dataframe from Expenditure Data

Here, we will search for health expenditure data at the county level. To do this, we will:

  • Search for layers related to healthcare spendings from Esri Living Atlas
  • Select the Health Insurance Spending layer for county level data
  • Create a dataframe of county level health expenses data by specifying the columns we need for our analysis
In [12]:
# Search for and get Healthcare expenditure data layer
health_exp = gis.content.get('1fb3254792694ee285a835dd5fffc20b')
health_exp
Out[12]:
2020 USA Health Insurance Spending
This layer shows the average amount spent on health insurance per household in the U.S. in 2020, by country, state, county, ZIP Code, tract, and block group. ArcGIS Online subscription required.Map Image Layer by esri
Last Modified: June 29, 2020
0 comments, 257 views
In [13]:
# Check first few layers in population Density
health_exp.layers[:5]
Out[13]:
[<FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/0">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/1">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/2">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/3">,
 <FeatureLayer url:"https://demographics5.arcgis.com/arcgis/rest/services/USA_Consumer_Expenditures_2_2020/MapServer/4">]
In [14]:
health_exp_county_layer = health_exp.layers[46]
print('FIELD NAME', "{0:>50}".format('FIELD ALIAS'))
for f in health_exp_county_layer.properties.fields[:10]:
    print(f['name'], "{0:>50}".format(f['alias']))
FIELD NAME                                        FIELD ALIAS
OBJECTID                                           OBJECTID
Shape                                              Shape
ID                                                 ID
NAME                                               NAME
STATE_NAME                                         STATE_NAME
ST_ABBREV                                          ST_ABBREV
X8001_X                                   2020 Health Care
X8001_A                          2020 Health Care: Average
X8001_I                            2020 Health Care: Index
X8002_X                              2020 Health Insurance
In [15]:
%%time

# Create a dataframe using specific attributes for Counties
health_exp_county_layer = health_exp.layers[46]
health_exp_county_df = pd.DataFrame()
out_fields=['ST_ABBREV','NAME','X8001_A','X8002_A','X8013_A','X8018_A','X8019_A','X8024_A','X8032_A',
            'X13002_A','X13004_A']
    
offset = 0
while offset <= 3000:
    chunk_df = health_exp_county_layer.query(out_fields=out_fields, return_all_records=False,
                                             result_offset=offset,result_record_count=500,as_df=True)
    health_exp_county_df = pd.concat([chunk_df, health_exp_county_df], ignore_index=True)
    offset += 500
  
Wall time: 26.2 s
In [16]:
health_exp_county_df.shape
Out[16]:
(3142, 13)

Spatially Join Demographic and Health Expenditure Data

We will spatially join both the demographic and expenditure layer so as to combine data into a single polygon layer.

In [17]:
# Merge demographic and health expenditure data at county level
%time county_healthexp_df = county_df.spatial.join(health_exp_county_df,how='left', op='within')
Wall time: 1min 19s
In [18]:
# Check geometry type
county_healthexp_df.spatial.geometry_type
Out[18]:
['polygon']
In [19]:
# Check shape of data
county_healthexp_df.shape
Out[19]:
(3142, 36)

We will now write this spatially enabled dataframe as a feature service to the Portal.

In [20]:
# Create a feature service on portal
demo_healthexp = county_healthexp_df.spatial.to_featurelayer('demographic_healthexp_layer')
In [21]:
demo_healthexp
Out[21]:
demographic_healthexp_layer
Feature Layer Collection by portaladmin
Last Modified: August 17, 2020
0 comments, 0 views
In [22]:
# Access the feature service
demo_healthexp_layer = demo_healthexp.layers[0]
demo_healthexp_layer
Out[22]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/a7cc14/FeatureServer/0">

Aggregate Data - Get Provider Count

Now, we need to get the provider count for each county. Count of providers will serve as a key variable and we can get this data in multiple ways:

- Method 1: We can query the provider data layer, store results in a spatially enabled dataframe and generate counts.
- Method 2: Use `AggregatePoints` tool from the features module.

Method 1 fails due to the size and memory usage required for our data (~5.8 million points). This is where tools for data summarization come into play.

AggregatePoints tool helps us aggregate point data (provider count) on polygon data (counties). The output generated is a polygon layer with aggregated statistics.

We will run AggregatePoints with provider data as our point layer, demo_healthexp_layer as the polygon layer and write out the results to a feature service demo_healthexp_allproviders. In simple words, this tool will grab all the providers in a particular county and give us an aggregate count of providers for that county along with other demographic and health expenditure data that was in the polygon layer for counties.

Learn more about Aggregate Points.

In [23]:
# Import Aggregate Points
from arcgis.features.summarize_data import aggregate_points

Let's create a function that:

  1. Check if the layers exists on Portal
  2. If layer exists, get layer from the portal
  3. If layer does not exist, run Aggregate Points tool to create layer
In [24]:
# Run Aggregate Points and write to a feature service

def Aggregation():

    # Check if the layers exists on Portal
    search_result = gis.content.search("demos_healthexp_allproviders", item_type='Feature Layer')

    # If exists, get layer from the item
    if (len(search_result) > 0):
        layer = search_result[0].layers[0]
        print('Layer exists... continue')

    # If layer does not exist, run Aggregate Points tool to create layer
    else:
        print('Running Aggregate Points')
        start_time = time.time()
        
        item = aggregate_points(point_layer = provider_data_layer,
                                 polygon_layer = demo_healthexp_layer,
                                 output_name = 'demos_healthexp_allproviders')

        layer = item.layers[0]
        duration = time.time()-start_time
        print(f'Time elapsed: {duration/60:.2f} mins') 
        print('Layer created... continue')
    
    return layer

allprovider = Aggregation()
Running Aggregate Points
Time elapsed: 11.62 mins
Layer created... continue

Get Aggregated Data Layer

Now that we have aggregated our data, we will get the aggrageted data layer that includes provider count and other demographics data.

In [25]:
# Get the feature layer with aggregated data
allprovider
Out[25]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/demos_healthexp_allproviders/FeatureServer/0">
In [26]:
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider.query(as_df=True)
allprovider_df.shape
Out[26]:
(3142, 39)
In [27]:
# Look at the first few rows of dataframe
allprovider_df.head()
Out[27]:
SHAPE amerind_cy analysisarea asian_cy asscdeg_cy avghhsz_cy avghinc_cy bachdeg_cy black_cy educbasecy graddeg_cy hisppop_cy index_right medage_cy minoritycy name_left name_right objectid objectid_left objectid_right othrace_cy pci_cy point_count popdens_cy smcoll_cy st_abbrev_left st_abbrev_right totpop_cy unemprt_cy white_cy x13002_a x13004_a x8001_a x8002_a x8013_a x8018_a x8019_a x8024_a x8032_a
0 {"rings": [[[-8966175.063999997, 4783327.06530... 33 320.588718 14 399 2.37 55987 797 166 6048 251 71 0 43.5 412 Doddridge County Doddridge County 1 3001 3001 5 21059 33 25.8 1032 WV WV 8259 9.5 7908 402.36 4631.13 4372.78 2797.49 481.64 1575.29 760.50 44.89 114.15
1 {"rings": [[[-9003763.871099995, 4594051.14379... 97 668.400698 126 2829 2.33 53306 4004 2012 34080 1723 577 1 45.0 3574 Fayette County Fayette County 2 3002 3002 112 22007 461 69.6 6078 WV WV 46044 9.5 42851 369.20 4396.48 4060.98 2607.20 446.05 1453.78 704.75 41.96 105.12
2 {"rings": [[[-8651697.2798, 4767802.112999998]... 196 211.494165 1016 3074 2.60 99084 8210 3591 41040 5425 3790 10 40.7 10091 Jefferson County Jefferson County 3 3011 3011 1396 37431 517 280.7 8227 WV WV 58849 10.3 50714 567.43 9197.22 6445.56 4162.27 471.27 2283.29 1266.35 77.03 127.93
3 {"rings": [[[-8986636.1429, 4735386.049599998]... 60 340.215576 54 448 2.31 49788 619 1037 6210 371 514 2 39.4 1755 Gilmer County Gilmer County 4 3003 3003 213 17155 54 24.7 1302 WV WV 8358 14.2 6823 337.54 4119.71 3730.03 2395.73 403.85 1334.30 649.71 38.73 95.56
4 {"rings": [[[-8792002.028199999, 4755766.55449... 21 480.087132 28 677 2.37 51798 905 125 9343 557 183 3 47.4 483 Grant County Grant County 5 3004 3004 88 21642 158 26.1 1406 WV WV 12436 7.4 12011 360.26 4416.26 3940.01 2517.00 403.61 1423.02 707.25 42.51 99.50

We can see that the dataframe allprovider_df has 3142 rows and 39 columns. This is the same number of rows as the number of counties. Thus aggregation has given us the number of providers in each county along with other variables from the demographic and health expenditure layer.

Clean Data

Here we will:

  • Check for Null values
  • Remove any duplicate columns
  • Rename columns
In [28]:
# Check for null values
allprovider_df.isnull().sum()
Out[28]:
SHAPE              0
amerind_cy         0
analysisarea       0
asian_cy           0
asscdeg_cy         0
avghhsz_cy         0
avghinc_cy         0
bachdeg_cy         0
black_cy           0
educbasecy         0
graddeg_cy         0
hisppop_cy         0
index_right        0
medage_cy          0
minoritycy         0
name_left          0
name_right         0
objectid           0
objectid_left      0
objectid_right     0
othrace_cy         0
pci_cy             0
point_count        0
popdens_cy         0
smcoll_cy          0
st_abbrev_left     0
st_abbrev_right    0
totpop_cy          0
unemprt_cy         0
white_cy           0
x13002_a           0
x13004_a           0
x8001_a            0
x8002_a            0
x8013_a            0
x8018_a            0
x8019_a            0
x8024_a            0
x8032_a            0
dtype: int64
In [29]:
# Check for duplicate columns
allprovider_df.columns
Out[29]:
Index(['SHAPE', 'amerind_cy', 'analysisarea', 'asian_cy', 'asscdeg_cy',
       'avghhsz_cy', 'avghinc_cy', 'bachdeg_cy', 'black_cy', 'educbasecy',
       'graddeg_cy', 'hisppop_cy', 'index_right', 'medage_cy', 'minoritycy',
       'name_left', 'name_right', 'objectid', 'objectid_left',
       'objectid_right', 'othrace_cy', 'pci_cy', 'point_count', 'popdens_cy',
       'smcoll_cy', 'st_abbrev_left', 'st_abbrev_right', 'totpop_cy',
       'unemprt_cy', 'white_cy', 'x13002_a', 'x13004_a', 'x8001_a', 'x8002_a',
       'x8013_a', 'x8018_a', 'x8019_a', 'x8024_a', 'x8032_a'],
      dtype='object')

We can see that name, objectid and st_abbrev columns have duplicates. We will drop these duplicate columns from our data.

In [30]:
# Drop duplicate columns
allprovider_df.drop(['objectid','objectid_right','st_abbrev_right','name_right','analysisarea','index_right'],
                    axis=1, inplace=True)
allprovider_df.columns
Out[30]:
Index(['SHAPE', 'amerind_cy', 'asian_cy', 'asscdeg_cy', 'avghhsz_cy',
       'avghinc_cy', 'bachdeg_cy', 'black_cy', 'educbasecy', 'graddeg_cy',
       'hisppop_cy', 'medage_cy', 'minoritycy', 'name_left', 'objectid_left',
       'othrace_cy', 'pci_cy', 'point_count', 'popdens_cy', 'smcoll_cy',
       'st_abbrev_left', 'totpop_cy', 'unemprt_cy', 'white_cy', 'x13002_a',
       'x13004_a', 'x8001_a', 'x8002_a', 'x8013_a', 'x8018_a', 'x8019_a',
       'x8024_a', 'x8032_a'],
      dtype='object')
In [31]:
# Rename Columns
allprovider_df.rename(columns={'x8001_a':'avg_healthcare','x8002_a':'avg_healthinsurance','x8013_a':'avg_medicare',
                                  'x8018_a':'avg_medicalcare','x8019_a':'avg_medicalsrvc','x8024_a':'avg_labtest',
                                  'x8032_a':'avg_presdrug','x13002_a':'avg_personalinsurance','x13004_a':'avg_socsecurity',
                                  'asian_cy':'asian_pop','amerind_cy':'amerind_pop','avghhsz_cy':'avg_hhsz',
                                  'avghinc_cy':'avg_hhinc','black_cy':'black_pop','educbasecy':'edubase',
                                  'hisppop_cy':'hisp_pop','medage_cy':'median_age','minoritycy':'minority_pop',
                                  'othrace_cy':'otherace_pop','pci_cy':'percap_income','popdens_cy':'pop_density',
                                  'unemprt_cy':'unemp_rate','white_cy':'white_pop','smcoll_cy':'some_college',
                                  'asscdeg_cy':'asso_deg','bachdeg_cy':'bach_deg','graddeg_cy':'grad_deg','totpop_cy':'total_population',
                                  'st_abbrev_left':'state','name_left':'county','point_count':'provider_count','objectid_left':'objectid'}, inplace=True)
allprovider_df.columns
Out[31]:
Index(['SHAPE', 'amerind_pop', 'asian_pop', 'asso_deg', 'avg_hhsz',
       'avg_hhinc', 'bach_deg', 'black_pop', 'edubase', 'grad_deg', 'hisp_pop',
       'median_age', 'minority_pop', 'county', 'objectid', 'otherace_pop',
       'percap_income', 'provider_count', 'pop_density', 'some_college',
       'state', 'total_population', 'unemp_rate', 'white_pop',
       'avg_personalinsurance', 'avg_socsecurity', 'avg_healthcare',
       'avg_healthinsurance', 'avg_medicare', 'avg_medicalcare',
       'avg_medicalsrvc', 'avg_labtest', 'avg_presdrug'],
      dtype='object')
In [32]:
# Check size of dataframe after dropping columns
allprovider_df.shape
Out[32]:
(3142, 33)

Write Cleaned Data as a Feature Service

Now that our aggregated data is clean, we will write this data as a feature service to portal. This feature service demographic_healthexp_clean_allproviders will be used later for creating models.

In [33]:
# Create a feature service on portal
allprovider_clean = allprovider_df.spatial.to_featurelayer('demographic_healthexp_clean_allproviders')

Plot People to Provider Ratio - US

To plot Population to Provider Ratio, we will create a new column for the no. of people per Healthcare Provider. To achieve this, we will divide total population by provider count for each county.

In [49]:
# Create a new column - People per Provider
allprovider_df['people_per_prov'] = allprovider_df['total_population']/allprovider_df['provider_count']

Plot People per Provider

In [57]:
allprovider_map = gis.map('USA', 4)
allprovider_map
Out[57]:

Darker shades on the map show large no. of people per provider which is bad. From this map, we can see that:

  • Counties in Westen and North Eastern US seem to have the best people to provider ratio with 0-100 people per healthcare provider.
  • Areas in North Central US seem to have a mix of both low and high number of people per provider. Nebraska stands out with Thomas, Logan, Loup, Hayes and Wheeler counties where number of people per healthcare provider is > 500.
  • Counties in Southern states seem to suffer the most where a majority of counties have 100-200 people per provider or higher.
  • Texas seems to stand out where counties like Borden, Glasscock, Irion, San Jacinto, Newton have > 500 people per healthcare provider. We can also see more counties where number of people per provider is high.

Define Renderer

In [52]:
# Define Renderer
allProvTest = {"renderer": {
                 "type": "classBreaks",  
                 "field":"people_per_prov",
                 "transparency":.5,
                 "minValue":1}}
In [53]:
# Define Manual Class breaks
allProvTest['renderer']["classBreakInfos"] = [{
  "classMaxValue": 100.00,
  "label": "0 - 100.00",
  "description": "0 - 100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [255,247,236,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 200.00,
  "label": "100.001 - 200.00",
  "description": "100.001 - 200.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [253,220,174,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 350.00,
  "label": "200.001 - 350.00",
  "description": "200.001 - 350.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [252,177,123,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 500.00,
  "label": "350.001 - 500.00",
  "description": "350.001 - 500.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [241,109,75,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 1100.00,
  "label": "500.001 - 1100.00",
  "description": "500.001 - 1100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [200,28,18,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}]
In [54]:
# Plot Map using defined Renderer
allprovider_map.remove_layers()
allprovider_df.spatial.plot(map_widget=allprovider_map, renderer=allProvTest['renderer'])
Out[54]:
True
In [55]:
# Add legend
allprovider_map.legend=True

An alternate way to create the map without define a renderer is to directly plot the spatial dataframe by using Esri's classification algorithm. Here we show the use of esriClassifyNaturalBreaks which breaks the column values naturally into different classes.

In [ ]:
# allprovider_df.spatial.plot(map_widget=allprovider_map,
#         renderer_type='c',  # for class breaks renderer
#         method='esriClassifyNaturalBreaks',  # classification algorithm
#         class_count=6,  # choose the number of classes
#         col='people_per_prov',  # numeric column to classify
#         cmap='OrRd',  # color map to pick colors from for each class
#         alpha=0.7 ,  # specify opacity
#         linewidth = 0.05)

The Lone Star State - Exploring Texas

The second largest state [1] in U.S. both by area and population, Texas baosts of 261,231.71 sq mi land area [2] with a population of ~28.7 million [3]. The population density is low with just 105.2 people per square mile [4] and varies drastically between heavely populated urban areas to sparsely populated rural.

Image source: https://www.npr.org/sections/thetwo-way/2018/05/24/614195884/new-census-data-shows-texas-cities-are-growing-faster-than-all-other-states

As seen in the previous notebook, Texas has the fourth largest number of healthcare providers in U.S. However, the state stands second highest on Health Resources and Services Administration's (HRSA) list of shortage areas [5]. Texas also tops HRSA's list of medically underserved areas [5].

Let's start our journey with exploring Texas!

Get the Data

We will filter our allprovider_df dataframe for TX and subset to get specific columns.

In [58]:
# Filter data for Texas
q = allprovider_df['state']=='TX'
txprovider_df = allprovider_df[q]
In [59]:
# Subset to keep specific columns
cols_to_keep = ['SHAPE','objectid','state','county','avg_hhinc','median_age',
                'total_population','provider_count','people_per_prov']
txprovider_df = txprovider_df[cols_to_keep]
txprovider_df.shape
Out[59]:
(254, 9)

Provider and Population Density

Let's explore distribution of providers with respect to population density.

In [65]:
# Create a map 
tx_pop_map = gis.map('Texas, USA', zoomlevel=6)
tx_pop_map.layout.height="650px"
tx_pop_map
Out[65]:

Darker shades on the map show large no. of people per provider which is bad. From this map, we can see that:

  1. Irion County (in red) seems to be the worst with 799 people per healthcare provider.
  2. Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.
  3. There are very few counties (in white) in Texas with less than 100 people per healthcare provider.
Define Renderer
In [61]:
# Define Renderer
txpopTest = {"renderer": { #This tells python to use JS autocasting
                 "type": "classBreaks",  
                 "field":"people_per_prov",
                 "transparency":.5,
                 "minValue":1}}
In [62]:
# Define Manual Class breaks
txpopTest['renderer']["classBreakInfos"] = [{
  "classMaxValue": 100.00,
  "label": "0 - 100.00",
  "description": "0 - 100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [255,247,236,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 300.00,
  "label": "100.001 - 300.00",
  "description": "100.001 - 300.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [253,220,174,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 500.00,
  "label": "300.001 - 500.00",
  "description": "300.001 - 500.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [252,177,123,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 700.00,
  "label": "500.001 - 700.00",
  "description": "500.001 - 700.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [241,109,75,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 800.00,
  "label": "700.001 - 800.00",
  "description": "700.001 - 800.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [200,28,18,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}]
Plot Map
In [63]:
# Plot Map using defined Renderer
txprovider_df.spatial.plot(map_widget=tx_pop_map, renderer=txpopTest['renderer'])
Out[63]:
True
In [64]:
# Add legend
tx_pop_map.legend=True

Provider and Avg. Household Income

We will create a Kernel Density Plot of People per Provider w.r.t. Avg. Household Income. Before we plot, let's check for any Nan or Infinite values in our data and replace or drop as necessary.

In [66]:
txprovider_df.describe().transpose()
Out[66]:
count mean std min 25% 50% 75% max
objectid 254.0 2.650500e+03 73.467680 2524.000000 2587.250000 2650.500000 2713.750000 2777.0
avg_hhinc 254.0 7.119343e+04 15027.773448 41857.000000 61321.000000 68941.000000 77239.500000 127417.0
median_age 254.0 4.061102e+01 5.877521 26.400000 36.300000 40.000000 44.575000 58.2
total_population 254.0 1.173478e+05 416409.337617 88.000000 7079.000000 19534.500000 54647.250000 4778365.0
provider_count 254.0 1.440287e+03 6174.628398 0.000000 36.250000 105.500000 387.750000 70973.0
people_per_prov 254.0 inf NaN 42.368833 109.109202 155.083449 209.904211 inf

Here we see that one or more records for provider_count is 0 causing the max for people_per_provider to be infinite. Let's find the record(s) with 0 min.

In [67]:
# Check for rows with zero provider count
txprovider_df[txprovider_df['provider_count']==0]
Out[67]:
SHAPE objectid state county avg_hhinc median_age total_population provider_count people_per_prov
206 {"rings": [[[-10883854.1443, 3158571.473099995... 2654 TX Kenedy County 43518 42.9 445 0 inf

Let's ignore this single record and work with remaining data.

In [68]:
# Create dataframe where provider count is greater than zero
txprovider_df = txprovider_df[txprovider_df['provider_count']>0]

Let's create a density plot to explore providers with respect to avg. household income.

In [70]:
g = sns.jointplot(x='people_per_prov', y='avg_hhinc', data=txprovider_df,xlim=[0,400], ylim=[35000,100000] ,kind='kde', n_levels=8)
g.fig.suptitle('Kernel Density Plot of People per Provider vs Avg. Household Income', y=1.05);

From this plot we can see that for majority of the counties in Texas the average number of people per healthcare provider vary from 100-200 and the average household income for these counties vary from 60,000 - 70,000 dollars.

Provider and Median Age

Let's create a density plot to explore providers with respect to median age.

In [71]:
g = sns.jointplot(y='people_per_prov', x='median_age', data=txprovider_df, ylim=[0,400], xlim=[20,60] ,kind='kde')
g.fig.suptitle('Kernel Density Plot of People per Provider vs Median Age', y=1.05);

From this plot we can see that for majority of the counties in Texas the median age is between 35-45. The average number of people per healthcare provider vary from 60 - 210 for these counties.

Summary

To summarize, in this notebook we saw that:

  • Counties in Westen and North Eastern US seem to have the best people to provider ratio with 0-100 people per healthcare provider.
  • Counties in Southern states seem to suffer the most where a majority of counties have 100-200 people per provider or higher.
  • Texas seems to stand out in terms of overall healthcare provider shortage:
    • Irion County (in red) seems to be the worst with 799 people per healthcare provider.
    • Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.
    • There are very few counties (in white) in Texas with less than 100 people per healthcare provider.

References