- 👟 Ready To Run!
- 🔬 Data Science
- 🚀 Data Exploration and Cleaning
- 📈 Statistics & Graphing
- ⚕️ Healthcare
Requirements
- 🗺️ Living Atlas configured
In this part of the study, we will explore geocoded provider data with respect to socioeconomic and demographic factors such as population density, median income, median age, population with health insurance etc. We will study the distribution of:
In this section, we will:
# Import Libraries
from IPython.display import display
# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Import library for time and os
import time
import os
# Create a GIS connection
gis = GIS("home")
Provider data was geocoded using the GeoAnalytics server. Let's get the geocoded provider data feature layer for exploration.
# Search the feature layer
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
provider_data_item = gis.content.search(f'provider_data_geocoded_7_30 {group_query}',
item_type = "Feature Service",
outside_org=True)[0]
provider_data_item
# Get the layer needed for analysis
provider_data_layer = provider_data_item.layers[0]
provider_data_layer
# Look at the first 5 fields and their data types
for f in provider_data_layer.properties.fields[:5]:
print(f['name'],"{0:>35}".format(f['type']))
# Check number of records in `provider_data_layer`
provider_data_layer.query(return_count_only=True)
Let's do a random test to check accuracy of geocoded data. For a given state, we will look at and plot the data points based on address provided vs address geocoded.
# Create a spatially enabled dataframe for WY
%time wy_df = provider_data_layer.query(where="user_Region='WY'", as_df=True)
wy_df.shape
# Look at data points that are geocoded outside of Wyoming
len(wy_df[wy_df['region']!='Wyoming'])
# Check the accuracy of geocoding
print(f'{round(100-((wy_df.shape[1]/wy_df.shape[0])*100),2)} %')
# Create a map
map1 = gis.map('USA',4)
map1
# Plot data points for WY
wy_df.spatial.plot(map1)
We have an accuracy of >99% with geocoding. From this map, we can see that out of 12770 points that have been geocoded only 3 are outside of WY. Similar analysis can be performed for other states to check for geocoding errors.
To visualize how healthcare providers are distributed, let's do a heatmap of providers in the United States.
# Create a map
map_usa = gis.map('USA', zoomlevel=4)
map_usa
The map shows distribution of providers throughout the US. We can see large gaps in Nevada, Idaho, Utah, Wyoming and Texas. These and other states can be explored further for analysis.
# Add provider data and create a heatmap
renderer = {"renderer": "autocast", #This tells python to use JS autocasting
"type": "heatmap",
"blurRadius":1, # changes the size of the clusters
"maxPixelIntensity":2,
"minPixelIntensity":0,
"field":None}
renderer["colorStops"] = [{"ratio":0,"color":[63, 40, 102, 0]},
{"ratio":0.25,"color":[167,97,170,179]},
{"ratio":0.50,"color":"#7b3ce9"},
{"ratio":0.75,"color":[222,102,0,179]},
{"ratio":1,"color":[244,204,0,179]}]
map_usa.add_layer(provider_data_layer,
{ "type": "FeatureLayer",
"renderer": renderer,
})
# Add Legend
map_usa.legend = True
Let's gather and process Demographic and Health Expenditure Data to create a data layer that includes aggregated provider count.
Here, we will search for demographic data at the county level. To do this, we will:
# Search for and get Population data layer
popdensity = gis.content.get('b9095ebdf5e8442588ab3f269dc7ee5e')
popdensity
# Check first 5 layers in population Density
popdensity.layers[:5]
# Look at first few field names for county layer
county_layer = popdensity.layers[46]
print('FIELD NAME', "{0:>50}".format('FIELD ALIAS'))
for field in county_layer.properties.fields[:10]:
print(field['name'], "{0:>50}".format(field['alias']))
%%time
# Create dataframe of specific attributes for Counties
county_layer = popdensity.layers[46]
county_df = pd.DataFrame()
out_fields=['Shape','ST_ABBREV','NAME','ASIAN_CY','AMERIND_CY','AVGHHSZ_CY','AVGHINC_CY','BLACK_CY',
'EDUCBASECY','HISPPOP_CY','MEDAGE_CY','MINORITYCY','OTHRACE_CY','PCI_CY','POPDENS_CY',
'UNEMPRT_CY','WHITE_CY','SMCOLL_CY','ASSCDEG_CY','BACHDEG_CY','GRADDEG_CY','TOTPOP_CY']
offset = 0
while offset <= 3000:
chunk_df = county_layer.query(out_fields=out_fields,return_all_records=False,
result_offset=offset,result_record_count=750,as_df=True)
county_df = pd.concat([chunk_df, county_df], ignore_index=True)
offset += 750
county_df.shape
Here, we will search for health expenditure data at the county level. To do this, we will:
# Search for and get Healthcare expenditure data layer
health_exp = gis.content.get('1fb3254792694ee285a835dd5fffc20b')
health_exp
# Check first few layers in population Density
health_exp.layers[:5]
health_exp_county_layer = health_exp.layers[46]
print('FIELD NAME', "{0:>50}".format('FIELD ALIAS'))
for f in health_exp_county_layer.properties.fields[:10]:
print(f['name'], "{0:>50}".format(f['alias']))
%%time
# Create a dataframe using specific attributes for Counties
health_exp_county_layer = health_exp.layers[46]
health_exp_county_df = pd.DataFrame()
out_fields=['ST_ABBREV','NAME','X8001_A','X8002_A','X8013_A','X8018_A','X8019_A','X8024_A','X8032_A',
'X13002_A','X13004_A']
offset = 0
while offset <= 3000:
chunk_df = health_exp_county_layer.query(out_fields=out_fields, return_all_records=False,
result_offset=offset,result_record_count=500,as_df=True)
health_exp_county_df = pd.concat([chunk_df, health_exp_county_df], ignore_index=True)
offset += 500
health_exp_county_df.shape
We will spatially join both the demographic and expenditure layer so as to combine data into a single polygon layer.
# Merge demographic and health expenditure data at county level
%time county_healthexp_df = county_df.spatial.join(health_exp_county_df,how='left', op='within')
# Check geometry type
county_healthexp_df.spatial.geometry_type
# Check shape of data
county_healthexp_df.shape
We will now write this spatially enabled dataframe as a feature service to the Portal.
# Create a feature service on portal
demo_healthexp = county_healthexp_df.spatial.to_featurelayer('demographic_healthexp_layer')
demo_healthexp
# Access the feature service
demo_healthexp_layer = demo_healthexp.layers[0]
demo_healthexp_layer
Now, we need to get the provider count for each county. Count of providers will serve as a key variable and we can get this data in multiple ways:
- Method 1: We can query the provider data layer, store results in a spatially enabled dataframe and generate counts.
- Method 2: Use `AggregatePoints` tool from the features module.
Method 1 fails due to the size and memory usage required for our data (~5.8 million points). This is where tools for data summarization come into play.
AggregatePoints
tool helps us aggregate point data (provider count) on polygon data (counties). The output generated is a polygon layer with aggregated statistics.
We will run AggregatePoints
with provider data
as our point layer, demo_healthexp_layer
as the polygon layer and write out the results to a feature service demo_healthexp_allproviders
.
In simple words, this tool will grab all the providers in a particular county and give us an aggregate count of providers for that county along with other demographic and health expenditure data that was in the polygon layer for counties.
Learn more about Aggregate Points.
# Import Aggregate Points
from arcgis.features.summarize_data import aggregate_points
Let's create a function that:
# Run Aggregate Points and write to a feature service
def Aggregation():
# Check if the layers exists on Portal
search_result = gis.content.search("demos_healthexp_allproviders", item_type='Feature Layer')
# If exists, get layer from the item
if (len(search_result) > 0):
layer = search_result[0].layers[0]
print('Layer exists... continue')
# If layer does not exist, run Aggregate Points tool to create layer
else:
print('Running Aggregate Points')
start_time = time.time()
item = aggregate_points(point_layer = provider_data_layer,
polygon_layer = demo_healthexp_layer,
output_name = 'demos_healthexp_allproviders')
layer = item.layers[0]
duration = time.time()-start_time
print(f'Time elapsed: {duration/60:.2f} mins')
print('Layer created... continue')
return layer
allprovider = Aggregation()
Now that we have aggregated our data, we will get the aggrageted data layer that includes provider count and other demographics data.
# Get the feature layer with aggregated data
allprovider
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider.query(as_df=True)
allprovider_df.shape
# Look at the first few rows of dataframe
allprovider_df.head()
We can see that the dataframe allprovider_df
has 3142 rows and 39 columns. This is the same number of rows as the number of counties. Thus aggregation has given us the number of providers in each county along with other variables from the demographic and health expenditure layer.
Here we will:
# Check for null values
allprovider_df.isnull().sum()
# Check for duplicate columns
allprovider_df.columns
We can see that name, objectid and st_abbrev
columns have duplicates. We will drop these duplicate columns from our data.
# Drop duplicate columns
allprovider_df.drop(['objectid','objectid_right','st_abbrev_right','name_right','analysisarea','index_right'],
axis=1, inplace=True)
allprovider_df.columns
# Rename Columns
allprovider_df.rename(columns={'x8001_a':'avg_healthcare','x8002_a':'avg_healthinsurance','x8013_a':'avg_medicare',
'x8018_a':'avg_medicalcare','x8019_a':'avg_medicalsrvc','x8024_a':'avg_labtest',
'x8032_a':'avg_presdrug','x13002_a':'avg_personalinsurance','x13004_a':'avg_socsecurity',
'asian_cy':'asian_pop','amerind_cy':'amerind_pop','avghhsz_cy':'avg_hhsz',
'avghinc_cy':'avg_hhinc','black_cy':'black_pop','educbasecy':'edubase',
'hisppop_cy':'hisp_pop','medage_cy':'median_age','minoritycy':'minority_pop',
'othrace_cy':'otherace_pop','pci_cy':'percap_income','popdens_cy':'pop_density',
'unemprt_cy':'unemp_rate','white_cy':'white_pop','smcoll_cy':'some_college',
'asscdeg_cy':'asso_deg','bachdeg_cy':'bach_deg','graddeg_cy':'grad_deg','totpop_cy':'total_population',
'st_abbrev_left':'state','name_left':'county','point_count':'provider_count','objectid_left':'objectid'}, inplace=True)
allprovider_df.columns
# Check size of dataframe after dropping columns
allprovider_df.shape
Now that our aggregated data is clean, we will write this data as a feature service to portal. This feature service demographic_healthexp_clean_allproviders
will be used later for creating models.
# Create a feature service on portal
allprovider_clean = allprovider_df.spatial.to_featurelayer('demographic_healthexp_clean_allproviders')
To plot Population to Provider Ratio, we will create a new column for the no. of people per Healthcare Provider. To achieve this, we will divide total population by provider count for each county.
# Create a new column - People per Provider
allprovider_df['people_per_prov'] = allprovider_df['total_population']/allprovider_df['provider_count']
allprovider_map = gis.map('USA', 4)
allprovider_map
Darker shades on the map show large no. of people per provider which is bad. From this map, we can see that:
# Define Renderer
allProvTest = {"renderer": {
"type": "classBreaks",
"field":"people_per_prov",
"transparency":.5,
"minValue":1}}
# Define Manual Class breaks
allProvTest['renderer']["classBreakInfos"] = [{
"classMaxValue": 100.00,
"label": "0 - 100.00",
"description": "0 - 100.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [255,247,236,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 200.00,
"label": "100.001 - 200.00",
"description": "100.001 - 200.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [253,220,174,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 350.00,
"label": "200.001 - 350.00",
"description": "200.001 - 350.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [252,177,123,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 500.00,
"label": "350.001 - 500.00",
"description": "350.001 - 500.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [241,109,75,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 1100.00,
"label": "500.001 - 1100.00",
"description": "500.001 - 1100.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [200,28,18,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}]
# Plot Map using defined Renderer
allprovider_map.remove_layers()
allprovider_df.spatial.plot(map_widget=allprovider_map, renderer=allProvTest['renderer'])
# Add legend
allprovider_map.legend=True
An alternate way to create the map without define a renderer is to directly plot the spatial dataframe by using Esri's classification algorithm. Here we show the use of esriClassifyNaturalBreaks
which breaks the column values naturally into different classes.
# allprovider_df.spatial.plot(map_widget=allprovider_map,
# renderer_type='c', # for class breaks renderer
# method='esriClassifyNaturalBreaks', # classification algorithm
# class_count=6, # choose the number of classes
# col='people_per_prov', # numeric column to classify
# cmap='OrRd', # color map to pick colors from for each class
# alpha=0.7 , # specify opacity
# linewidth = 0.05)
The second largest state [1] in U.S. both by area and population, Texas baosts of 261,231.71 sq mi land area [2] with a population of ~28.7 million [3]. The population density is low with just 105.2 people per square mile [4] and varies drastically between heavely populated urban areas to sparsely populated rural.
As seen in the previous notebook, Texas has the fourth largest number of healthcare providers in U.S. However, the state stands second highest on Health Resources and Services Administration's (HRSA) list of shortage areas [5]. Texas also tops HRSA's list of medically underserved areas [5].
Let's start our journey with exploring Texas!
Get the Data
We will filter our allprovider_df
dataframe for TX and subset to get specific columns.
# Filter data for Texas
q = allprovider_df['state']=='TX'
txprovider_df = allprovider_df[q]
# Subset to keep specific columns
cols_to_keep = ['SHAPE','objectid','state','county','avg_hhinc','median_age',
'total_population','provider_count','people_per_prov']
txprovider_df = txprovider_df[cols_to_keep]
txprovider_df.shape
Let's explore distribution of providers with respect to population density.
# Create a map
tx_pop_map = gis.map('Texas, USA', zoomlevel=6)
tx_pop_map.layout.height="650px"
tx_pop_map
Darker shades on the map show large no. of people per provider which is bad. From this map, we can see that:
- Irion County (in red) seems to be the worst with 799 people per healthcare provider.
- Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.
- There are very few counties (in white) in Texas with less than 100 people per healthcare provider.
# Define Renderer
txpopTest = {"renderer": { #This tells python to use JS autocasting
"type": "classBreaks",
"field":"people_per_prov",
"transparency":.5,
"minValue":1}}
# Define Manual Class breaks
txpopTest['renderer']["classBreakInfos"] = [{
"classMaxValue": 100.00,
"label": "0 - 100.00",
"description": "0 - 100.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [255,247,236,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 300.00,
"label": "100.001 - 300.00",
"description": "100.001 - 300.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [253,220,174,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 500.00,
"label": "300.001 - 500.00",
"description": "300.001 - 500.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [252,177,123,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 700.00,
"label": "500.001 - 700.00",
"description": "500.001 - 700.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [241,109,75,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 800.00,
"label": "700.001 - 800.00",
"description": "700.001 - 800.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [200,28,18,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}]
# Plot Map using defined Renderer
txprovider_df.spatial.plot(map_widget=tx_pop_map, renderer=txpopTest['renderer'])
# Add legend
tx_pop_map.legend=True
We will create a Kernel Density Plot of People per Provider w.r.t. Avg. Household Income. Before we plot, let's check for any Nan or Infinite values in our data and replace or drop as necessary.
txprovider_df.describe().transpose()
Here we see that one or more records for provider_count
is 0 causing the max
for people_per_provider
to be infinite. Let's find the record(s) with 0 min.
# Check for rows with zero provider count
txprovider_df[txprovider_df['provider_count']==0]
Let's ignore this single record and work with remaining data.
# Create dataframe where provider count is greater than zero
txprovider_df = txprovider_df[txprovider_df['provider_count']>0]
Let's create a density plot to explore providers with respect to avg. household income.
g = sns.jointplot(x='people_per_prov', y='avg_hhinc', data=txprovider_df,xlim=[0,400], ylim=[35000,100000] ,kind='kde', n_levels=8)
g.fig.suptitle('Kernel Density Plot of People per Provider vs Avg. Household Income', y=1.05);
From this plot we can see that for majority of the counties in Texas the average number of people per healthcare provider vary from 100-200 and the average household income for these counties vary from 60,000 - 70,000 dollars.
Let's create a density plot to explore providers with respect to median age.
g = sns.jointplot(y='people_per_prov', x='median_age', data=txprovider_df, ylim=[0,400], xlim=[20,60] ,kind='kde')
g.fig.suptitle('Kernel Density Plot of People per Provider vs Median Age', y=1.05);
From this plot we can see that for majority of the counties in Texas the median age is between 35-45. The average number of people per healthcare provider vary from 60 - 210 for these counties.
To summarize, in this notebook we saw that:
[1] https://en.wikipedia.org/wiki/Texas
[2] https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area
[3] https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population
[4] http://worldpopulationreview.com/states/texas-population/
[5] https://data.hrsa.gov/topics/health-workforce/shortage-areas