- 👟 Ready To Run!
- 📈 Statistics & Graphing
- 💻 Predictive Modeling
- ⚕️ Healthcare
Requirements
- 🗄️ Notebook Server Advanced License
The big question that we would like to answer with this analysis is that: What are the socieconomic and demographic factors that influence access to healthcare providers?
In this notebook, we will build various global and local models and will try to answer this question by:
In this section, we will:
# Import Libraries
from IPython.display import display
# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Import library for time
import time
# Supress Warning
import warnings
warnings.filterwarnings("ignore")
# Create connection
gis = GIS("home")
To understand shortage of healthcare providers we will try to identify socioeconomic and demographic factors that influence access to these providers. We will use the cleaned aggregated data layer demographic_healthexp_clean_allproviders
created in Part 2 of this 4 part study. This layer includes the provider count, demographic data and some health expenditure features.
# Search the feature layer
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
allprovider_item = gis.content.search(f'demographic_healthexp_clean_allproviders {group_query}',
item_type = "Feature Service",
outside_org=True)[0]
allprovider_item
# Get the layer
allprovider_layer = allprovider_item.layers[0]
allprovider_layer
# Look at the first few fields
for f in allprovider_layer.properties.fields[:5]:
print(f['name'], "{0:>25}".format(f['alias']))
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape
We can see that the dataframe (allprovider_df
) has 3139 rows and 35 columns.
# Look at the first few rows of dataframe
allprovider_df.head()
Before we plot the data, let's create a copy of allprovider_df
and drop unnecessary columns.
# Create copy of dataframe
test_newcounty_df = allprovider_df.copy()
# Drop the shape and objectid columns
test_newcounty_df.drop(['objectid','SHAPE','SHAPE__Length','SHAPE__Area'], axis=1, inplace=True)
test_newcounty_df.columns
We will plot each numerical variable with respect to others to see how the data is distributed and how correlated the variables are to each other.
sns.pairplot(test_newcounty_df.iloc[:,0:9]);
sns.pairplot(test_newcounty_df.iloc[:,10:19]);
sns.pairplot(test_newcounty_df.iloc[:,20:]);
From these pair plots we can see that:
Correlation plot is another great way to visualize correlation among predictor variables and correlation of predictors with response variable.
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(test_newcounty_df.corr(), vmin=-1.0, vmax=1.0, annot=True,
linewidths=.5, fmt= '.1f',ax=ax)
plt.title('Correlation Plot', fontdict = {'fontsize' : 30});
From this plot, we can see that:
# Sort the values by descending order of Provider Count
test_newcounty_df = test_newcounty_df.sort_values(by=['provider_count'], ascending=False)
test_newcounty_df.head()
The idea behind a global model is to identify various socioeconomic and demographic factors that influence access to healthcare providers across all counties in the United States.
To build this global model, we will follow a 3 step process:
1. Build an Ordinary Least Squares (OLS) Regression Model and verify regressions assumptions
2. Perform Feature Selection using Lasso, Elastic Net and Recursive Feature Elimination techniques
3. Using selected features, build another OLS model, check performance and verify assumptions
# Create predictor variables
train_pred = test_newcounty_df.drop(['county','state','provider_count'],axis=1)
train_pred.head()
# Create response variables
train_y = test_newcounty_df['provider_count']
train_y.head()
We will start with building an Ordinary Least Squares (OLS) Regression [1] model with all predictors and verify regressions assumptions.
# Import libraries
import statsmodels.api as sm
# import statsmodels
from statsmodels.regression import linear_model
# Set the seed for consistent results
np.random.seed(101)
# Create Model
X_train = train_pred
X_train = sm.add_constant(X_train) # add constant
sm_ols = sm.OLS(train_y, X_train).fit()
# Generate model summary
sm_ols.summary()
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
sm_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Base Model');
# Calculate RMSE of model
from statsmodels.tools.eval_measures import rmse
pred_val = sm_ols.fittedvalues.copy()
rmse_base_ols = rmse(train_y, pred_val)
print('RMSE for Base Model:',round(rmse_base_ols,2))
Analysing the model summary, we can see that:
- The R-squared value of 0.972 shows that 97.2% of the variability in Provider Count is explained by the model. The R-squared value is too good to be true and the model seems to be overfitting.
- To identify variables that are statistically significant, we look at the p-values of individual variables. Among the coefficients, those that are statistically significant at 90% significance level are:
- asian_pop, amerind_pop, black_pop, edubase, hisp_pop, median_age, minority_pop, otherace_pop, percap_income, pop_density, white_pop, some_college, asso_deg, grad_deg, total_population, avg_labtest, avg_presdrug, avg_socsecurity.
- The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.
Here, we will verify regression assumtions [2] of:
# Get residual value
residual = sm_ols.resid
import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=0, xmax=max(pred_val), linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
ax1.set_title('Residuals vs Fitted Values')
# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True);
# Check Linearity - Plot predictors with response
preds = train_pred.columns
fig = plt.figure(figsize=(15, 20))
for sp in range(0,27):
ax = fig.add_subplot(7,4,sp+1)
ax.scatter(train_pred.loc[:,preds[sp]], train_y)
ax.set_xlabel(preds[sp])
ax.set_ylabel('Provider Count')
plt.xticks(rotation=0)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
plt.tight_layout(pad=1.0) # automatically adjusts layout to fit long labels
plt.show()
One way to think about whether the results we have are driven by a given data point is to calculate how far the predicted values for data would move if model was fit without the data point in question. This calculated total distance is called Cook's distance. Cook's D is a function of the leverage and standardized residual associated with each data point. The influence of each point can be visualized using an Influence Plot [3].
# Create Influence Plot
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(sm_ols, ax= ax, criterion="cooks")
From this plot, we can see that points with larger size have high leverage and higher than average residuals. The larger size of points indicates these are influential points. Points that do not have a high leverage but very high residual and are considered outliers.
Note that numbers shown next to the points in this plot are observation numbers (row ids) from the data. This context is important to understand the following section for removing outliers.
Removing outliers is an iterative process and outliers need to be investigated before removal. Here, we followed a 4-step process to remove outliers as follows:
The outliers were removed in different iterations. Here are the obervations that were removed with each iteration:
- Iteration 1: 2631,2774,1277,1245,296
- Iteration 2: 1198,2353,2459,1688,1077
- Iteration 3: 916,2632,2262,1207,797,531
- Iteration 4: 2613,310,1052,38,1632,1512
- Iteration 5: 629,1846,2710,388,1614,2784,1727
The code below shows all outliers being removed. If you see a different set of points as outliers for your analysis, update the global_df.drop()
command in code cell below with outlier observations and run the analysis.
# Remove Outliers
global_df = test_newcounty_df.copy()
global_df.drop([2631,2774,1277,1245,296,
1198,2353,2459,1688,1077,
916,2632,2262,1207,797,531,
2613, 310, 1052, 38, 1632, 1512,
629, 1846, 2710, 388, 1614, 2784, 1727],axis=0, inplace=True)
global_df.shape
# Create prdictor and response variables
train_x_rerun = global_df.drop(['county','state','provider_count'],axis=1)
train_y_rerun = global_df['provider_count']
train_x_rerun.head()
# Run Model
X_train_rerun = train_x_rerun
X_train_rerun = sm.add_constant(X_train_rerun)
global_ols_rerun = sm.OLS(train_y_rerun, X_train_rerun).fit()
residual = global_ols_rerun.resid
# Plot Outliers
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(global_ols_rerun, ax= ax, criterion="cooks")
Comparing this influece plot to the original plot before removing outliers, we can see that the range for H Leverage
has considerably reduced from 0.0 - 0.8 to 0.0 - 0.40
. The range for Studentized Residuals
has also reduced from -12.0 - 20.0 to -10.0 - 10.0
.
# Check Linearity - Plot predictors with response
preds = train_x_rerun.columns
fig = plt.figure(figsize=(15, 15))
for sp in range(0,27):
ax = fig.add_subplot(7,4,sp+1)
ax.scatter(train_x_rerun.loc[:,preds[sp]], train_y_rerun)
ax.set_xlabel(preds[sp])
ax.set_ylabel('Provider Count')
plt.xticks(rotation=0)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
plt.tight_layout(pad=0.5) # automatically adjusts layout to fit long labels
plt.show()
Our regression model results showed only a few predictors that were statistically significant in predicting provider count. Feature selection is used to select those features in data that contribute most to the response variable. Having irrelevant features in the data can decrease the accuracy of many models, especially linear algorithms. To identify which predictors play a key role and remove predictors that are not statistically significant, we can use various feature selection techniques.
Greedy algorithms like Stepwise Regression
[4] and Recursive Feature Elimination (RFE)
[5] work by adding or removing attributes one at a time and building a model on those attributes that remain. RFE algorithm recursively removes attributes and uses accuracy metric to rank the feature according to their importance.
Regularization methods like Lasso
[6] and ElasticNet
[6] seek to minimize the complexity (magnitude and number of regression coefficients) of the model by penalizing a feature given a coefficient threshold.
Lasso penalizes the sum of absolute value of regression coefficients thereby forcing many coefficients to 0.
ElasticNet combines the properties of both Lasso (L1 term) and Ridge [6] (L2 term) regression. It penalizes the model by using both the L2-norm (sum of squared values of coefficients) and the L1-norm (sum of absolute value of coefficients) thereby shrinking some coefficients closer to 0 to reduce variance and making other coefficients 0.
Feature importance techniques like Random Forest
[7] are used to select features using a trained supervised classifier. Random forest consists of a number of decision trees where each node in a tree is a condition on a single feature, designed to split the dataset.
Here we will:
# Create prdictor and response variables
train_x_global = train_x_rerun
train_y_global = train_y_rerun
train_x_global.head()
# Import Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring ='neg_mean_squared_error'
# Create Pipeline and Standardize the dataset
clf = LinearRegression()
pipelines = []
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LASSO',Lasso(max_iter=10000))])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),('EN',ElasticNet())])))
pipelines.append(('ScaledRFECV', Pipeline([('Scaler', StandardScaler()),('RFECV',RFECV(estimator=clf, cv=5))])))
pipelines.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF',RandomForestRegressor(n_estimators = 100))])))
# Run models
import math
results = []
names = []
for name, model in pipelines:
kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
cv_results = cross_val_score(model, train_x_global, train_y_global, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, math.sqrt(abs(cv_results.mean())), cv_results.std())
print(msg)
From the results above, Lasso seems to be the most promising with lowest RMSE. Let's explore Lasso to see which variables are selected by this model.
# Standardize the data using Standard Scalar
from sklearn.preprocessing import StandardScaler
train_x_std = StandardScaler().fit_transform(train_x_global)
# Run Lasso Model
from sklearn.linear_model import Lasso
# from sklearn.linear_model import ElasticNet
lasso_model=Lasso(max_iter=10000)
lasso_model.fit(train_x_std,train_y_global)
# Identify how many variables are picked
coef = pd.Series(lasso_model.coef_, index = train_pred.columns)
print("Model picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
# Get important variables and their coefficients
imp_coef = pd.concat([coef.sort_values(ascending=False).head(10),
coef.sort_values(ascending=False).tail(10)])
imp_coef
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Model");
From the analysis above, we can see that factors that are predictive of higher provider count are:
Factors that negatively influence provider count are:
In the base model, we saw deviations from linearity, constant variance and independence assumptions. The model also seemed to overfit data with an R-squared value of 0.97.
For Linearity - If a model does not fit well, it does not mean that the regression is not useful. One problem could be that the relation between one or more predictors and the response is not linear.
For Constant Variance and Independence - If normality or the constant variance do not hold, then we transform the response variable. A common transformation is the power transformation of y to the lambda.
Here we will use the features selected by lasso model, and transform
both the predictor and response variables to create a regression model on our data.
# Create a copy of dataframe
test_log = global_df.copy()
test_log.shape
# Describe to identify any zero values in dataset
test_log.describe().transpose()
We can see that the minimum for asian, black population and some other variables is 0. Let's find out the observations with these zero values and remove them.
# Find observations that have 0 values
state_list = ['asian_pop','black_pop','amerind_pop','hisp_pop','otherace_pop','pop_density','unemp_rate','grad_deg']
x = test_log[(test_log[state_list] == 0).any(axis=1)]
x.shape
We can see that 50 out of 3110 observations have 0 values. We will go ahead and remove these.
Replace 0 values with Nan and then drop those values
# Replace 0 with NaN for columns in state_list
test_log[state_list] = test_log[state_list].replace({0:np.nan})
test_log.dropna(inplace=True)
test_log.shape
test_log.describe().transpose()
Now that 0 valued observations are removed, we can see that minimum value is not 0.
# Subset important predictors chosen from Lasso Model
train_x_lasso = test_log.loc[:,imp_coef.index]
train_y_lasso = test_log['provider_count']
train_x_lasso.head()
# Transform data
from numpy import log
sc_data = StandardScaler()
train_x_log = np.log(train_x_lasso)
train_y_log = np.log(test_log.loc[:,'provider_count'])
train_y_log.shape
# Distribution of Dependent variable before and after transformation
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
sns.distplot(train_y_log, ax=ax2)
sns.distplot(train_y_lasso, ax=ax1);
From the plots above, we can see how the distribution of provider count varies before and after log transformation.
In this model, we will build an Ordinary Least Squares (OLS) Regression [1] model using predictors selected from feature selection.
X_train_log = sm.add_constant(train_x_log)
global_ols = sm.OLS(train_y_log, X_train_log).fit()
global_ols.summary()
Analysing the model summary, we can see that:
- The Adjusted R2 of our global model is 95% which means that 95% of variation in provider count can be explained by our model.
- The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.
- The p-values of individual variables determine which variables are statistically significant. The equation below is the final regression equation for our model showing statistically significant variables and their coefficients. Note that both dependent and independent variables are log transformed.
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
global_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Global Model');
Regression Equation
# Model Interpretation
coef = 1.1271
((1.01**coef)-1)*100
Model Interpretations [9]:
𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦_𝑝𝑜𝑝
- The coefficient is 0.1668. We can say that for 1% increase in minority population, Provider Count will increase by 0.17% ((1.01^coef-1)*100) holding all other predictors fixed.grad_deg
- The coefficient is 0.2330. We can say that for 1% increase in graduate degree holders, Provider Count will increase by 0.23% holding all other predictors fixed.𝑡𝑜𝑡𝑎𝑙_𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
- The coefficient is 1.1271. We can say that for 1% increase in total population, Provider Count will increase by 1.127% holding all other predictors fixed.edubase
- The coefficient is -0.0523. We can say that for 1% increase in education base, Provider Count will decrease by 0.05% holding all other predictors fixed.𝑝𝑒𝑟𝑐𝑎𝑝_𝑖𝑛𝑐𝑜𝑚𝑒
- The coefficient is 0.4684. We can say that for 1% increase in per capita income, Provider Count will increase by 0.47% holding all other predictors fixed.𝑎𝑠𝑠𝑜_𝑑𝑒𝑔
- The coefficient is 0.0846. We can say that for 1% increase in associate degree holders, Provider Count will increase by 0.08% holding all other predictors fixed.𝑎𝑣𝑔_𝑚𝑒𝑑𝑖𝑐𝑎𝑟𝑒
- The coefficient is 3.3141. We can say that for 1% increase in average medicare, Provider Count will increase by 3.31% holding all other predictors fixed.pop_density
- The coefficient is -0.0277. We can say that for 1% increase in population density, Provider Count will decrease by 0.027% holding all other predictors fixed.avg_hhsz
- The coefficient is -2.7010. We can say that for 1% increase in average household size, Provider Count will decrease by 2.70% holding all other predictors fixed.white_pop
- The coefficient is -0.2686. We can say that for 1% increase in white population, Provider Count will decrease by 0.27% holding all other predictors fixed.unemp_rate
- The coefficient is 0.0087. We can say that for 1% increase in unemployment rate, Provider Count will increase by 0.009% holding all other predictors fixed.median_age
- The coefficient is -0.9122. We can say that for 1% increase in median age, Provider Count will decrease by 0.91% holding all other predictors fixed.otherace_pop
- The coefficient is 0.0319. We can say that for 1% increase in other race population, Provider Count will increase by 0.03% holding all other predictors fixed.some_college
- The coefficient is 0.0294. We can say that for 1% increase in people with some college education, Provider Count will increase by 0.03% holding all other predictors fixed.𝑎𝑣𝑔_𝑝𝑟𝑒𝑠𝑑𝑟𝑢𝑔
- The coefficient is -3.9498. We can say that for 1% increase in average prescription drug, Provider Count will decrease by 3.95% holding all other predictors fixed.𝑎𝑣𝑔_𝑠𝑜𝑐𝑠𝑒𝑐𝑢𝑟𝑖𝑡𝑦
- The coefficient is -0.4264. We can say that for 1% increase in average social security, Provider Count will decrease by 0.43% holding all other predictors fixed.𝑎sian_𝑝𝑜𝑝
- The coefficient is 0.0608. We can say that for 1% increase in asian population, Provider Count will increase by 0.06% holding all other predictors fixed.bach_deg
- The coefficient is 0.0059. We can say that for 1% increase in bachelors degree holders, Provider Count will increase by 0.005% holding all other predictors fixed.black_pop
- The coefficient is -0.1246. We can say that for 1% increase in black population, Provider Count will decrease by 0.125% holding all other predictors fixed.ℎ𝑖𝑠𝑝_𝑝𝑜𝑝
- The coefficient is -0.1045. We can say that for 1% increase in hispanic population, Provider Count will decrease by 0.105% holding all other predictors fixed.
# Calculate RMSE of global model
pred_val = global_ols.fittedvalues.copy()
new_rmse = rmse(train_y_log, pred_val)
new_rmse
We can see that the RMSE of our model is 0.38. Since RMSE has the same unit as the dependent variable, we can compare RMSE with the range of dependent variable to see how spread out our residuals are and see how fit our model is. A value of 0.38 on a range of 0-11 (dependent variable range) is pretty small showing the model is a good fit.
Here, we will verify regression assumptions [2] of global model.
# Residual Value
residual = global_ols.resid
import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=min(pred_val), xmax=max(pred_val), linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True);
# Residuals vs Predictors
states = train_x_log.columns
fig = plt.figure(figsize=(15, 20))
for sp in range(0,10):
mini = min(train_x_log.loc[:,states[sp]])
maxi = max(train_x_log.loc[:,states[sp]])
ax = fig.add_subplot(7,4,sp+1)
ax.scatter(train_x_log.loc[:,states[sp]], residual)
ax.hlines(y=0, xmin=mini, xmax=maxi, linewidth=2, color='r')
ax.set_xlabel(states[sp])
ax.set_ylabel('Residuals')
# ax.set_ylim(0,provType.iloc[0,1])
plt.xticks(rotation=0)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
plt.tight_layout(pad=0.5) # automatically adjusts layout to fit long labels
plt.show()
Let's build some spatial models such as Geographically Weighted Regression(GWR), Forest Based Classification and Regression Trees and Local Bivariate Relationships to understand how provider count is influenced by different factors at county level. Before we build these models, let's import libraries, define an output directory and create an environment workspace.
# Import Libraries
import arcpy
import os
# Set data directory to output results
arcgis_dir = './arcgis'
home_dir = os.path.join(arcgis_dir, 'home')
print(f"root dir: {arcgis_dir}")
print(f"home dir: {home_dir}")
# Create empty FGDB in data directory
if not arcpy.Exists(os.path.join(home_dir, 'Results.gdb')):
arcpy.CreateFileGDB_management(home_dir, 'Results.gdb')
# Setup Workspace
arcpy.env.workspace = os.path.join(home_dir,'Results.gdb')
arcpy.env.workspace
The global model helps us identify a relationship between predictor variables and provider count at the country level. However, the dynamics of how different variables impact provider count can be very different for different counties within a state. The question we ask here is:
What if an influential variable's impact on provider count varies accross different counties, i.e. how does average household size (or any other variable's) impact provider count vary across different counties?
To answer this question we will explore Geographically Weighted Regression Model
.
Learn more about Geographically Weighted Regression.
To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders
. The output of the model will be stored in a file geodatabase in /arcgis/home
which can then be accessed to visualize the results.
# Run GWR and output results to the FGDB
from arcpy.stats import GWR
# Workspace
arcpy.env.workspace
# Url for the layer
lyr_url = allprovider_layer.url
predictors = ['asian_pop','amerind_pop','avg_hhsz','black_pop','hisp_pop',
'median_age','minority_pop','otherace_pop','percap_income',
'unemp_rate','white_pop','some_college','grad_deg','avg_healthinsurance',
'avg_medicare','avg_labtest','avg_presdrug','avg_personalinsurance']
# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
model_type = 'CONTINUOUS'
explanatory_variables = predictors
output_features = 'GWRResults_NB1'
neighborhood_type = 'NUMBER_OF_NEIGHBORS'
neighborhood_selection_method = 'USER_DEFINED'
minimum_number_of_neighbors = None
maximum_number_of_neighbors = None
minimum_search_distance = None
maximum_search_distance = None
number_of_neighbors_increment = None
search_distance_increment = None
number_of_increments = None
number_of_neighbors = 50
distance_band = None
prediction_locations = None
explanatory_variables_to_match = None
output_predicted_features = None
robust_prediction = 'ROBUST'
local_weighting_scheme = 'GAUSSIAN'
coefficient_raster_workspace = None
# Run Model
GWR(in_features, dependent_variable, model_type, explanatory_variables, output_features, neighborhood_type,
neighborhood_selection_method, minimum_number_of_neighbors, maximum_number_of_neighbors, minimum_search_distance,
maximum_search_distance, number_of_neighbors_increment, search_distance_increment, number_of_increments,
number_of_neighbors, distance_band, prediction_locations, explanatory_variables_to_match, output_predicted_features,
robust_prediction, local_weighting_scheme,coefficient_raster_workspace)
# Access GWR data from local
gwr_df = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
gwr_df.head()
gwr_df.shape
# Create Map
gwr_avg_hhsz_map = gis.map('Washington DC', 5)
gwr_avg_hhsz_map
The map shows how provider count varies with average household size for different counties. We can see:
- Counties in New York, New Jersey and Philadelphia where an increase in average household size will have a higher impact on provider count.
- Similarly, for counties in Washington, Virginia, Maine, New Hampshire, Massachusetts an increase in average household size will have a higher impact on provider count.
- In the Midwest, areas near Chicago and Minneapolis also see a higher impact on provider count.
# Create Map
gwr_avg_hhsz_map = gis.map('Washington DC', 5)
gwr_avg_hhsz_map
gwr_avg_hhsz_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_hhsz_map,
renderer_type='c', # for class breaks renderer
method='esriClassifyNaturalBreaks', # classification algorithm
class_count=6, # choose the number of classes
col='C_AVG_HHSZ', # numeric column to classify
cmap='OrRd', # color map to pick colors from for each class
alpha=0.7,
line_width=0) # specify opacity
# Add Legend
gwr_avg_hhsz_map.legend = True
# Create Map
gwr_avg_presdrug_map = gis.map('USA', 4)
gwr_avg_presdrug_map
The map shows how provider count varies with average prescription drug prices for different counties. We can see:
- Provider Count in majority of California and areas in New Hampshire is strongly negatively impacted by increase in average prescription drug prices. This means increase in average prescription drug prices will reduce provider count
- For areas in Minnesota and Wisconsin, provider count is strongly positively impacted by increase in average prescription drug prices.This means increase in average prescription drug prices will increase provider count
- Although for majority of midwest, provider count is positively impacted by average prescription drug prices, some counties in North New Mexico, Southern Colorado and Kansas seem to have a slight negative impact.
gwr_avg_presdrug_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_presdrug_map,
renderer_type='c', # for class breaks renderer
method='esriClassifyNaturalBreaks', # classification algorithm
class_count=6, # choose the number of classes
col='C_AVG_PRESDRUG', # numeric column to classify
cmap='OrRd', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
gwr_avg_presdrug_map.legend = True
The global model and GWR model are parametric models and they assume either a linear relationship between predictors and response variable or assume a form of the function and identify influential factors at the country level and at county level. These methods are easy to understand however, are highly constrained to the specified form. But:
What if this relationship between predictors and response is not linear? OR What if we do not make strong assumptions about the form of function?
To answer this question, we will explore a non-parametric technique called Forest Based Classification and Regression Trees Model
. Forest based models are good black box models that map non-linear and complex relationships quite well and are not influenced by outliers and missing values to an extent.
Forest Based Classification and Regression Trees creates an ensemble of decision trees. Each decision tree is created using randomly generated portions of data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome. The model will output a variable importance table to identify importance of each variable given to the model.
Leran more about Forest Based Classification and Regression Trees.
To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders
. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.
Here, we created a 'Train Only' model with all predictors. The model is trained on 70% data and 30% data is excluded for validation. The model was run for 5 iterations.
Note that increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate. Learn more about the parameters used and outputs generated by this model here.
from arcpy.stats import Forest
# Workspace
arcpy.env.workspace
lyr_url = allprovider_layer.url
predictors = [['amerind_pop','false'],['asian_pop','false'],['asso_deg','false'],['avg_healthcare','false'],
['avg_healthinsurance','false'],['avg_hhinc','false'],['avg_hhsz','false'],['avg_labtest','false'],
['avg_medicalcare','false'],['avg_medicalsrvc','false'],['avg_medicare','false'],['avg_personalinsurance','false'],
['avg_presdrug','false'],['avg_socsecurity','false'],['bach_deg','false'],['grad_deg','false'],
['hisp_pop','false'],['median_age','false'],['minority_pop','false'],['otherace_pop','false'],
['percap_income','false'],['pop_density','false'],['some_college','false'],['unemp_rate','false'],
['white_pop','false']]
# Define model variables
prediction_type = 'TRAIN'
in_features = lyr_url
variable_predict = 'provider_count'
treat_variable_as_categorical = None
explanatory_variables = predictors
distance_features = None
explanatory_rasters = None
features_to_predict = None
output_features = None
output_raster = None
explanatory_variable_matching = None
explanatory_distance_matching = None
explanatory_rasters_matching = None
output_trained_features = 'RF_Output_Result1'
output_importance_table = 'RF_VImp_Result1'
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = None
maximum_depth = None
sample_size = 100
random_variables = None
percentage_for_training = 30
output_classification_table = None
output_validation_table = None
compensate_sparse_categories = False
number_validation_runs = 5
calculate_uncertainty = False
# Run Model
Forest(prediction_type,in_features, variable_predict, treat_variable_as_categorical, explanatory_variables, distance_features,
explanatory_rasters, features_to_predict, output_features, output_raster, explanatory_variable_matching,
explanatory_distance_matching, explanatory_rasters_matching, output_trained_features, output_importance_table,
use_raster_values, number_of_trees, minimum_leaf_size, maximum_depth, sample_size, random_variables,
percentage_for_training, output_classification_table, output_validation_table, compensate_sparse_categories,
number_validation_runs, calculate_uncertainty)
# RF Vimp Table
rf_df = pd.DataFrame.spatial.from_table(os.path.join(arcpy.env.workspace,output_importance_table))
rf_df.head()
rf_df.shape
data = rf_df.drop(['OBJECTID','BEST_ITE'],axis=1)
data
# Plot Importance
sns.set(rc={'figure.figsize':(15,8.27)})
data = rf_df.drop(['OBJECTID','BEST_ITE'],axis=1)
g = sns.boxplot(data = data)
plt.xticks(rotation=90);
We ran 5 iterations of Random Forest model. The box plot shows variable importance of each variable and how importance varies for each iteration.
# Create dataframe for Best Iteration
best = rf_df[rf_df['BEST_ITE']=='Best Iteration']
best
# Create Plot
sns.set(rc={'figure.figsize':(15,8.27)})
sns.barplot(data=best, color='lightcoral')
plt.xticks(rotation=90)
plt.title('Variable Importance - Best Iteration', fontsize=16);
The barplot shows variable importance of each variable for the best iteration. These variables will now be used in the next step of our analysis to understand how they vary with respect to provider count accross various counties in U.S.
Random Forest model generates variable importance and gives us important variables. However, this model does not tell us:
To understand the type and significance of the relations of a predictor variable with respect to provider count at the County level, we will explore Local Bivariate Relations (LBR) Model
.
Local Bivariate Relations analyzes two variables for statistically significant relationships using local entropy. The output can be used to visualize the type of relationship between two variables and explore how their relationship changes across counties.
We will use Local Bivariate Relations to study how relationship of important variables determined by Random Forest model change with respect to Provider Count accross various counties. Learn more about Local Bivariate Relationships.
Two of the top five most important variables generated by Forest Based CART model are Graduate Degree
and White Population
. Let's build LBR models for Provider Count w.r.t these variables.
To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders
. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.
from arcpy.stats import LocalBivariateRelationships
# Workspace
arcpy.env.workspace
# Get url of layer
lyr_url = allprovider_layer.url
# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
explanatory_variable = 'grad_deg'
output_features = 'allprovider_grad_deg_LBR_Results1'
number_of_neighbors = 50
number_of_permutations = 99
enable_local_scatterplot_popups = 'NO_POPUP'
level_of_confidence = '95%'
apply_false_discovery_rate_fdr_correction = 'APPLY_FDR'
scaling_factor = 0.5
# Run Model
LocalBivariateRelationships(in_features, dependent_variable,explanatory_variable,output_features,number_of_neighbors,
number_of_permutations, enable_local_scatterplot_popups, level_of_confidence,
apply_false_discovery_rate_fdr_correction, scaling_factor)
# Access GWR data from local
lbr_graddeg = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
lbr_graddeg.head()
lbr_graddeg.shape
lbr_graddeg.LBR_TYPE.value_counts()
# Create Map
lbr_graddeg_map = gis.map('USA', 4)
lbr_graddeg_map
The map shows how relationship of provider count varies with graduate degree holders for different counties. We can see:
- Majority of the counties from North Dakota to Texas share a Convex or Positive Linear relation. This means that provider count increase with increasing graduate degree holders for these counties either linearly or polynomially.
- Majority of East and West coast share a concave relation whcih means that provider count increases with graduater degree holders to a certain extent and then plateaus off at some point.
lbr_graddeg_map.remove_layers()
lbr_graddeg.spatial.plot(map_widget=lbr_graddeg_map,
renderer_type='u', # for unique value renderer
col='LBR_TYPE', # numeric column to classify
cmap='prism', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_graddeg_map.legend = True
from arcpy.stats import LocalBivariateRelationships
# Workspace
arcpy.env.workspace
# Get url of layer
lyr_url = allprovider_layer.url
# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
explanatory_variable = 'white_pop'
output_features = 'allprovider_white_pop_LBR_Results1'
number_of_neighbors = 50
number_of_permutations = 99
enable_local_scatterplot_popups = 'NO_POPUP'
level_of_confidence = '95%'
apply_false_discovery_rate_fdr_correction = 'APPLY_FDR'
scaling_factor = 0.5
# Run Model
LocalBivariateRelationships(in_features, dependent_variable,explanatory_variable,output_features,number_of_neighbors,
number_of_permutations, enable_local_scatterplot_popups, level_of_confidence,
apply_false_discovery_rate_fdr_correction, scaling_factor)
# Access GWR data from local
lbr_whitepop = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
lbr_whitepop.head()
lbr_whitepop.shape
# Counts of different relationship types
lbr_whitepop.LBR_TYPE.value_counts()
# Create Map
lbr_whitepop_map = gis.map('USA', 4)
lbr_whitepop_map
The map shows how relationship of provider count varies with white population for different counties. We can see:
- For most of the counties in United States, the relationship is either Convex or Positive Linear which means that provider count increase with increasing white population either linearly or polynomially.
- Southern parts of New Mexico, some counties in North Texas, Wyoming and Idaho have a concave relationship. This means that provider count increases to a certain extent and then plateaus off at some point.
lbr_whitepop_map.remove_layers()
lbr_whitepop.spatial.plot(map_widget=lbr_whitepop_map,
renderer_type='u', # for class breaks renderer
col='LBR_TYPE', # numeric column to classify
cmap='prism', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_whitepop_map.legend = True
To summarize, we:
Base (OLS) Regression model
using all predictors:Lasso regularization
technique.Global (OLS) Regression model
using selected features.
The Global model helped us identify relationship between predictor variables and provider count at the country level.Geographically Weighted Regression (GWR) model
, to understand how different variables impact provider count accross different counties.Forest Based Classification and Regression Trees model
.variable importance
of each variable. Local Bivariate Relations Model
.Studies [10] [11] have shown that shortages of health providers limit patients' access to required treatment. In addition to provider supply, sociodemographic factors such as age, race, income and health insurance, are associated with access to care and health outcomes [12] [13].
Public health stakeholders use numerous datasets to derive insights into the prevalence of healthcare providers and the impact of social determinants of health within their geographic service areas. The volume of available data can make it difficult for these stakeholders to determine how to best reach underserved populations [14].
Our analysis integrates publicly available sociodemographic and provider location datasets [15] and uses advanced data analysis, visualization and machine learning techniques to show the variation of provider supply over different geographical areas along with a summary of the most distinctive and relevant sociodemographic features that are predictive of the variation in the selected area.
By better understanding these social determinants of health in their geographies, both government and non-governmental healthcare leaders can prioritize interventions around addressing underserved areas with healthcare provider shortages.
The analysis is designed to be easily extended with additional features or adapted to other provider specialties. The techniques defined in this analysis can support further data integrations including Medicare, Medicaid, and commercial insurance claims data to better understand actual patient utilization behavior and improve models.
To conclude, in this 4-part study we use Dask, ArcGIS API for Python, ArcGIS WebGIS stack, various spatial analysis tools and machine learning techniques to:
[1] https://en.wikipedia.org/wiki/Ordinary_least_squares
[2] https://online.stat.psu.edu/stat501/lesson/7/7.3
[3] https://sites.google.com/site/statisticsforspreadsheets/regression-modeling/influence-plot
[4] https://online.stat.psu.edu/stat501/lesson/10/10.2
[5] https://bookdown.org/max/FES/recursive-feature-elimination.html
[6] https://towardsdatascience.com/ridge-lasso-and-elastic-net-regularization-7861ca575c64
[7] https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f
[8] Feng C, Wang H, Lu N, et al. Log-transformation and its implications for data analysis [published correction appears in Gen Psychiatr. 2019 Sep 6;32(5):e100146corr1]. Shanghai Arch Psychiatry. 2014;26(2):105-109. doi:10.3969/j.issn.1002-0829.2014.02.009 [Link]
[9] https://www.cscu.cornell.edu/news/statnews/stnews83.pdf
[10] Institute of Medicine (US) National Cancer Policy Forum. Ensuring Quality Cancer Care through the Oncology Workforce: Sustaining Care in the 21st Century: Workshop Summary. Washington (DC): National Academies Press (US); 2009. Supply and Demand in the Health Care Workforce. Available from: https://www.ncbi.nlm.nih.gov/books/NBK215247/
[11] Kullgren J, McLaughlin C, Mitra N, Armstrong K. Nonfinancial barriers and access to care for U.S. adults. Health Serv Res. 2012;47(1):462–485. doi: 10.1111/j.1475-6773.2011.01308.x. [PubMed]
[12] Sorkin, D.H., Ngo-Metzger, Q. & De Alba, I. Racial/Ethnic Discrimination in Health Care: Impact on Perceived Quality of Care. J GEN INTERN MED 25, 390–396 (2010). https://doi.org/10.1007/s11606-010-1257-5
[13] Reiners F, Sturm J, Bouw LJW, Wouters EJM. Sociodemographic Factors Influencing the Use of eHealth in People with Chronic Diseases. Int J Environ Res Public Health. 2019;16(4):645. Published 2019 Feb 21. doi:10.3390/ijerph16040645
[14] Community Health Needs Assessment PeaceHealth (2016). Sacred Heart Medical Center. https://www.peacehealth.org/about-peacehealth/Pages/community-health-needs-assessment
[15] https://www.cms.gov/Regulations-and-Guidance/Administrative-Simplification/NationalProvIdentStand/DataDissemination.html