Analyzing Healthcare Provider Shortage - Part 4/4¶

👟 Ready To Run!

📈 Statistics & Graphing

💻 Predictive Modeling

⚕️ Healthcare

Requirements

🗄️ Notebook Server Advanced License

The big question that we would like to answer with this analysis is that: What are the socieconomic and demographic factors that influence access to healthcare providers?

In this notebook, we will build various global and local models and will try to answer this question by:

Identifying key sociodemographic and economic factors that are most predictive of provider count.
Identifying the direction of influence that these factors have on provider count i.e. whether the factors influence access to providers positively or negatively.
Identify how the influence of these factors varies across different counties.

1 Gather and Plot Data
- 1.1 Get the Cleaned Data Layer
- 1.2 Plot Data
  - 1.2.1 Pair Plot
  - 1.2.2 Correlation Plot
2 Generate a Global Model
3 Spatial Models - Setup
4 Geographically Weighted Regression (GWR) Model
5 Forest Based Classification and Regression Trees Model
6 Local Bivariate Relationships (LBR) Model
- 6.1 Build Model - Provider Count and Graduate Degree
  - 6.1.1 Access Model Results as Spatially Enabled Dataframe
  - 6.1.2 Plot Results on Map
- 6.2 Build Model - Provider Count and White Population
  - 6.2.1 Access Model Results as Spatially Enabled Dataframe
  - 6.2.2 Plot Results on Map
7 Summary
8 Conclusion
9 References

Part 4: Provider Shortage - Model and Understand Shortage¶

In this section, we will:

Gather and Process Demographic and Heath Expenditure data
Aggragate Provider Count for all counties in U.S.
Generate a Base (OLS) Model of provider count using demographic and health expenditure variables
Perform Feature Selection using multiple techniques to select relevant features
Run a Global (OLS) model using selected features
Create a Geographically Weighted Regression (GWR) Model to understand how various predictors vary accross different counties
Create a Forest Based Classification and Regression Trees Model to understand Non-linear relations and to indentify important variables
Create Local Bivariate Relationships (LBR) Model to understand the type and significance of relationships of Provider Count with respect to variables selected from Forest based model.

# Import Libraries
from IPython.display import display

# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap

# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np

# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Import library for time
import time

# Supress Warning
import warnings
warnings.filterwarnings("ignore")

# Create connection
gis = GIS("home")

Gather and Plot Data¶

To understand shortage of healthcare providers we will try to identify socioeconomic and demographic factors that influence access to these providers. We will use the cleaned aggregated data layer demographic_healthexp_clean_allproviders created in Part 2 of this 4 part study. This layer includes the provider count, demographic data and some health expenditure features.

Get the Cleaned Data Layer¶

# Search the feature layer
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
allprovider_item = gis.content.search(f'demographic_healthexp_clean_allproviders {group_query}',
                                      item_type = "Feature Service",
                                      outside_org=True)[0]
allprovider_item

# Get the layer
allprovider_layer = allprovider_item.layers[0]
allprovider_layer

<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/demographic_healthexp_clean_allproviders/FeatureServer/0">

# Look at the first few fields
for f in allprovider_layer.properties.fields[:5]:
    print(f['name'], "{0:>25}".format(f['alias']))

objectid                  OBJECTID
state                     state
county                    county
asian_pop                 asian_pop
amerind_pop               amerind_pop

# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape

(3139, 35)

We can see that the dataframe (allprovider_df) has 3139 rows and 35 columns.

# Look at the first few rows of dataframe
allprovider_df.head()

Plot Data¶

Before we plot the data, let's create a copy of allprovider_df and drop unnecessary columns.

# Create copy of dataframe
test_newcounty_df = allprovider_df.copy()

# Drop the shape and objectid columns
test_newcounty_df.drop(['objectid','SHAPE','SHAPE__Length','SHAPE__Area'], axis=1, inplace=True)
test_newcounty_df.columns

Index(['amerind_pop', 'asian_pop', 'asso_deg', 'avg_healthcare',
       'avg_healthinsurance', 'avg_hhinc', 'avg_hhsz', 'avg_labtest',
       'avg_medicalcare', 'avg_medicalsrvc', 'avg_medicare',
       'avg_personalinsurance', 'avg_presdrug', 'avg_socsecurity', 'bach_deg',
       'black_pop', 'county', 'edubase', 'grad_deg', 'hisp_pop', 'median_age',
       'minority_pop', 'otherace_pop', 'percap_income', 'pop_density',
       'provider_count', 'some_college', 'state', 'total_population',
       'unemp_rate', 'white_pop'],
      dtype='object')

Pair Plot¶

We will plot each numerical variable with respect to others to see how the data is distributed and how correlated the variables are to each other.

sns.pairplot(test_newcounty_df.iloc[:,0:9]);

sns.pairplot(test_newcounty_df.iloc[:,10:19]);

sns.pairplot(test_newcounty_df.iloc[:,20:]);

From these pair plots we can see that:

Distribution of most variables is heavily skewed to the right.
Some variables have outliers that influencing data distribution.
Most demographic variables are not correlated to other demographic variables.
Most health expenditure variables seem to be highly correlated to other health expenditure variables.

Correlation Plot¶

Correlation plot is another great way to visualize correlation among predictor variables and correlation of predictors with response variable.

f,ax = plt.subplots(figsize=(18, 18))

sns.heatmap(test_newcounty_df.corr(), vmin=-1.0, vmax=1.0, annot=True, 
            linewidths=.5, fmt= '.1f',ax=ax)
plt.title('Correlation Plot', fontdict = {'fontsize' : 30});

From this plot, we can see that:

Education level variables are highly correlated with each other.
Similarly, health expenditure variables also seem to be highly correlated.

# Sort the values by descending order of Provider Count
test_newcounty_df = test_newcounty_df.sort_values(by=['provider_count'], ascending=False)
test_newcounty_df.head()

Generate a Global Model¶

The idea behind a global model is to identify various socioeconomic and demographic factors that influence access to healthcare providers across all counties in the United States.

To build this global model, we will follow a 3 step process:

1. Build an Ordinary Least Squares (OLS) Regression Model and verify regressions assumptions
2. Perform Feature Selection using Lasso, Elastic Net and Recursive Feature Elimination techniques
3. Using selected features, build another OLS model, check performance and verify assumptions

Create Predictor and Response Variables¶

# Create predictor variables
train_pred = test_newcounty_df.drop(['county','state','provider_count'],axis=1)
train_pred.head()

# Create response variables
train_y = test_newcounty_df['provider_count']
train_y.head()

1198    188179.0
2784    104037.0
2459     80839.0
1614     71069.0
1405     70953.0
Name: provider_count, dtype: float64

Create a Base Model¶

We will start with building an Ordinary Least Squares (OLS) Regression [1] model with all predictors and verify regressions assumptions.

# Import libraries
import statsmodels.api as sm
# import statsmodels
from statsmodels.regression import linear_model

# Set the seed for consistent results
np.random.seed(101)

Linear Regression Model using Stats Models¶

# Create Model
X_train = train_pred
X_train = sm.add_constant(X_train) # add constant
sm_ols = sm.OLS(train_y, X_train).fit()

# Generate model summary
sm_ols.summary()

# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
sm_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Base Model');

# Calculate RMSE of model
from statsmodels.tools.eval_measures import rmse

pred_val = sm_ols.fittedvalues.copy()
rmse_base_ols = rmse(train_y, pred_val)

print('RMSE for Base Model:',round(rmse_base_ols,2))

RMSE for Base Model: 1111.25

Analysing the model summary, we can see that:

The R-squared value of 0.972 shows that 97.2% of the variability in Provider Count is explained by the model. The R-squared value is too good to be true and the model seems to be overfitting.

To identify variables that are statistically significant, we look at the p-values of individual variables. Among the coefficients, those that are statistically significant at 90% significance level are:

asian_pop, amerind_pop, black_pop, edubase, hisp_pop, median_age, minority_pop, otherace_pop, percap_income, pop_density, white_pop, some_college, asso_deg, grad_deg, total_population, avg_labtest, avg_presdrug, avg_socsecurity.

The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.

Verify Assumptions¶

Here, we will verify regression assumtions [2] of:

Linearity
Independence
Normality
Equal Variance

# Get residual value
residual = sm_ols.resid

import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))

# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=0, xmax=max(pred_val), linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
ax1.set_title('Residuals vs Fitted Values')

# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True);

Residuals vs Fitted plot
- Constant Variance: From the plot, we observe a departure from constant variance as the distance of data points from zero are not consistently equal. We can see that the residuals are closer to 0 in the beginning but values start to go up as we move a little towards the right of the plot.
- Independence: We see that the data is clustered together and is not randomly distributed. This indicates a departure from independence as clusters of residuals generally indicate correlation errors, or an indication that multicollinearity exists in parts of the data.

Probability Plot
- Normality: From this plot, we can see that the residuals have tails on both the positive and negative side. Although residuals stay close to the zero line they are heavily tailed showing violation of Normality assumption.

# Check Linearity - Plot predictors with response
preds = train_pred.columns
fig = plt.figure(figsize=(15, 20))
for sp in range(0,27):
    ax = fig.add_subplot(7,4,sp+1)
    ax.scatter(train_pred.loc[:,preds[sp]], train_y)
    ax.set_xlabel(preds[sp])
    ax.set_ylabel('Provider Count')
    plt.xticks(rotation=0)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.margins(0.01, 0)
plt.tight_layout(pad=1.0) # automatically adjusts layout to fit long labels
plt.show()

Predictors vs Response
- Linearity - From the plot above, we can see the each predictor variable is either strongly or weakly related to response variable. From this plot, we can say that the Linearity assumption holds.

Outlier Detection¶

One way to think about whether the results we have are driven by a given data point is to calculate how far the predicted values for data would move if model was fit without the data point in question. This calculated total distance is called Cook's distance. Cook's D is a function of the leverage and standardized residual associated with each data point. The influence of each point can be visualized using an Influence Plot [3].

# Create Influence Plot

fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(sm_ols, ax= ax, criterion="cooks")

From this plot, we can see that points with larger size have high leverage and higher than average residuals. The larger size of points indicates these are influential points. Points that do not have a high leverage but very high residual and are considered outliers.

Note that numbers shown next to the points in this plot are observation numbers (row ids) from the data. This context is important to understand the following section for removing outliers.

Remove Outliers¶

Removing outliers is an iterative process and outliers need to be investigated before removal. Here, we followed a 4-step process to remove outliers as follows:

Remove initial outliers based on Influence Plot
Rerun the model with data after removing outliers
Plot predictors with response to see outliers in each predictor
Repeat the process above until you see satisfactory results

The outliers were removed in different iterations. Here are the obervations that were removed with each iteration:

- Iteration 1: 2631,2774,1277,1245,296
- Iteration 2: 1198,2353,2459,1688,1077
- Iteration 3: 916,2632,2262,1207,797,531
- Iteration 4: 2613,310,1052,38,1632,1512
- Iteration 5: 629,1846,2710,388,1614,2784,1727

The code below shows all outliers being removed. If you see a different set of points as outliers for your analysis, update the global_df.drop() command in code cell below with outlier observations and run the analysis.

# Remove Outliers
global_df = test_newcounty_df.copy()

global_df.drop([2631,2774,1277,1245,296,
                1198,2353,2459,1688,1077,
                916,2632,2262,1207,797,531,
                2613, 310, 1052, 38, 1632, 1512,
                629, 1846, 2710, 388, 1614, 2784, 1727],axis=0, inplace=True)

global_df.shape

(3110, 31)

Re-run Model¶

# Create prdictor and response variables
train_x_rerun = global_df.drop(['county','state','provider_count'],axis=1)
train_y_rerun = global_df['provider_count']
train_x_rerun.head()

# Run Model
X_train_rerun = train_x_rerun
X_train_rerun = sm.add_constant(X_train_rerun)
global_ols_rerun = sm.OLS(train_y_rerun, X_train_rerun).fit() 
residual = global_ols_rerun.resid

# Plot Outliers
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(global_ols_rerun, ax= ax, criterion="cooks")

Comparing this influece plot to the original plot before removing outliers, we can see that the range for H Leverage has considerably reduced from 0.0 - 0.8 to 0.0 - 0.40. The range for Studentized Residuals has also reduced from -12.0 - 20.0 to -10.0 - 10.0.

# Check Linearity - Plot predictors with response
preds = train_x_rerun.columns
fig = plt.figure(figsize=(15, 15))
for sp in range(0,27):
    ax = fig.add_subplot(7,4,sp+1)
    ax.scatter(train_x_rerun.loc[:,preds[sp]], train_y_rerun)
    ax.set_xlabel(preds[sp])
    ax.set_ylabel('Provider Count')
    plt.xticks(rotation=0)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.margins(0.01, 0)
plt.tight_layout(pad=0.5) # automatically adjusts layout to fit long labels
plt.show()

Feature Selection and Importance¶

Our regression model results showed only a few predictors that were statistically significant in predicting provider count. Feature selection is used to select those features in data that contribute most to the response variable. Having irrelevant features in the data can decrease the accuracy of many models, especially linear algorithms. To identify which predictors play a key role and remove predictors that are not statistically significant, we can use various feature selection techniques.

Greedy algorithms like Stepwise Regression [4] and Recursive Feature Elimination (RFE) [5] work by adding or removing attributes one at a time and building a model on those attributes that remain. RFE algorithm recursively removes attributes and uses accuracy metric to rank the feature according to their importance.

Regularization methods like Lasso [6] and ElasticNet [6] seek to minimize the complexity (magnitude and number of regression coefficients) of the model by penalizing a feature given a coefficient threshold.

Lasso penalizes the sum of absolute value of regression coefficients thereby forcing many coefficients to 0.
ElasticNet combines the properties of both Lasso (L1 term) and Ridge [6] (L2 term) regression. It penalizes the model by using both the L2-norm (sum of squared values of coefficients) and the L1-norm (sum of absolute value of coefficients) thereby shrinking some coefficients closer to 0 to reduce variance and making other coefficients 0.

Feature importance techniques like Random Forest [7] are used to select features using a trained supervised classifier. Random forest consists of a number of decision trees where each node in a tree is a condition on a single feature, designed to split the dataset.

Here we will:

Run some feature selection algorithms
Choose an algorithm with lowest error
Run the chosen algorithm on all predictors
Choose the predictors selected by this model for further analysis

# Create prdictor and response variables
train_x_global = train_x_rerun
train_y_global = train_y_rerun
train_x_global.head()

# Import Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

Run Feature Selection Algorithms¶

# Test options and evaluation metric
num_folds = 10
seed = 7
scoring ='neg_mean_squared_error'

# Create Pipeline and Standardize the dataset
clf = LinearRegression()
pipelines = []
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LASSO',Lasso(max_iter=10000))])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),('EN',ElasticNet())])))
pipelines.append(('ScaledRFECV', Pipeline([('Scaler', StandardScaler()),('RFECV',RFECV(estimator=clf, cv=5))])))
pipelines.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF',RandomForestRegressor(n_estimators = 100))])))

# Run models
import math
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
    cv_results = cross_val_score(model, train_x_global, train_y_global, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, math.sqrt(abs(cv_results.mean())), cv_results.std())
    print(msg)

ScaledLASSO: 810.966882 (144215.366577)
ScaledEN: 1036.948941 (257837.375171)
ScaledRFECV: 823.745259 (121412.751591)
ScaledRF: 1069.088212 (951602.468757)

From the results above, Lasso seems to be the most promising with lowest RMSE. Let's explore Lasso to see which variables are selected by this model.

Identify Features Using Lasso¶

# Standardize the data using Standard Scalar
from sklearn.preprocessing import StandardScaler
train_x_std = StandardScaler().fit_transform(train_x_global)

# Run Lasso Model
from sklearn.linear_model import Lasso
# from sklearn.linear_model import ElasticNet
lasso_model=Lasso(max_iter=10000)
lasso_model.fit(train_x_std,train_y_global)

Lasso(max_iter=10000)

# Identify how many variables are picked
coef = pd.Series(lasso_model.coef_, index = train_pred.columns)
print("Model picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

Model picked 20 variables and eliminated the other 8 variables

# Get important variables and their coefficients
imp_coef = pd.concat([coef.sort_values(ascending=False).head(10),
                     coef.sort_values(ascending=False).tail(10)])
imp_coef

minority_pop        6112.644751
grad_deg            3080.902796
total_population    1746.411139
edubase             1045.670237
percap_income        651.533545
asso_deg             632.276396
avg_medicare         612.894279
pop_density          161.600102
avg_hhsz              49.771492
white_pop             -0.000000
unemp_rate           -53.553069
median_age          -180.673223
otherace_pop        -189.915878
some_college        -399.935748
avg_presdrug        -574.941967
avg_socsecurity     -951.248207
asian_pop          -1269.292044
bach_deg           -1703.537961
black_pop          -1950.128199
hisp_pop           -3752.211812
dtype: float64

# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Model");

Text(0.5, 1.0, 'Coefficients in the Model')

From the analysis above, we can see that factors that are predictive of higher provider count are:

minority_pop (2020 Minority Population (Esri))
grad_deg (2020 Education: Graduate/Professional Degree (Esri))
total_population (2020 Total Population (Esri))
percap_income (2020 Per Capita Income (Esri))
avg_medicare (2020 Medicare Payments: Average)
asso_deg (2020 Education: Associate's Degree (Esri))
edubase (2020 Educational Attainment Base (Esri))
pop_density (2020 Population Density (Pop per Square Mile) (Esri))
avg_hhsz (2020 Average Household Size (Esri))

Factors that negatively influence provider count are:

hisp_pop (2020 Hispanic Population (Esri))
black_pop (2020 Black/African American Population (Esri))
bach_deg (2020 Education: Bachelor's Degree (Esri))
asian_pop (2020 Asian Population (Esri))
avg_socsecurity (2020 Pensions & Social Security: Average)
avg_presdrug (2020 Prescription Drugs: Average)
median_age (2020 Median Age (Esri))
some_college (2020 Education: Some College/No Degree (Esri))
otherrace_pop (2020 Other Race Population by Age Base (Esri))
unemp_rate (2020 Unemployment Rate (Esri))

Regression with Selected Features¶

In the base model, we saw deviations from linearity, constant variance and independence assumptions. The model also seemed to overfit data with an R-squared value of 0.97.

For Linearity - If a model does not fit well, it does not mean that the regression is not useful. One problem could be that the relation between one or more predictors and the response is not linear.

For Constant Variance and Independence - If normality or the constant variance do not hold, then we transform the response variable. A common transformation is the power transformation of y to the lambda.

Here we will use the features selected by lasso model, and transform both the predictor and response variables to create a regression model on our data.

Clean the Data¶

Some transformations such as Log transformation [8] do not work well with zero or negative values. We do not have any negatives but let's check and remove any zero values from the data.

# Create a copy of dataframe
test_log = global_df.copy()
test_log.shape

(3110, 31)

# Describe to identify any zero values in dataset
test_log.describe().transpose()

We can see that the minimum for asian, black population and some other variables is 0. Let's find out the observations with these zero values and remove them.

# Find observations that have 0 values
state_list = ['asian_pop','black_pop','amerind_pop','hisp_pop','otherace_pop','pop_density','unemp_rate','grad_deg']
x = test_log[(test_log[state_list] == 0).any(axis=1)]
x.shape

(50, 31)

We can see that 50 out of 3110 observations have 0 values. We will go ahead and remove these.

Replace 0 values with Nan and then drop those values

# Replace 0 with NaN for columns in state_list
test_log[state_list] = test_log[state_list].replace({0:np.nan})
test_log.dropna(inplace=True)
test_log.shape

(3060, 31)

test_log.describe().transpose()

Now that 0 valued observations are removed, we can see that minimum value is not 0.

Transform Data¶

We will use Log transformation [8] to transform the data.

# Subset important predictors chosen from Lasso Model
train_x_lasso = test_log.loc[:,imp_coef.index]
train_y_lasso = test_log['provider_count']
train_x_lasso.head()

# Transform data
from numpy import log
sc_data = StandardScaler()
train_x_log = np.log(train_x_lasso)
train_y_log = np.log(test_log.loc[:,'provider_count'])

train_y_log.shape

(3060,)

# Distribution of Dependent variable before and after transformation
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
sns.distplot(train_y_log, ax=ax2)
sns.distplot(train_y_lasso, ax=ax1);

From the plots above, we can see how the distribution of provider count varies before and after log transformation.

Run Global Model¶

In this model, we will build an Ordinary Least Squares (OLS) Regression [1] model using predictors selected from feature selection.

X_train_log = sm.add_constant(train_x_log)
global_ols = sm.OLS(train_y_log, X_train_log).fit()

global_ols.summary()

Analysing the model summary, we can see that:

The Adjusted R2 of our global model is 95% which means that 95% of variation in provider count can be explained by our model.

The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.

The p-values of individual variables determine which variables are statistically significant. The equation below is the final regression equation for our model showing statistically significant variables and their coefficients. Note that both dependent and independent variables are log transformed.

# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
global_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Global Model');

Regression Equation

$$ \begin{equation*} provider\_count = 0.1668*minority\_pop + 0.2330*grad\_deg + 1.1271*total\_population - 0.0523*edubase + 0.4684*percap\_income\\ + 0.0846*asso\_deg + 3.3141*avg\_medicare - 0.0277*pop\_density - 2.7010*avg\_hhsz - 0.2686*white\_pop + 0.0087*unemp\_rate\\ - 0.9122*median\_age + 0.0319*otherace\_pop + 0.0294*some\_college - 3.9498*avg\_presdrug - 0.4264*avg\_socsecurity\\ + 0.0608*asian\_pop + 0.0059*bach\_deg - 0.1246*black\_pop - 0.1045*hisp\_pop \end{equation*} $$

# Model Interpretation
coef = 1.1271
((1.01**coef)-1)*100

1.1278141976397205

Model Interpretations [9]:

𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦_𝑝𝑜𝑝 - The coefficient is 0.1668. We can say that for 1% increase in minority population, Provider Count will increase by 0.17% ((1.01^coef-1)*100) holding all other predictors fixed.

grad_deg - The coefficient is 0.2330. We can say that for 1% increase in graduate degree holders, Provider Count will increase by 0.23% holding all other predictors fixed.

𝑡𝑜𝑡𝑎𝑙_𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 - The coefficient is 1.1271. We can say that for 1% increase in total population, Provider Count will increase by 1.127% holding all other predictors fixed.

edubase - The coefficient is -0.0523. We can say that for 1% increase in education base, Provider Count will decrease by 0.05% holding all other predictors fixed.

𝑝𝑒𝑟𝑐𝑎𝑝_𝑖𝑛𝑐𝑜𝑚𝑒 - The coefficient is 0.4684. We can say that for 1% increase in per capita income, Provider Count will increase by 0.47% holding all other predictors fixed.

𝑎𝑠𝑠𝑜_𝑑𝑒𝑔 - The coefficient is 0.0846. We can say that for 1% increase in associate degree holders, Provider Count will increase by 0.08% holding all other predictors fixed.

𝑎𝑣𝑔_𝑚𝑒𝑑𝑖𝑐𝑎𝑟𝑒 - The coefficient is 3.3141. We can say that for 1% increase in average medicare, Provider Count will increase by 3.31% holding all other predictors fixed.

pop_density - The coefficient is -0.0277. We can say that for 1% increase in population density, Provider Count will decrease by 0.027% holding all other predictors fixed.

avg_hhsz - The coefficient is -2.7010. We can say that for 1% increase in average household size, Provider Count will decrease by 2.70% holding all other predictors fixed.

white_pop - The coefficient is -0.2686. We can say that for 1% increase in white population, Provider Count will decrease by 0.27% holding all other predictors fixed.

unemp_rate - The coefficient is 0.0087. We can say that for 1% increase in unemployment rate, Provider Count will increase by 0.009% holding all other predictors fixed.

median_age - The coefficient is -0.9122. We can say that for 1% increase in median age, Provider Count will decrease by 0.91% holding all other predictors fixed.

otherace_pop - The coefficient is 0.0319. We can say that for 1% increase in other race population, Provider Count will increase by 0.03% holding all other predictors fixed.

some_college - The coefficient is 0.0294. We can say that for 1% increase in people with some college education, Provider Count will increase by 0.03% holding all other predictors fixed.

𝑎𝑣𝑔_𝑝𝑟𝑒𝑠𝑑𝑟𝑢𝑔 - The coefficient is -3.9498. We can say that for 1% increase in average prescription drug, Provider Count will decrease by 3.95% holding all other predictors fixed.

𝑎𝑣𝑔_𝑠𝑜𝑐𝑠𝑒𝑐𝑢𝑟𝑖𝑡𝑦 - The coefficient is -0.4264. We can say that for 1% increase in average social security, Provider Count will decrease by 0.43% holding all other predictors fixed.

𝑎sian_𝑝𝑜𝑝 - The coefficient is 0.0608. We can say that for 1% increase in asian population, Provider Count will increase by 0.06% holding all other predictors fixed.

bach_deg - The coefficient is 0.0059. We can say that for 1% increase in bachelors degree holders, Provider Count will increase by 0.005% holding all other predictors fixed.

black_pop - The coefficient is -0.1246. We can say that for 1% increase in black population, Provider Count will decrease by 0.125% holding all other predictors fixed.

ℎ𝑖𝑠𝑝_𝑝𝑜𝑝 - The coefficient is -0.1045. We can say that for 1% increase in hispanic population, Provider Count will decrease by 0.105% holding all other predictors fixed.

# Calculate RMSE of global model
pred_val = global_ols.fittedvalues.copy()
new_rmse = rmse(train_y_log, pred_val)
new_rmse

0.3796336341163295

We can see that the RMSE of our model is 0.38. Since RMSE has the same unit as the dependent variable, we can compare RMSE with the range of dependent variable to see how spread out our residuals are and see how fit our model is. A value of 0.38 on a range of 0-11 (dependent variable range) is pretty small showing the model is a good fit.

Verify Assumptions¶

Here, we will verify regression assumptions [2] of global model.

# Residual Value
residual = global_ols.resid

import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))

# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=min(pred_val), xmax=max(pred_val), linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')

# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True);

# Residuals vs Predictors
states = train_x_log.columns
fig = plt.figure(figsize=(15, 20))

for sp in range(0,10):
    mini = min(train_x_log.loc[:,states[sp]])
    maxi = max(train_x_log.loc[:,states[sp]])
    ax = fig.add_subplot(7,4,sp+1)
    ax.scatter(train_x_log.loc[:,states[sp]], residual)
    ax.hlines(y=0, xmin=mini, xmax=maxi, linewidth=2, color='r')
    ax.set_xlabel(states[sp])
    ax.set_ylabel('Residuals')
#     ax.set_ylim(0,provType.iloc[0,1])
    plt.xticks(rotation=0)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.margins(0.01, 0)
plt.tight_layout(pad=0.5) # automatically adjusts layout to fit long labels
plt.show()

Residuals vs Fitted plot
- Constant Variance: From the plot, we still observe a departure from constant variance as the distance of data points from zero line is not consistent. We can see that the variation in residuals is high initially and then starts to drop.
- Independence: We see that the data is randomly distributed along the zero line indicating the independence holds.

Probability Plot
- Normality: From this plot, we see that the residuals follow straight line with departure at the tails. Normality assumption does not hold.

Residuals vs Predictors Plot
- Linearity: From the plot, we can see that the residuals for each predictor variable are randomly distributed accross the 0 line. We can say that the linearity assumption holds.

Spatial Models - Setup¶

Let's build some spatial models such as Geographically Weighted Regression(GWR), Forest Based Classification and Regression Trees and Local Bivariate Relationships to understand how provider count is influenced by different factors at county level. Before we build these models, let's import libraries, define an output directory and create an environment workspace.

# Import Libraries
import arcpy
import os

# Set data directory to output results

arcgis_dir = './arcgis'
home_dir = os.path.join(arcgis_dir, 'home')

print(f"root dir: {arcgis_dir}")
print(f"home dir: {home_dir}")

root dir: ./arcgis
home dir: ./arcgis\home

# Create empty FGDB in data directory
if not arcpy.Exists(os.path.join(home_dir, 'Results.gdb')):
    arcpy.CreateFileGDB_management(home_dir, 'Results.gdb')

# Setup Workspace
arcpy.env.workspace = os.path.join(home_dir,'Results.gdb')
arcpy.env.workspace

'./ArcGIS\\home\\Results.gdb'

Geographically Weighted Regression (GWR) Model¶

The global model helps us identify a relationship between predictor variables and provider count at the country level. However, the dynamics of how different variables impact provider count can be very different for different counties within a state. The question we ask here is:

What if an influential variable's impact on provider count varies accross different counties, i.e. how does average household size (or any other variable's) impact provider count vary across different counties?

To answer this question we will explore Geographically Weighted Regression Model.

Learn more about Geographically Weighted Regression.

Run GWR Model¶

To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase in /arcgis/home which can then be accessed to visualize the results.

# Run GWR and output results to the FGDB
from arcpy.stats import GWR

# Workspace
arcpy.env.workspace

# Url for the layer
lyr_url = allprovider_layer.url

predictors = ['asian_pop','amerind_pop','avg_hhsz','black_pop','hisp_pop',
              'median_age','minority_pop','otherace_pop','percap_income',
              'unemp_rate','white_pop','some_college','grad_deg','avg_healthinsurance',
              'avg_medicare','avg_labtest','avg_presdrug','avg_personalinsurance']

# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
model_type = 'CONTINUOUS'
explanatory_variables = predictors
output_features = 'GWRResults_NB1'
neighborhood_type = 'NUMBER_OF_NEIGHBORS'
neighborhood_selection_method = 'USER_DEFINED'
minimum_number_of_neighbors = None
maximum_number_of_neighbors = None
minimum_search_distance = None
maximum_search_distance = None
number_of_neighbors_increment = None
search_distance_increment = None
number_of_increments = None
number_of_neighbors = 50
distance_band = None
prediction_locations = None
explanatory_variables_to_match = None
output_predicted_features = None
robust_prediction = 'ROBUST'
local_weighting_scheme = 'GAUSSIAN'
coefficient_raster_workspace = None

# Run Model
GWR(in_features, dependent_variable, model_type, explanatory_variables, output_features, neighborhood_type,
    neighborhood_selection_method, minimum_number_of_neighbors, maximum_number_of_neighbors, minimum_search_distance,
    maximum_search_distance, number_of_neighbors_increment, search_distance_increment, number_of_increments,
    number_of_neighbors, distance_band, prediction_locations, explanatory_variables_to_match, output_predicted_features,
    robust_prediction, local_weighting_scheme,coefficient_raster_workspace)

Access GWR Results as Spatially Enabled Dataframe¶

# Access GWR data from local
gwr_df = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
gwr_df.head()

gwr_df.shape

(3139, 67)

Plot GWR Results¶

Plot Average Household Size¶

# Create Map
gwr_avg_hhsz_map = gis.map('Washington DC', 5)
gwr_avg_hhsz_map

The map shows how provider count varies with average household size for different counties. We can see:

Counties in New York, New Jersey and Philadelphia where an increase in average household size will have a higher impact on provider count.

Similarly, for counties in Washington, Virginia, Maine, New Hampshire, Massachusetts an increase in average household size will have a higher impact on provider count.

In the Midwest, areas near Chicago and Minneapolis also see a higher impact on provider count.

# Create Map
gwr_avg_hhsz_map = gis.map('Washington DC', 5)
gwr_avg_hhsz_map

gwr_avg_hhsz_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_hhsz_map,
        renderer_type='c',  # for class breaks renderer
        method='esriClassifyNaturalBreaks',  # classification algorithm
        class_count=6,  # choose the number of classes
        col='C_AVG_HHSZ',  # numeric column to classify
        cmap='OrRd',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0) # specify opacity

True

# Add Legend
gwr_avg_hhsz_map.legend = True

Plot Average Prescription Drug¶

# Create Map
gwr_avg_presdrug_map = gis.map('USA', 4)
gwr_avg_presdrug_map

The map shows how provider count varies with average prescription drug prices for different counties. We can see:

Provider Count in majority of California and areas in New Hampshire is strongly negatively impacted by increase in average prescription drug prices. This means increase in average prescription drug prices will reduce provider count

For areas in Minnesota and Wisconsin, provider count is strongly positively impacted by increase in average prescription drug prices.This means increase in average prescription drug prices will increase provider count

Although for majority of midwest, provider count is positively impacted by average prescription drug prices, some counties in North New Mexico, Southern Colorado and Kansas seem to have a slight negative impact.

gwr_avg_presdrug_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_presdrug_map,
        renderer_type='c',  # for class breaks renderer
        method='esriClassifyNaturalBreaks',  # classification algorithm
        class_count=6,  # choose the number of classes
        col='C_AVG_PRESDRUG',  # numeric column to classify
        cmap='OrRd',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0.05) # specify opacity

True

gwr_avg_presdrug_map.legend = True

Forest Based Classification and Regression Trees Model¶

The global model and GWR model are parametric models and they assume either a linear relationship between predictors and response variable or assume a form of the function and identify influential factors at the country level and at county level. These methods are easy to understand however, are highly constrained to the specified form. But:

What if this relationship between predictors and response is not linear? OR What if we do not make strong assumptions about the form of function?

To answer this question, we will explore a non-parametric technique called Forest Based Classification and Regression Trees Model. Forest based models are good black box models that map non-linear and complex relationships quite well and are not influenced by outliers and missing values to an extent.

Forest Based Classification and Regression Trees creates an ensemble of decision trees. Each decision tree is created using randomly generated portions of data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome. The model will output a variable importance table to identify importance of each variable given to the model.

Leran more about Forest Based Classification and Regression Trees.

Build Forest Based CART Model¶

To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.

Here, we created a 'Train Only' model with all predictors. The model is trained on 70% data and 30% data is excluded for validation. The model was run for 5 iterations.

Note that increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate. Learn more about the parameters used and outputs generated by this model here.

from arcpy.stats import Forest

# Workspace
arcpy.env.workspace
lyr_url = allprovider_layer.url

predictors = [['amerind_pop','false'],['asian_pop','false'],['asso_deg','false'],['avg_healthcare','false'],
              ['avg_healthinsurance','false'],['avg_hhinc','false'],['avg_hhsz','false'],['avg_labtest','false'],
              ['avg_medicalcare','false'],['avg_medicalsrvc','false'],['avg_medicare','false'],['avg_personalinsurance','false'],
              ['avg_presdrug','false'],['avg_socsecurity','false'],['bach_deg','false'],['grad_deg','false'],
              ['hisp_pop','false'],['median_age','false'],['minority_pop','false'],['otherace_pop','false'],
              ['percap_income','false'],['pop_density','false'],['some_college','false'],['unemp_rate','false'],
              ['white_pop','false']]

# Define model variables
prediction_type = 'TRAIN'
in_features = lyr_url
variable_predict = 'provider_count'
treat_variable_as_categorical = None
explanatory_variables = predictors
distance_features = None
explanatory_rasters = None
features_to_predict = None
output_features = None
output_raster = None
explanatory_variable_matching = None
explanatory_distance_matching = None
explanatory_rasters_matching = None
output_trained_features = 'RF_Output_Result1'
output_importance_table = 'RF_VImp_Result1'
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = None
maximum_depth = None
sample_size = 100
random_variables = None
percentage_for_training = 30
output_classification_table = None
output_validation_table = None
compensate_sparse_categories = False
number_validation_runs = 5
calculate_uncertainty = False

# Run Model
Forest(prediction_type,in_features, variable_predict, treat_variable_as_categorical, explanatory_variables, distance_features, 
       explanatory_rasters, features_to_predict, output_features, output_raster, explanatory_variable_matching,
      explanatory_distance_matching, explanatory_rasters_matching, output_trained_features, output_importance_table,
      use_raster_values, number_of_trees, minimum_leaf_size, maximum_depth, sample_size, random_variables, 
       percentage_for_training, output_classification_table, output_validation_table, compensate_sparse_categories, 
       number_validation_runs, calculate_uncertainty)

Get Variable Importance¶

# RF Vimp Table
rf_df = pd.DataFrame.spatial.from_table(os.path.join(arcpy.env.workspace,output_importance_table))
rf_df.head()

rf_df.shape

(5, 27)

Plot Variable Importance¶

data = rf_df.drop(['OBJECTID','BEST_ITE'],axis=1)
data

# Plot Importance
sns.set(rc={'figure.figsize':(15,8.27)})
data = rf_df.drop(['OBJECTID','BEST_ITE'],axis=1)
g = sns.boxplot(data = data)
plt.xticks(rotation=90);

We ran 5 iterations of Random Forest model. The box plot shows variable importance of each variable and how importance varies for each iteration.

Plot Best Iteration¶

# Create dataframe for Best Iteration
best = rf_df[rf_df['BEST_ITE']=='Best Iteration']
best

# Create Plot
sns.set(rc={'figure.figsize':(15,8.27)})
sns.barplot(data=best, color='lightcoral')
plt.xticks(rotation=90)
plt.title('Variable Importance - Best Iteration', fontsize=16);

The barplot shows variable importance of each variable for the best iteration. These variables will now be used in the next step of our analysis to understand how they vary with respect to provider count accross various counties in U.S.

Local Bivariate Relationships (LBR) Model¶

Random Forest model generates variable importance and gives us important variables. However, this model does not tell us:

the type of relationship and
significance of relation of a predictor variable with respect to provider count.

To understand the type and significance of the relations of a predictor variable with respect to provider count at the County level, we will explore Local Bivariate Relations (LBR) Model.

Local Bivariate Relations analyzes two variables for statistically significant relationships using local entropy. The output can be used to visualize the type of relationship between two variables and explore how their relationship changes across counties.

We will use Local Bivariate Relations to study how relationship of important variables determined by Random Forest model change with respect to Provider Count accross various counties. Learn more about Local Bivariate Relationships.

Two of the top five most important variables generated by Forest Based CART model are Graduate Degree and White Population. Let's build LBR models for Provider Count w.r.t these variables.

Build Model - Provider Count and Graduate Degree¶

To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.

from arcpy.stats import LocalBivariateRelationships

# Workspace
arcpy.env.workspace

# Get url of layer
lyr_url = allprovider_layer.url

# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
explanatory_variable = 'grad_deg'
output_features = 'allprovider_grad_deg_LBR_Results1'
number_of_neighbors = 50
number_of_permutations = 99
enable_local_scatterplot_popups = 'NO_POPUP'
level_of_confidence = '95%'
apply_false_discovery_rate_fdr_correction = 'APPLY_FDR'
scaling_factor = 0.5

# Run Model
LocalBivariateRelationships(in_features, dependent_variable,explanatory_variable,output_features,number_of_neighbors,
                           number_of_permutations, enable_local_scatterplot_popups, level_of_confidence, 
                            apply_false_discovery_rate_fdr_correction, scaling_factor)

Access Model Results as Spatially Enabled Dataframe¶

# Access GWR data from local
lbr_graddeg = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
lbr_graddeg.head()

lbr_graddeg.shape

(3139, 20)

lbr_graddeg.LBR_TYPE.value_counts()

Positive Linear    1173
Convex             1085
Concave             881
Name: LBR_TYPE, dtype: int64

Plot Results on Map¶

# Create Map
lbr_graddeg_map = gis.map('USA', 4)
lbr_graddeg_map

The map shows how relationship of provider count varies with graduate degree holders for different counties. We can see:

Majority of the counties from North Dakota to Texas share a Convex or Positive Linear relation. This means that provider count increase with increasing graduate degree holders for these counties either linearly or polynomially.

Majority of East and West coast share a concave relation whcih means that provider count increases with graduater degree holders to a certain extent and then plateaus off at some point.

lbr_graddeg_map.remove_layers()
lbr_graddeg.spatial.plot(map_widget=lbr_graddeg_map,
        renderer_type='u',  # for unique value renderer
        col='LBR_TYPE',  # numeric column to classify
        cmap='prism',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0.05) # specify opacity

True

lbr_graddeg_map.legend = True

Build Model - Provider Count and White Population¶

from arcpy.stats import LocalBivariateRelationships

# Workspace
arcpy.env.workspace

# Get url of layer
lyr_url = allprovider_layer.url

# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
explanatory_variable = 'white_pop'
output_features = 'allprovider_white_pop_LBR_Results1'
number_of_neighbors = 50
number_of_permutations = 99
enable_local_scatterplot_popups = 'NO_POPUP'
level_of_confidence = '95%'
apply_false_discovery_rate_fdr_correction = 'APPLY_FDR'
scaling_factor = 0.5

# Run Model
LocalBivariateRelationships(in_features, dependent_variable,explanatory_variable,output_features,number_of_neighbors,
                           number_of_permutations, enable_local_scatterplot_popups, level_of_confidence, 
                            apply_false_discovery_rate_fdr_correction, scaling_factor)

Access Model Results as Spatially Enabled Dataframe¶

# Access GWR data from local
lbr_whitepop = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
lbr_whitepop.head()

lbr_whitepop.shape

(3139, 20)

# Counts of different relationship types
lbr_whitepop.LBR_TYPE.value_counts()

Convex             1938
Positive Linear    1047
Concave             154
Name: LBR_TYPE, dtype: int64

Plot Results on Map¶

# Create Map
lbr_whitepop_map = gis.map('USA', 4)
lbr_whitepop_map

The map shows how relationship of provider count varies with white population for different counties. We can see:

For most of the counties in United States, the relationship is either Convex or Positive Linear which means that provider count increase with increasing white population either linearly or polynomially.

Southern parts of New Mexico, some counties in North Texas, Wyoming and Idaho have a concave relationship. This means that provider count increases to a certain extent and then plateaus off at some point.

lbr_whitepop_map.remove_layers()
lbr_whitepop.spatial.plot(map_widget=lbr_whitepop_map,
        renderer_type='u',  # for class breaks renderer
        col='LBR_TYPE',  # numeric column to classify
        cmap='prism',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0.05) # specify opacity

True

lbr_whitepop_map.legend = True

Summary¶

To summarize, we:

Started with a Base (OLS) Regression model using all predictors:
- Performed feature selection using Lasso regularization technique.
- Created a Global (OLS) Regression model using selected features. The Global model helped us identify relationship between predictor variables and provider count at the country level.

Next, we created a Geographically Weighted Regression (GWR) model, to understand how different variables impact provider count accross different counties.
- Here we plotted how provider count varies with average prescription drug prices and average household size across different counties.

To investigate non-linear relationship, we explored a non-parametric technique called Forest Based Classification and Regression Trees model.
- The model returned variable importance of each variable.

To understand the type and significance of relationship of an individual predictor with respect to provider count, we created a Local Bivariate Relations Model.

Conclusion¶

Studies [10] [11] have shown that shortages of health providers limit patients' access to required treatment. In addition to provider supply, sociodemographic factors such as age, race, income and health insurance, are associated with access to care and health outcomes [12] [13].

Public health stakeholders use numerous datasets to derive insights into the prevalence of healthcare providers and the impact of social determinants of health within their geographic service areas. The volume of available data can make it difficult for these stakeholders to determine how to best reach underserved populations [14].

Our analysis integrates publicly available sociodemographic and provider location datasets [15] and uses advanced data analysis, visualization and machine learning techniques to show the variation of provider supply over different geographical areas along with a summary of the most distinctive and relevant sociodemographic features that are predictive of the variation in the selected area.

By better understanding these social determinants of health in their geographies, both government and non-governmental healthcare leaders can prioritize interventions around addressing underserved areas with healthcare provider shortages.

The analysis is designed to be easily extended with additional features or adapted to other provider specialties. The techniques defined in this analysis can support further data integrations including Medicare, Medicaid, and commercial insurance claims data to better understand actual patient utilization behavior and improve models.

To conclude, in this 4-part study we use Dask, ArcGIS API for Python, ArcGIS WebGIS stack, various spatial analysis tools and machine learning techniques to:

Read, clean and geocode ~ 5.7 million records of healthcare provider data.
Identify shortage areas for Overall, Mental health and OB-GYN healthcare providers in the U.S.
Identify sociodemographic and economic factors that influence access to providers.
Identify the influence of factors on access to providers.

References¶

[1] https://en.wikipedia.org/wiki/Ordinary_least_squares [2] https://online.stat.psu.edu/stat501/lesson/7/7.3
[3] https://sites.google.com/site/statisticsforspreadsheets/regression-modeling/influence-plot
[4] https://online.stat.psu.edu/stat501/lesson/10/10.2
[5] https://bookdown.org/max/FES/recursive-feature-elimination.html
[6] https://towardsdatascience.com/ridge-lasso-and-elastic-net-regularization-7861ca575c64
[7] https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f
[8] Feng C, Wang H, Lu N, et al. Log-transformation and its implications for data analysis [published correction appears in Gen Psychiatr. 2019 Sep 6;32(5):e100146corr1]. Shanghai Arch Psychiatry. 2014;26(2):105-109. doi:10.3969/j.issn.1002-0829.2014.02.009 [Link]
[9] https://www.cscu.cornell.edu/news/statnews/stnews83.pdf
[10] Institute of Medicine (US) National Cancer Policy Forum. Ensuring Quality Cancer Care through the Oncology Workforce: Sustaining Care in the 21st Century: Workshop Summary. Washington (DC): National Academies Press (US); 2009. Supply and Demand in the Health Care Workforce. Available from: https://www.ncbi.nlm.nih.gov/books/NBK215247/
[11] Kullgren J, McLaughlin C, Mitra N, Armstrong K. Nonfinancial barriers and access to care for U.S. adults. Health Serv Res. 2012;47(1):462–485. doi: 10.1111/j.1475-6773.2011.01308.x. [PubMed]
[12] Sorkin, D.H., Ngo-Metzger, Q. & De Alba, I. Racial/Ethnic Discrimination in Health Care: Impact on Perceived Quality of Care. J GEN INTERN MED 25, 390–396 (2010). https://doi.org/10.1007/s11606-010-1257-5
[13] Reiners F, Sturm J, Bouw LJW, Wouters EJM. Sociodemographic Factors Influencing the Use of eHealth in People with Chronic Diseases. Int J Environ Res Public Health. 2019;16(4):645. Published 2019 Feb 21. doi:10.3390/ijerph16040645
[14] Community Health Needs Assessment PeaceHealth (2016). Sacred Heart Medical Center. https://www.peacehealth.org/about-peacehealth/Pages/community-health-needs-assessment
[15] https://www.cms.gov/Regulations-and-Guidance/Administrative-Simplification/NationalProvIdentStand/DataDissemination.html

	SHAPE	SHAPE__Area	SHAPE__Length	amerind_pop	asian_pop	asso_deg	avg_healthcare	avg_healthinsurance	avg_hhinc	avg_hhsz	avg_labtest	avg_medicalcare	avg_medicalsrvc	avg_medicare	avg_personalinsurance	avg_presdrug	avg_socsecurity	bach_deg	black_pop	county	edubase	grad_deg	hisp_pop	median_age	minority_pop	objectid	otherace_pop	percap_income	pop_density	provider_count	some_college	state	total_population	unemp_rate	white_pop
0	{"rings": [[[-8966174.9526, 4783327.065400004]...	1.386118e+09	181962.662239	26	33	363	4466.09	2894.59	51537	2.42	52.65	1571.51	794.24	632.30	270.40	132.74	4548.75	695	201	Doddridge County	6405	368	91	43.2	467	1	5	20870	27.7	33	1048	WV	8841	5.8	8454
1	{"rings": [[[-9053514.349800002, 4605160.99729...	2.782337e+09	338942.950065	112	101	2901	4258.47	2763.71	50170	2.37	49.59	1494.76	759.18	592.84	257.88	124.51	4436.31	3215	2000	Fayette County	33677	1731	540	44.7	3456	2	106	20953	69.2	407	5473	WV	45748	4.7	42649
2	{"rings": [[[-8986636.588199995, 4735386.19310...	1.453919e+09	190697.949972	53	51	336	4182.28	2716.19	50297	2.35	47.76	1466.09	745.24	579.07	248.70	122.37	4427.67	632	1044	Gilmer County	6106	460	479	39.0	1720	3	209	20296	24.5	54	1272	WV	8294	7.6	6772
3	{"rings": [[[-8792002.250900002, 4755765.97959...	2.067679e+09	253575.180085	19	26	731	4382.40	2848.84	52611	2.40	50.31	1533.57	784.03	596.32	272.29	125.52	4723.08	804	118	Grant County	9296	655	171	46.9	453	4	82	21869	26.2	158	1270	WV	12489	3.8	12092
4	{"rings": [[[-9814179.0036, 5448666.451099999]...	3.612248e+09	304874.767338	501	1882	8678	5281.24	3469.30	72853	2.38	57.14	1811.94	974.64	594.38	357.65	124.23	6907.69	11652	1953	Fond du Lac County	73127	5425	5716	41.6	11180	5	2234	30109	144.3	1444	14708	WI	103861	2.1	95613

	amerind_pop	asian_pop	asso_deg	avg_healthcare	avg_healthinsurance	avg_hhinc	avg_hhsz	avg_labtest	avg_medicalcare	avg_medicalsrvc	avg_medicare	avg_personalinsurance	avg_presdrug	avg_socsecurity	bach_deg	black_pop	county	edubase	grad_deg	hisp_pop	median_age	minority_pop	otherace_pop	percap_income	pop_density	provider_count	some_college	state	total_population	unemp_rate	white_pop
1198	73559	1517296	469649	5905.52	3907.44	94861	3.00	59.76	1998.08	1132.94	579.43	399.93	118.60	9046.16	1445895	850350	Los Angeles County	6899087	782758	5043293	35.7	7644196	2305030	31563	2535.5	188179.0	1300786	CA	10288937	4.5	5019340
2784	21113	395712	237028	5692.87	3774.06	88254	2.58	56.05	1918.81	1061.74	585.60	381.05	122.50	8356.49	830433	1233716	Cook County	3607345	552050	1374256	36.6	3078944	608672	34083	5579.2	104037.0	690225	IL	5274129	6.9	2859209
2459	8756	212844	48360	8035.56	5382.10	136860	2.00	69.80	2653.46	1464.96	796.17	530.01	167.61	12793.70	406878	247085	New York County	1251653	371255	441304	37.9	899778	193191	66805	72736.2	80839.0	121070	NY	1660472	3.9	924396
1614	98584	184150	241714	5574.19	3675.76	83005	2.69	57.84	1898.43	1041.69	581.95	362.02	123.67	7908.62	597653	253576	Maricopa County	2891837	342159	1373153	35.8	1981147	598035	30701	476.9	71069.0	711033	AZ	4387226	5.3	3066684
1405	30001	341640	193511	5800.32	3840.81	89252	2.85	60.64	1959.50	1087.37	564.48	374.86	119.69	8547.08	605813	901459	Harris County	3029538	345886	2035551	33.5	3339578	712410	31405	2780.1	70953.0	606842	TX	4735852	5.9	2573473

	amerind_pop	asian_pop	asso_deg	avg_healthcare	avg_healthinsurance	avg_hhinc	avg_hhsz	avg_labtest	avg_medicalcare	avg_medicalsrvc	avg_medicare	avg_personalinsurance	avg_presdrug	avg_socsecurity	bach_deg	black_pop	edubase	grad_deg	hisp_pop	median_age	minority_pop	otherace_pop	percap_income	pop_density	some_college	total_population	unemp_rate	white_pop
1198	73559	1517296	469649	5905.52	3907.44	94861	3.00	59.76	1998.08	1132.94	579.43	399.93	118.60	9046.16	1445895	850350	6899087	782758	5043293	35.7	7644196	2305030	31563	2535.5	1300786	10288937	4.5	5019340
2784	21113	395712	237028	5692.87	3774.06	88254	2.58	56.05	1918.81	1061.74	585.60	381.05	122.50	8356.49	830433	1233716	3607345	552050	1374256	36.6	3078944	608672	34083	5579.2	690225	5274129	6.9	2859209
2459	8756	212844	48360	8035.56	5382.10	136860	2.00	69.80	2653.46	1464.96	796.17	530.01	167.61	12793.70	406878	247085	1251653	371255	441304	37.9	899778	193191	66805	72736.2	121070	1660472	3.9	924396
1614	98584	184150	241714	5574.19	3675.76	83005	2.69	57.84	1898.43	1041.69	581.95	362.02	123.67	7908.62	597653	253576	2891837	342159	1373153	35.8	1981147	598035	30701	476.9	711033	4387226	5.3	3066684
1405	30001	341640	193511	5800.32	3840.81	89252	2.85	60.64	1959.50	1087.37	564.48	374.86	119.69	8547.08	605813	901459	3029538	345886	2035551	33.5	3339578	712410	31405	2780.1	606842	4735852	5.9	2573473

Dep. Variable:	provider_count	R-squared:	0.972
Model:	OLS	Adj. R-squared:	0.972
Method:	Least Squares	F-statistic:	3863.
Date:	Mon, 17 Aug 2020	Prob (F-statistic):	0.00
Time:	14:21:29	Log-Likelihood:	-26469.
No. Observations:	3139	AIC:	5.300e+04
Df Residuals:	3110	BIC:	5.317e+04
Df Model:	28
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	2632.0977	840.995	3.130	0.002	983.137	4281.059
amerind_pop	-0.0445	0.009	-5.236	0.000	-0.061	-0.028
asian_pop	-0.0838	0.006	-14.154	0.000	-0.095	-0.072
asso_deg	0.0251	0.011	2.259	0.024	0.003	0.047
avg_healthcare	468.7517	3996.627	0.117	0.907	-7367.543	8305.047
avg_healthinsurance	-463.0344	3996.586	-0.116	0.908	-8299.249	7373.180
avg_hhinc	0.2111	0.128	1.647	0.100	-0.040	0.462
avg_hhsz	-134.0568	252.876	-0.530	0.596	-629.878	361.765
avg_labtest	-106.1273	52.970	-2.004	0.045	-209.987	-2.267
avg_medicalcare	-460.9320	3997.058	-0.115	0.908	-8298.073	7376.209
avg_medicalsrvc	-1.7558	14.157	-0.124	0.901	-29.513	26.002
avg_medicare	4.8259	12.003	0.402	0.688	-18.709	28.361
avg_personalinsurance	3.0744	8.394	0.366	0.714	-13.385	19.534
avg_presdrug	-133.4695	64.206	-2.079	0.038	-259.360	-7.579
avg_socsecurity	-4.7149	1.430	-3.297	0.001	-7.519	-1.911
bach_deg	-0.0004	0.004	-0.091	0.928	-0.009	0.008
black_pop	-0.0576	0.005	-12.104	0.000	-0.067	-0.048
edubase	0.0574	0.004	16.112	0.000	0.050	0.064
grad_deg	0.0798	0.005	17.414	0.000	0.071	0.089
hisp_pop	-0.0880	0.012	-7.570	0.000	-0.111	-0.065
median_age	-52.1251	10.225	-5.098	0.000	-72.174	-32.076
minority_pop	0.0954	0.012	7.895	0.000	0.072	0.119
otherace_pop	0.0370	0.011	3.304	0.001	0.015	0.059
percap_income	0.1126	0.025	4.544	0.000	0.064	0.161
pop_density	0.3258	0.022	15.136	0.000	0.284	0.368
some_college	-0.0116	0.005	-2.309	0.021	-0.021	-0.002
total_population	-0.0610	0.010	-5.978	0.000	-0.081	-0.041
unemp_rate	-1.7981	9.948	-0.181	0.857	-21.304	17.707
white_pop	0.0323	0.010	3.168	0.002	0.012	0.052

Omnibus:	1740.052	Durbin-Watson:	1.824
Prob(Omnibus):	0.000	Jarque-Bera (JB):	397367.718
Skew:	1.487	Prob(JB):	0.00
Kurtosis:	58.039	Cond. No.	1.88e+08

	count	mean	std	min	25%	50%	75%	max
amerind_pop	3110.0	925.189389	3253.585408	0.00	61.0000	160.000	523.5000	56871.00
asian_pop	3110.0	3841.547267	20680.860712	0.00	48.0000	162.000	803.7500	501246.00
asso_deg	3110.0	5340.767524	12191.720275	5.00	647.2500	1596.500	4428.7500	193511.00
avg_healthcare	3110.0	5078.543723	947.949453	2821.51	4437.6100	4927.990	5558.7475	10690.41
avg_healthinsurance	3110.0	3318.752814	625.270951	1871.50	2899.4575	3221.570	3633.6175	6873.42
avg_hhinc	3110.0	66116.593891	15742.065759	36078.00	55847.7500	63405.500	72196.2500	162801.00
avg_hhsz	3110.0	2.484016	0.218515	1.96	2.3500	2.460	2.5700	4.34
avg_labtest	3110.0	57.226193	12.376326	27.75	49.0125	54.795	63.1475	149.41
avg_medicalcare	3110.0	1759.790894	324.316111	950.01	1541.2025	1705.395	1921.3075	3816.99
avg_medicalsrvc	3110.0	926.770312	186.200670	496.89	799.9600	897.325	1020.8125	1955.37
avg_medicare	3110.0	628.224878	110.561226	342.05	556.8850	610.050	682.0225	1501.33
avg_personalinsurance	3110.0	326.753232	74.904571	162.69	276.3250	315.235	362.0075	791.45
avg_presdrug	3110.0	131.243627	21.053413	73.65	117.8525	128.335	141.7175	298.34
avg_socsecurity	3110.0	6112.216814	1642.662502	3121.71	5053.2750	5813.570	6727.9125	16133.33
bach_deg	3110.0	11675.685209	33447.991357	2.00	846.0000	2118.000	6539.2500	605813.00
black_pop	3110.0	11506.624759	44758.931850	0.00	121.0000	840.000	5567.5000	901459.00
edubase	3110.0	61346.768489	149479.894659	69.00	7912.5000	18437.500	46770.7500	3029538.00
grad_deg	3110.0	7112.766881	21273.455845	0.00	420.2500	1141.500	3759.2500	345886.00
hisp_pop	3110.0	14253.300322	73794.908292	0.00	378.0000	1128.000	5182.7500	2035551.00
median_age	3110.0	41.960514	5.362405	23.20	38.7250	41.900	45.2000	62.50
minority_pop	3110.0	32202.119614	125773.042120	4.00	1270.2500	4461.000	15167.5000	3339578.00
otherace_pop	3110.0	5257.274920	27156.607632	0.00	109.0000	422.000	1985.7500	712410.00
percap_income	3110.0	26343.677170	6071.379975	9853.00	22218.2500	25470.500	29272.2500	66766.00
pop_density	3110.0	193.653055	611.577761	0.00	17.1000	45.600	114.7000	11869.90
provider_count	3110.0	1498.314469	4319.913817	1.00	88.0000	263.000	935.7500	70953.00
some_college	3110.0	12867.820257	30879.598848	22.00	1666.2500	3847.000	10220.7500	606842.00
total_population	3110.0	90102.121543	223536.760943	82.00	11248.5000	26225.500	67613.5000	4735852.00
unemp_rate	3110.0	4.622669	2.548112	0.00	2.9000	4.300	5.8000	27.90
white_pop	3110.0	65542.636334	141893.693591	65.00	8914.0000	21478.000	56310.5000	2573473.00

	count	mean	std	min	25%	50%	75%	max
amerind_pop	3060.0	938.172549	3277.662837	1.00	64.0000	166.000	534.7500	56871.00
asian_pop	3060.0	3904.242484	20843.326126	1.00	51.0000	167.000	823.7500	501246.00
asso_deg	3060.0	5426.352614	12272.396268	5.00	683.0000	1645.000	4518.0000	193511.00
avg_healthcare	3060.0	5070.785444	939.047067	2821.51	4434.8850	4920.420	5556.8650	10125.26
avg_healthinsurance	3060.0	3314.343611	620.278570	1871.50	2897.2850	3218.950	3626.4725	6711.29
avg_hhinc	3060.0	66136.494444	15747.147692	36078.00	55877.0000	63487.500	72235.5000	162801.00
avg_hhsz	3060.0	2.486990	0.216719	1.96	2.3600	2.460	2.5800	4.34
avg_labtest	3060.0	56.990644	12.038398	27.75	48.9300	54.730	62.8500	118.65
avg_medicalcare	3060.0	1756.441820	320.330034	950.01	1539.2550	1704.200	1919.3575	3413.96
avg_medicalsrvc	3060.0	925.372297	184.806441	496.89	799.6600	895.695	1019.7550	1930.73
avg_medicare	3060.0	626.078137	107.524692	342.05	555.9500	608.515	679.6750	1192.23
avg_personalinsurance	3060.0	326.419869	74.697177	162.69	276.0600	315.110	361.7850	791.45
avg_presdrug	3060.0	130.870804	20.516788	73.65	117.7375	128.065	141.4075	236.92
avg_socsecurity	3060.0	6116.741948	1645.083037	3121.71	5059.6025	5817.650	6734.8375	16133.33
bach_deg	3060.0	11863.987582	33687.509892	38.00	895.7500	2192.000	6773.7500	605813.00
black_pop	3060.0	11694.084641	45099.006112	1.00	132.0000	893.500	5684.5000	901459.00
edubase	3060.0	62331.507190	150496.226005	444.00	8389.5000	18912.500	47753.0000	3029538.00
grad_deg	3060.0	7227.982026	21427.345484	11.00	442.0000	1192.500	3880.5000	345886.00
hisp_pop	3060.0	14483.974837	74373.297150	2.00	394.0000	1179.000	5362.7500	2035551.00
median_age	3060.0	41.861307	5.298345	23.20	38.7000	41.800	45.0000	62.50
minority_pop	3060.0	32722.983007	126730.157837	14.00	1355.7500	4659.000	15646.7500	3339578.00
otherace_pop	3060.0	5342.442810	27369.400986	1.00	117.0000	440.000	2042.0000	712410.00
percap_income	3060.0	26313.961438	6055.021104	9853.00	22179.0000	25453.500	29277.7500	66766.00
pop_density	3060.0	196.779967	616.061929	0.10	18.3000	46.550	117.7750	11869.90
provider_count	3060.0	1522.592157	4350.862736	1.00	93.0000	274.000	978.2500	70953.00
some_college	3060.0	13073.895752	31088.465733	107.00	1740.0000	3959.000	10384.2500	606842.00
total_population	3060.0	91549.904248	225066.638803	623.00	11819.2500	26956.000	68665.5000	4735852.00
unemp_rate	3060.0	4.643072	2.524086	0.10	3.0000	4.300	5.8000	27.90
white_pop	3060.0	66593.150327	142808.374409	205.00	9355.0000	22318.500	57582.7500	2573473.00

	minority_pop	grad_deg	total_population	edubase	percap_income	asso_deg	avg_medicare	pop_density	avg_hhsz	white_pop	unemp_rate	median_age	otherace_pop	some_college	avg_presdrug	avg_socsecurity	asian_pop	bach_deg	black_pop	hisp_pop
1405	3339578	345886.0	4735852	3029538	31405	193511	564.48	2780.1	2.85	2573473	5.9	33.5	712410.0	606842	119.69	8547.08	341640.0	605813	901459.0	2035551.0
1698	1844110	331783.0	3344185	2246969	36291	191669	649.22	795.0	2.77	2056884	3.9	35.7	483908.0	487496	135.09	9730.65	401095.0	525473	168807.0	1144104.0
1348	891019	320833.0	2199247	1552376	47839	126421	720.68	1039.6	2.42	1401488	5.0	38.3	94894.0	278602	150.11	11292.96	395459.0	484710	146077.0	216100.0
209	1890363	193333.0	2654282	1709385	29987	102480	534.73	3046.4	2.76	1335362	4.8	33.6	436469.0	332724	113.62	7832.26	170268.0	337376	608225.0	1072941.0
1355	1224672	159309.0	1901425	1358758	31438	138229	585.04	1571.7	2.56	1120747	5.3	41.0	83078.0	263467	124.13	7615.72	70007.0	279433	555082.0	574151.0

Omnibus:	174.445	Durbin-Watson:	1.753
Prob(Omnibus):	0.000	Jarque-Bera (JB):	673.039
Skew:	0.116	Prob(JB):	7.10e-147
Kurtosis:	5.286	Cond. No.	7.86e+03

	OBJECTID	SOURCE_ID	provider_count	asian_pop	amerind_pop	avg_hhsz	black_pop	hisp_pop	median_age	minority_pop	otherace_pop	percap_income	unemp_rate	white_pop	some_college	grad_deg	avg_healthinsurance	avg_medicare	avg_labtest	avg_presdrug	avg_personalinsurance	INTRCPT	SE_INTRCPT	C_ASIAN_POP	SE_ASIAN_POP	C_AMERIND_POP	SE_AMERIND_POP	C_AVG_HHSZ	SE_AVG_HHSZ	C_BLACK_POP	SE_BLACK_POP	C_HISP_POP	SE_HISP_POP	C_MEDIAN_AGE	SE_MEDIAN_AGE	C_MINORITY_POP	SE_MINORITY_POP	C_OTHERACE_POP	SE_OTHERACE_POP	C_PERCAP_INCOME	SE_PERCAP_INCOME	C_UNEMP_RATE	SE_UNEMP_RATE	C_WHITE_POP	SE_WHITE_POP	C_SOME_COLLEGE	SE_SOME_COLLEGE	C_GRAD_DEG	SE_GRAD_DEG	C_AVG_HEALTHINSURANCE	SE_AVG_HEALTHINSURANCE	C_AVG_MEDICARE	SE_AVG_MEDICARE	C_AVG_LABTEST	SE_AVG_LABTEST	C_AVG_PRESDRUG	SE_AVG_PRESDRUG	C_AVG_PERSONALINSURANCE	SE_AVG_PERSONALINSURANCE	PREDICTED	RESIDUAL	STDRESID	INFLUENCE	COOKS_D	CND_NUMBER	LOCALR2	SHAPE
0	1	1	1044.0	1701	423	2.85	12062	3945	41.0	20664	886	41430	5.6	75625	16211	8931	5227.44	767.21	83.06	157.57	592.98	7083.320304	2163.643990	0.310213	0.034698	0.990575	0.151966	-4103.182427	1022.086678	0.361931	0.029911	0.352328	0.028752	97.160179	30.503968	-0.303474	0.031008	-0.182483	0.025863	-0.356893	0.069323	-10.183314	37.897324	0.029947	0.002078	-0.138177	0.016391	0.077414	0.005721	13.028243	2.624787	30.874419	13.078704	-262.776975	48.454211	-209.364470	65.866365	-36.852385	9.987946	638.394954	405.605046	0.630561	0.113512	1.214450e-04	5.690942e+07	0.973956	{"rings": [[[-8529642.2486, 4688391.039899997]...
1	2	2	5.0	6	3	2.40	895	8	37.9	913	2	16066	5.2	494	199	45	2149.61	475.83	36.65	102.27	191.88	-1715.168222	2993.517884	-0.458553	0.134184	-0.362877	0.178817	664.571445	1007.225468	-0.330165	0.121548	-0.246908	0.115457	-8.918920	31.196270	0.343730	0.123639	-0.231777	0.053857	0.072423	0.118846	-26.403059	22.367435	0.000929	0.007559	0.015853	0.042727	0.139246	0.034656	3.947265	4.227333	56.098070	28.122803	-224.833764	112.938033	-231.319434	132.906664	-20.224481	20.583998	-9.689861	14.689861	0.023290	0.147643	2.241195e-07	3.286173e+07	0.974494	{"rings": [[[-10120538.0346, 3896326.054200001...
2	3	3	621.0	421	168	2.47	1422	2977	40.0	5873	882	24981	5.5	51908	7627	2153	3081.77	547.19	51.40	114.41	313.28	3784.309355	2761.277425	-0.214660	0.068190	-0.482683	0.178969	-47.118911	993.920916	-0.075986	0.057489	-0.166814	0.044674	-31.561083	28.793589	0.080670	0.055741	0.148595	0.050896	-0.049633	0.105370	-10.370135	29.855035	-0.016545	0.003015	0.102708	0.014245	0.292671	0.014161	0.032765	3.547578	25.490189	20.667847	-114.990943	82.141477	-92.796269	102.054007	4.643850	16.127691	468.338221	152.661779	0.227026	0.031206	3.960186e-06	1.113030e+08	0.995132	{"rings": [[[-9221715.5229, 5050074.560699999]...
3	4	4	79.0	258	69	2.77	558	1307	40.4	2464	759	21821	3.1	16894	2043	574	3042.93	557.61	47.81	122.89	289.06	1903.340959	2856.660109	0.182783	0.157086	0.206154	0.184325	177.776565	1150.078324	0.191140	0.155578	0.095773	0.179583	-50.293477	30.293061	-0.188790	0.156154	0.162334	0.091728	0.103288	0.120806	-0.971110	30.753106	0.004361	0.008765	0.065683	0.047084	0.116723	0.026147	-2.651421	4.443284	14.083429	22.637054	-108.647983	107.579662	-21.039989	111.969049	17.232070	18.195763	-173.106408	252.106408	0.400106	0.149376	6.705767e-05	6.998668e+07	0.988206	{"rings": [[[-9290699.0981, 4093591.7991999984...
4	5	5	1210.0	1727	307	2.81	49467	9491	37.8	62154	4497	26350	6.1	31300	11944	5462	3356.61	520.50	54.48	109.17	344.50	863.913879	2593.718643	0.288447	0.170895	0.365302	0.233586	199.681335	1023.574012	0.299485	0.172295	0.246918	0.210308	-35.740864	29.912414	-0.286954	0.172091	0.051971	0.120297	0.064150	0.111697	4.024449	29.745940	0.011930	0.010230	0.020039	0.053421	0.100307	0.029699	-0.201878	4.513428	27.416702	24.186869	-196.397368	120.138953	-84.888220	117.761353	11.976873	17.869470	868.152228	341.847772	0.540363	0.142536	1.157803e-04	5.948622e+07	0.991026	{"rings": [[[-9341355.0324, 3994505.8356000036...

id	value
0
1
2	./ArcGIS\home\Results.gdb\RF_Output_Result1
3	.\ArcGIS\home\Results.gdb\RF_VImp_Result1
4
5
6

	OBJECTID	ASIAN_POP	AMERIND_POP	AVG_HHSZ	AVG_HHINC	HISP_POP	MEDIAN_AGE	MINORITY_POP	OTHERACE_POP	PERCAP_INCOME	POP_DENSITY	UNEMP_RATE	WHITE_POP	SOME_COLLEGE	ASSO_DEG	BACH_DEG	GRAD_DEG	AVG_HEALTHCARE	AVG_HEALTHINSURANCE	AVG_MEDICARE	AVG_MEDICALCARE	AVG_MEDICALSRVC	AVG_LABTEST	AVG_PRESDRUG	AVG_PERSONALINSURANCE	AVG_SOCSECURITY	BEST_ITE
0	1	1.936236e+09	3.626429e+08	3.493813e+08	1.125670e+08	1.250103e+09	1.173539e+08	4.140227e+09	6.644813e+08	1.448598e+08	7.810601e+08	1.923894e+08	7.744006e+09	7.352001e+09	7.441370e+09	1.640525e+10	1.491561e+10	8.568482e+07	7.602678e+07	1.409397e+08	1.779115e+08	1.250645e+08	1.550210e+08	1.979505e+08	1.010077e+08	8.123558e+07	Iterations
1	2	2.433282e+09	3.596755e+08	5.197042e+08	8.798179e+07	7.996029e+08	8.551507e+07	3.733401e+09	1.230084e+09	2.092976e+08	8.595931e+08	2.186328e+08	7.824331e+09	7.049030e+09	3.981829e+09	1.680595e+10	1.614554e+10	9.478313e+07	1.280119e+08	8.784919e+07	8.976733e+07	8.299869e+07	1.931147e+08	1.128664e+08	1.231691e+08	1.423122e+08	Iterations
2	3	6.599357e+09	3.059011e+09	3.728035e+08	1.275330e+08	3.492286e+09	2.321379e+08	7.863340e+09	2.919583e+09	6.579626e+08	6.159041e+08	1.683346e+08	1.039170e+10	1.243282e+10	7.768626e+09	1.736604e+10	2.124281e+10	1.431652e+08	1.175526e+08	1.172117e+08	1.422636e+08	1.450450e+08	1.939403e+08	1.784143e+08	1.297512e+08	1.678500e+08	Best Iteration
3	4	1.757684e+09	1.945869e+08	4.493219e+08	1.671524e+08	1.149944e+09	1.763526e+08	3.769591e+09	1.570518e+09	1.501774e+08	9.313179e+08	2.405480e+08	1.040796e+10	5.983284e+09	1.079745e+10	1.993911e+10	1.675146e+10	1.288261e+08	1.356361e+08	1.155745e+08	1.714808e+08	1.001931e+08	2.944538e+08	1.551372e+08	8.507679e+07	8.355390e+07	Iterations
4	5	2.814250e+09	1.713711e+08	4.762753e+08	1.089614e+08	9.045489e+08	1.936031e+08	4.892379e+09	8.828760e+08	1.527177e+08	7.046872e+08	3.028327e+08	9.538610e+09	1.130780e+10	1.121870e+10	1.794521e+10	1.795797e+10	1.915149e+08	1.026670e+08	1.464976e+08	1.400921e+08	1.150179e+08	2.381621e+08	1.880367e+08	9.893507e+07	1.269060e+08	Iterations

	OBJECTID	SOURCE_ID	provider_count	grad_deg	ENTROPY	PVALUES	LBR_SIG	INTERCEPT	COEF_1	PINTERCEPT	P_COEF_1	P_COEF_2	AICC	R2	P_AICc	P_R2	SIG_COEF	P_SIG_COEF	LBR_TYPE	SHAPE
0	1	1	1044.0	8931	0.502901	0.01	99% Confidence	0.002812	0.458814	-0.001986	0.796103	-1.276968	-247.674353	0.731508	-256.013530	0.778649	_*	_**	Concave	{"rings": [[[-8529642.2486, 4688391.039899997]...
1	2	2	5.0	45	0.152911	0.01	99% Confidence	0.000017	1.115497	0.000651	0.470873	31.338637	-446.296708	0.802377	-449.717733	0.820235	_*	_**	Convex	{"rings": [[[-10120538.0346, 3896326.054200001...
2	3	3	621.0	2153	0.352790	0.01	99% Confidence	-0.001671	1.237995	0.000260	0.938741	2.075127	-329.879103	0.967074	-333.452877	0.970141	_*	_**	Convex	{"rings": [[[-9221715.5229, 5050074.560699999]...
3	4	4	79.0	574	0.291545	0.01	99% Confidence	0.000094	0.603532	0.000494	0.531451	0.468285	-446.685913	0.985494	-449.537319	0.986654	_*	_**	Positive Linear	{"rings": [[[-9290699.0981, 4093591.7991999984...
4	5	5	1210.0	5462	0.312456	0.01	99% Confidence	0.000090	0.603811	0.000518	0.535677	0.441553	-434.274636	0.981340	-435.557616	0.982284	_*	_**	Positive Linear	{"rings": [[[-9341355.0324, 3994505.8356000036...

	OBJECTID	SOURCE_ID	provider_count	white_pop	ENTROPY	PVALUES	LBR_SIG	INTERCEPT	COEF_1	PINTERCEPT	P_COEF_1	P_COEF_2	AICC	R2	P_AICc	P_R2	SIG_COEF	P_SIG_COEF	LBR_TYPE	SHAPE
0	1	1	1044.0	75625	0.462784	0.01	95% Confidence	-0.000806	0.964163	-0.005220	1.439742	-4.516278	-222.974441	0.559980	-222.721608	0.569225	_*	_*_	Positive Linear	{"rings": [[[-8529642.2486, 4688391.039899997]...
1	2	2	5.0	494	0.172985	0.01	95% Confidence	-0.000454	0.991115	-0.002512	2.194106	-60.014844	-401.267715	0.513645	-407.081515	0.578266	_*	***	Concave	{"rings": [[[-10120538.0346, 3896326.054200001...
2	3	3	621.0	51908	0.396061	0.01	95% Confidence	-0.010495	1.138899	-0.004588	0.700722	2.639104	-304.416195	0.945209	-314.253990	0.956163	**	***	Convex	{"rings": [[[-9221715.5229, 5050074.560699999]...
3	4	4	79.0	16894	0.324165	0.01	95% Confidence	-0.003290	0.899859	-0.000728	0.483160	5.162043	-304.611806	0.751356	-305.247346	0.760867	**	_**	Positive Linear	{"rings": [[[-9290699.0981, 4093591.7991999984...
4	5	5	1210.0	31300	0.347374	0.01	95% Confidence	-0.003391	0.903693	-0.000686	0.501848	4.935998	-302.769404	0.741085	-303.106776	0.749499	**	_*_	Positive Linear	{"rings": [[[-9341355.0324, 3994505.8356000036...

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-2.8120	1.504	-1.870	0.062	-5.760	0.136
minority_pop	0.1668	0.027	6.125	0.000	0.113	0.220
grad_deg	0.2330	0.027	8.724	0.000	0.181	0.285
total_population	1.1271	0.372	3.027	0.002	0.397	1.857
edubase	-0.0523	0.350	-0.149	0.881	-0.739	0.634
percap_income	0.4684	0.276	1.694	0.090	-0.074	1.010
asso_deg	0.0846	0.027	3.169	0.002	0.032	0.137
avg_medicare	3.3141	0.636	5.215	0.000	2.068	4.560
pop_density	-0.0277	0.010	-2.862	0.004	-0.047	-0.009
avg_hhsz	-2.7010	0.295	-9.171	0.000	-3.278	-2.124
white_pop	-0.2686	0.048	-5.567	0.000	-0.363	-0.174
unemp_rate	0.0087	0.016	0.534	0.593	-0.023	0.041
median_age	-0.9122	0.209	-4.355	0.000	-1.323	-0.502
otherace_pop	0.0319	0.016	2.004	0.045	0.001	0.063
some_college	0.0294	0.046	0.633	0.527	-0.062	0.120
avg_presdrug	-3.9498	0.666	-5.934	0.000	-5.255	-2.645
avg_socsecurity	-0.4264	0.250	-1.706	0.088	-0.917	0.064
asian_pop	0.0608	0.012	5.168	0.000	0.038	0.084
bach_deg	0.0059	0.039	0.152	0.879	-0.071	0.082
black_pop	-0.1246	0.009	-13.506	0.000	-0.143	-0.106
hisp_pop	-0.1045	0.024	-4.303	0.000	-0.152	-0.057

id	value
0	./ArcGIS\home\Results.gdb\GWRResults_NB1
1
2

Analyzing Healthcare Provider Shortage - Part 4/4¶

Table of Contents

Part 4: Provider Shortage - Model and Understand Shortage¶

Gather and Plot Data¶

Get the Cleaned Data Layer¶

Plot Data¶

Pair Plot¶

Correlation Plot¶

Generate a Global Model¶

Create Predictor and Response Variables¶

Create a Base Model¶

Linear Regression Model using Stats Models¶

Verify Assumptions¶

Outlier Detection¶

Remove Outliers¶

Re-run Model¶

Feature Selection and Importance¶

Run Feature Selection Algorithms¶

Identify Features Using Lasso¶

Regression with Selected Features¶

Clean the Data¶

Transform Data¶

Run Global Model¶

Verify Assumptions¶

Spatial Models - Setup¶

Geographically Weighted Regression (GWR) Model¶

Run GWR Model¶

Output

Messages

Access GWR Results as Spatially Enabled Dataframe¶

Plot GWR Results¶

Plot Average Household Size¶

Plot Average Prescription Drug¶

Forest Based Classification and Regression Trees Model¶

Build Forest Based CART Model¶

Output

Messages

Get Variable Importance¶

Plot Variable Importance¶

Plot Best Iteration¶

Local Bivariate Relationships (LBR) Model¶

Build Model - Provider Count and Graduate Degree¶

Output

Messages

Access Model Results as Spatially Enabled Dataframe¶

Plot Results on Map¶

Build Model - Provider Count and White Population¶

Output

Messages

Access Model Results as Spatially Enabled Dataframe¶

Plot Results on Map¶

Summary¶

Conclusion¶

References¶