Analyzing Healthcare Provider Shortage - Part 4/4

  • 👟 Ready To Run!
  • 📈 Statistics & Graphing
  • 💻 Predictive Modeling
  • ⚕️ Healthcare

Requirements

  • 🗄️ Notebook Server Advanced License

The big question that we would like to answer with this analysis is that: What are the socieconomic and demographic factors that influence access to healthcare providers?

In this notebook, we will build various global and local models and will try to answer this question by:

  • Identifying key sociodemographic and economic factors that are most predictive of provider count.
  • Identifying the direction of influence that these factors have on provider count i.e. whether the factors influence access to providers positively or negatively.
  • Identify how the influence of these factors varies across different counties.

Part 4: Provider Shortage - Model and Understand Shortage

In this section, we will:

  • Gather and Process Demographic and Heath Expenditure data
  • Aggragate Provider Count for all counties in U.S.
  • Generate a Base (OLS) Model of provider count using demographic and health expenditure variables
  • Perform Feature Selection using multiple techniques to select relevant features
  • Run a Global (OLS) model using selected features
  • Create a Geographically Weighted Regression (GWR) Model to understand how various predictors vary accross different counties
  • Create a Forest Based Classification and Regression Trees Model to understand Non-linear relations and to indentify important variables
  • Create Local Bivariate Relationships (LBR) Model to understand the type and significance of relationships of Provider Count with respect to variables selected from Forest based model.
In [1]:
# Import Libraries
from IPython.display import display

# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap

# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np

# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Import library for time
import time

# Supress Warning
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Create connection
gis = GIS("home")

Gather and Plot Data

To understand shortage of healthcare providers we will try to identify socioeconomic and demographic factors that influence access to these providers. We will use the cleaned aggregated data layer demographic_healthexp_clean_allproviders created in Part 2 of this 4 part study. This layer includes the provider count, demographic data and some health expenditure features.

Get the Cleaned Data Layer

In [3]:
# Search the feature layer
data_groups = gis.groups.search('"ArcGIS Sample Notebooks Data" owner:esri_notebook',
                                outside_org=True)
group_query = f"group: {data_groups[0].id}" if data_groups else ""
allprovider_item = gis.content.search(f'demographic_healthexp_clean_allproviders {group_query}',
                                      item_type = "Feature Service",
                                      outside_org=True)[0]
allprovider_item
Out[3]:
demographic_healthexp_clean_allproviders
Clean Healthcare provider data with demographic and health expenditure information.Feature Layer Collection by portaladmin
Last Modified: August 17, 2020
0 comments, 0 views
In [4]:
# Get the layer
allprovider_layer = allprovider_item.layers[0]
allprovider_layer
Out[4]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/demographic_healthexp_clean_allproviders/FeatureServer/0">
In [7]:
# Look at the first few fields
for f in allprovider_layer.properties.fields[:5]:
    print(f['name'], "{0:>25}".format(f['alias']))
objectid                  OBJECTID
state                     state
county                    county
asian_pop                 asian_pop
amerind_pop               amerind_pop
In [8]:
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape
Out[8]:
(3139, 35)

We can see that the dataframe (allprovider_df) has 3139 rows and 35 columns.

In [8]:
# Look at the first few rows of dataframe
allprovider_df.head()
Out[8]:
SHAPE SHAPE__Area SHAPE__Length amerind_pop asian_pop asso_deg avg_healthcare avg_healthinsurance avg_hhinc avg_hhsz avg_labtest avg_medicalcare avg_medicalsrvc avg_medicare avg_personalinsurance avg_presdrug avg_socsecurity bach_deg black_pop county edubase grad_deg hisp_pop median_age minority_pop objectid otherace_pop percap_income pop_density provider_count some_college state total_population unemp_rate white_pop
0 {"rings": [[[-8966174.9526, 4783327.065400004]... 1.386118e+09 181962.662239 26 33 363 4466.09 2894.59 51537 2.42 52.65 1571.51 794.24 632.30 270.40 132.74 4548.75 695 201 Doddridge County 6405 368 91 43.2 467 1 5 20870 27.7 33 1048 WV 8841 5.8 8454
1 {"rings": [[[-9053514.349800002, 4605160.99729... 2.782337e+09 338942.950065 112 101 2901 4258.47 2763.71 50170 2.37 49.59 1494.76 759.18 592.84 257.88 124.51 4436.31 3215 2000 Fayette County 33677 1731 540 44.7 3456 2 106 20953 69.2 407 5473 WV 45748 4.7 42649
2 {"rings": [[[-8986636.588199995, 4735386.19310... 1.453919e+09 190697.949972 53 51 336 4182.28 2716.19 50297 2.35 47.76 1466.09 745.24 579.07 248.70 122.37 4427.67 632 1044 Gilmer County 6106 460 479 39.0 1720 3 209 20296 24.5 54 1272 WV 8294 7.6 6772
3 {"rings": [[[-8792002.250900002, 4755765.97959... 2.067679e+09 253575.180085 19 26 731 4382.40 2848.84 52611 2.40 50.31 1533.57 784.03 596.32 272.29 125.52 4723.08 804 118 Grant County 9296 655 171 46.9 453 4 82 21869 26.2 158 1270 WV 12489 3.8 12092
4 {"rings": [[[-9814179.0036, 5448666.451099999]... 3.612248e+09 304874.767338 501 1882 8678 5281.24 3469.30 72853 2.38 57.14 1811.94 974.64 594.38 357.65 124.23 6907.69 11652 1953 Fond du Lac County 73127 5425 5716 41.6 11180 5 2234 30109 144.3 1444 14708 WI 103861 2.1 95613

Plot Data

Before we plot the data, let's create a copy of allprovider_df and drop unnecessary columns.

In [9]:
# Create copy of dataframe
test_newcounty_df = allprovider_df.copy()
In [10]:
# Drop the shape and objectid columns
test_newcounty_df.drop(['objectid','SHAPE','SHAPE__Length','SHAPE__Area'], axis=1, inplace=True)
test_newcounty_df.columns
Out[10]:
Index(['amerind_pop', 'asian_pop', 'asso_deg', 'avg_healthcare',
       'avg_healthinsurance', 'avg_hhinc', 'avg_hhsz', 'avg_labtest',
       'avg_medicalcare', 'avg_medicalsrvc', 'avg_medicare',
       'avg_personalinsurance', 'avg_presdrug', 'avg_socsecurity', 'bach_deg',
       'black_pop', 'county', 'edubase', 'grad_deg', 'hisp_pop', 'median_age',
       'minority_pop', 'otherace_pop', 'percap_income', 'pop_density',
       'provider_count', 'some_college', 'state', 'total_population',
       'unemp_rate', 'white_pop'],
      dtype='object')

Pair Plot

We will plot each numerical variable with respect to others to see how the data is distributed and how correlated the variables are to each other.

In [30]:
sns.pairplot(test_newcounty_df.iloc[:,0:9]);
In [31]:
sns.pairplot(test_newcounty_df.iloc[:,10:19]);
In [32]:
sns.pairplot(test_newcounty_df.iloc[:,20:]);

From these pair plots we can see that:

  • Distribution of most variables is heavily skewed to the right.
  • Some variables have outliers that influencing data distribution.
  • Most demographic variables are not correlated to other demographic variables.
  • Most health expenditure variables seem to be highly correlated to other health expenditure variables.

Correlation Plot

Correlation plot is another great way to visualize correlation among predictor variables and correlation of predictors with response variable.

In [38]:
f,ax = plt.subplots(figsize=(18, 18))

sns.heatmap(test_newcounty_df.corr(), vmin=-1.0, vmax=1.0, annot=True, 
            linewidths=.5, fmt= '.1f',ax=ax)
plt.title('Correlation Plot', fontdict = {'fontsize' : 30});

From this plot, we can see that:

  • Education level variables are highly correlated with each other.
  • Similarly, health expenditure variables also seem to be highly correlated.
In [68]:
# Sort the values by descending order of Provider Count
test_newcounty_df = test_newcounty_df.sort_values(by=['provider_count'], ascending=False)
test_newcounty_df.head()
Out[68]:
amerind_pop asian_pop asso_deg avg_healthcare avg_healthinsurance avg_hhinc avg_hhsz avg_labtest avg_medicalcare avg_medicalsrvc avg_medicare avg_personalinsurance avg_presdrug avg_socsecurity bach_deg black_pop county edubase grad_deg hisp_pop median_age minority_pop otherace_pop percap_income pop_density provider_count some_college state total_population unemp_rate white_pop
1198 73559 1517296 469649 5905.52 3907.44 94861 3.00 59.76 1998.08 1132.94 579.43 399.93 118.60 9046.16 1445895 850350 Los Angeles County 6899087 782758 5043293 35.7 7644196 2305030 31563 2535.5 188179.0 1300786 CA 10288937 4.5 5019340
2784 21113 395712 237028 5692.87 3774.06 88254 2.58 56.05 1918.81 1061.74 585.60 381.05 122.50 8356.49 830433 1233716 Cook County 3607345 552050 1374256 36.6 3078944 608672 34083 5579.2 104037.0 690225 IL 5274129 6.9 2859209
2459 8756 212844 48360 8035.56 5382.10 136860 2.00 69.80 2653.46 1464.96 796.17 530.01 167.61 12793.70 406878 247085 New York County 1251653 371255 441304 37.9 899778 193191 66805 72736.2 80839.0 121070 NY 1660472 3.9 924396
1614 98584 184150 241714 5574.19 3675.76 83005 2.69 57.84 1898.43 1041.69 581.95 362.02 123.67 7908.62 597653 253576 Maricopa County 2891837 342159 1373153 35.8 1981147 598035 30701 476.9 71069.0 711033 AZ 4387226 5.3 3066684
1405 30001 341640 193511 5800.32 3840.81 89252 2.85 60.64 1959.50 1087.37 564.48 374.86 119.69 8547.08 605813 901459 Harris County 3029538 345886 2035551 33.5 3339578 712410 31405 2780.1 70953.0 606842 TX 4735852 5.9 2573473

Generate a Global Model

The idea behind a global model is to identify various socioeconomic and demographic factors that influence access to healthcare providers across all counties in the United States.

To build this global model, we will follow a 3 step process:

1. Build an Ordinary Least Squares (OLS) Regression Model and verify regressions assumptions
2. Perform Feature Selection using Lasso, Elastic Net and Recursive Feature Elimination techniques
3. Using selected features, build another OLS model, check performance and verify assumptions

Create Predictor and Response Variables

In [204]:
# Create predictor variables
train_pred = test_newcounty_df.drop(['county','state','provider_count'],axis=1)
train_pred.head()
Out[204]:
amerind_pop asian_pop asso_deg avg_healthcare avg_healthinsurance avg_hhinc avg_hhsz avg_labtest avg_medicalcare avg_medicalsrvc avg_medicare avg_personalinsurance avg_presdrug avg_socsecurity bach_deg black_pop edubase grad_deg hisp_pop median_age minority_pop otherace_pop percap_income pop_density some_college total_population unemp_rate white_pop
1198 73559 1517296 469649 5905.52 3907.44 94861 3.00 59.76 1998.08 1132.94 579.43 399.93 118.60 9046.16 1445895 850350 6899087 782758 5043293 35.7 7644196 2305030 31563 2535.5 1300786 10288937 4.5 5019340
2784 21113 395712 237028 5692.87 3774.06 88254 2.58 56.05 1918.81 1061.74 585.60 381.05 122.50 8356.49 830433 1233716 3607345 552050 1374256 36.6 3078944 608672 34083 5579.2 690225 5274129 6.9 2859209
2459 8756 212844 48360 8035.56 5382.10 136860 2.00 69.80 2653.46 1464.96 796.17 530.01 167.61 12793.70 406878 247085 1251653 371255 441304 37.9 899778 193191 66805 72736.2 121070 1660472 3.9 924396
1614 98584 184150 241714 5574.19 3675.76 83005 2.69 57.84 1898.43 1041.69 581.95 362.02 123.67 7908.62 597653 253576 2891837 342159 1373153 35.8 1981147 598035 30701 476.9 711033 4387226 5.3 3066684
1405 30001 341640 193511 5800.32 3840.81 89252 2.85 60.64 1959.50 1087.37 564.48 374.86 119.69 8547.08 605813 901459 3029538 345886 2035551 33.5 3339578 712410 31405 2780.1 606842 4735852 5.9 2573473
In [70]:
# Create response variables
train_y = test_newcounty_df['provider_count']
train_y.head()
Out[70]:
1198    188179.0
2784    104037.0
2459     80839.0
1614     71069.0
1405     70953.0
Name: provider_count, dtype: float64

Create a Base Model

We will start with building an Ordinary Least Squares (OLS) Regression [1] model with all predictors and verify regressions assumptions.

In [71]:
# Import libraries
import statsmodels.api as sm
# import statsmodels
from statsmodels.regression import linear_model

# Set the seed for consistent results
np.random.seed(101)

Linear Regression Model using Stats Models

In [72]:
# Create Model
X_train = train_pred
X_train = sm.add_constant(X_train) # add constant
sm_ols = sm.OLS(train_y, X_train).fit() 
In [73]:
# Generate model summary
sm_ols.summary()
Out[73]:
OLS Regression Results
Dep. Variable: provider_count R-squared: 0.972
Model: OLS Adj. R-squared: 0.972
Method: Least Squares F-statistic: 3863.
Date: Mon, 17 Aug 2020 Prob (F-statistic): 0.00
Time: 14:21:29 Log-Likelihood: -26469.
No. Observations: 3139 AIC: 5.300e+04
Df Residuals: 3110 BIC: 5.317e+04
Df Model: 28
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2632.0977 840.995 3.130 0.002 983.137 4281.059
amerind_pop -0.0445 0.009 -5.236 0.000 -0.061 -0.028
asian_pop -0.0838 0.006 -14.154 0.000 -0.095 -0.072
asso_deg 0.0251 0.011 2.259 0.024 0.003 0.047
avg_healthcare 468.7517 3996.627 0.117 0.907 -7367.543 8305.047
avg_healthinsurance -463.0344 3996.586 -0.116 0.908 -8299.249 7373.180
avg_hhinc 0.2111 0.128 1.647 0.100 -0.040 0.462
avg_hhsz -134.0568 252.876 -0.530 0.596 -629.878 361.765
avg_labtest -106.1273 52.970 -2.004 0.045 -209.987 -2.267
avg_medicalcare -460.9320 3997.058 -0.115 0.908 -8298.073 7376.209
avg_medicalsrvc -1.7558 14.157 -0.124 0.901 -29.513 26.002
avg_medicare 4.8259 12.003 0.402 0.688 -18.709 28.361
avg_personalinsurance 3.0744 8.394 0.366 0.714 -13.385 19.534
avg_presdrug -133.4695 64.206 -2.079 0.038 -259.360 -7.579
avg_socsecurity -4.7149 1.430 -3.297 0.001 -7.519 -1.911
bach_deg -0.0004 0.004 -0.091 0.928 -0.009 0.008
black_pop -0.0576 0.005 -12.104 0.000 -0.067 -0.048
edubase 0.0574 0.004 16.112 0.000 0.050 0.064
grad_deg 0.0798 0.005 17.414 0.000 0.071 0.089
hisp_pop -0.0880 0.012 -7.570 0.000 -0.111 -0.065
median_age -52.1251 10.225 -5.098 0.000 -72.174 -32.076
minority_pop 0.0954 0.012 7.895 0.000 0.072 0.119
otherace_pop 0.0370 0.011 3.304 0.001 0.015 0.059
percap_income 0.1126 0.025 4.544 0.000 0.064 0.161
pop_density 0.3258 0.022 15.136 0.000 0.284 0.368
some_college -0.0116 0.005 -2.309 0.021 -0.021 -0.002
total_population -0.0610 0.010 -5.978 0.000 -0.081 -0.041
unemp_rate -1.7981 9.948 -0.181 0.857 -21.304 17.707
white_pop 0.0323 0.010 3.168 0.002 0.012 0.052
Omnibus: 1740.052 Durbin-Watson: 1.824
Prob(Omnibus): 0.000 Jarque-Bera (JB): 397367.718
Skew: 1.487 Prob(JB): 0.00
Kurtosis: 58.039 Cond. No. 1.88e+08


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.88e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
In [74]:
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
sm_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Base Model');
In [75]:
# Calculate RMSE of model
from statsmodels.tools.eval_measures import rmse

pred_val = sm_ols.fittedvalues.copy()
rmse_base_ols = rmse(train_y, pred_val)

print('RMSE for Base Model:',round(rmse_base_ols,2))
RMSE for Base Model: 1111.25

Analysing the model summary, we can see that:

  • The R-squared value of 0.972 shows that 97.2% of the variability in Provider Count is explained by the model. The R-squared value is too good to be true and the model seems to be overfitting.
  • To identify variables that are statistically significant, we look at the p-values of individual variables. Among the coefficients, those that are statistically significant at 90% significance level are:
  • asian_pop, amerind_pop, black_pop, edubase, hisp_pop, median_age, minority_pop, otherace_pop, percap_income, pop_density, white_pop, some_college, asso_deg, grad_deg, total_population, avg_labtest, avg_presdrug, avg_socsecurity.
  • The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.

Verify Assumptions

Here, we will verify regression assumtions [2] of:

  • Linearity
  • Independence
  • Normality
  • Equal Variance
In [47]:
# Get residual value
residual = sm_ols.resid
In [48]:
import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))

# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=0, xmax=max(pred_val), linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
ax1.set_title('Residuals vs Fitted Values')

# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True);
  1. Residuals vs Fitted plot
    • Constant Variance: From the plot, we observe a departure from constant variance as the distance of data points from zero are not consistently equal. We can see that the residuals are closer to 0 in the beginning but values start to go up as we move a little towards the right of the plot.
    • Independence: We see that the data is clustered together and is not randomly distributed. This indicates a departure from independence as clusters of residuals generally indicate correlation errors, or an indication that multicollinearity exists in parts of the data.
  1. Probability Plot
    • Normality: From this plot, we can see that the residuals have tails on both the positive and negative side. Although residuals stay close to the zero line they are heavily tailed showing violation of Normality assumption.
In [59]:
# Check Linearity - Plot predictors with response
preds = train_pred.columns
fig = plt.figure(figsize=(15, 20))
for sp in range(0,27):
    ax = fig.add_subplot(7,4,sp+1)
    ax.scatter(train_pred.loc[:,preds[sp]], train_y)
    ax.set_xlabel(preds[sp])
    ax.set_ylabel('Provider Count')
    plt.xticks(rotation=0)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.margins(0.01, 0)
plt.tight_layout(pad=1.0) # automatically adjusts layout to fit long labels
plt.show()
  1. Predictors vs Response
    • Linearity - From the plot above, we can see the each predictor variable is either strongly or weakly related to response variable. From this plot, we can say that the Linearity assumption holds.

Outlier Detection

One way to think about whether the results we have are driven by a given data point is to calculate how far the predicted values for data would move if model was fit without the data point in question. This calculated total distance is called Cook's distance. Cook's D is a function of the leverage and standardized residual associated with each data point. The influence of each point can be visualized using an Influence Plot [3].

In [35]:
# Create Influence Plot

fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(sm_ols, ax= ax, criterion="cooks")

From this plot, we can see that points with larger size have high leverage and higher than average residuals. The larger size of points indicates these are influential points. Points that do not have a high leverage but very high residual and are considered outliers.

Note that numbers shown next to the points in this plot are observation numbers (row ids) from the data. This context is important to understand the following section for removing outliers.

Remove Outliers

Removing outliers is an iterative process and outliers need to be investigated before removal. Here, we followed a 4-step process to remove outliers as follows:

  1. Remove initial outliers based on Influence Plot
  2. Rerun the model with data after removing outliers
  3. Plot predictors with response to see outliers in each predictor
  4. Repeat the process above until you see satisfactory results

The outliers were removed in different iterations. Here are the obervations that were removed with each iteration:

- Iteration 1: 2631,2774,1277,1245,296
- Iteration 2: 1198,2353,2459,1688,1077
- Iteration 3: 916,2632,2262,1207,797,531
- Iteration 4: 2613,310,1052,38,1632,1512
- Iteration 5: 629,1846,2710,388,1614,2784,1727

The code below shows all outliers being removed. If you see a different set of points as outliers for your analysis, update the global_df.drop() command in code cell below with outlier observations and run the analysis.

In [146]:
# Remove Outliers
global_df = test_newcounty_df.copy()

global_df.drop([2631,2774,1277,1245,296,
                1198,2353,2459,1688,1077,
                916,2632,2262,1207,797,531,
                2613, 310, 1052, 38, 1632, 1512,
                629, 1846, 2710, 388, 1614, 2784, 1727],axis=0, inplace=True)
In [147]:
global_df.shape
Out[147]:
(3110, 31)
Re-run Model
In [148]:
# Create prdictor and response variables
train_x_rerun = global_df.drop(['county','state','provider_count'],axis=1)
train_y_rerun = global_df['provider_count']
train_x_rerun.head()
Out[148]:
amerind_pop asian_pop asso_deg avg_healthcare avg_healthinsurance avg_hhinc avg_hhsz avg_labtest avg_medicalcare avg_medicalsrvc avg_medicare avg_personalinsurance avg_presdrug avg_socsecurity bach_deg black_pop edubase grad_deg hisp_pop median_age minority_pop otherace_pop percap_income pop_density some_college total_population unemp_rate white_pop
1405 30001 341640 193511 5800.32 3840.81 89252 2.85 60.64 1959.50 1087.37 564.48 374.86 119.69 8547.08 605813 901459 3029538 345886 2035551 33.5 3339578 712410 31405 2780.1 606842 4735852 5.9 2573473
1698 27438 401095 191669 6514.21 4306.51 101373 2.77 66.80 2207.70 1234.01 649.22 439.85 135.09 9730.65 525473 168807 2246969 331783 1144104 35.7 1844110 483908 36291 795.0 487496 3344185 3.9 2056884
1348 17234 395459 126421 7408.65 4907.57 117025 2.42 74.96 2501.08 1398.15 720.68 505.47 150.11 11292.96 484710 146077 1552376 320833 216100 38.3 891019 94894 47839 1039.6 278602 2199247 5.0 1401488
209 17668 170268 102480 5366.30 3553.67 82550 2.76 55.22 1812.63 1000.06 534.73 342.48 113.62 7832.26 337376 608225 1709385 193333 1072941 33.6 1890363 436469 29987 3046.4 332724 2654282 4.8 1335362
1355 6003 70007 138229 5441.04 3587.55 80411 2.56 55.26 1853.49 1012.03 585.04 357.09 124.13 7615.72 279433 555082 1358758 159309 574151 41.0 1224672 83078 31438 1571.7 263467 1901425 5.3 1120747
In [149]:
# Run Model
X_train_rerun = train_x_rerun
X_train_rerun = sm.add_constant(X_train_rerun)
global_ols_rerun = sm.OLS(train_y_rerun, X_train_rerun).fit() 
residual = global_ols_rerun.resid
In [150]:
# Plot Outliers
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(global_ols_rerun, ax= ax, criterion="cooks")

Comparing this influece plot to the original plot before removing outliers, we can see that the range for H Leverage has considerably reduced from 0.0 - 0.8 to 0.0 - 0.40. The range for Studentized Residuals has also reduced from -12.0 - 20.0 to -10.0 - 10.0.

In [152]:
# Check Linearity - Plot predictors with response
preds = train_x_rerun.columns
fig = plt.figure(figsize=(15, 15))
for sp in range(0,27):
    ax = fig.add_subplot(7,4,sp+1)
    ax.scatter(train_x_rerun.loc[:,preds[sp]], train_y_rerun)
    ax.set_xlabel(preds[sp])
    ax.set_ylabel('Provider Count')
    plt.xticks(rotation=0)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.margins(0.01, 0)
plt.tight_layout(pad=0.5) # automatically adjusts layout to fit long labels
plt.show()

Feature Selection and Importance

Our regression model results showed only a few predictors that were statistically significant in predicting provider count. Feature selection is used to select those features in data that contribute most to the response variable. Having irrelevant features in the data can decrease the accuracy of many models, especially linear algorithms. To identify which predictors play a key role and remove predictors that are not statistically significant, we can use various feature selection techniques.

Greedy algorithms like Stepwise Regression [4] and Recursive Feature Elimination (RFE) [5] work by adding or removing attributes one at a time and building a model on those attributes that remain. RFE algorithm recursively removes attributes and uses accuracy metric to rank the feature according to their importance.

Regularization methods like Lasso [6] and ElasticNet [6] seek to minimize the complexity (magnitude and number of regression coefficients) of the model by penalizing a feature given a coefficient threshold.

  • Lasso penalizes the sum of absolute value of regression coefficients thereby forcing many coefficients to 0.

  • ElasticNet combines the properties of both Lasso (L1 term) and Ridge [6] (L2 term) regression. It penalizes the model by using both the L2-norm (sum of squared values of coefficients) and the L1-norm (sum of absolute value of coefficients) thereby shrinking some coefficients closer to 0 to reduce variance and making other coefficients 0.

Feature importance techniques like Random Forest [7] are used to select features using a trained supervised classifier. Random forest consists of a number of decision trees where each node in a tree is a condition on a single feature, designed to split the dataset.

Here we will:

  • Run some feature selection algorithms
  • Choose an algorithm with lowest error
  • Run the chosen algorithm on all predictors
  • Choose the predictors selected by this model for further analysis
In [205]:
# Create prdictor and response variables
train_x_global = train_x_rerun
train_y_global = train_y_rerun
train_x_global.head()
Out[205]:
amerind_pop asian_pop asso_deg avg_healthcare avg_healthinsurance avg_hhinc avg_hhsz avg_labtest avg_medicalcare avg_medicalsrvc avg_medicare avg_personalinsurance avg_presdrug avg_socsecurity bach_deg black_pop edubase grad_deg hisp_pop median_age minority_pop otherace_pop percap_income pop_density some_college total_population unemp_rate white_pop
1405 30001 341640 193511 5800.32 3840.81 89252 2.85 60.64 1959.50 1087.37 564.48 374.86 119.69 8547.08 605813 901459 3029538 345886 2035551 33.5 3339578 712410 31405 2780.1 606842 4735852 5.9 2573473
1698 27438 401095 191669 6514.21 4306.51 101373 2.77 66.80 2207.70 1234.01 649.22 439.85 135.09 9730.65 525473 168807 2246969 331783 1144104 35.7 1844110 483908 36291 795.0 487496 3344185 3.9 2056884
1348 17234 395459 126421 7408.65 4907.57 117025 2.42 74.96 2501.08 1398.15 720.68 505.47 150.11 11292.96 484710 146077 1552376 320833 216100 38.3 891019 94894 47839 1039.6 278602 2199247 5.0 1401488
209 17668 170268 102480 5366.30 3553.67 82550 2.76 55.22 1812.63 1000.06 534.73 342.48 113.62 7832.26 337376 608225 1709385 193333 1072941 33.6 1890363 436469 29987 3046.4 332724 2654282 4.8 1335362
1355 6003 70007 138229 5441.04 3587.55 80411 2.56 55.26 1853.49 1012.03 585.04 357.09 124.13 7615.72 279433 555082 1358758 159309 574151 41.0 1224672 83078 31438 1571.7 263467 1901425 5.3 1120747
In [206]:
# Import Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

Run Feature Selection Algorithms

In [207]:
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring ='neg_mean_squared_error'
In [208]:
# Create Pipeline and Standardize the dataset
clf = LinearRegression()
pipelines = []
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LASSO',Lasso(max_iter=10000))])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),('EN',ElasticNet())])))
pipelines.append(('ScaledRFECV', Pipeline([('Scaler', StandardScaler()),('RFECV',RFECV(estimator=clf, cv=5))])))
pipelines.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF',RandomForestRegressor(n_estimators = 100))])))
In [209]:
# Run models
import math
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
    cv_results = cross_val_score(model, train_x_global, train_y_global, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, math.sqrt(abs(cv_results.mean())), cv_results.std())
    print(msg)
ScaledLASSO: 810.966882 (144215.366577)
ScaledEN: 1036.948941 (257837.375171)
ScaledRFECV: 823.745259 (121412.751591)
ScaledRF: 1069.088212 (951602.468757)

From the results above, Lasso seems to be the most promising with lowest RMSE. Let's explore Lasso to see which variables are selected by this model.

Identify Features Using Lasso

In [210]:
# Standardize the data using Standard Scalar
from sklearn.preprocessing import StandardScaler
train_x_std = StandardScaler().fit_transform(train_x_global)
In [211]:
# Run Lasso Model
from sklearn.linear_model import Lasso
# from sklearn.linear_model import ElasticNet
lasso_model=Lasso(max_iter=10000)
lasso_model.fit(train_x_std,train_y_global)
Out[211]:
Lasso(max_iter=10000)
In [213]:
# Identify how many variables are picked
coef = pd.Series(lasso_model.coef_, index = train_pred.columns)
print("Model picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")
Model picked 20 variables and eliminated the other 8 variables
In [214]:
# Get important variables and their coefficients
imp_coef = pd.concat([coef.sort_values(ascending=False).head(10),
                     coef.sort_values(ascending=False).tail(10)])
imp_coef
Out[214]:
minority_pop        6112.644751
grad_deg            3080.902796
total_population    1746.411139
edubase             1045.670237
percap_income        651.533545
asso_deg             632.276396
avg_medicare         612.894279
pop_density          161.600102
avg_hhsz              49.771492
white_pop             -0.000000
unemp_rate           -53.553069
median_age          -180.673223
otherace_pop        -189.915878
some_college        -399.935748
avg_presdrug        -574.941967
avg_socsecurity     -951.248207
asian_pop          -1269.292044
bach_deg           -1703.537961
black_pop          -1950.128199
hisp_pop           -3752.211812
dtype: float64
In [215]:
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Model");
Out[215]:
Text(0.5, 1.0, 'Coefficients in the Model')

From the analysis above, we can see that factors that are predictive of higher provider count are:

  • minority_pop (2020 Minority Population (Esri))
  • grad_deg (2020 Education: Graduate/Professional Degree (Esri))
  • total_population (2020 Total Population (Esri))
  • percap_income (2020 Per Capita Income (Esri))
  • avg_medicare (2020 Medicare Payments: Average)
  • asso_deg (2020 Education: Associate's Degree (Esri))
  • edubase (2020 Educational Attainment Base (Esri))
  • pop_density (2020 Population Density (Pop per Square Mile) (Esri))
  • avg_hhsz (2020 Average Household Size (Esri))

Factors that negatively influence provider count are:

  • hisp_pop (2020 Hispanic Population (Esri))
  • black_pop (2020 Black/African American Population (Esri))
  • bach_deg (2020 Education: Bachelor's Degree (Esri))
  • asian_pop (2020 Asian Population (Esri))
  • avg_socsecurity (2020 Pensions & Social Security: Average)
  • avg_presdrug (2020 Prescription Drugs: Average)
  • median_age (2020 Median Age (Esri))
  • some_college (2020 Education: Some College/No Degree (Esri))
  • otherrace_pop (2020 Other Race Population by Age Base (Esri))
  • unemp_rate (2020 Unemployment Rate (Esri))

Regression with Selected Features

In the base model, we saw deviations from linearity, constant variance and independence assumptions. The model also seemed to overfit data with an R-squared value of 0.97.

For Linearity - If a model does not fit well, it does not mean that the regression is not useful. One problem could be that the relation between one or more predictors and the response is not linear.

For Constant Variance and Independence - If normality or the constant variance do not hold, then we transform the response variable. A common transformation is the power transformation of y to the lambda.

Here we will use the features selected by lasso model, and transform both the predictor and response variables to create a regression model on our data.

Clean the Data

Some transformations such as Log transformation [8] do not work well with zero or negative values. We do not have any negatives but let's check and remove any zero values from the data.

In [216]:
# Create a copy of dataframe
test_log = global_df.copy()
test_log.shape
Out[216]:
(3110, 31)
In [217]:
# Describe to identify any zero values in dataset
test_log.describe().transpose()
Out[217]:
count mean std min 25% 50% 75% max
amerind_pop 3110.0 925.189389 3253.585408 0.00 61.0000 160.000 523.5000 56871.00
asian_pop 3110.0 3841.547267 20680.860712 0.00 48.0000 162.000 803.7500 501246.00
asso_deg 3110.0 5340.767524 12191.720275 5.00 647.2500 1596.500 4428.7500 193511.00
avg_healthcare 3110.0 5078.543723 947.949453 2821.51 4437.6100 4927.990 5558.7475 10690.41
avg_healthinsurance 3110.0 3318.752814 625.270951 1871.50 2899.4575 3221.570 3633.6175 6873.42
avg_hhinc 3110.0 66116.593891 15742.065759 36078.00 55847.7500 63405.500 72196.2500 162801.00
avg_hhsz 3110.0 2.484016 0.218515 1.96 2.3500 2.460 2.5700 4.34
avg_labtest 3110.0 57.226193 12.376326 27.75 49.0125 54.795 63.1475 149.41
avg_medicalcare 3110.0 1759.790894 324.316111 950.01 1541.2025 1705.395 1921.3075 3816.99
avg_medicalsrvc 3110.0 926.770312 186.200670 496.89 799.9600 897.325 1020.8125 1955.37
avg_medicare 3110.0 628.224878 110.561226 342.05 556.8850 610.050 682.0225 1501.33
avg_personalinsurance 3110.0 326.753232 74.904571 162.69 276.3250 315.235 362.0075 791.45
avg_presdrug 3110.0 131.243627 21.053413 73.65 117.8525 128.335 141.7175 298.34
avg_socsecurity 3110.0 6112.216814 1642.662502 3121.71 5053.2750 5813.570 6727.9125 16133.33
bach_deg 3110.0 11675.685209 33447.991357 2.00 846.0000 2118.000 6539.2500 605813.00
black_pop 3110.0 11506.624759 44758.931850 0.00 121.0000 840.000 5567.5000 901459.00
edubase 3110.0 61346.768489 149479.894659 69.00 7912.5000 18437.500 46770.7500 3029538.00
grad_deg 3110.0 7112.766881 21273.455845 0.00 420.2500 1141.500 3759.2500 345886.00
hisp_pop 3110.0 14253.300322 73794.908292 0.00 378.0000 1128.000 5182.7500 2035551.00
median_age 3110.0 41.960514 5.362405 23.20 38.7250 41.900 45.2000 62.50
minority_pop 3110.0 32202.119614 125773.042120 4.00 1270.2500 4461.000 15167.5000 3339578.00
otherace_pop 3110.0 5257.274920 27156.607632 0.00 109.0000 422.000 1985.7500 712410.00
percap_income 3110.0 26343.677170 6071.379975 9853.00 22218.2500 25470.500 29272.2500 66766.00
pop_density 3110.0 193.653055 611.577761 0.00 17.1000 45.600 114.7000 11869.90
provider_count 3110.0 1498.314469 4319.913817 1.00 88.0000 263.000 935.7500 70953.00
some_college 3110.0 12867.820257 30879.598848 22.00 1666.2500 3847.000 10220.7500 606842.00
total_population 3110.0 90102.121543 223536.760943 82.00 11248.5000 26225.500 67613.5000 4735852.00
unemp_rate 3110.0 4.622669 2.548112 0.00 2.9000 4.300 5.8000 27.90
white_pop 3110.0 65542.636334 141893.693591 65.00 8914.0000 21478.000 56310.5000 2573473.00

We can see that the minimum for asian, black population and some other variables is 0. Let's find out the observations with these zero values and remove them.

In [218]:
# Find observations that have 0 values
state_list = ['asian_pop','black_pop','amerind_pop','hisp_pop','otherace_pop','pop_density','unemp_rate','grad_deg']
x = test_log[(test_log[state_list] == 0).any(axis=1)]
x.shape
Out[218]:
(50, 31)

We can see that 50 out of 3110 observations have 0 values. We will go ahead and remove these.

Replace 0 values with Nan and then drop those values

In [219]:
# Replace 0 with NaN for columns in state_list
test_log[state_list] = test_log[state_list].replace({0:np.nan})
test_log.dropna(inplace=True)
test_log.shape
Out[219]:
(3060, 31)
In [220]:
test_log.describe().transpose()
Out[220]:
count mean std min 25% 50% 75% max
amerind_pop 3060.0 938.172549 3277.662837 1.00 64.0000 166.000 534.7500 56871.00
asian_pop 3060.0 3904.242484 20843.326126 1.00 51.0000 167.000 823.7500 501246.00
asso_deg 3060.0 5426.352614 12272.396268 5.00 683.0000 1645.000 4518.0000 193511.00
avg_healthcare 3060.0 5070.785444 939.047067 2821.51 4434.8850 4920.420 5556.8650 10125.26
avg_healthinsurance 3060.0 3314.343611 620.278570 1871.50 2897.2850 3218.950 3626.4725 6711.29
avg_hhinc 3060.0 66136.494444 15747.147692 36078.00 55877.0000 63487.500 72235.5000 162801.00
avg_hhsz 3060.0 2.486990 0.216719 1.96 2.3600 2.460 2.5800 4.34
avg_labtest 3060.0 56.990644 12.038398 27.75 48.9300 54.730 62.8500 118.65
avg_medicalcare 3060.0 1756.441820 320.330034 950.01 1539.2550 1704.200 1919.3575 3413.96
avg_medicalsrvc 3060.0 925.372297 184.806441 496.89 799.6600 895.695 1019.7550 1930.73
avg_medicare 3060.0 626.078137 107.524692 342.05 555.9500 608.515 679.6750 1192.23
avg_personalinsurance 3060.0 326.419869 74.697177 162.69 276.0600 315.110 361.7850 791.45
avg_presdrug 3060.0 130.870804 20.516788 73.65 117.7375 128.065 141.4075 236.92
avg_socsecurity 3060.0 6116.741948 1645.083037 3121.71 5059.6025 5817.650 6734.8375 16133.33
bach_deg 3060.0 11863.987582 33687.509892 38.00 895.7500 2192.000 6773.7500 605813.00
black_pop 3060.0 11694.084641 45099.006112 1.00 132.0000 893.500 5684.5000 901459.00
edubase 3060.0 62331.507190 150496.226005 444.00 8389.5000 18912.500 47753.0000 3029538.00
grad_deg 3060.0 7227.982026 21427.345484 11.00 442.0000 1192.500 3880.5000 345886.00
hisp_pop 3060.0 14483.974837 74373.297150 2.00 394.0000 1179.000 5362.7500 2035551.00
median_age 3060.0 41.861307 5.298345 23.20 38.7000 41.800 45.0000 62.50
minority_pop 3060.0 32722.983007 126730.157837 14.00 1355.7500 4659.000 15646.7500 3339578.00
otherace_pop 3060.0 5342.442810 27369.400986 1.00 117.0000 440.000 2042.0000 712410.00
percap_income 3060.0 26313.961438 6055.021104 9853.00 22179.0000 25453.500 29277.7500 66766.00
pop_density 3060.0 196.779967 616.061929 0.10 18.3000 46.550 117.7750 11869.90
provider_count 3060.0 1522.592157 4350.862736 1.00 93.0000 274.000 978.2500 70953.00
some_college 3060.0 13073.895752 31088.465733 107.00 1740.0000 3959.000 10384.2500 606842.00
total_population 3060.0 91549.904248 225066.638803 623.00 11819.2500 26956.000 68665.5000 4735852.00
unemp_rate 3060.0 4.643072 2.524086 0.10 3.0000 4.300 5.8000 27.90
white_pop 3060.0 66593.150327 142808.374409 205.00 9355.0000 22318.500 57582.7500 2573473.00

Now that 0 valued observations are removed, we can see that minimum value is not 0.

Transform Data

We will use Log transformation [8] to transform the data.

In [221]:
# Subset important predictors chosen from Lasso Model
train_x_lasso = test_log.loc[:,imp_coef.index]
train_y_lasso = test_log['provider_count']
train_x_lasso.head()
Out[221]:
minority_pop grad_deg total_population edubase percap_income asso_deg avg_medicare pop_density avg_hhsz white_pop unemp_rate median_age otherace_pop some_college avg_presdrug avg_socsecurity asian_pop bach_deg black_pop hisp_pop
1405 3339578 345886.0 4735852 3029538 31405 193511 564.48 2780.1 2.85 2573473 5.9 33.5 712410.0 606842 119.69 8547.08 341640.0 605813 901459.0 2035551.0
1698 1844110 331783.0 3344185 2246969 36291 191669 649.22 795.0 2.77 2056884 3.9 35.7 483908.0 487496 135.09 9730.65 401095.0 525473 168807.0 1144104.0
1348 891019 320833.0 2199247 1552376 47839 126421 720.68 1039.6 2.42 1401488 5.0 38.3 94894.0 278602 150.11 11292.96 395459.0 484710 146077.0 216100.0
209 1890363 193333.0 2654282 1709385 29987 102480 534.73 3046.4 2.76 1335362 4.8 33.6 436469.0 332724 113.62 7832.26 170268.0 337376 608225.0 1072941.0
1355 1224672 159309.0 1901425 1358758 31438 138229 585.04 1571.7 2.56 1120747 5.3 41.0 83078.0 263467 124.13 7615.72 70007.0 279433 555082.0 574151.0
In [224]:
# Transform data
from numpy import log
sc_data = StandardScaler()
train_x_log = np.log(train_x_lasso)
train_y_log = np.log(test_log.loc[:,'provider_count'])
In [226]:
train_y_log.shape
Out[226]:
(3060,)
In [227]:
# Distribution of Dependent variable before and after transformation
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
sns.distplot(train_y_log, ax=ax2)
sns.distplot(train_y_lasso, ax=ax1);

From the plots above, we can see how the distribution of provider count varies before and after log transformation.

Run Global Model

In this model, we will build an Ordinary Least Squares (OLS) Regression [1] model using predictors selected from feature selection.

In [228]:
X_train_log = sm.add_constant(train_x_log)
global_ols = sm.OLS(train_y_log, X_train_log).fit() 
In [229]:
global_ols.summary()
Out[229]:
OLS Regression Results
Dep. Variable: provider_count R-squared: 0.950
Model: OLS Adj. R-squared: 0.950
Method: Least Squares F-statistic: 2904.
Date: Mon, 17 Aug 2020 Prob (F-statistic): 0.00
Time: 16:28:41 Log-Likelihood: -1378.2
No. Observations: 3060 AIC: 2798.
Df Residuals: 3039 BIC: 2925.
Df Model: 20
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -2.8120 1.504 -1.870 0.062 -5.760 0.136
minority_pop 0.1668 0.027 6.125 0.000 0.113 0.220
grad_deg 0.2330 0.027 8.724 0.000 0.181 0.285
total_population 1.1271 0.372 3.027 0.002 0.397 1.857
edubase -0.0523 0.350 -0.149 0.881 -0.739 0.634
percap_income 0.4684 0.276 1.694 0.090 -0.074 1.010
asso_deg 0.0846 0.027 3.169 0.002 0.032 0.137
avg_medicare 3.3141 0.636 5.215 0.000 2.068 4.560
pop_density -0.0277 0.010 -2.862 0.004 -0.047 -0.009
avg_hhsz -2.7010 0.295 -9.171 0.000 -3.278 -2.124
white_pop -0.2686 0.048 -5.567 0.000 -0.363 -0.174
unemp_rate 0.0087 0.016 0.534 0.593 -0.023 0.041
median_age -0.9122 0.209 -4.355 0.000 -1.323 -0.502
otherace_pop 0.0319 0.016 2.004 0.045 0.001 0.063
some_college 0.0294 0.046 0.633 0.527 -0.062 0.120
avg_presdrug -3.9498 0.666 -5.934 0.000 -5.255 -2.645
avg_socsecurity -0.4264 0.250 -1.706 0.088 -0.917 0.064
asian_pop 0.0608 0.012 5.168 0.000 0.038 0.084
bach_deg 0.0059 0.039 0.152 0.879 -0.071 0.082
black_pop -0.1246 0.009 -13.506 0.000 -0.143 -0.106
hisp_pop -0.1045 0.024 -4.303 0.000 -0.152 -0.057
Omnibus: 174.445 Durbin-Watson: 1.753
Prob(Omnibus): 0.000 Jarque-Bera (JB): 673.039
Skew: 0.116 Prob(JB): 7.10e-147
Kurtosis: 5.286 Cond. No. 7.86e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.86e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Analysing the model summary, we can see that:

  • The Adjusted R2 of our global model is 95% which means that 95% of variation in provider count can be explained by our model.
  • The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.
  • The p-values of individual variables determine which variables are statistically significant. The equation below is the final regression equation for our model showing statistically significant variables and their coefficients. Note that both dependent and independent variables are log transformed.
In [230]:
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
global_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Global Model');

Regression Equation

$$ \begin{equation*} provider\_count = 0.1668*minority\_pop + 0.2330*grad\_deg + 1.1271*total\_population - 0.0523*edubase + 0.4684*percap\_income\\ + 0.0846*asso\_deg + 3.3141*avg\_medicare - 0.0277*pop\_density - 2.7010*avg\_hhsz - 0.2686*white\_pop + 0.0087*unemp\_rate\\ - 0.9122*median\_age + 0.0319*otherace\_pop + 0.0294*some\_college - 3.9498*avg\_presdrug - 0.4264*avg\_socsecurity\\ + 0.0608*asian\_pop + 0.0059*bach\_deg - 0.1246*black\_pop - 0.1045*hisp\_pop \end{equation*} $$
In [44]:
# Model Interpretation
coef = 1.1271
((1.01**coef)-1)*100
Out[44]:
1.1278141976397205

Model Interpretations [9]:

  1. 𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦_𝑝𝑜𝑝 - The coefficient is 0.1668. We can say that for 1% increase in minority population, Provider Count will increase by 0.17% ((1.01^coef-1)*100) holding all other predictors fixed.
  2. grad_deg - The coefficient is 0.2330. We can say that for 1% increase in graduate degree holders, Provider Count will increase by 0.23% holding all other predictors fixed.
  3. 𝑡𝑜𝑡𝑎𝑙_𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 - The coefficient is 1.1271. We can say that for 1% increase in total population, Provider Count will increase by 1.127% holding all other predictors fixed.
  4. edubase - The coefficient is -0.0523. We can say that for 1% increase in education base, Provider Count will decrease by 0.05% holding all other predictors fixed.
  5. 𝑝𝑒𝑟𝑐𝑎𝑝_𝑖𝑛𝑐𝑜𝑚𝑒 - The coefficient is 0.4684. We can say that for 1% increase in per capita income, Provider Count will increase by 0.47% holding all other predictors fixed.
  6. 𝑎𝑠𝑠𝑜_𝑑𝑒𝑔 - The coefficient is 0.0846. We can say that for 1% increase in associate degree holders, Provider Count will increase by 0.08% holding all other predictors fixed.
  7. 𝑎𝑣𝑔_𝑚𝑒𝑑𝑖𝑐𝑎𝑟𝑒 - The coefficient is 3.3141. We can say that for 1% increase in average medicare, Provider Count will increase by 3.31% holding all other predictors fixed.
  8. pop_density - The coefficient is -0.0277. We can say that for 1% increase in population density, Provider Count will decrease by 0.027% holding all other predictors fixed.
  9. avg_hhsz - The coefficient is -2.7010. We can say that for 1% increase in average household size, Provider Count will decrease by 2.70% holding all other predictors fixed.
  10. white_pop - The coefficient is -0.2686. We can say that for 1% increase in white population, Provider Count will decrease by 0.27% holding all other predictors fixed.
  11. unemp_rate - The coefficient is 0.0087. We can say that for 1% increase in unemployment rate, Provider Count will increase by 0.009% holding all other predictors fixed.
  12. median_age - The coefficient is -0.9122. We can say that for 1% increase in median age, Provider Count will decrease by 0.91% holding all other predictors fixed.
  13. otherace_pop - The coefficient is 0.0319. We can say that for 1% increase in other race population, Provider Count will increase by 0.03% holding all other predictors fixed.
  14. some_college - The coefficient is 0.0294. We can say that for 1% increase in people with some college education, Provider Count will increase by 0.03% holding all other predictors fixed.
  15. 𝑎𝑣𝑔_𝑝𝑟𝑒𝑠𝑑𝑟𝑢𝑔 - The coefficient is -3.9498. We can say that for 1% increase in average prescription drug, Provider Count will decrease by 3.95% holding all other predictors fixed.
  16. 𝑎𝑣𝑔_𝑠𝑜𝑐𝑠𝑒𝑐𝑢𝑟𝑖𝑡𝑦 - The coefficient is -0.4264. We can say that for 1% increase in average social security, Provider Count will decrease by 0.43% holding all other predictors fixed.
  17. 𝑎sian_𝑝𝑜𝑝 - The coefficient is 0.0608. We can say that for 1% increase in asian population, Provider Count will increase by 0.06% holding all other predictors fixed.
  18. bach_deg - The coefficient is 0.0059. We can say that for 1% increase in bachelors degree holders, Provider Count will increase by 0.005% holding all other predictors fixed.
  19. black_pop - The coefficient is -0.1246. We can say that for 1% increase in black population, Provider Count will decrease by 0.125% holding all other predictors fixed.
  20. ℎ𝑖𝑠𝑝_𝑝𝑜𝑝 - The coefficient is -0.1045. We can say that for 1% increase in hispanic population, Provider Count will decrease by 0.105% holding all other predictors fixed.
In [232]:
# Calculate RMSE of global model
pred_val = global_ols.fittedvalues.copy()
new_rmse = rmse(train_y_log, pred_val)
new_rmse
Out[232]:
0.3796336341163295

We can see that the RMSE of our model is 0.38. Since RMSE has the same unit as the dependent variable, we can compare RMSE with the range of dependent variable to see how spread out our residuals are and see how fit our model is. A value of 0.38 on a range of 0-11 (dependent variable range) is pretty small showing the model is a good fit.

Verify Assumptions

Here, we will verify regression assumptions [2] of global model.

In [233]:
# Residual Value
residual = global_ols.resid
In [236]:
import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))

# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=min(pred_val), xmax=max(pred_val), linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')

# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True);
In [237]:
# Residuals vs Predictors
states = train_x_log.columns
fig = plt.figure(figsize=(15, 20))

for sp in range(0,10):
    mini = min(train_x_log.loc[:,states[sp]])
    maxi = max(train_x_log.loc[:,states[sp]])
    ax = fig.add_subplot(7,4,sp+1)
    ax.scatter(train_x_log.loc[:,states[sp]], residual)
    ax.hlines(y=0, xmin=mini, xmax=maxi, linewidth=2, color='r')
    ax.set_xlabel(states[sp])
    ax.set_ylabel('Residuals')
#     ax.set_ylim(0,provType.iloc[0,1])
    plt.xticks(rotation=0)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.margins(0.01, 0)
plt.tight_layout(pad=0.5) # automatically adjusts layout to fit long labels
plt.show()
  1. Residuals vs Fitted plot
    • Constant Variance: From the plot, we still observe a departure from constant variance as the distance of data points from zero line is not consistent. We can see that the variation in residuals is high initially and then starts to drop.
    • Independence: We see that the data is randomly distributed along the zero line indicating the independence holds.
  1. Probability Plot
    • Normality: From this plot, we see that the residuals follow straight line with departure at the tails. Normality assumption does not hold.
  1. Residuals vs Predictors Plot
    • Linearity: From the plot, we can see that the residuals for each predictor variable are randomly distributed accross the 0 line. We can say that the linearity assumption holds.

Spatial Models - Setup

Let's build some spatial models such as Geographically Weighted Regression(GWR), Forest Based Classification and Regression Trees and Local Bivariate Relationships to understand how provider count is influenced by different factors at county level. Before we build these models, let's import libraries, define an output directory and create an environment workspace.

In [5]:
# Import Libraries
import arcpy
import os
In [6]:
# Set data directory to output results

arcgis_dir = './arcgis'
home_dir = os.path.join(arcgis_dir, 'home')

print(f"root dir: {arcgis_dir}")
print(f"home dir: {home_dir}")
root dir: ./arcgis
home dir: ./arcgis\home
In [7]:
# Create empty FGDB in data directory
if not arcpy.Exists(os.path.join(home_dir, 'Results.gdb')):
    arcpy.CreateFileGDB_management(home_dir, 'Results.gdb')
In [8]:
# Setup Workspace
arcpy.env.workspace = os.path.join(home_dir,'Results.gdb')
arcpy.env.workspace
Out[8]:
'./ArcGIS\\home\\Results.gdb'

Geographically Weighted Regression (GWR) Model

The global model helps us identify a relationship between predictor variables and provider count at the country level. However, the dynamics of how different variables impact provider count can be very different for different counties within a state. The question we ask here is:

What if an influential variable's impact on provider count varies accross different counties, i.e. how does average household size (or any other variable's) impact provider count vary across different counties?

To answer this question we will explore Geographically Weighted Regression Model.

Learn more about Geographically Weighted Regression.

Run GWR Model

To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase in /arcgis/home which can then be accessed to visualize the results.

In [9]:
# Run GWR and output results to the FGDB
from arcpy.stats import GWR

# Workspace
arcpy.env.workspace

# Url for the layer
lyr_url = allprovider_layer.url

predictors = ['asian_pop','amerind_pop','avg_hhsz','black_pop','hisp_pop',
              'median_age','minority_pop','otherace_pop','percap_income',
              'unemp_rate','white_pop','some_college','grad_deg','avg_healthinsurance',
              'avg_medicare','avg_labtest','avg_presdrug','avg_personalinsurance']

# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
model_type = 'CONTINUOUS'
explanatory_variables = predictors
output_features = 'GWRResults_NB1'
neighborhood_type = 'NUMBER_OF_NEIGHBORS'
neighborhood_selection_method = 'USER_DEFINED'
minimum_number_of_neighbors = None
maximum_number_of_neighbors = None
minimum_search_distance = None
maximum_search_distance = None
number_of_neighbors_increment = None
search_distance_increment = None
number_of_increments = None
number_of_neighbors = 50
distance_band = None
prediction_locations = None
explanatory_variables_to_match = None
output_predicted_features = None
robust_prediction = 'ROBUST'
local_weighting_scheme = 'GAUSSIAN'
coefficient_raster_workspace = None

# Run Model
GWR(in_features, dependent_variable, model_type, explanatory_variables, output_features, neighborhood_type,
    neighborhood_selection_method, minimum_number_of_neighbors, maximum_number_of_neighbors, minimum_search_distance,
    maximum_search_distance, number_of_neighbors_increment, search_distance_increment, number_of_increments,
    number_of_neighbors, distance_band, prediction_locations, explanatory_variables_to_match, output_predicted_features,
    robust_prediction, local_weighting_scheme,coefficient_raster_workspace)
Out[9]:

Output

idvalue
0./ArcGIS\home\Results.gdb\GWRResults_NB1
1
2

Messages

Start Time: Tuesday, August 18, 2020 10:16:36 AM

------------- Analysis Details -------------
Number of Features: 3139
Dependent Variable: PROVIDER_COUNT
Explanatory Variables: ASIAN_POP
AMERIND_POP
AVG_HHSZ
BLACK_POP
HISP_POP
MEDIAN_AGE
MINORITY_POP
OTHERACE_POP
PERCAP_INCOME
UNEMP_RATE
WHITE_POP
SOME_COLLEGE
GRAD_DEG
AVG_HEALTHINSURANCE
AVG_MEDICARE
AVG_LABTEST
AVG_PRESDRUG
AVG_PERSONALINSURANCE
Number of Neighbors: 50
--------------------------------------------

----------- Model Diagnostics -----------
R2: 0.9913
AdjR2: 0.9894
AICc: 50227.1493
Sigma-Squared: 466744.4992
Sigma-Squared MLE: 382271.0780
Effective Degrees of Freedom: 2570.8903
-----------------------------------------

WARNING 110249: Input variables number is 19, will only use the first 10 to create the scatter plot matrix.
Succeeded at Tuesday, August 18, 2020 10:17:19 AM (Elapsed Time: 43.30 seconds)

Access GWR Results as Spatially Enabled Dataframe

In [10]:
# Access GWR data from local
gwr_df = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
gwr_df.head()
Out[10]:
OBJECTID SOURCE_ID provider_count asian_pop amerind_pop avg_hhsz black_pop hisp_pop median_age minority_pop otherace_pop percap_income unemp_rate white_pop some_college grad_deg avg_healthinsurance avg_medicare avg_labtest avg_presdrug avg_personalinsurance INTRCPT SE_INTRCPT C_ASIAN_POP SE_ASIAN_POP C_AMERIND_POP SE_AMERIND_POP C_AVG_HHSZ SE_AVG_HHSZ C_BLACK_POP SE_BLACK_POP C_HISP_POP SE_HISP_POP C_MEDIAN_AGE SE_MEDIAN_AGE C_MINORITY_POP SE_MINORITY_POP C_OTHERACE_POP SE_OTHERACE_POP C_PERCAP_INCOME SE_PERCAP_INCOME C_UNEMP_RATE SE_UNEMP_RATE C_WHITE_POP SE_WHITE_POP C_SOME_COLLEGE SE_SOME_COLLEGE C_GRAD_DEG SE_GRAD_DEG C_AVG_HEALTHINSURANCE SE_AVG_HEALTHINSURANCE C_AVG_MEDICARE SE_AVG_MEDICARE C_AVG_LABTEST SE_AVG_LABTEST C_AVG_PRESDRUG SE_AVG_PRESDRUG C_AVG_PERSONALINSURANCE SE_AVG_PERSONALINSURANCE PREDICTED RESIDUAL STDRESID INFLUENCE COOKS_D CND_NUMBER LOCALR2 SHAPE
0 1 1 1044.0 1701 423 2.85 12062 3945 41.0 20664 886 41430 5.6 75625 16211 8931 5227.44 767.21 83.06 157.57 592.98 7083.320304 2163.643990 0.310213 0.034698 0.990575 0.151966 -4103.182427 1022.086678 0.361931 0.029911 0.352328 0.028752 97.160179 30.503968 -0.303474 0.031008 -0.182483 0.025863 -0.356893 0.069323 -10.183314 37.897324 0.029947 0.002078 -0.138177 0.016391 0.077414 0.005721 13.028243 2.624787 30.874419 13.078704 -262.776975 48.454211 -209.364470 65.866365 -36.852385 9.987946 638.394954 405.605046 0.630561 0.113512 1.214450e-04 5.690942e+07 0.973956 {"rings": [[[-8529642.2486, 4688391.039899997]...
1 2 2 5.0 6 3 2.40 895 8 37.9 913 2 16066 5.2 494 199 45 2149.61 475.83 36.65 102.27 191.88 -1715.168222 2993.517884 -0.458553 0.134184 -0.362877 0.178817 664.571445 1007.225468 -0.330165 0.121548 -0.246908 0.115457 -8.918920 31.196270 0.343730 0.123639 -0.231777 0.053857 0.072423 0.118846 -26.403059 22.367435 0.000929 0.007559 0.015853 0.042727 0.139246 0.034656 3.947265 4.227333 56.098070 28.122803 -224.833764 112.938033 -231.319434 132.906664 -20.224481 20.583998 -9.689861 14.689861 0.023290 0.147643 2.241195e-07 3.286173e+07 0.974494 {"rings": [[[-10120538.0346, 3896326.054200001...
2 3 3 621.0 421 168 2.47 1422 2977 40.0 5873 882 24981 5.5 51908 7627 2153 3081.77 547.19 51.40 114.41 313.28 3784.309355 2761.277425 -0.214660 0.068190 -0.482683 0.178969 -47.118911 993.920916 -0.075986 0.057489 -0.166814 0.044674 -31.561083 28.793589 0.080670 0.055741 0.148595 0.050896 -0.049633 0.105370 -10.370135 29.855035 -0.016545 0.003015 0.102708 0.014245 0.292671 0.014161 0.032765 3.547578 25.490189 20.667847 -114.990943 82.141477 -92.796269 102.054007 4.643850 16.127691 468.338221 152.661779 0.227026 0.031206 3.960186e-06 1.113030e+08 0.995132 {"rings": [[[-9221715.5229, 5050074.560699999]...
3 4 4 79.0 258 69 2.77 558 1307 40.4 2464 759 21821 3.1 16894 2043 574 3042.93 557.61 47.81 122.89 289.06 1903.340959 2856.660109 0.182783 0.157086 0.206154 0.184325 177.776565 1150.078324 0.191140 0.155578 0.095773 0.179583 -50.293477 30.293061 -0.188790 0.156154 0.162334 0.091728 0.103288 0.120806 -0.971110 30.753106 0.004361 0.008765 0.065683 0.047084 0.116723 0.026147 -2.651421 4.443284 14.083429 22.637054 -108.647983 107.579662 -21.039989 111.969049 17.232070 18.195763 -173.106408 252.106408 0.400106 0.149376 6.705767e-05 6.998668e+07 0.988206 {"rings": [[[-9290699.0981, 4093591.7991999984...
4 5 5 1210.0 1727 307 2.81 49467 9491 37.8 62154 4497 26350 6.1 31300 11944 5462 3356.61 520.50 54.48 109.17 344.50 863.913879 2593.718643 0.288447 0.170895 0.365302 0.233586 199.681335 1023.574012 0.299485 0.172295 0.246918 0.210308 -35.740864 29.912414 -0.286954 0.172091 0.051971 0.120297 0.064150 0.111697 4.024449 29.745940 0.011930 0.010230 0.020039 0.053421 0.100307 0.029699 -0.201878 4.513428 27.416702 24.186869 -196.397368 120.138953 -84.888220 117.761353 11.976873 17.869470 868.152228 341.847772 0.540363 0.142536 1.157803e-04 5.948622e+07 0.991026 {"rings": [[[-9341355.0324, 3994505.8356000036...
In [11]:
gwr_df.shape
Out[11]:
(3139, 67)

Plot GWR Results

Plot Average Household Size

In [262]:
# Create Map
gwr_avg_hhsz_map = gis.map('Washington DC', 5)
gwr_avg_hhsz_map
Out[262]:

The map shows how provider count varies with average household size for different counties. We can see:

  • Counties in New York, New Jersey and Philadelphia where an increase in average household size will have a higher impact on provider count.
  • Similarly, for counties in Washington, Virginia, Maine, New Hampshire, Massachusetts an increase in average household size will have a higher impact on provider count.
  • In the Midwest, areas near Chicago and Minneapolis also see a higher impact on provider count.
In [12]:
# Create Map
gwr_avg_hhsz_map = gis.map('Washington DC', 5)
gwr_avg_hhsz_map
In [13]:
gwr_avg_hhsz_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_hhsz_map,
        renderer_type='c',  # for class breaks renderer
        method='esriClassifyNaturalBreaks',  # classification algorithm
        class_count=6,  # choose the number of classes
        col='C_AVG_HHSZ',  # numeric column to classify
        cmap='OrRd',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0) # specify opacity
Out[13]:
True
In [253]:
# Add Legend
gwr_avg_hhsz_map.legend = True

Plot Average Prescription Drug

In [266]:
# Create Map
gwr_avg_presdrug_map = gis.map('USA', 4)
gwr_avg_presdrug_map
Out[266]:

The map shows how provider count varies with average prescription drug prices for different counties. We can see:

  • Provider Count in majority of California and areas in New Hampshire is strongly negatively impacted by increase in average prescription drug prices. This means increase in average prescription drug prices will reduce provider count
  • For areas in Minnesota and Wisconsin, provider count is strongly positively impacted by increase in average prescription drug prices.This means increase in average prescription drug prices will increase provider count
  • Although for majority of midwest, provider count is positively impacted by average prescription drug prices, some counties in North New Mexico, Southern Colorado and Kansas seem to have a slight negative impact.
In [264]:
gwr_avg_presdrug_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_presdrug_map,
        renderer_type='c',  # for class breaks renderer
        method='esriClassifyNaturalBreaks',  # classification algorithm
        class_count=6,  # choose the number of classes
        col='C_AVG_PRESDRUG',  # numeric column to classify
        cmap='OrRd',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0.05) # specify opacity
Out[264]:
True
In [265]:
gwr_avg_presdrug_map.legend = True

Forest Based Classification and Regression Trees Model

The global model and GWR model are parametric models and they assume either a linear relationship between predictors and response variable or assume a form of the function and identify influential factors at the country level and at county level. These methods are easy to understand however, are highly constrained to the specified form. But:

What if this relationship between predictors and response is not linear? OR What if we do not make strong assumptions about the form of function?

To answer this question, we will explore a non-parametric technique called Forest Based Classification and Regression Trees Model. Forest based models are good black box models that map non-linear and complex relationships quite well and are not influenced by outliers and missing values to an extent.

Forest Based Classification and Regression Trees creates an ensemble of decision trees. Each decision tree is created using randomly generated portions of data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome. The model will output a variable importance table to identify importance of each variable given to the model.

Leran more about Forest Based Classification and Regression Trees.

Build Forest Based CART Model

To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.

Here, we created a 'Train Only' model with all predictors. The model is trained on 70% data and 30% data is excluded for validation. The model was run for 5 iterations.

Note that increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate. Learn more about the parameters used and outputs generated by this model here.

In [17]:
from arcpy.stats import Forest

# Workspace
arcpy.env.workspace
lyr_url = allprovider_layer.url

predictors = [['amerind_pop','false'],['asian_pop','false'],['asso_deg','false'],['avg_healthcare','false'],
              ['avg_healthinsurance','false'],['avg_hhinc','false'],['avg_hhsz','false'],['avg_labtest','false'],
              ['avg_medicalcare','false'],['avg_medicalsrvc','false'],['avg_medicare','false'],['avg_personalinsurance','false'],
              ['avg_presdrug','false'],['avg_socsecurity','false'],['bach_deg','false'],['grad_deg','false'],
              ['hisp_pop','false'],['median_age','false'],['minority_pop','false'],['otherace_pop','false'],
              ['percap_income','false'],['pop_density','false'],['some_college','false'],['unemp_rate','false'],
              ['white_pop','false']]

# Define model variables
prediction_type = 'TRAIN'
in_features = lyr_url
variable_predict = 'provider_count'
treat_variable_as_categorical = None
explanatory_variables = predictors
distance_features = None
explanatory_rasters = None
features_to_predict = None
output_features = None
output_raster = None
explanatory_variable_matching = None
explanatory_distance_matching = None
explanatory_rasters_matching = None
output_trained_features = 'RF_Output_Result1'
output_importance_table = 'RF_VImp_Result1'
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = None
maximum_depth = None
sample_size = 100
random_variables = None
percentage_for_training = 30
output_classification_table = None
output_validation_table = None
compensate_sparse_categories = False
number_validation_runs = 5
calculate_uncertainty = False

# Run Model
Forest(prediction_type,in_features, variable_predict, treat_variable_as_categorical, explanatory_variables, distance_features, 
       explanatory_rasters, features_to_predict, output_features, output_raster, explanatory_variable_matching,
      explanatory_distance_matching, explanatory_rasters_matching, output_trained_features, output_importance_table,
      use_raster_values, number_of_trees, minimum_leaf_size, maximum_depth, sample_size, random_variables, 
       percentage_for_training, output_classification_table, output_validation_table, compensate_sparse_categories, 
       number_validation_runs, calculate_uncertainty)
Out[17]:

Output

idvalue
0
1
2./ArcGIS\home\Results.gdb\RF_Output_Result1
3.\ArcGIS\home\Results.gdb\RF_VImp_Result1
4
5
6

Messages

Start Time: Tuesday, August 18, 2020 11:18:47 AM
Random Seed: 743240

-------------- Model Characteristics ---------------
Number of Trees 100
Leaf Size 5
Tree Depth Range 21-35
Mean Tree Depth 24
% of Training Available per Tree 100
Number of Randomly Sampled Variables 8
% of Training Data Excluded for Validation 30

----------------- Model Out of Bag Errors -----------------
Number of Trees 50 100
MSE 3548137.480 3557138.319
% of variation explained 89.876 89.850

------------------ Top Variable Importance -------------------
Variable Importance %
bach_deg 19939112417.90 26
grad_deg 16751463301.93 22
asso_deg 10797452932.51 14
white_pop 10407959333.70 14
some_college 5983283750.09 8
minority_pop 3769590765.62 5
asian_pop 1757683683.45 2
otherace_pop 1570518369.82 2
hisp_pop 1149944355.78 2
pop_density 931317892.05 1
avg_hhsz 449321852.69 1
avg_labtest 294453790.43 0
unemp_rate 240548007.81 0
amerind_pop 194586943.19 0
median_age 176352561.41 0
avg_medicalcare 171480757.12 0
avg_hhinc 167152354.09 0
avg_presdrug 155137159.98 0
percap_income 150177406.71 0
avg_healthinsurance 135636139.98 0

----- Training Data: Regression Diagnostics ------
R-Squared 0.980
p-value 0.000
Standard Error 0.003
*Predictions for the data used to train the model compared to the observed categories for those features

---- Validation Data: Regression Diagnostics -----
R-Squared 0.806
p-value 0.000
Standard Error 0.010
*Predictions for the test data (excluded from model training) compared to the observed values for those test features

Median R2 0.806 was approximately reached at seed 600633

------------------------- Explanatory Variable Range Diagnostics -------------------------
Training Validation Training Validation
Variable Minimum Maximum Minimum Maximum Share(a) Share(b)
asian_pop 0.00 721646.00 0.00 1517296.00 0.48* 2.10
amerind_pop 0.00 56871.00 0.00 98584.00 0.58* 1.73
avg_hhsz 2.00 4.34 1.96 4.32 0.98* 0.99*
avg_hhinc 36078.00 162801.00 36312.00 146887.00 1.00 0.87*
hisp_pop 0.00 2035551.00 4.00 5043293.00 0.40* 1.00*
median_age 23.20 62.50 23.60 58.60 1.00 0.89*
minority_pop 4.00 3339578.00 9.00 7644196.00 0.44* 1.00*
otherace_pop 0.00 712410.00 0.00 2305030.00 0.31* 3.24
percap_income 9853.00 71106.00 12247.00 57556.00 1.00 0.74*
pop_density 0.00 72736.20 0.10 38009.90 1.00 0.52*
unemp_rate 0.10 27.90 0.00 23.20 1.00* 0.83*
white_pop 65.00 2859209.00 205.00 5019340.00 0.57* 1.00*
some_college 22.00 690225.00 106.00 1300786.00 0.53* 1.00*
asso_deg 5.00 237028.00 33.00 469649.00 0.50* 1.00*
bach_deg 2.00 830433.00 38.00 1445895.00 0.57* 1.00*
grad_deg 0.00 552050.00 19.00 782758.00 0.71* 1.00*
avg_healthcare 2821.51 10690.41 2979.78 9276.07 1.00 0.80*
avg_healthinsurance 1871.50 6873.42 1942.71 6168.19 1.00 0.84*
avg_medicare 342.05 1501.33 358.05 1150.11 1.00 0.68*
avg_medicalcare 950.01 3816.99 1033.36 3107.89 1.00 0.72*
avg_medicalsrvc 496.89 1955.37 523.61 1747.84 1.00 0.84*
avg_labtest 25.87 149.41 30.83 114.46 1.00 0.68*
avg_presdrug 73.65 298.34 76.64 228.55 1.00 0.68*
avg_personalinsurance 162.69 791.45 174.19 687.83 1.00 0.82*
avg_socsecurity 3121.71 16133.33 3163.22 14201.61 1.00 0.85*
(a) % of overlap between the ranges of the training data and the input explanatory variable
(b) % of overlap between the ranges of the validation data and the training data
* Data ranges do not coincide. Training or validation is occurring with incomplete data.
+ Ranges of the training data and prediction data do not coincide and the tool is attempting to extrapolate.
Succeeded at Tuesday, August 18, 2020 11:19:20 AM (Elapsed Time: 33.42 seconds)

Get Variable Importance

In [18]:
# RF Vimp Table
rf_df = pd.DataFrame.spatial.from_table(os.path.join(arcpy.env.workspace,output_importance_table))
rf_df.head()
Out[18]:
OBJECTID ASIAN_POP AMERIND_POP AVG_HHSZ AVG_HHINC HISP_POP MEDIAN_AGE MINORITY_POP OTHERACE_POP PERCAP_INCOME POP_DENSITY UNEMP_RATE WHITE_POP SOME_COLLEGE ASSO_DEG BACH_DEG GRAD_DEG AVG_HEALTHCARE AVG_HEALTHINSURANCE AVG_MEDICARE AVG_MEDICALCARE AVG_MEDICALSRVC AVG_LABTEST AVG_PRESDRUG AVG_PERSONALINSURANCE AVG_SOCSECURITY BEST_ITE
0 1 1.936236e+09 3.626429e+08 3.493813e+08 1.125670e+08 1.250103e+09 1.173539e+08 4.140227e+09 6.644813e+08 1.448598e+08 7.810601e+08 1.923894e+08 7.744006e+09 7.352001e+09 7.441370e+09 1.640525e+10 1.491561e+10 8.568482e+07 7.602678e+07 1.409397e+08 1.779115e+08 1.250645e+08 1.550210e+08 1.979505e+08 1.010077e+08 8.123558e+07 Iterations
1 2 2.433282e+09 3.596755e+08 5.197042e+08 8.798179e+07 7.996029e+08 8.551507e+07 3.733401e+09 1.230084e+09 2.092976e+08 8.595931e+08 2.186328e+08 7.824331e+09 7.049030e+09 3.981829e+09 1.680595e+10 1.614554e+10 9.478313e+07 1.280119e+08 8.784919e+07 8.976733e+07 8.299869e+07 1.931147e+08 1.128664e+08 1.231691e+08 1.423122e+08 Iterations
2 3 6.599357e+09 3.059011e+09 3.728035e+08 1.275330e+08 3.492286e+09 2.321379e+08 7.863340e+09 2.919583e+09 6.579626e+08 6.159041e+08 1.683346e+08 1.039170e+10 1.243282e+10 7.768626e+09 1.736604e+10 2.124281e+10 1.431652e+08 1.175526e+08 1.172117e+08 1.422636e+08 1.450450e+08 1.939403e+08 1.784143e+08 1.297512e+08 1.678500e+08 Best Iteration
3 4 1.757684e+09 1.945869e+08 4.493219e+08 1.671524e+08 1.149944e+09 1.763526e+08 3.769591e+09 1.570518e+09 1.501774e+08 9.313179e+08 2.405480e+08 1.040796e+10 5.983284e+09 1.079745e+10 1.993911e+10 1.675146e+10 1.288261e+08 1.356361e+08 1.155745e+08 1.714808e+08 1.001931e+08 2.944538e+08 1.551372e+08 8.507679e+07 8.355390e+07 Iterations
4 5 2.814250e+09 1.713711e+08 4.762753e+08 1.089614e+08 9.045489e+08 1.936031e+08 4.892379e+09 8.828760e+08 1.527177e+08 7.046872e+08 3.028327e+08 9.538610e+09 1.130780e+10 1.121870e+10 1.794521e+10 1.795797e+10 1.915149e+08 1.026670e+08 1.464976e+08 1.400921e+08 1.150179e+08 2.381621e+08 1.880367e+08 9.893507e+07 1.269060e+08 Iterations
In [19]:
rf_df.shape
Out[19]:
(5, 27)

Plot Variable Importance

In [20]:
data = rf_df.drop(['OBJECTID','BEST_ITE'],axis=1)
data
Out[20]:
ASIAN_POP AMERIND_POP AVG_HHSZ AVG_HHINC HISP_POP MEDIAN_AGE MINORITY_POP OTHERACE_POP PERCAP_INCOME POP_DENSITY UNEMP_RATE WHITE_POP SOME_COLLEGE ASSO_DEG BACH_DEG GRAD_DEG AVG_HEALTHCARE AVG_HEALTHINSURANCE AVG_MEDICARE AVG_MEDICALCARE AVG_MEDICALSRVC AVG_LABTEST AVG_PRESDRUG AVG_PERSONALINSURANCE AVG_SOCSECURITY
0 1.936236e+09 3.626429e+08 3.493813e+08 1.125670e+08 1.250103e+09 1.173539e+08 4.140227e+09 6.644813e+08 1.448598e+08 7.810601e+08 1.923894e+08 7.744006e+09 7.352001e+09 7.441370e+09 1.640525e+10 1.491561e+10 8.568482e+07 7.602678e+07 1.409397e+08 1.779115e+08 1.250645e+08 1.550210e+08 1.979505e+08 1.010077e+08 8.123558e+07
1 2.433282e+09 3.596755e+08 5.197042e+08 8.798179e+07 7.996029e+08 8.551507e+07 3.733401e+09 1.230084e+09 2.092976e+08 8.595931e+08 2.186328e+08 7.824331e+09 7.049030e+09 3.981829e+09 1.680595e+10 1.614554e+10 9.478313e+07 1.280119e+08 8.784919e+07 8.976733e+07 8.299869e+07 1.931147e+08 1.128664e+08 1.231691e+08 1.423122e+08
2 6.599357e+09 3.059011e+09 3.728035e+08 1.275330e+08 3.492286e+09 2.321379e+08 7.863340e+09 2.919583e+09 6.579626e+08 6.159041e+08 1.683346e+08 1.039170e+10 1.243282e+10 7.768626e+09 1.736604e+10 2.124281e+10 1.431652e+08 1.175526e+08 1.172117e+08 1.422636e+08 1.450450e+08 1.939403e+08 1.784143e+08 1.297512e+08 1.678500e+08
3 1.757684e+09 1.945869e+08 4.493219e+08 1.671524e+08 1.149944e+09 1.763526e+08 3.769591e+09 1.570518e+09 1.501774e+08 9.313179e+08 2.405480e+08 1.040796e+10 5.983284e+09 1.079745e+10 1.993911e+10 1.675146e+10 1.288261e+08 1.356361e+08 1.155745e+08 1.714808e+08 1.001931e+08 2.944538e+08 1.551372e+08 8.507679e+07 8.355390e+07
4 2.814250e+09 1.713711e+08 4.762753e+08 1.089614e+08 9.045489e+08 1.936031e+08 4.892379e+09 8.828760e+08 1.527177e+08 7.046872e+08 3.028327e+08 9.538610e+09 1.130780e+10 1.121870e+10 1.794521e+10 1.795797e+10 1.915149e+08 1.026670e+08 1.464976e+08 1.400921e+08 1.150179e+08 2.381621e+08 1.880367e+08 9.893507e+07 1.269060e+08
In [22]:
# Plot Importance
sns.set(rc={'figure.figsize':(15,8.27)})
data = rf_df.drop(['OBJECTID','BEST_ITE'],axis=1)
g = sns.boxplot(data = data)
plt.xticks(rotation=90);

We ran 5 iterations of Random Forest model. The box plot shows variable importance of each variable and how importance varies for each iteration.

Plot Best Iteration

In [23]:
# Create dataframe for Best Iteration
best = rf_df[rf_df['BEST_ITE']=='Best Iteration']
best
Out[23]:
OBJECTID ASIAN_POP AMERIND_POP AVG_HHSZ AVG_HHINC HISP_POP MEDIAN_AGE MINORITY_POP OTHERACE_POP PERCAP_INCOME POP_DENSITY UNEMP_RATE WHITE_POP SOME_COLLEGE ASSO_DEG BACH_DEG GRAD_DEG AVG_HEALTHCARE AVG_HEALTHINSURANCE AVG_MEDICARE AVG_MEDICALCARE AVG_MEDICALSRVC AVG_LABTEST AVG_PRESDRUG AVG_PERSONALINSURANCE AVG_SOCSECURITY BEST_ITE
2 3 6.599357e+09 3.059011e+09 3.728035e+08 1.275330e+08 3.492286e+09 2.321379e+08 7.863340e+09 2.919583e+09 6.579626e+08 6.159041e+08 1.683346e+08 1.039170e+10 1.243282e+10 7.768626e+09 1.736604e+10 2.124281e+10 1.431652e+08 1.175526e+08 1.172117e+08 1.422636e+08 1.450450e+08 1.939403e+08 1.784143e+08 1.297512e+08 1.678500e+08 Best Iteration
In [25]:
# Create Plot
sns.set(rc={'figure.figsize':(15,8.27)})
sns.barplot(data=best, color='lightcoral')
plt.xticks(rotation=90)
plt.title('Variable Importance - Best Iteration', fontsize=16);

The barplot shows variable importance of each variable for the best iteration. These variables will now be used in the next step of our analysis to understand how they vary with respect to provider count accross various counties in U.S.

Local Bivariate Relationships (LBR) Model

Random Forest model generates variable importance and gives us important variables. However, this model does not tell us:

  • the type of relationship and
  • significance of relation of a predictor variable with respect to provider count.

To understand the type and significance of the relations of a predictor variable with respect to provider count at the County level, we will explore Local Bivariate Relations (LBR) Model.

Local Bivariate Relations analyzes two variables for statistically significant relationships using local entropy. The output can be used to visualize the type of relationship between two variables and explore how their relationship changes across counties.

We will use Local Bivariate Relations to study how relationship of important variables determined by Random Forest model change with respect to Provider Count accross various counties. Learn more about Local Bivariate Relationships.

Two of the top five most important variables generated by Forest Based CART model are Graduate Degree and White Population. Let's build LBR models for Provider Count w.r.t these variables.

Build Model - Provider Count and Graduate Degree

To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.

In [27]:
from arcpy.stats import LocalBivariateRelationships

# Workspace
arcpy.env.workspace

# Get url of layer
lyr_url = allprovider_layer.url

# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
explanatory_variable = 'grad_deg'
output_features = 'allprovider_grad_deg_LBR_Results1'
number_of_neighbors = 50
number_of_permutations = 99
enable_local_scatterplot_popups = 'NO_POPUP'
level_of_confidence = '95%'
apply_false_discovery_rate_fdr_correction = 'APPLY_FDR'
scaling_factor = 0.5

# Run Model
LocalBivariateRelationships(in_features, dependent_variable,explanatory_variable,output_features,number_of_neighbors,
                           number_of_permutations, enable_local_scatterplot_popups, level_of_confidence, 
                            apply_false_discovery_rate_fdr_correction, scaling_factor)
Out[27]:

Output

./ArcGIS\home\Results.gdb\allprovider_grad_deg_LBR_Results1

Messages

Start Time: Tuesday, August 18, 2020 11:44:27 AM

------------ Categorical Summary -------------
----------------------------------------------
Description # of features % of features
Positive Linear 1173 37.37
Negative Linear 0 0.00
Concave 881 28.07
Convex 1085 34.57
Undefined Complex 0 0.00
Not Significant 0 0.00
Total 3139 100.00
----------------------------------------------


--------- Entropy Results Summary ----------
--------------------------------------------
Description Min Max Mean Median
Entropy 0.0518 0.8344 0.2526 0.2220
p-value 0.0100 0.0100 0.0100 0.0100
--------------------------------------------


--------------- FDR Comparison ---------------
----------------------------------------------
Description # significant % significant
Without FDR 3139 100.00
FDR 3139 100.00
----------------------------------------------

Succeeded at Tuesday, August 18, 2020 11:44:57 AM (Elapsed Time: 29.96 seconds)

Access Model Results as Spatially Enabled Dataframe

In [28]:
# Access GWR data from local
lbr_graddeg = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
lbr_graddeg.head()
Out[28]:
OBJECTID SOURCE_ID provider_count grad_deg ENTROPY PVALUES LBR_SIG INTERCEPT COEF_1 PINTERCEPT P_COEF_1 P_COEF_2 AICC R2 P_AICc P_R2 SIG_COEF P_SIG_COEF LBR_TYPE SHAPE
0 1 1 1044.0 8931 0.502901 0.01 99% Confidence 0.002812 0.458814 -0.001986 0.796103 -1.276968 -247.674353 0.731508 -256.013530 0.778649 _* _** Concave {"rings": [[[-8529642.2486, 4688391.039899997]...
1 2 2 5.0 45 0.152911 0.01 99% Confidence 0.000017 1.115497 0.000651 0.470873 31.338637 -446.296708 0.802377 -449.717733 0.820235 _* _** Convex {"rings": [[[-10120538.0346, 3896326.054200001...
2 3 3 621.0 2153 0.352790 0.01 99% Confidence -0.001671 1.237995 0.000260 0.938741 2.075127 -329.879103 0.967074 -333.452877 0.970141 _* _** Convex {"rings": [[[-9221715.5229, 5050074.560699999]...
3 4 4 79.0 574 0.291545 0.01 99% Confidence 0.000094 0.603532 0.000494 0.531451 0.468285 -446.685913 0.985494 -449.537319 0.986654 _* _** Positive Linear {"rings": [[[-9290699.0981, 4093591.7991999984...
4 5 5 1210.0 5462 0.312456 0.01 99% Confidence 0.000090 0.603811 0.000518 0.535677 0.441553 -434.274636 0.981340 -435.557616 0.982284 _* _** Positive Linear {"rings": [[[-9341355.0324, 3994505.8356000036...
In [29]:
lbr_graddeg.shape
Out[29]:
(3139, 20)
In [30]:
lbr_graddeg.LBR_TYPE.value_counts()
Out[30]:
Positive Linear    1173
Convex             1085
Concave             881
Name: LBR_TYPE, dtype: int64

Plot Results on Map

In [34]:
# Create Map
lbr_graddeg_map = gis.map('USA', 4)
lbr_graddeg_map
Out[34]:

The map shows how relationship of provider count varies with graduate degree holders for different counties. We can see:

  • Majority of the counties from North Dakota to Texas share a Convex or Positive Linear relation. This means that provider count increase with increasing graduate degree holders for these counties either linearly or polynomially.
  • Majority of East and West coast share a concave relation whcih means that provider count increases with graduater degree holders to a certain extent and then plateaus off at some point.
In [32]:
lbr_graddeg_map.remove_layers()
lbr_graddeg.spatial.plot(map_widget=lbr_graddeg_map,
        renderer_type='u',  # for unique value renderer
        col='LBR_TYPE',  # numeric column to classify
        cmap='prism',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0.05) # specify opacity
Out[32]:
True
In [33]:
lbr_graddeg_map.legend = True

Build Model - Provider Count and White Population

In [35]:
from arcpy.stats import LocalBivariateRelationships

# Workspace
arcpy.env.workspace

# Get url of layer
lyr_url = allprovider_layer.url

# Define model variables
in_features = lyr_url
dependent_variable = 'provider_count'
explanatory_variable = 'white_pop'
output_features = 'allprovider_white_pop_LBR_Results1'
number_of_neighbors = 50
number_of_permutations = 99
enable_local_scatterplot_popups = 'NO_POPUP'
level_of_confidence = '95%'
apply_false_discovery_rate_fdr_correction = 'APPLY_FDR'
scaling_factor = 0.5

# Run Model
LocalBivariateRelationships(in_features, dependent_variable,explanatory_variable,output_features,number_of_neighbors,
                           number_of_permutations, enable_local_scatterplot_popups, level_of_confidence, 
                            apply_false_discovery_rate_fdr_correction, scaling_factor)
Out[35]:

Output

./ArcGIS\home\Results.gdb\allprovider_white_pop_LBR_Results1

Messages

Start Time: Tuesday, August 18, 2020 12:21:32 PM

------------ Categorical Summary -------------
----------------------------------------------
Description # of features % of features
Positive Linear 1047 33.35
Negative Linear 0 0.00
Concave 154 4.91
Convex 1938 61.74
Undefined Complex 0 0.00
Not Significant 0 0.00
Total 3139 100.00
----------------------------------------------


--------- Entropy Results Summary ----------
--------------------------------------------
Description Min Max Mean Median
Entropy 0.0721 0.8752 0.2771 0.2497
p-value 0.0100 0.0300 0.0100 0.0100
--------------------------------------------


--------------- FDR Comparison ---------------
----------------------------------------------
Description # significant % significant
Without FDR 3139 100.00
FDR 3139 100.00
----------------------------------------------

Succeeded at Tuesday, August 18, 2020 12:22:01 PM (Elapsed Time: 29.71 seconds)

Access Model Results as Spatially Enabled Dataframe

In [36]:
# Access GWR data from local
lbr_whitepop = pd.DataFrame.spatial.from_featureclass(os.path.join(arcpy.env.workspace,output_features))
lbr_whitepop.head()
Out[36]:
OBJECTID SOURCE_ID provider_count white_pop ENTROPY PVALUES LBR_SIG INTERCEPT COEF_1 PINTERCEPT P_COEF_1 P_COEF_2 AICC R2 P_AICc P_R2 SIG_COEF P_SIG_COEF LBR_TYPE SHAPE
0 1 1 1044.0 75625 0.462784 0.01 95% Confidence -0.000806 0.964163 -0.005220 1.439742 -4.516278 -222.974441 0.559980 -222.721608 0.569225 _* _*_ Positive Linear {"rings": [[[-8529642.2486, 4688391.039899997]...
1 2 2 5.0 494 0.172985 0.01 95% Confidence -0.000454 0.991115 -0.002512 2.194106 -60.014844 -401.267715 0.513645 -407.081515 0.578266 _* *** Concave {"rings": [[[-10120538.0346, 3896326.054200001...
2 3 3 621.0 51908 0.396061 0.01 95% Confidence -0.010495 1.138899 -0.004588 0.700722 2.639104 -304.416195 0.945209 -314.253990 0.956163 ** *** Convex {"rings": [[[-9221715.5229, 5050074.560699999]...
3 4 4 79.0 16894 0.324165 0.01 95% Confidence -0.003290 0.899859 -0.000728 0.483160 5.162043 -304.611806 0.751356 -305.247346 0.760867 ** _** Positive Linear {"rings": [[[-9290699.0981, 4093591.7991999984...
4 5 5 1210.0 31300 0.347374 0.01 95% Confidence -0.003391 0.903693 -0.000686 0.501848 4.935998 -302.769404 0.741085 -303.106776 0.749499 ** _*_ Positive Linear {"rings": [[[-9341355.0324, 3994505.8356000036...
In [37]:
lbr_whitepop.shape
Out[37]:
(3139, 20)
In [38]:
# Counts of different relationship types
lbr_whitepop.LBR_TYPE.value_counts()
Out[38]:
Convex             1938
Positive Linear    1047
Concave             154
Name: LBR_TYPE, dtype: int64

Plot Results on Map

In [43]:
# Create Map
lbr_whitepop_map = gis.map('USA', 4)
lbr_whitepop_map
Out[43]:

The map shows how relationship of provider count varies with white population for different counties. We can see:

  • For most of the counties in United States, the relationship is either Convex or Positive Linear which means that provider count increase with increasing white population either linearly or polynomially.
  • Southern parts of New Mexico, some counties in North Texas, Wyoming and Idaho have a concave relationship. This means that provider count increases to a certain extent and then plateaus off at some point.
In [41]:
lbr_whitepop_map.remove_layers()
lbr_whitepop.spatial.plot(map_widget=lbr_whitepop_map,
        renderer_type='u',  # for class breaks renderer
        col='LBR_TYPE',  # numeric column to classify
        cmap='prism',  # color map to pick colors from for each class
        alpha=0.7,
        line_width=0.05) # specify opacity
Out[41]:
True
In [42]:
lbr_whitepop_map.legend = True

Summary

To summarize, we:

  1. Started with a Base (OLS) Regression model using all predictors:
    • Performed feature selection using Lasso regularization technique.
    • Created a Global (OLS) Regression model using selected features. The Global model helped us identify relationship between predictor variables and provider count at the country level.
  1. Next, we created a Geographically Weighted Regression (GWR) model, to understand how different variables impact provider count accross different counties.
    • Here we plotted how provider count varies with average prescription drug prices and average household size across different counties.
  1. To investigate non-linear relationship, we explored a non-parametric technique called Forest Based Classification and Regression Trees model.
    • The model returned variable importance of each variable.
  1. To understand the type and significance of relationship of an individual predictor with respect to provider count, we created a Local Bivariate Relations Model.

Conclusion

Studies [10] [11] have shown that shortages of health providers limit patients' access to required treatment. In addition to provider supply, sociodemographic factors such as age, race, income and health insurance, are associated with access to care and health outcomes [12] [13].

Public health stakeholders use numerous datasets to derive insights into the prevalence of healthcare providers and the impact of social determinants of health within their geographic service areas. The volume of available data can make it difficult for these stakeholders to determine how to best reach underserved populations [14].

Our analysis integrates publicly available sociodemographic and provider location datasets [15] and uses advanced data analysis, visualization and machine learning techniques to show the variation of provider supply over different geographical areas along with a summary of the most distinctive and relevant sociodemographic features that are predictive of the variation in the selected area.

By better understanding these social determinants of health in their geographies, both government and non-governmental healthcare leaders can prioritize interventions around addressing underserved areas with healthcare provider shortages.

The analysis is designed to be easily extended with additional features or adapted to other provider specialties. The techniques defined in this analysis can support further data integrations including Medicare, Medicaid, and commercial insurance claims data to better understand actual patient utilization behavior and improve models.

To conclude, in this 4-part study we use Dask, ArcGIS API for Python, ArcGIS WebGIS stack, various spatial analysis tools and machine learning techniques to:

  1. Read, clean and geocode ~ 5.7 million records of healthcare provider data.
  2. Identify shortage areas for Overall, Mental health and OB-GYN healthcare providers in the U.S.
  3. Identify sociodemographic and economic factors that influence access to providers.
  4. Identify the influence of factors on access to providers.

References

[1] https://en.wikipedia.org/wiki/Ordinary_least_squares [2] https://online.stat.psu.edu/stat501/lesson/7/7.3
[3] https://sites.google.com/site/statisticsforspreadsheets/regression-modeling/influence-plot
[4] https://online.stat.psu.edu/stat501/lesson/10/10.2
[5] https://bookdown.org/max/FES/recursive-feature-elimination.html
[6] https://towardsdatascience.com/ridge-lasso-and-elastic-net-regularization-7861ca575c64
[7] https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f
[8] Feng C, Wang H, Lu N, et al. Log-transformation and its implications for data analysis [published correction appears in Gen Psychiatr. 2019 Sep 6;32(5):e100146corr1]. Shanghai Arch Psychiatry. 2014;26(2):105-109. doi:10.3969/j.issn.1002-0829.2014.02.009 [Link]
[9] https://www.cscu.cornell.edu/news/statnews/stnews83.pdf
[10] Institute of Medicine (US) National Cancer Policy Forum. Ensuring Quality Cancer Care through the Oncology Workforce: Sustaining Care in the 21st Century: Workshop Summary. Washington (DC): National Academies Press (US); 2009. Supply and Demand in the Health Care Workforce. Available from: https://www.ncbi.nlm.nih.gov/books/NBK215247/
[11] Kullgren J, McLaughlin C, Mitra N, Armstrong K. Nonfinancial barriers and access to care for U.S. adults. Health Serv Res. 2012;47(1):462–485. doi: 10.1111/j.1475-6773.2011.01308.x. [PubMed]
[12] Sorkin, D.H., Ngo-Metzger, Q. & De Alba, I. Racial/Ethnic Discrimination in Health Care: Impact on Perceived Quality of Care. J GEN INTERN MED 25, 390–396 (2010). https://doi.org/10.1007/s11606-010-1257-5
[13] Reiners F, Sturm J, Bouw LJW, Wouters EJM. Sociodemographic Factors Influencing the Use of eHealth in People with Chronic Diseases. Int J Environ Res Public Health. 2019;16(4):645. Published 2019 Feb 21. doi:10.3390/ijerph16040645
[14] Community Health Needs Assessment PeaceHealth (2016). Sacred Heart Medical Center. https://www.peacehealth.org/about-peacehealth/Pages/community-health-needs-assessment
[15] https://www.cms.gov/Regulations-and-Guidance/Administrative-Simplification/NationalProvIdentStand/DataDissemination.html