Goal: The goal is to create a naive framework for building linear and logistic regression models with $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties.
Overfitting: is a phenomenon that occurs when a model fits very well on training data but fails to generalize to test data or other datasets. Overfit models are characterized by high variance (high uncertainity) and low bias.
Linear models can overfit if irrelevant or highly correlated features are included as the model will learn a coefficient for every feature, regardless of whether it is a signal or noise. This is especially true when p (number of features) is close to n (number of observations). Also, large estimated coefficients in a model is a sign of overfitting. The larger the absolute value of the coefficient, the more power it has to change the predicted response, resulting in a higher variance.
Regularization: is a process of penalizing model complexity to prevent overfitting. Heuristically 'large' is interpreted as 'complex' and the idea is to penalize large values of coefficients ($\theta$) jointly thereby reducing model complexity. With regularization, we add a penalty to the model for complexity. For e.g., the cost function for linear regression with penalty can be written as:
$J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+ \lambda\ Penalty (\theta_{1}, \ldots, \theta_{p})\right)$
$\lambda$ is a constant that balances the tradeoff between the lack of fit measured by the model and the complexity measured by the penalty which depends on the regression coefficients. The bigger $\lambda$ is, the bigger the penalty for model complexity. Choosing $\lambda$ helps satisfy two goals:
- It allows for fitting the training data well and
- Keeping model parameters small
Cost Function gives an idea of how far the predicted values are from the actual values. Cost function for different penalty terms are:
REGRESSION:
CLASSIFICATION:
where:
$ m = no.\ of\ records$
$ p = no.\ of\ features$
$ y_{i} = true\ values $
$Features:\ x=\left\{x_{0}, x_{1}, x_{2}, \ldots x_{p}\right\}$
$Parameters:\ \theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{p}\right\}$
$i = 1,...,m$
$j = 1,...,p$
Hypothesis:
$\lambda_{1}=\lambda \left(l_{1}\right.$ ratio $)$
$\lambda_{2}=\lambda \left(1-l_{1}\right.$ ratio $)$
$l_{1}$ ratio = number between 0 and 1 that emphasis on lasso vs ridge penalty
Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we use all the training data to compute the gradient of cost function and then updates parameters.
Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. Gradients for various models with penalties can be written as:
Bias: $\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$
Note: $\theta_{0}$ remains the same for all models.
Weights:
where:
$\frac{\partial|\omega|}{\partial \omega}=\left\{\begin{array}{cc}1 & \omega>0 \\ -1 & \omega<0\end{array}\right.$
$ m = no.\ of\ records$
$ p = no.\ of\ features$
$i = 1,...,m$
$j = 1,...,p$
$ y_{i} = true\ values $
$h_{\theta}(x) = hypothesis$
$\alpha = learning\ rate$
In this section, we will define the classes that will build, fit and predict linear and logistic regression models including penalty terms for $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties. We will use batch gradient descent to fit the model and generate feature weights. These classes will then be used to build various regression and classification models with penalty terms in later sections.
We will also define two function to help us better visualize the performance of various models and their results. The functions will:
The idea is to build three classes:
_regularization
- computes the penalty cost based on penalty type (l1, l2, l1 and l2)_gradient
- computes the gradient depending on penalty typefit
- fits a regression model using batch gradient descent_approximation
- to compute hypothesis $h_{\theta}(x)$: approximate values of y given training data_cost
- to compute model cost with regularizationpredict
- to make predictions based on model weights/coefficients Base_Regression
and overwrites the following methods:_approximation
- computes hypothesis $h_{\theta}(x)$ for linear regression_cost
- computes model cost with regularization for linear regressionpredict
- makes predictions based on model weights/coefficients Base_Regression
and overwrites the following methods:_approximation
- computes hypothesis $h_{\theta}(x)$ for logistic regression_cost
- computes model cost with regularization for logistic regressionpredict
- makes predictions based on model weights/coefficientssigmoid
- new method to compute sigmoid for hypothesis class Base_Regression:
'''
Models the relationship between a dependent variable y and independent variables X
'''
def __init__(self):
self.b = None
self.w = None
def _regularization(self, lambd, w, reg_type, l1_ratio):
'''
Computes the penalty cost depending on penalty type:
1. For lasso, we compute the sum of the abs. value of regression coefficients = norm 1
also written as: np.linalg.norm(self.w, ord=1)
2. For ridge, we compute the sum of squared values of regression coefficients
3. Elastic net is a combination lasso and ridge
'''
if reg_type == 'lasso':
return lambd * np.sum(np.abs(self.w))
elif reg_type == 'ridge':
return lambd * np.sum(np.square(self.w))
elif reg_type == 'elastic net':
l1 = l1_ratio * np.sum(np.abs(self.w))
l2 = (1 - l1_ratio) * np.sum(np.square(self.w))
return lambd * (l1 + l2)
else:
return 0
def _gradient(self, n, y, y_pred, w, X, lambd, reg_type, l1_ratio):
'''
Computes the gradient depending on penalty type:
1. For lasso, gradient is (lambd/n_obs) * np.sign(Q)
2. For ridge, gradient is (lambd/n_obs) * Q (theta)
3. For elastic net, gradient is l1_ratio*gradient(lasso) + (1-l1_ratio)*gradient(ridge)
'''
D_b = (1/n) * np.sum(y_pred - y)
D_w = (1/n) * np.dot(y_pred - y, X)
if reg_type == 'lasso':
D_w += (lambd/n) * np.sign(self.w)
elif reg_type == 'ridge':
D_w += (lambd/n) * self.w
elif reg_type == 'elastic net':
l1 = l1_ratio * np.sign(self.w)
l2 = (1 - l1_ratio) * self.w
D_w += (lambd/n) * (l1 + l2)
else:
D_w = D_w
return D_b, D_w
def fit(self, X, y, learning_rate=0.01, epochs=1000, verbose=False, lambd=0, reg_type=None, l1_ratio=0.5):
'''
fits a regression model using batch gradient descent
'''
n_obs, m_features = X.shape
total_loss = []
# Initialize weights and bias
self.b = np.random.rand(1)
self.w = np.random.rand(m_features)
for e in range(epochs):
# Compute loss
y_pred = self._approximation(self.w, self.b, X)
loss = self._cost(n_obs, y_pred, y, self.w, lambd, reg_type, l1_ratio)
total_loss.append(loss)
# Calculate gradients
D_b, D_w = self._gradient(n_obs, y, y_pred, self.w, X, lambd, reg_type, l1_ratio)
# Update parameters
self.b -= learning_rate * D_b
self.w -= learning_rate * D_w
if verbose == True and e%1000 == 0:
print(f'Epoch: {e}, Loss: {loss}')
return self.b, self.w, total_loss
def _approximation(self, w, b, X):
raise NotImplementedError()
def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
raise NotImplementedError()
def predict(self, X, threshold=0.5):
return self._predict(X, self.w, self.b)
def _predict(self, X, w, b):
raise NotImplementedError()
class Linear_Regression(Base_Regression):
def _approximation(self, w, b, X):
'''
Computes hypothesis (hQ_x): y_pred = (b0 + b1Xm)
'''
return self.b + np.dot(self.w, X.T)
def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
'''
Computes cost with regularization
(1/n_obs) * (sum((ym - y_pred)**2) + regularization cost)
'''
cost = (1/n_obs) * (np.sum((y - y_pred)**2) +
self._regularization(lambd, self.w, reg_type, l1_ratio))
return cost
def _predict(self, X, w, b):
'''
Make predictions based on model weights/coefficients
'''
return self.b + np.dot(X, self.w)
class Logistic_Regression(Base_Regression):
def _approximation(self, w, b, X):
Qt_x = self.b + np.dot(self.w, X.T)
return self._sigmoid(Qt_x)
def _sigmoid(self, z):
return 1/(1 + np.exp(-z))
def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
'''
a really small value 'epsilon' is added to avoid
overflow and divison by zero error for log
loss = (-1/m) * (sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x)) - regularization cost)
where hQ_x = 1/(1 + e^(-Q_t*x))
'''
eps = 1e-5
cost = (-1/n_obs) * (np.sum(y * np.log(y_pred+eps) + (1-y) * np.log(1-y_pred+eps)) -
self._regularization(lambd, self.w, reg_type, l1_ratio))
return cost
def _predict(self, X, w, b, threshold=0.5):
y_predicted = self._approximation(self.w, self.b, X)
y_predicted_cls = [1 if i>=threshold else 0 for i in y_predicted]
return np.array(y_predicted_cls)
Learning rate is extremely important for finding best model. We will define a function that iterates over various learning rates and build model using each learning rate. Next, it plots the loss generated by different models.
def loss_per_epoch_plot(epochs, model, learning_rate):
"""
Pass in the no. of epochs, model class, list of learning rates
and plot the loss per epoch for models generated using different
learning rates
"""
fig = plt.figure(figsize = (15, 8))
losses = {}
e = epochs
model = model
for lr in learning_rate:
try:
b, w, epoch_loss = model.fit(x_train, y_train,
learning_rate=lr, epochs=e)
losses[f'LR={str(lr)}'] = epoch_loss
except:
continue
# Plot loss
xs = np.arange(len(epoch_loss))
for k, v in losses.items():
plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")
plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
plt.xlabel('Epochs', size=14)
plt.ylabel('Loss', size=14)
plt.legend(fontsize=12)
return fig
We will define a function to help us visualize the model's estimated weight for different lambda values.
# Define a function to plot weights and lambda values
def weight_versus_lambda_plot(weight, lambd, features):
"""
Pass in the estimated weight, the lambda value and the names
for the features and plot the model's estimated weight
for different lambda values
"""
fig = plt.figure(figsize = (10, 8))
# ensure that the weight is an array
weight = np.array(weight)
for col in range(weight.shape[1]):
plt.plot(lambd, weight[:, col], label = features[col])
plt.axhline(0, color = 'black', linestyle = '--', linewidth = 3)
# manually specify the coordinate of the legend
plt.legend(bbox_to_anchor = (1.3, 0.9))
plt.ylabel('Feature weight')
plt.xlabel('Lambda')
return fig
For regression, we will use the Boston Housing Prices toy dataset available in the scikit-learn library. We will:
Ridge Regression
Lasso Regression
Elastic Net Regression
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
from sklearn.datasets import load_boston
boston_data=pd.DataFrame(load_boston().data,columns=load_boston().feature_names)
boston_data.head()
# Create dependent and independent variables
data = load_boston()
X = data.data
Y = data.target
features = data.feature_names
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=123)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
We will train a regression model without regularization for different learning rates. Next, we will build a model using a selected value of learning rate and evaluate its performance on test set.
Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.
# Build and fit LR model for different learning rates
epochs = 10000
model = Linear_Regression()
learning_rate = [0.1, 0.01, 0.001, 0.0001, 0.00001]
mlr_fig = loss_per_epoch_plot(epochs, model, learning_rate)
From the plot above, we can initially see the loss reducing significantly and then stabalize, as epochs increase, for different learning rates $\alpha$. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 to build and fit a model and then evaluate its performance.
We will bild the model using a learning rate and evaluate the performance of our model on test data.
# Build and fit best LR model
alpha = 0.001
e = 10000
batch = 32
# Build model
mlr_model = Linear_Regression()
# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True)
# Create predictions on test set
mlr_test_pred = mlr_model.predict(x_test)
# Calculate metrics
mlr_test_r2 = np.abs(r2_score(y_test, mlr_test_pred))
mlr_test_mse = mean_squared_error(y_test, mlr_test_pred)
mlr_test_rmse = np.sqrt(mlr_test_mse)
print("Performance on test set:")
print(f'Test R2 is: {mlr_test_r2}')
print(f'Test MSE is: {mlr_test_mse}')
print(f'Test RMSE is: {mlr_test_rmse}')
We will train a regression model with $l2$ penalty (Ridge) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [10, 100, 1000]
e = 10000
# stores the weights of each feature
weights = []
for l in lambd:
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.001,
epochs=e, verbose=False, lambd=l,
reg_type='ridge')
weights.append(w)
# Plot weights and lambda
ridge_fig = weight_versus_lambda_plot(weights, lambd, features)
ridge_fig.suptitle('Ridge: Variation in feature weights as lambda grows', fontsize=16, y=0.93);
From the plot, we can see that feature weights continue to decrease and come close to zero as $\lambda$ increases. Weights remain close to zero but do not become zero.
Here we will build a model using a selected value of lambda
and evaluate its performance on test set.
# Build and fit model
alpha = 0.001
e = 10000
lambd = 1000
# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True, lambd=lambd,
reg_type='ridge')
# Create predictions on test set
mlr_ridge_test_pred = mlr_model.predict(x_test)
# Calculate metrics
mlr_ridge_test_r2 = np.abs(r2_score(y_test, mlr_ridge_test_pred))
mlr_ridge_test_mse = mean_squared_error(y_test, mlr_ridge_test_pred)
mlr_ridge_test_rmse = np.sqrt(mlr_ridge_test_mse)
print("Performance on test set:")
print(f'Test R2 is: {mlr_ridge_test_r2}')
print(f'Test MSE is: {mlr_ridge_test_mse}')
print(f'Test RMSE is: {mlr_ridge_test_rmse}')
We will train a regression model with $l1$ penalty (Lasso) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1,10,50,100,150,200,250]
# stores the weights of each feature
weights = []
for l in lambd:
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.01,
epochs=e, verbose=False, lambd=l,
reg_type='lasso')
weights.append(w)
# Plot weights and lambda
lasso_fig = weight_versus_lambda_plot(weights, lambd, features)
lasso_fig.suptitle('Lasso: Variation in feature weights as lambda grows',
fontsize=16, y=0.93);
From the plot, we can see that feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero.
Here we will build a model using a selected value of lambda
and evaluate its performance on test set.
# Build and fit model
alpha = 0.001
e = 5000
lambd = 250
# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True, lambd=lambd, reg_type='lasso')
# Create predictions on test set
mlr_lasso_test_pred = mlr_model.predict(x_test)
# Calculate metrics
mlr_lasso_test_r2 = np.abs(r2_score(y_test, mlr_lasso_test_pred))
mlr_lasso_test_mse = mean_squared_error(y_test, mlr_lasso_test_pred)
mlr_lasso_test_rmse = np.sqrt(mlr_lasso_test_mse)
print("Performance on test set:")
print(f'Test R2 is: {mlr_lasso_test_r2}')
print(f'Test MSE is: {mlr_lasso_test_mse}')
print(f'Test RMSE is: {mlr_lasso_test_rmse}')
We will train a regression model with $l1\ and\ l2$ penalty (Elastic Net) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1,10,100]
# stores the weights of each feature
weights = []
for l in lambd:
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.01,
epochs=e, verbose=False, lambd=l,
reg_type='elastic net', l1_ratio=0.6)
weights.append(w)
# Plot weights and lambda
elastic_fig = weight_versus_lambda_plot(weights, lambd, features)
elastic_fig.suptitle('Elastic Net: Variation in feature weights as lambda grows', fontsize=16, y=0.93);
From the plot, we can see that feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero.
Here we will build a model using a selected value of lambda
and evaluate its performance on test set.
# Build and fit model
alpha = 0.001
e = 10000
lambd = 100
# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True, lambd=lambd,
reg_type='elastic net', l1_ratio=0.6)
# Create predictions on test set
mlr_elastic_test_pred = mlr_model.predict(x_test)
# Calculate metrics
mlr_elastic_test_r2 = np.abs(r2_score(y_test, mlr_elastic_test_pred))
mlr_elastic_test_mse = mean_squared_error(y_test, mlr_elastic_test_pred)
mlr_elastic_test_rmse = np.sqrt(mlr_elastic_test_mse)
print("Performance on test set:")
print(f'Test R2 is: {mlr_elastic_test_r2}')
print(f'Test MSE is: {mlr_elastic_test_mse}')
print(f'Test RMSE is: {mlr_elastic_test_rmse}')
For classification, we will use the Breast Cancer toy dataset available in the scikit-learn library. We will:
Ridge Regression
Lasso Regression
Elastic Net Regression
# Basic Imports
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')
X=load_breast_cancer().data[:,:10]
display(f'Shape of X: {X.shape}')
features = load_breast_cancer().feature_names
# Train test split
from sklearn.model_selection import train_test_split
# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1234)
display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
We will train a logistic regression model without regularization for different learning rates. Next, we will build a model using a selected value of learning rate and evaluate its performance on test set.
Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.
# Build and fit LR model for different learning rates
epochs = 5000
model = Logistic_Regression()
learning_rate = [0.1, 0.01, 0.001]
log_fig = loss_per_epoch_plot(epochs, model, learning_rate)
From the plot above, we can initially see the loss reducing significantly and then stabalize, as epochs increase, for different learning rates $\alpha$. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.01 to build and fit a model and then evaluate its performance.
We will bild the model using a learning rate and evaluate the performance of our model on test data.
# Build and fit best LR model
alpha = 0.01
e = 5000
# Build model
log_model = Logistic_Regression()
# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True)
# Create predictions on test set
log_test_pred = log_model.predict(x_test, 0.5)
# Define Accuracy
def accuracy(y_true, y_pred):
acc = np.sum(y_true == y_pred) / len(y_true)
return acc
# Calculate accuracy
log_test_acc = accuracy(y_test, log_test_pred)
print(f'Accuracy on test set: {log_test_acc}')
# Calculate metrics
cm_test = confusion_matrix(y_test, log_test_pred)
test_report = classification_report(y_test, log_test_pred)
print("Performance on test set:\n")
print(f'Confusion Matrix:\n {cm_test}\n')
print(f'Classification Report:\n {test_report}')
We will train a regression model with $l2$ penalty (Ridge) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
log_model = Logistic_Regression()
lambd = [10,50,100,150]
e = 5000
# stores the weights of each feature
weights = []
for l in lambd:
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01,
epochs=e, verbose=False, lambd=l,
reg_type='ridge')
weights.append(w)
# Plot weights and lambda
ridge_fig = weight_versus_lambda_plot(weights, lambd, features)
ridge_fig.suptitle('Ridge: Variation in feature weights as lambda grows', fontsize=16, y=0.93);
From the plot, we can see that feature weights continue to decrease and come close to zero as $\lambda$ increases. Weights remain close to zero but do not become zero.
Here we will build a model using a selected value of lambda
and evaluate its performance on test set.
# Build and fit model
alpha = 0.01
e = 5000
lambd = 150
# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True, lambd=lambd,
reg_type='ridge')
# Create predictions on test set
log_ridge_test_pred = log_model.predict(x_test, 0.5)
# Calculate accuracy
ridge_test_acc = accuracy(y_test, log_ridge_test_pred)
print(f'Accuracy on test set: {ridge_test_acc}')
# Calculate metrics
ridge_cm_test = confusion_matrix(y_test, log_ridge_test_pred)
ridge_test_report = classification_report(y_test, log_ridge_test_pred)
print("Performance on test set:\n")
print(f'Confusion Matrix:\n {ridge_cm_test}\n')
print(f'Classification Report:\n {ridge_test_report}')
We will train a regression model with $l1$ penalty (Lasso) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [0.1, 1, 10]
e = 5000
# stores the weights of each feature
weights = []
for l in lambd:
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01,
epochs=e, verbose=False, lambd=l,
reg_type='lasso')
weights.append(w)
# Plot weights and lambda
lasso_fig = weight_versus_lambda_plot(weights, lambd, features)
lasso_fig.suptitle('Lasso: Variation in feature weights as lambda grows', fontsize=16, y=0.93);
From the plot, we can see that most feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero. There are a few features for which weights tend to go up as $\lambda\ >\ 1$.
Here we will build a model using a selected value of lambda
and evaluate its performance on test set.
# Build and fit model
alpha = 0.01
e = 5000
lambd = 10
# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True, lambd=lambd,
reg_type='lasso')
# Create predictions on test set
log_lasso_test_pred = log_model.predict(x_test, 0.5)
# Calculate accuracy
lasso_test_acc = accuracy(y_test, log_lasso_test_pred)
print(f'Accuracy on test set: {lasso_test_acc}')
# Calculate metrics
lasso_cm_test = confusion_matrix(y_test, log_lasso_test_pred)
lasso_test_report = classification_report(y_test, log_lasso_test_pred)
print("Performance on test set:\n")
print(f'Confusion Matrix:\n {lasso_cm_test}\n')
print(f'Classification Report:\n {lasso_test_report}')
We will train a regression model with $l1\ and\ l2$ penalty (Elastic Net) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1, 10, 50, 100]
e = 5000
# stores the weights of each feature
weights = []
for l in lambd:
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01,
epochs=e, verbose=False, lambd=l,
reg_type='elastic net', l1_ratio=0.6)
weights.append(w)
# Plot weights and lambda
elastic_fig = weight_versus_lambda_plot(weights, lambd, features)
elastic_fig.suptitle('Elastic Net: Variation in feature weights as lambda grows', fontsize=16, y=0.93);
From the plot, we can see that most feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero. There are a few features for which weights tend to go up as $\lambda\ >\ 1$.
Here we will build a model using a selected value of lambda
and evaluate its performance on test set.
# Build and fit model
alpha = 0.01
e = 5000
lambd = 100
# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha,
epochs=e, verbose=True, lambd=lambd,
reg_type='elastic net', l1_ratio=0.6)
# Create predictions on test set
log_elastic_test_pred = log_model.predict(x_test, 0.5)
# Calculate accuracy
elastic_test_acc = accuracy(y_test, log_elastic_test_pred)
print(f'Accuracy on test set: {elastic_test_acc}')
# Calculate metrics
elastic_cm_test = confusion_matrix(y_test, log_elastic_test_pred)
elastic_test_report = classification_report(y_test, log_elastic_test_pred)
print("Performance on test set:\n")
print(f'Confusion Matrix:\n {elastic_cm_test}\n')
print(f'Classification Report:\n {elastic_test_report}')
REGRESSION:
Model | R2 | MSE | RMSE |
---|---|---|---|
Linear | 0.67 | 26.38 | 5.14 |
Ridge | 0.51 | 39.33 | 6.27 |
Lasso | 0.59 | 32.54 | 5.7 |
Elastic Net | 0.64 | 28.13 | 5.3 |
CLASSIFICATION:
Model | Accuracy | Precision | Recall | F1 Score | |||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
Logistic | 0.92 | 0.92 | 0.92 | 0.87 | 0.95 | 0.9 | 0.94 |
Ridge | 0.89 | 0.95 | 0.86 | 0.8 | 0.99 | 0.84 | 0.91 |
Lasso | 0.92 | 0.94 | 0.9 | 0.84 | 0.97 | 0.88 | 0.93 |
Elastic Net | 0.83 | 0.94 | 0.84 | 0.6 | 0.98 | 0.73 | 0.88 |
To summarize, in this notebook we created a naive framework for building linear and logistic regression models with $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties. The framework was then used to build and fit various models with and without regularization using different datasets. Later, we evaluated the results of our models using test data. We saw how iterating over different learning rates helped identify a learning rate with reduced loss to generate better results. We also visualized the variation of feature weights for different 𝜆 (penalty) values.