Regularization: Ridge, Lasso and Elastic Net

Overview

Basics

Goal: The goal is to create a naive framework for building linear and logistic regression models with $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties.

Overfitting: is a phenomenon that occurs when a model fits very well on training data but fails to generalize to test data or other datasets. Overfit models are characterized by high variance (high uncertainity) and low bias.

Linear models can overfit if irrelevant or highly correlated features are included as the model will learn a coefficient for every feature, regardless of whether it is a signal or noise. This is especially true when p (number of features) is close to n (number of observations). Also, large estimated coefficients in a model is a sign of overfitting. The larger the absolute value of the coefficient, the more power it has to change the predicted response, resulting in a higher variance.

image.png

Regularization: is a process of penalizing model complexity to prevent overfitting. Heuristically 'large' is interpreted as 'complex' and the idea is to penalize large values of coefficients ($\theta$) jointly thereby reducing model complexity. With regularization, we add a penalty to the model for complexity. For e.g., the cost function for linear regression with penalty can be written as:

$J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+ \lambda\ Penalty (\theta_{1}, \ldots, \theta_{p})\right)$

$\lambda$ is a constant that balances the tradeoff between the lack of fit measured by the model and the complexity measured by the penalty which depends on the regression coefficients. The bigger $\lambda$ is, the bigger the penalty for model complexity. Choosing $\lambda$ helps satisfy two goals:

- It allows for fitting the training data well and 
- Keeping model parameters small

Penalty Terms

  1. $l1$ penalty (Lasso): penalizes the sum of the absolute values of regression coefficients thereby forcing some coefficients ($\theta$) to 0. Lasso performs variable selection by forcing coefficients of some features to 0. Lasso does not handle multicollinearity well, so if there is a group of variables with high correlation, it tends to select only one variable from the group.
    $\sum_{j=1}^{p} \left|\theta_{j}\right|$
  1. $l2$ penalty (Ridge): penalizes the sum of the squared values of regression coefficients thereby reducing the magnitude of coefficients and bringing them closer to 0. By reducing the magnitude, Ridge corrects for the impact of multicollinearity. Ridge only shrinks the coefficients closer to 0, but it does not force the coefficients to be 0, and hence does not perform variable selection.
    $\sum_{j=1}^{p} \theta_{j}^{2}$
  1. $l1\ and\ l2$ penalty (Elastic-Net): penalizes the sum of the absolute values and the sum of the squared values of regression coefficients. It combines the effects of both lasso and ridge penalties; simultaneously performs variable selection and continues shrinkage. Elastic-Net often out-performs Lasso in terms of prediction accuracy.
    $\sum_{j=1}^{p} \left|\theta_{j}\right|+\sum_{j=1}^{p} \theta_{j}^{2}$

Cost Function

Cost Function gives an idea of how far the predicted values are from the actual values. Cost function for different penalty terms are:

REGRESSION:

  1. Linear Regression
    $J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}$
  1. With $l1$ penalty: Lasso
    $J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+\lambda \sum_{j=1}^{p} \left|\theta_{j}\right| \right)$
  1. With $l2$ penalty: Ridge
    $J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+\lambda \sum_{j=1}^{p} \theta_{j}^{2}\right)$
  1. With $l1\ and\ l2$ penalty: Elastic Net
    $J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{0}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+\lambda_{1} \sum_{j=1}^{p}\left|\theta_{j}\right|+\lambda_{2} \sum_{j=1}^{p} \theta_{j}^{2}\right)$

CLASSIFICATION:

  1. Logistic Regression
    $J(\theta)= -\frac{1}{m}\left[\sum_{i=1}^{m} \left(y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right)\right]$
  1. With $l1$ penatly: Lasso
    $J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} \left(y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right.\right.\right.)$ $\left.-\lambda \sum_{j=1}^{p}\left|\theta_{j}\right|\right]$
  1. With $l2$ penatly: Ridge
    $J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} (y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right.\right.)$ $\left.-\lambda \sum_{j=1}^{p} \theta_{j}^{2}\right]$
  1. With $l1\ and\ l2$ penatly: Elastic Net
    $J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} (y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right.\right.)$ $-\lambda_{1} \sum_{j=1}^{p}\left|\theta_{j}\right|\left.-\lambda_{2} \sum_{j=1}^{p} \theta_{j}^{2}\right]$

where:
$ m = no.\ of\ records$
$ p = no.\ of\ features$
$ y_{i} = true\ values $
$Features:\ x=\left\{x_{0}, x_{1}, x_{2}, \ldots x_{p}\right\}$
$Parameters:\ \theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{p}\right\}$
$i = 1,...,m$
$j = 1,...,p$
Hypothesis:

  • Linear Regression: $h_{\theta}(x)=\theta^{T}x=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$
  • Logistic Regression: $h_{\theta}(x)=\frac{1}{1+e^{-\left(\theta^{T} x\right)}}$

$\lambda_{1}=\lambda \left(l_{1}\right.$ ratio $)$
$\lambda_{2}=\lambda \left(1-l_{1}\right.$ ratio $)$
$l_{1}$ ratio = number between 0 and 1 that emphasis on lasso vs ridge penalty

Gradient Descent

Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we use all the training data to compute the gradient of cost function and then updates parameters.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. Gradients for various models with penalties can be written as:

Bias: $\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$
Note: $\theta_{0}$ remains the same for all models.

Weights:

  1. Model without penalty terms
    $\theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$
  1. With $l1$ penatly: Lasso
    $\theta_{j}=\theta_{j}-\alpha\left[\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta} x^{(i)}-y^{(i)}\right) x^{(i)}_{j}+\frac{\lambda}{m} \frac{\partial\left|\theta_{j}\right|}{\partial \theta_{j}}\right]$
  1. With $l2$ penatly: Ridge
    $\theta_{j}=\theta_{j}-\alpha\left[\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta} x^{(i)}-y^{(i)}\right) x^{(i)}_{j}+\frac{\lambda}{m} \theta_{j}\right]$
  1. With $l1\ and\ l2$ penatly: Elastic Net
    $\theta_{j}=\theta_{j}-\alpha\left[\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta} x^{(i)}-y^{(i)}\right) x^{(i)}_{j}+\frac{\lambda}{m} \theta_{j}+\frac{\lambda}{m} \frac{\partial\left|\theta_{j}\right|}{\partial \theta_{j}}\right]$

where:
$\frac{\partial|\omega|}{\partial \omega}=\left\{\begin{array}{cc}1 & \omega>0 \\ -1 & \omega<0\end{array}\right.$
$ m = no.\ of\ records$
$ p = no.\ of\ features$
$i = 1,...,m$
$j = 1,...,p$
$ y_{i} = true\ values $
$h_{\theta}(x) = hypothesis$
$\alpha = learning\ rate$

image.png

Regularization

In this section, we will define the classes that will build, fit and predict linear and logistic regression models including penalty terms for $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties. We will use batch gradient descent to fit the model and generate feature weights. These classes will then be used to build various regression and classification models with penalty terms in later sections.

We will also define two function to help us better visualize the performance of various models and their results. The functions will:

  1. Plot Loss for different learning rates: iterates over various learning rates and build model using each learning rate. Next, it plots the loss generated by different models.
  2. Plot model weights for different penalty values ($\lambda$): visualize the model's estimated coefficient weight for different lambda values.

Define Classes

The idea is to build three classes:

  1. Base_Regression: models the relationship between a dependent variable y and independent variables X. It will include the following methods:
    a. _regularization - computes the penalty cost based on penalty type (l1, l2, l1 and l2)
    b. _gradient - computes the gradient depending on penalty type
    c. fit - fits a regression model using batch gradient descent
    d. _approximation - to compute hypothesis $h_{\theta}(x)$: approximate values of y given training data
    e. _cost - to compute model cost with regularization
    f. predict - to make predictions based on model weights/coefficients
  1. Linear_Regression(): inherits from Base_Regression and overwrites the following methods:
    a. _approximation - computes hypothesis $h_{\theta}(x)$ for linear regression
    b. _cost - computes model cost with regularization for linear regression
    c. predict - makes predictions based on model weights/coefficients
  1. Logistic_Regression(): inherits from Base_Regression and overwrites the following methods:
    a. _approximation - computes hypothesis $h_{\theta}(x)$ for logistic regression
    b. _cost - computes model cost with regularization for logistic regression
    c. predict - makes predictions based on model weights/coefficients
    d. sigmoid - new method to compute sigmoid for hypothesis
In [1]:
class Base_Regression:
    '''
    Models the relationship between a dependent variable y and independent variables X
    '''
    def __init__(self):
        self.b = None
        self.w = None
         
    def _regularization(self, lambd, w, reg_type, l1_ratio):
        '''
        Computes the penalty cost depending on penalty type:
        
        1. For lasso, we compute the sum of the abs. value of regression coefficients = norm 1 
        also written as: np.linalg.norm(self.w, ord=1)
        
        2. For ridge, we compute the sum of squared values of regression coefficients
        
        3. Elastic net is a combination lasso and ridge 
        '''
        
        if reg_type == 'lasso':
            return lambd * np.sum(np.abs(self.w))
        
        elif reg_type == 'ridge':
            return lambd * np.sum(np.square(self.w))
        
        elif reg_type == 'elastic net':
            l1 = l1_ratio * np.sum(np.abs(self.w))
            l2 = (1 - l1_ratio) * np.sum(np.square(self.w))
            return lambd * (l1 + l2)
        
        else:
            return 0
    
    def _gradient(self, n, y, y_pred, w, X, lambd, reg_type, l1_ratio):

        '''
        Computes the gradient depending on penalty type:
        
        1. For lasso, gradient is (lambd/n_obs) * np.sign(Q)      
        
        2. For ridge, gradient is (lambd/n_obs) * Q (theta)
        
        3. For elastic net, gradient is l1_ratio*gradient(lasso) + (1-l1_ratio)*gradient(ridge)
        '''
        
        D_b = (1/n) * np.sum(y_pred - y)
        D_w = (1/n) * np.dot(y_pred - y, X)
        
        if reg_type == 'lasso':
            D_w += (lambd/n) * np.sign(self.w)
       
        elif reg_type == 'ridge':
            D_w += (lambd/n) * self.w
            
        elif reg_type == 'elastic net':
            l1 = l1_ratio * np.sign(self.w)
            l2 = (1 - l1_ratio) * self.w
            D_w += (lambd/n) * (l1 + l2)
            
        else:
            D_w = D_w
            
        return D_b, D_w
    
    
    def fit(self, X, y, learning_rate=0.01, epochs=1000, verbose=False, lambd=0, reg_type=None, l1_ratio=0.5):
        '''
        fits a regression model using batch gradient descent
        '''
        n_obs, m_features = X.shape
        total_loss = []
        
        # Initialize weights and bias
        self.b = np.random.rand(1)
        self.w = np.random.rand(m_features)
        
        for e in range(epochs):
            
            # Compute loss
            y_pred = self._approximation(self.w, self.b, X)
            loss = self._cost(n_obs, y_pred, y, self.w, lambd, reg_type, l1_ratio)
            total_loss.append(loss)
            
            # Calculate gradients
            D_b, D_w = self._gradient(n_obs, y, y_pred, self.w, X, lambd, reg_type, l1_ratio)
            
            # Update parameters
            self.b -= learning_rate * D_b
            self.w -= learning_rate * D_w
            
            if verbose == True and e%1000 == 0:
                print(f'Epoch: {e}, Loss: {loss}')
        
        return self.b, self.w, total_loss

    def _approximation(self, w, b, X):
        raise NotImplementedError()
          
    def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
        raise NotImplementedError()
    
    def predict(self, X, threshold=0.5):
        return self._predict(X, self.w, self.b)
    
    def _predict(self, X, w, b):
        raise NotImplementedError()
    

class Linear_Regression(Base_Regression):
    
    def _approximation(self, w, b, X):
        '''
        Computes hypothesis (hQ_x): y_pred = (b0 + b1Xm)
        '''
        return self.b + np.dot(self.w, X.T)
    
    def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
        '''
        Computes cost with regularization
        (1/n_obs) * (sum((ym - y_pred)**2) + regularization cost)
        '''
        cost = (1/n_obs) * (np.sum((y - y_pred)**2) + 
                            self._regularization(lambd, self.w, reg_type, l1_ratio))
        return cost
    
    def _predict(self, X, w, b):
        '''
        Make predictions based on model weights/coefficients
        '''
        return self.b + np.dot(X, self.w)
    

class Logistic_Regression(Base_Regression):
    
    def _approximation(self, w, b, X):
        Qt_x = self.b + np.dot(self.w, X.T)
        return self._sigmoid(Qt_x)
    
    def _sigmoid(self, z):
        return 1/(1 + np.exp(-z))
    
    def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
        '''
        a really small value 'epsilon' is added to avoid 
        overflow and divison by zero error for log
        
        loss = (-1/m) * (sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x)) - regularization cost)
        where hQ_x = 1/(1 + e^(-Q_t*x))
        '''
        eps = 1e-5
        cost = (-1/n_obs) * (np.sum(y * np.log(y_pred+eps) + (1-y) * np.log(1-y_pred+eps)) - 
                             self._regularization(lambd, self.w, reg_type, l1_ratio))
        return cost
    
    def _predict(self, X, w, b, threshold=0.5):
        y_predicted = self._approximation(self.w, self.b, X)
        y_predicted_cls = [1 if i>=threshold else 0 for i in y_predicted]
        return np.array(y_predicted_cls)
        

Plot Loss for different learning rates ($\alpha$)

Learning rate is extremely important for finding best model. We will define a function that iterates over various learning rates and build model using each learning rate. Next, it plots the loss generated by different models.

In [2]:
def loss_per_epoch_plot(epochs, model, learning_rate):
    """
    Pass in the no. of epochs, model class, list of learning rates 
    and plot the loss per epoch for models generated using different 
    learning rates
    """ 
    fig = plt.figure(figsize = (15, 8))
    losses = {} 
    
    e = epochs
    model = model
    for lr in learning_rate:
        try:
            b, w, epoch_loss = model.fit(x_train, y_train, 
                                         learning_rate=lr, epochs=e)
            losses[f'LR={str(lr)}'] = epoch_loss
        except:
            continue

    # Plot loss    
    xs = np.arange(len(epoch_loss))

    for k, v in losses.items():
        plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")

    plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
    plt.xlabel('Epochs', size=14)
    plt.ylabel('Loss', size=14)
    plt.legend(fontsize=12)
    return fig

Plot model weights for different penalty values ($\lambda$)

We will define a function to help us visualize the model's estimated weight for different lambda values.

In [3]:
# Define a function to plot weights and lambda values
def weight_versus_lambda_plot(weight, lambd, features):
    """
    Pass in the estimated weight, the lambda value and the names
    for the features and plot the model's estimated weight 
    for different lambda values
    """
    fig = plt.figure(figsize = (10, 8))
    
    # ensure that the weight is an array
    weight = np.array(weight)
    for col in range(weight.shape[1]):
        plt.plot(lambd, weight[:, col], label = features[col])

    plt.axhline(0, color = 'black', linestyle = '--', linewidth = 3)
    
    # manually specify the coordinate of the legend
    plt.legend(bbox_to_anchor = (1.3, 0.9))
    plt.ylabel('Feature weight')
    plt.xlabel('Lambda')
    return fig

Regression

For regression, we will use the Boston Housing Prices toy dataset available in the scikit-learn library. We will:

  1. Load the data
  2. Split and Standardize the data
  3. Build and fit a basic model using batch gradient descent
    a. Evaluate model performance on test set
  4. Build a regularized model using l2 penalty: Ridge Regression
    a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
  5. Build a regularized model using l1 penalty: Lasso Regression
    a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
  6. Build a regularized model using l1 and l2 penalty: Elastic Net Regression
    a. Plot variation of weights for different lambda values b. Fit model and evaluate performance

Get Data

Load Data

In [4]:
# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
In [5]:
# Read the data
from sklearn.datasets import load_boston
boston_data=pd.DataFrame(load_boston().data,columns=load_boston().feature_names)
boston_data.head()
Out[5]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
In [6]:
# Create dependent and independent variables
data = load_boston()
X = data.data
Y = data.target
features = data.feature_names
In [7]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=123)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (379, 13)'
'Shape of test data (127, 13)'
In [8]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Linear Model

We will train a regression model without regularization for different learning rates. Next, we will build a model using a selected value of learning rate and evaluate its performance on test set.

Plot Loss for different learning rates

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.

In [9]:
# Build and fit LR model for different learning rates

epochs = 10000
model = Linear_Regression()
learning_rate = [0.1, 0.01, 0.001, 0.0001, 0.00001]

mlr_fig = loss_per_epoch_plot(epochs, model, learning_rate)

From the plot above, we can initially see the loss reducing significantly and then stabalize, as epochs increase, for different learning rates $\alpha$. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 to build and fit a model and then evaluate its performance.

Fit Model and Evaluate Performance

We will bild the model using a learning rate and evaluate the performance of our model on test data.

In [10]:
# Build and fit best LR model
alpha = 0.001
e = 10000
batch = 32

# Build model
mlr_model = Linear_Regression()


# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, verbose=True)
Epoch: 0, Loss: 568.5920328719429
Epoch: 1000, Loss: 90.3083203919063
Epoch: 2000, Loss: 32.10197340638541
Epoch: 3000, Loss: 23.821701722205226
Epoch: 4000, Loss: 22.43743142232134
Epoch: 5000, Loss: 22.072809206405893
Epoch: 6000, Loss: 21.899094236283833
Epoch: 7000, Loss: 21.785539160309746
Epoch: 8000, Loss: 21.70337462375776
Epoch: 9000, Loss: 21.641697656784903
In [11]:
# Create predictions on test set
mlr_test_pred = mlr_model.predict(x_test)
In [12]:
# Calculate metrics
mlr_test_r2 = np.abs(r2_score(y_test, mlr_test_pred))
mlr_test_mse = mean_squared_error(y_test, mlr_test_pred)
mlr_test_rmse = np.sqrt(mlr_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_test_r2}')
print(f'Test MSE is: {mlr_test_mse}')
print(f'Test RMSE is: {mlr_test_rmse}')
Performance on test set:
Test R2 is: 0.6664908666139335
Test MSE is: 26.327957451131354
Test RMSE is: 5.131077611099968

Ridge Model

We will train a regression model with $l2$ penalty (Ridge) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.

Plot Weights for different Lambda values

In [13]:
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [10, 100, 1000]
e = 10000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.001, 
                                     epochs=e, verbose=False, lambd=l, 
                                     reg_type='ridge')
    weights.append(w)
In [14]:
# Plot weights and lambda

ridge_fig = weight_versus_lambda_plot(weights, lambd, features)
ridge_fig.suptitle('Ridge: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease and come close to zero as $\lambda$ increases. Weights remain close to zero but do not become zero.

Build and Fit Model

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

In [15]:
# Build and fit model
alpha = 0.001
e = 10000
lambd = 1000

# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                 epochs=e, verbose=True, lambd=lambd, 
                                 reg_type='ridge')
Epoch: 0, Loss: 622.8415551485903
Epoch: 1000, Loss: 121.2586514272164
Epoch: 2000, Loss: 63.209795385576655
Epoch: 3000, Loss: 55.36534799076975
Epoch: 4000, Loss: 54.30478750806339
Epoch: 5000, Loss: 54.161399830930264
Epoch: 6000, Loss: 54.14201382773705
Epoch: 7000, Loss: 54.139392841551754
Epoch: 8000, Loss: 54.13903848441503
Epoch: 9000, Loss: 54.13899057535658
In [16]:
# Create predictions on test set
mlr_ridge_test_pred = mlr_model.predict(x_test)
In [17]:
# Calculate metrics
mlr_ridge_test_r2 = np.abs(r2_score(y_test, mlr_ridge_test_pred))
mlr_ridge_test_mse = mean_squared_error(y_test, mlr_ridge_test_pred)
mlr_ridge_test_rmse = np.sqrt(mlr_ridge_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_ridge_test_r2}')
print(f'Test MSE is: {mlr_ridge_test_mse}')
print(f'Test RMSE is: {mlr_ridge_test_rmse}')
Performance on test set:
Test R2 is: 0.5018065147532993
Test MSE is: 39.32850878426335
Test RMSE is: 6.271244596111951

Lasso Model

We will train a regression model with $l1$ penalty (Lasso) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.

Plot Weights for different Lambda values

In [18]:
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1,10,50,100,150,200,250]

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.01, 
                                     epochs=e, verbose=False, lambd=l, 
                                     reg_type='lasso')
    weights.append(w)
In [19]:
# Plot weights and lambda

lasso_fig = weight_versus_lambda_plot(weights, lambd, features)
lasso_fig.suptitle('Lasso: Variation in feature weights as lambda grows', 
                   fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero.

Build and Fit Model

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

In [20]:
# Build and fit model
alpha = 0.001
e = 5000
lambd = 250

# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, reg_type='lasso')
Epoch: 0, Loss: 612.7721592862399
Epoch: 1000, Loss: 102.05510468089096
Epoch: 2000, Loss: 41.37289240919998
Epoch: 3000, Loss: 32.883892146603614
Epoch: 4000, Loss: 31.70026715530038
In [21]:
# Create predictions on test set
mlr_lasso_test_pred = mlr_model.predict(x_test)
In [22]:
# Calculate metrics
mlr_lasso_test_r2 = np.abs(r2_score(y_test, mlr_lasso_test_pred))
mlr_lasso_test_mse = mean_squared_error(y_test, mlr_lasso_test_pred)
mlr_lasso_test_rmse = np.sqrt(mlr_lasso_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_lasso_test_r2}')
print(f'Test MSE is: {mlr_lasso_test_mse}')
print(f'Test RMSE is: {mlr_lasso_test_rmse}')
Performance on test set:
Test R2 is: 0.5878305143618532
Test MSE is: 32.5375817158631
Test RMSE is: 5.704172307694001

Elastic Net Model

We will train a regression model with $l1\ and\ l2$ penalty (Elastic Net) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.

Plot Weights for different Lambda values

In [23]:
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1,10,100]

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='elastic net', l1_ratio=0.6)
    weights.append(w)
In [24]:
# Plot weights and lambda

elastic_fig = weight_versus_lambda_plot(weights, lambd, features)
elastic_fig.suptitle('Elastic Net: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero.

Build and Fit Model

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

In [25]:
# Build and fit model
alpha = 0.001
e = 10000
lambd = 100

# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='elastic net', l1_ratio=0.6)
Epoch: 0, Loss: 562.3691236063022
Epoch: 1000, Loss: 94.7863715039122
Epoch: 2000, Loss: 37.51882786055542
Epoch: 3000, Loss: 29.653415539049384
Epoch: 4000, Loss: 28.49646337953559
Epoch: 5000, Loss: 28.286807139443578
Epoch: 6000, Loss: 28.226957040648003
Epoch: 7000, Loss: 28.199134418791104
Epoch: 8000, Loss: 28.18246156038106
Epoch: 9000, Loss: 28.171421424880272
In [26]:
# Create predictions on test set
mlr_elastic_test_pred = mlr_model.predict(x_test)
In [27]:
# Calculate metrics
mlr_elastic_test_r2 = np.abs(r2_score(y_test, mlr_elastic_test_pred))
mlr_elastic_test_mse = mean_squared_error(y_test, mlr_elastic_test_pred)
mlr_elastic_test_rmse = np.sqrt(mlr_elastic_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_elastic_test_r2}')
print(f'Test MSE is: {mlr_elastic_test_mse}')
print(f'Test RMSE is: {mlr_elastic_test_rmse}')
Performance on test set:
Test R2 is: 0.6437118460102644
Test MSE is: 28.126184321690875
Test RMSE is: 5.303412516643494

Classification

Get Data

For classification, we will use the Breast Cancer toy dataset available in the scikit-learn library. We will:

  1. Load the data
  2. Split and Standardize the data
  3. Build and fit a basic model using batch gradient descent
    a. Evaluate model performance on test set
  4. Build a regularized model using l2 penalty: Ridge Regression
    a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
  5. Build a regularized model using l1 penalty: Lasso Regression
    a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
  6. Build a regularized model using l1 and l2 penalty: Elastic Net Regression
    a. Plot variation of weights for different lambda values b. Fit model and evaluate performance

Load Data

In [28]:
# Basic Imports
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report
In [29]:
# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()
Out[29]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

In [30]:
# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')

X=load_breast_cancer().data[:,:10]
display(f'Shape of X: {X.shape}')

features = load_breast_cancer().feature_names
'Shape of Y: (569,)'
'Shape of X: (569, 10)'
In [31]:
# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1234)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')
'Shape of training data (426, 10)'
'Shape of test data (143, 10)'
In [32]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Logistic Model

We will train a logistic regression model without regularization for different learning rates. Next, we will build a model using a selected value of learning rate and evaluate its performance on test set.

Plot Loss for different learning rates

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.

In [33]:
# Build and fit LR model for different learning rates

epochs = 5000
model = Logistic_Regression()
learning_rate = [0.1, 0.01, 0.001]

log_fig = loss_per_epoch_plot(epochs, model, learning_rate)

From the plot above, we can initially see the loss reducing significantly and then stabalize, as epochs increase, for different learning rates $\alpha$. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.01 to build and fit a model and then evaluate its performance.

Fit Model and Evaluate Performance

We will bild the model using a learning rate and evaluate the performance of our model on test data.

In [34]:
# Build and fit best LR model
alpha = 0.01
e = 5000

# Build model
log_model = Logistic_Regression()


# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, verbose=True)
Epoch: 0, Loss: 2.801124198008247
Epoch: 1000, Loss: 0.16386755120467672
Epoch: 2000, Loss: 0.13467828554073283
Epoch: 3000, Loss: 0.12344957042805385
Epoch: 4000, Loss: 0.1174704941905151
In [35]:
# Create predictions on test set
log_test_pred = log_model.predict(x_test, 0.5)
In [36]:
# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc
In [37]:
# Calculate accuracy
log_test_acc = accuracy(y_test, log_test_pred)
print(f'Accuracy on test set: {log_test_acc}')
Accuracy on test set: 0.9230769230769231
In [38]:
# Calculate metrics
cm_test = confusion_matrix(y_test, log_test_pred)
test_report = classification_report(y_test, log_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {cm_test}\n')
print(f'Classification Report:\n {test_report}')
Performance on test set:

Confusion Matrix:
 [[48  7]
 [ 4 84]]

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.87      0.90        55
           1       0.92      0.95      0.94        88

    accuracy                           0.92       143
   macro avg       0.92      0.91      0.92       143
weighted avg       0.92      0.92      0.92       143

Ridge Model

We will train a regression model with $l2$ penalty (Ridge) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.

Plot Weights for different Lambda values

In [39]:
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
log_model = Logistic_Regression()
lambd = [10,50,100,150]
e = 5000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='ridge')
    weights.append(w)
In [40]:
# Plot weights and lambda

ridge_fig = weight_versus_lambda_plot(weights, lambd, features)
ridge_fig.suptitle('Ridge: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease and come close to zero as $\lambda$ increases. Weights remain close to zero but do not become zero.

Build and Fit Model

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

In [41]:
# Build and fit model
alpha = 0.01
e = 5000
lambd = 150

# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='ridge')
Epoch: 0, Loss: 4.810672897888645
Epoch: 1000, Loss: 0.4630314146284181
Epoch: 2000, Loss: 0.46222240367092043
Epoch: 3000, Loss: 0.4621400387231419
Epoch: 4000, Loss: 0.4621245765327778
In [42]:
# Create predictions on test set
log_ridge_test_pred = log_model.predict(x_test, 0.5)
In [43]:
# Calculate accuracy
ridge_test_acc = accuracy(y_test, log_ridge_test_pred)
print(f'Accuracy on test set: {ridge_test_acc}')
Accuracy on test set: 0.8881118881118881
In [44]:
# Calculate metrics
ridge_cm_test = confusion_matrix(y_test, log_ridge_test_pred)
ridge_test_report = classification_report(y_test, log_ridge_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {ridge_cm_test}\n')
print(f'Classification Report:\n {ridge_test_report}')
Performance on test set:

Confusion Matrix:
 [[41 14]
 [ 2 86]]

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.75      0.84        55
           1       0.86      0.98      0.91        88

    accuracy                           0.89       143
   macro avg       0.91      0.86      0.88       143
weighted avg       0.90      0.89      0.88       143

Lasso Model

We will train a regression model with $l1$ penalty (Lasso) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.

Plot Weights for different Lambda values

In [45]:
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [0.1, 1, 10]
e = 5000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='lasso')
    weights.append(w)
In [46]:
# Plot weights and lambda

lasso_fig = weight_versus_lambda_plot(weights, lambd, features)
lasso_fig.suptitle('Lasso: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that most feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero. There are a few features for which weights tend to go up as $\lambda\ >\ 1$.

Build and Fit Model

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

In [47]:
# Build and fit model
alpha = 0.01
e = 5000
lambd = 10

# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='lasso')
Epoch: 0, Loss: 2.3778737795894664
Epoch: 1000, Loss: 0.26718522501660646
Epoch: 2000, Loss: 0.2584320409432221
Epoch: 3000, Loss: 0.2565173888446097
Epoch: 4000, Loss: 0.2559426202287423
In [48]:
# Create predictions on test set
log_lasso_test_pred = log_model.predict(x_test, 0.5)
In [49]:
# Calculate accuracy
lasso_test_acc = accuracy(y_test, log_lasso_test_pred)
print(f'Accuracy on test set: {lasso_test_acc}')
Accuracy on test set: 0.916083916083916
In [50]:
# Calculate metrics
lasso_cm_test = confusion_matrix(y_test, log_lasso_test_pred)
lasso_test_report = classification_report(y_test, log_lasso_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {lasso_cm_test}\n')
print(f'Classification Report:\n {lasso_test_report}')
Performance on test set:

Confusion Matrix:
 [[46  9]
 [ 3 85]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.84      0.88        55
           1       0.90      0.97      0.93        88

    accuracy                           0.92       143
   macro avg       0.92      0.90      0.91       143
weighted avg       0.92      0.92      0.92       143

Elastic Net Model

We will train a regression model with $l1\ and\ l2$ penalty (Elastic Net) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.

Plot Weights for different Lambda values

In [51]:
# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1, 10, 50, 100]
e = 5000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='elastic net', l1_ratio=0.6)
    weights.append(w)
In [52]:
# Plot weights and lambda

elastic_fig = weight_versus_lambda_plot(weights, lambd, features)
elastic_fig.suptitle('Elastic Net: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that most feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero. There are a few features for which weights tend to go up as $\lambda\ >\ 1$.

Build and Fit Model

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

In [53]:
# Build and fit model
alpha = 0.01
e = 5000
lambd = 100

# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='elastic net', l1_ratio=0.6)
Epoch: 0, Loss: 4.398304647049499
Epoch: 1000, Loss: 0.5581468375691527
Epoch: 2000, Loss: 0.5581587247090656
Epoch: 3000, Loss: 0.5583725172916181
Epoch: 4000, Loss: 0.5584206447040022
In [54]:
# Create predictions on test set
log_elastic_test_pred = log_model.predict(x_test, 0.5)
In [55]:
# Calculate accuracy
elastic_test_acc = accuracy(y_test, log_elastic_test_pred)
print(f'Accuracy on test set: {elastic_test_acc}')
Accuracy on test set: 0.8321678321678322
In [56]:
# Calculate metrics
elastic_cm_test = confusion_matrix(y_test, log_elastic_test_pred)
elastic_test_report = classification_report(y_test, log_elastic_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {elastic_cm_test}\n')
print(f'Classification Report:\n {elastic_test_report}')
Performance on test set:

Confusion Matrix:
 [[33 22]
 [ 2 86]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.60      0.73        55
           1       0.80      0.98      0.88        88

    accuracy                           0.83       143
   macro avg       0.87      0.79      0.81       143
weighted avg       0.85      0.83      0.82       143

Model Results

REGRESSION:

Model R2 MSE RMSE
Linear 0.67 26.38 5.14
Ridge 0.51 39.33 6.27
Lasso 0.59 32.54 5.7
Elastic Net 0.64 28.13 5.3

CLASSIFICATION:

Model Accuracy Precision Recall F1 Score
0 1 0 1 0 1
Logistic 0.92 0.92 0.92 0.87 0.95 0.9 0.94
Ridge 0.89 0.95 0.86 0.8 0.99 0.84 0.91
Lasso 0.92 0.94 0.9 0.84 0.97 0.88 0.93
Elastic Net 0.83 0.94 0.84 0.6 0.98 0.73 0.88

Summary

To summarize, in this notebook we created a naive framework for building linear and logistic regression models with $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties. The framework was then used to build and fit various models with and without regularization using different datasets. Later, we evaluated the results of our models using test data. We saw how iterating over different learning rates helped identify a learning rate with reduced loss to generate better results. We also visualized the variation of feature weights for different 𝜆 (penalty) values.