Regularization: Ridge, Lasso and Elastic Net¶

Source: https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Overview¶

Basics¶

Goal: The goal is to create a naive framework for building linear and logistic regression models with $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties.

Overfitting: is a phenomenon that occurs when a model fits very well on training data but fails to generalize to test data or other datasets. Overfit models are characterized by high variance (high uncertainity) and low bias.

Linear models can overfit if irrelevant or highly correlated features are included as the model will learn a coefficient for every feature, regardless of whether it is a signal or noise. This is especially true when p (number of features) is close to n (number of observations). Also, large estimated coefficients in a model is a sign of overfitting. The larger the absolute value of the coefficient, the more power it has to change the predicted response, resulting in a higher variance.

Regularization: is a process of penalizing model complexity to prevent overfitting. Heuristically 'large' is interpreted as 'complex' and the idea is to penalize large values of coefficients ($\theta$) jointly thereby reducing model complexity. With regularization, we add a penalty to the model for complexity. For e.g., the cost function for linear regression with penalty can be written as:

$J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+ \lambda\ Penalty (\theta_{1}, \ldots, \theta_{p})\right)$

$\lambda$ is a constant that balances the tradeoff between the lack of fit measured by the model and the complexity measured by the penalty which depends on the regression coefficients. The bigger $\lambda$ is, the bigger the penalty for model complexity. Choosing $\lambda$ helps satisfy two goals:

- It allows for fitting the training data well and 
- Keeping model parameters small

Penalty Terms¶

$l1$ penalty (Lasso): penalizes the sum of the absolute values of regression coefficients thereby forcing some coefficients ($\theta$) to 0. Lasso performs variable selection by forcing coefficients of some features to 0. Lasso does not handle multicollinearity well, so if there is a group of variables with high correlation, it tends to select only one variable from the group.
$\sum_{j=1}^{p} \left|\theta_{j}\right|$

$l2$ penalty (Ridge): penalizes the sum of the squared values of regression coefficients thereby reducing the magnitude of coefficients and bringing them closer to 0. By reducing the magnitude, Ridge corrects for the impact of multicollinearity. Ridge only shrinks the coefficients closer to 0, but it does not force the coefficients to be 0, and hence does not perform variable selection.
$\sum_{j=1}^{p} \theta_{j}^{2}$

$l1\ and\ l2$ penalty (Elastic-Net): penalizes the sum of the absolute values and the sum of the squared values of regression coefficients. It combines the effects of both lasso and ridge penalties; simultaneously performs variable selection and continues shrinkage. Elastic-Net often out-performs Lasso in terms of prediction accuracy.
$\sum_{j=1}^{p} \left|\theta_{j}\right|+\sum_{j=1}^{p} \theta_{j}^{2}$

Cost Function¶

Cost Function gives an idea of how far the predicted values are from the actual values. Cost function for different penalty terms are:

REGRESSION:

Linear Regression
$J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}$

With $l1$ penalty: Lasso
$J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+\lambda \sum_{j=1}^{p} \left|\theta_{j}\right| \right)$

With $l2$ penalty: Ridge
$J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+\lambda \sum_{j=1}^{p} \theta_{j}^{2}\right)$

With $l1\ and\ l2$ penalty: Elastic Net
$J(\theta)=\frac{1}{m}\left(\sum_{i=1}^{m}\left(h_{0}\left(x^{(i)}\right)-\left(y^{(i)}\right)\right)^{2}+\lambda_{1} \sum_{j=1}^{p}\left|\theta_{j}\right|+\lambda_{2} \sum_{j=1}^{p} \theta_{j}^{2}\right)$

CLASSIFICATION:

Logistic Regression
$J(\theta)= -\frac{1}{m}\left[\sum_{i=1}^{m} \left(y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right)\right]$

With $l1$ penatly: Lasso
$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} \left(y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right.\right.\right.)$ $\left.-\lambda \sum_{j=1}^{p}\left|\theta_{j}\right|\right]$

With $l2$ penatly: Ridge
$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} (y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right.\right.)$ $\left.-\lambda \sum_{j=1}^{p} \theta_{j}^{2}\right]$

With $l1\ and\ l2$ penatly: Elastic Net
$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} (y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right.\right.)$ $-\lambda_{1} \sum_{j=1}^{p}\left|\theta_{j}\right|\left.-\lambda_{2} \sum_{j=1}^{p} \theta_{j}^{2}\right]$

where:
$ m = no.\ of\ records$
$ p = no.\ of\ features$
$ y_{i} = true\ values $
$Features:\ x=\left\{x_{0}, x_{1}, x_{2}, \ldots x_{p}\right\}$
$Parameters:\ \theta=\left\{\theta_{0}, \theta_{1}, \theta_{2}, \ldots \theta_{p}\right\}$
$i = 1,...,m$
$j = 1,...,p$
Hypothesis:

Linear Regression: $h_{\theta}(x)=\theta^{T}x=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$
Logistic Regression: $h_{\theta}(x)=\frac{1}{1+e^{-\left(\theta^{T} x\right)}}$

$\lambda_{1}=\lambda \left(l_{1}\right.$ ratio $)$
$\lambda_{2}=\lambda \left(1-l_{1}\right.$ ratio $)$
$l_{1}$ ratio = number between 0 and 1 that emphasis on lasso vs ridge penalty

Gradient Descent¶

Batch Gradient Descent is an optimization technique that seeks to find model parameters (coefficients and bias) that minimize a cost function. In this technique, we use all the training data to compute the gradient of cost function and then updates parameters.

Gradients can be calculated by taking a partial derivative of the cost function w.r.t each parameter $\theta$. Gradients for various models with penalties can be written as:

Bias: $\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$
Note: $\theta_{0}$ remains the same for all models.

Weights:

Model without penalty terms
$\theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)}_{j}$

With $l1$ penatly: Lasso
$\theta_{j}=\theta_{j}-\alpha\left[\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta} x^{(i)}-y^{(i)}\right) x^{(i)}_{j}+\frac{\lambda}{m} \frac{\partial\left|\theta_{j}\right|}{\partial \theta_{j}}\right]$

With $l2$ penatly: Ridge
$\theta_{j}=\theta_{j}-\alpha\left[\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta} x^{(i)}-y^{(i)}\right) x^{(i)}_{j}+\frac{\lambda}{m} \theta_{j}\right]$

With $l1\ and\ l2$ penatly: Elastic Net
$\theta_{j}=\theta_{j}-\alpha\left[\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta} x^{(i)}-y^{(i)}\right) x^{(i)}_{j}+\frac{\lambda}{m} \theta_{j}+\frac{\lambda}{m} \frac{\partial\left|\theta_{j}\right|}{\partial \theta_{j}}\right]$

where:
$\frac{\partial|\omega|}{\partial \omega}=\left\{\begin{array}{cc}1 & \omega>0 \\ -1 & \omega<0\end{array}\right.$
$ m = no.\ of\ records$
$ p = no.\ of\ features$
$i = 1,...,m$
$j = 1,...,p$
$ y_{i} = true\ values $
$h_{\theta}(x) = hypothesis$
$\alpha = learning\ rate$

Regularization¶

In this section, we will define the classes that will build, fit and predict linear and logistic regression models including penalty terms for $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties. We will use batch gradient descent to fit the model and generate feature weights. These classes will then be used to build various regression and classification models with penalty terms in later sections.

We will also define two function to help us better visualize the performance of various models and their results. The functions will:

Plot Loss for different learning rates: iterates over various learning rates and build model using each learning rate. Next, it plots the loss generated by different models.
Plot model weights for different penalty values ($\lambda$): visualize the model's estimated coefficient weight for different lambda values.

Define Classes¶

The idea is to build three classes:

Base_Regression: models the relationship between a dependent variable y and independent variables X. It will include the following methods:
a. _regularization - computes the penalty cost based on penalty type (l1, l2, l1 and l2)
b. _gradient - computes the gradient depending on penalty type
c. fit - fits a regression model using batch gradient descent
d. _approximation - to compute hypothesis $h_{\theta}(x)$: approximate values of y given training data
e. _cost - to compute model cost with regularization
f. predict - to make predictions based on model weights/coefficients

Linear_Regression(): inherits from Base_Regression and overwrites the following methods:
a. _approximation - computes hypothesis $h_{\theta}(x)$ for linear regression
b. _cost - computes model cost with regularization for linear regression
c. predict - makes predictions based on model weights/coefficients

Logistic_Regression(): inherits from Base_Regression and overwrites the following methods:
a. _approximation - computes hypothesis $h_{\theta}(x)$ for logistic regression
b. _cost - computes model cost with regularization for logistic regression
c. predict - makes predictions based on model weights/coefficients
d. sigmoid - new method to compute sigmoid for hypothesis

class Base_Regression:
    '''
    Models the relationship between a dependent variable y and independent variables X
    '''
    def __init__(self):
        self.b = None
        self.w = None
         
    def _regularization(self, lambd, w, reg_type, l1_ratio):
        '''
        Computes the penalty cost depending on penalty type:
        
        1. For lasso, we compute the sum of the abs. value of regression coefficients = norm 1 
        also written as: np.linalg.norm(self.w, ord=1)
        
        2. For ridge, we compute the sum of squared values of regression coefficients
        
        3. Elastic net is a combination lasso and ridge 
        '''
        
        if reg_type == 'lasso':
            return lambd * np.sum(np.abs(self.w))
        
        elif reg_type == 'ridge':
            return lambd * np.sum(np.square(self.w))
        
        elif reg_type == 'elastic net':
            l1 = l1_ratio * np.sum(np.abs(self.w))
            l2 = (1 - l1_ratio) * np.sum(np.square(self.w))
            return lambd * (l1 + l2)
        
        else:
            return 0
    
    def _gradient(self, n, y, y_pred, w, X, lambd, reg_type, l1_ratio):

        '''
        Computes the gradient depending on penalty type:
        
        1. For lasso, gradient is (lambd/n_obs) * np.sign(Q)      
        
        2. For ridge, gradient is (lambd/n_obs) * Q (theta)
        
        3. For elastic net, gradient is l1_ratio*gradient(lasso) + (1-l1_ratio)*gradient(ridge)
        '''
        
        D_b = (1/n) * np.sum(y_pred - y)
        D_w = (1/n) * np.dot(y_pred - y, X)
        
        if reg_type == 'lasso':
            D_w += (lambd/n) * np.sign(self.w)
       
        elif reg_type == 'ridge':
            D_w += (lambd/n) * self.w
            
        elif reg_type == 'elastic net':
            l1 = l1_ratio * np.sign(self.w)
            l2 = (1 - l1_ratio) * self.w
            D_w += (lambd/n) * (l1 + l2)
            
        else:
            D_w = D_w
            
        return D_b, D_w
    
    
    def fit(self, X, y, learning_rate=0.01, epochs=1000, verbose=False, lambd=0, reg_type=None, l1_ratio=0.5):
        '''
        fits a regression model using batch gradient descent
        '''
        n_obs, m_features = X.shape
        total_loss = []
        
        # Initialize weights and bias
        self.b = np.random.rand(1)
        self.w = np.random.rand(m_features)
        
        for e in range(epochs):
            
            # Compute loss
            y_pred = self._approximation(self.w, self.b, X)
            loss = self._cost(n_obs, y_pred, y, self.w, lambd, reg_type, l1_ratio)
            total_loss.append(loss)
            
            # Calculate gradients
            D_b, D_w = self._gradient(n_obs, y, y_pred, self.w, X, lambd, reg_type, l1_ratio)
            
            # Update parameters
            self.b -= learning_rate * D_b
            self.w -= learning_rate * D_w
            
            if verbose == True and e%1000 == 0:
                print(f'Epoch: {e}, Loss: {loss}')
        
        return self.b, self.w, total_loss

    def _approximation(self, w, b, X):
        raise NotImplementedError()
          
    def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
        raise NotImplementedError()
    
    def predict(self, X, threshold=0.5):
        return self._predict(X, self.w, self.b)
    
    def _predict(self, X, w, b):
        raise NotImplementedError()
    

class Linear_Regression(Base_Regression):
    
    def _approximation(self, w, b, X):
        '''
        Computes hypothesis (hQ_x): y_pred = (b0 + b1Xm)
        '''
        return self.b + np.dot(self.w, X.T)
    
    def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
        '''
        Computes cost with regularization
        (1/n_obs) * (sum((ym - y_pred)**2) + regularization cost)
        '''
        cost = (1/n_obs) * (np.sum((y - y_pred)**2) + 
                            self._regularization(lambd, self.w, reg_type, l1_ratio))
        return cost
    
    def _predict(self, X, w, b):
        '''
        Make predictions based on model weights/coefficients
        '''
        return self.b + np.dot(X, self.w)
    

class Logistic_Regression(Base_Regression):
    
    def _approximation(self, w, b, X):
        Qt_x = self.b + np.dot(self.w, X.T)
        return self._sigmoid(Qt_x)
    
    def _sigmoid(self, z):
        return 1/(1 + np.exp(-z))
    
    def _cost(self, n_obs, y_pred, y, w, lambd, reg_type, l1_ratio):
        '''
        a really small value 'epsilon' is added to avoid 
        overflow and divison by zero error for log
        
        loss = (-1/m) * (sum(y * log(hQ_x) + (1-y) * log(1 - hQ_x)) - regularization cost)
        where hQ_x = 1/(1 + e^(-Q_t*x))
        '''
        eps = 1e-5
        cost = (-1/n_obs) * (np.sum(y * np.log(y_pred+eps) + (1-y) * np.log(1-y_pred+eps)) - 
                             self._regularization(lambd, self.w, reg_type, l1_ratio))
        return cost
    
    def _predict(self, X, w, b, threshold=0.5):
        y_predicted = self._approximation(self.w, self.b, X)
        y_predicted_cls = [1 if i>=threshold else 0 for i in y_predicted]
        return np.array(y_predicted_cls)

Plot Loss for different learning rates ($\alpha$)¶

Learning rate is extremely important for finding best model. We will define a function that iterates over various learning rates and build model using each learning rate. Next, it plots the loss generated by different models.

def loss_per_epoch_plot(epochs, model, learning_rate):
    """
    Pass in the no. of epochs, model class, list of learning rates 
    and plot the loss per epoch for models generated using different 
    learning rates
    """ 
    fig = plt.figure(figsize = (15, 8))
    losses = {} 
    
    e = epochs
    model = model
    for lr in learning_rate:
        try:
            b, w, epoch_loss = model.fit(x_train, y_train, 
                                         learning_rate=lr, epochs=e)
            losses[f'LR={str(lr)}'] = epoch_loss
        except:
            continue

    # Plot loss    
    xs = np.arange(len(epoch_loss))

    for k, v in losses.items():
        plt.plot(xs, v, lw=3, label=f"{k}, Final loss = {v[-1]:.2f}")

    plt.title('Loss per Epoch (MSE) for different learning rates', size=20)
    plt.xlabel('Epochs', size=14)
    plt.ylabel('Loss', size=14)
    plt.legend(fontsize=12)
    return fig

Plot model weights for different penalty values ($\lambda$)¶

We will define a function to help us visualize the model's estimated weight for different lambda values.

# Define a function to plot weights and lambda values
def weight_versus_lambda_plot(weight, lambd, features):
    """
    Pass in the estimated weight, the lambda value and the names
    for the features and plot the model's estimated weight 
    for different lambda values
    """
    fig = plt.figure(figsize = (10, 8))
    
    # ensure that the weight is an array
    weight = np.array(weight)
    for col in range(weight.shape[1]):
        plt.plot(lambd, weight[:, col], label = features[col])

    plt.axhline(0, color = 'black', linestyle = '--', linewidth = 3)
    
    # manually specify the coordinate of the legend
    plt.legend(bbox_to_anchor = (1.3, 0.9))
    plt.ylabel('Feature weight')
    plt.xlabel('Lambda')
    return fig

Regression¶

For regression, we will use the Boston Housing Prices toy dataset available in the scikit-learn library. We will:

Load the data
Split and Standardize the data
Build and fit a basic model using batch gradient descent
a. Evaluate model performance on test set
Build a regularized model using l2 penalty: Ridge Regression
a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
Build a regularized model using l1 penalty: Lasso Regression
a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
Build a regularized model using l1 and l2 penalty: Elastic Net Regression
a. Plot variation of weights for different lambda values b. Fit model and evaluate performance

Get Data¶

Load Data¶

# Basic Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score

# Read the data
from sklearn.datasets import load_boston
boston_data=pd.DataFrame(load_boston().data,columns=load_boston().feature_names)
boston_data.head()

# Create dependent and independent variables
data = load_boston()
X = data.data
Y = data.target
features = data.feature_names

# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=123)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')

'Shape of training data (379, 13)'

'Shape of test data (127, 13)'

# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Linear Model¶

We will train a regression model without regularization for different learning rates. Next, we will build a model using a selected value of learning rate and evaluate its performance on test set.

Plot Loss for different learning rates¶

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.

# Build and fit LR model for different learning rates

epochs = 10000
model = Linear_Regression()
learning_rate = [0.1, 0.01, 0.001, 0.0001, 0.00001]

mlr_fig = loss_per_epoch_plot(epochs, model, learning_rate)

From the plot above, we can initially see the loss reducing significantly and then stabalize, as epochs increase, for different learning rates $\alpha$. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.001 to build and fit a model and then evaluate its performance.

Fit Model and Evaluate Performance¶

We will bild the model using a learning rate and evaluate the performance of our model on test data.

# Build and fit best LR model
alpha = 0.001
e = 10000
batch = 32

# Build model
mlr_model = Linear_Regression()


# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, verbose=True)

Epoch: 0, Loss: 568.5920328719429
Epoch: 1000, Loss: 90.3083203919063
Epoch: 2000, Loss: 32.10197340638541
Epoch: 3000, Loss: 23.821701722205226
Epoch: 4000, Loss: 22.43743142232134
Epoch: 5000, Loss: 22.072809206405893
Epoch: 6000, Loss: 21.899094236283833
Epoch: 7000, Loss: 21.785539160309746
Epoch: 8000, Loss: 21.70337462375776
Epoch: 9000, Loss: 21.641697656784903

# Create predictions on test set
mlr_test_pred = mlr_model.predict(x_test)

# Calculate metrics
mlr_test_r2 = np.abs(r2_score(y_test, mlr_test_pred))
mlr_test_mse = mean_squared_error(y_test, mlr_test_pred)
mlr_test_rmse = np.sqrt(mlr_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_test_r2}')
print(f'Test MSE is: {mlr_test_mse}')
print(f'Test RMSE is: {mlr_test_rmse}')

Performance on test set:
Test R2 is: 0.6664908666139335
Test MSE is: 26.327957451131354
Test RMSE is: 5.131077611099968

Ridge Model¶

We will train a regression model with $l2$ penalty (Ridge) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.

Plot Weights for different Lambda values¶

# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [10, 100, 1000]
e = 10000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.001, 
                                     epochs=e, verbose=False, lambd=l, 
                                     reg_type='ridge')
    weights.append(w)

# Plot weights and lambda

ridge_fig = weight_versus_lambda_plot(weights, lambd, features)
ridge_fig.suptitle('Ridge: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease and come close to zero as $\lambda$ increases. Weights remain close to zero but do not become zero.

Build and Fit Model¶

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

# Build and fit model
alpha = 0.001
e = 10000
lambd = 1000

# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                 epochs=e, verbose=True, lambd=lambd, 
                                 reg_type='ridge')

Epoch: 0, Loss: 622.8415551485903
Epoch: 1000, Loss: 121.2586514272164
Epoch: 2000, Loss: 63.209795385576655
Epoch: 3000, Loss: 55.36534799076975
Epoch: 4000, Loss: 54.30478750806339
Epoch: 5000, Loss: 54.161399830930264
Epoch: 6000, Loss: 54.14201382773705
Epoch: 7000, Loss: 54.139392841551754
Epoch: 8000, Loss: 54.13903848441503
Epoch: 9000, Loss: 54.13899057535658

# Create predictions on test set
mlr_ridge_test_pred = mlr_model.predict(x_test)

# Calculate metrics
mlr_ridge_test_r2 = np.abs(r2_score(y_test, mlr_ridge_test_pred))
mlr_ridge_test_mse = mean_squared_error(y_test, mlr_ridge_test_pred)
mlr_ridge_test_rmse = np.sqrt(mlr_ridge_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_ridge_test_r2}')
print(f'Test MSE is: {mlr_ridge_test_mse}')
print(f'Test RMSE is: {mlr_ridge_test_rmse}')

Performance on test set:
Test R2 is: 0.5018065147532993
Test MSE is: 39.32850878426335
Test RMSE is: 6.271244596111951

Lasso Model¶

We will train a regression model with $l1$ penalty (Lasso) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.

Plot Weights for different Lambda values¶

# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1,10,50,100,150,200,250]

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.01, 
                                     epochs=e, verbose=False, lambd=l, 
                                     reg_type='lasso')
    weights.append(w)

# Plot weights and lambda

lasso_fig = weight_versus_lambda_plot(weights, lambd, features)
lasso_fig.suptitle('Lasso: Variation in feature weights as lambda grows', 
                   fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero.

Build and Fit Model¶

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

# Build and fit model
alpha = 0.001
e = 5000
lambd = 250

# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, reg_type='lasso')

Epoch: 0, Loss: 612.7721592862399
Epoch: 1000, Loss: 102.05510468089096
Epoch: 2000, Loss: 41.37289240919998
Epoch: 3000, Loss: 32.883892146603614
Epoch: 4000, Loss: 31.70026715530038

# Create predictions on test set
mlr_lasso_test_pred = mlr_model.predict(x_test)

# Calculate metrics
mlr_lasso_test_r2 = np.abs(r2_score(y_test, mlr_lasso_test_pred))
mlr_lasso_test_mse = mean_squared_error(y_test, mlr_lasso_test_pred)
mlr_lasso_test_rmse = np.sqrt(mlr_lasso_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_lasso_test_r2}')
print(f'Test MSE is: {mlr_lasso_test_mse}')
print(f'Test RMSE is: {mlr_lasso_test_rmse}')

Performance on test set:
Test R2 is: 0.5878305143618532
Test MSE is: 32.5375817158631
Test RMSE is: 5.704172307694001

Elastic Net Model¶

We will train a regression model with $l1\ and\ l2$ penalty (Elastic Net) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of $\lambda$ and evaluate its performance on test set.

Plot Weights for different Lambda values¶

# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1,10,100]

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='elastic net', l1_ratio=0.6)
    weights.append(w)

# Plot weights and lambda

elastic_fig = weight_versus_lambda_plot(weights, lambd, features)
elastic_fig.suptitle('Elastic Net: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero.

Build and Fit Model¶

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

# Build and fit model
alpha = 0.001
e = 10000
lambd = 100

# Fit Model
b, w, epoch_loss = mlr_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='elastic net', l1_ratio=0.6)

Epoch: 0, Loss: 562.3691236063022
Epoch: 1000, Loss: 94.7863715039122
Epoch: 2000, Loss: 37.51882786055542
Epoch: 3000, Loss: 29.653415539049384
Epoch: 4000, Loss: 28.49646337953559
Epoch: 5000, Loss: 28.286807139443578
Epoch: 6000, Loss: 28.226957040648003
Epoch: 7000, Loss: 28.199134418791104
Epoch: 8000, Loss: 28.18246156038106
Epoch: 9000, Loss: 28.171421424880272

# Create predictions on test set
mlr_elastic_test_pred = mlr_model.predict(x_test)

# Calculate metrics
mlr_elastic_test_r2 = np.abs(r2_score(y_test, mlr_elastic_test_pred))
mlr_elastic_test_mse = mean_squared_error(y_test, mlr_elastic_test_pred)
mlr_elastic_test_rmse = np.sqrt(mlr_elastic_test_mse)

print("Performance on test set:")
print(f'Test R2 is: {mlr_elastic_test_r2}')
print(f'Test MSE is: {mlr_elastic_test_mse}')
print(f'Test RMSE is: {mlr_elastic_test_rmse}')

Performance on test set:
Test R2 is: 0.6437118460102644
Test MSE is: 28.126184321690875
Test RMSE is: 5.303412516643494

Classification¶

Get Data¶

For classification, we will use the Breast Cancer toy dataset available in the scikit-learn library. We will:

Load the data
Split and Standardize the data
Build and fit a basic model using batch gradient descent
a. Evaluate model performance on test set
Build a regularized model using l2 penalty: Ridge Regression
a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
Build a regularized model using l1 penalty: Lasso Regression
a. Plot variation of weights for different lambda values b. Fit model and evaluate performance
Build a regularized model using l1 and l2 penalty: Elastic Net Regression
a. Plot variation of weights for different lambda values b. Fit model and evaluate performance

Load Data¶

# Basic Imports
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, classification_report

# Read the data
cancer_data=pd.DataFrame(load_breast_cancer().data,columns=load_breast_cancer().feature_names)
cancer_data.head()

# Create dependent and independent variables
Y=load_breast_cancer().target
display(f'Shape of Y: {Y.shape}')

X=load_breast_cancer().data[:,:10]
display(f'Shape of X: {X.shape}')

features = load_breast_cancer().feature_names

'Shape of Y: (569,)'

'Shape of X: (569, 10)'

# Train test split
from sklearn.model_selection import train_test_split

# Split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1234)

display(f'Shape of training data {x_train.shape}')
display(f'Shape of test data {x_test.shape}')

'Shape of training data (426, 10)'

'Shape of test data (143, 10)'

# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Logistic Model¶

We will train a logistic regression model without regularization for different learning rates. Next, we will build a model using a selected value of learning rate and evaluate its performance on test set.

Plot Loss for different learning rates¶

Learning rate is extremely important for finding best model. Here, we iterate over various learning rates and build model using each learning rate. Next, we plot the loss generated by different models.

# Build and fit LR model for different learning rates

epochs = 5000
model = Logistic_Regression()
learning_rate = [0.1, 0.01, 0.001]

log_fig = loss_per_epoch_plot(epochs, model, learning_rate)

From the plot above, we can initially see the loss reducing significantly and then stabalize, as epochs increase, for different learning rates $\alpha$. If the learning rate is too high, the model may not converge. On the other hand, if its too low, the model may take longer to converge. Let's use the learning rate of 0.01 to build and fit a model and then evaluate its performance.

Fit Model and Evaluate Performance¶

We will bild the model using a learning rate and evaluate the performance of our model on test data.

# Build and fit best LR model
alpha = 0.01
e = 5000

# Build model
log_model = Logistic_Regression()


# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                                   epochs=e, verbose=True)

Epoch: 0, Loss: 2.801124198008247
Epoch: 1000, Loss: 0.16386755120467672
Epoch: 2000, Loss: 0.13467828554073283
Epoch: 3000, Loss: 0.12344957042805385
Epoch: 4000, Loss: 0.1174704941905151

# Create predictions on test set
log_test_pred = log_model.predict(x_test, 0.5)

# Define Accuracy
def accuracy(y_true, y_pred):
    acc = np.sum(y_true == y_pred) / len(y_true)
    return acc

# Calculate accuracy
log_test_acc = accuracy(y_test, log_test_pred)
print(f'Accuracy on test set: {log_test_acc}')

Accuracy on test set: 0.9230769230769231

# Calculate metrics
cm_test = confusion_matrix(y_test, log_test_pred)
test_report = classification_report(y_test, log_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {cm_test}\n')
print(f'Classification Report:\n {test_report}')

Performance on test set:

Confusion Matrix:
 [[48  7]
 [ 4 84]]

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.87      0.90        55
           1       0.92      0.95      0.94        88

    accuracy                           0.92       143
   macro avg       0.92      0.91      0.92       143
weighted avg       0.92      0.92      0.92       143

Ridge Model¶

We will train a regression model with $l2$ penalty (Ridge) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.

Plot Weights for different Lambda values¶

# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
log_model = Logistic_Regression()
lambd = [10,50,100,150]
e = 5000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='ridge')
    weights.append(w)

# Plot weights and lambda

ridge_fig = weight_versus_lambda_plot(weights, lambd, features)
ridge_fig.suptitle('Ridge: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that feature weights continue to decrease and come close to zero as $\lambda$ increases. Weights remain close to zero but do not become zero.

Build and Fit Model¶

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

# Build and fit model
alpha = 0.01
e = 5000
lambd = 150

# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='ridge')

Epoch: 0, Loss: 4.810672897888645
Epoch: 1000, Loss: 0.4630314146284181
Epoch: 2000, Loss: 0.46222240367092043
Epoch: 3000, Loss: 0.4621400387231419
Epoch: 4000, Loss: 0.4621245765327778

# Create predictions on test set
log_ridge_test_pred = log_model.predict(x_test, 0.5)

# Calculate accuracy
ridge_test_acc = accuracy(y_test, log_ridge_test_pred)
print(f'Accuracy on test set: {ridge_test_acc}')

Accuracy on test set: 0.8881118881118881

# Calculate metrics
ridge_cm_test = confusion_matrix(y_test, log_ridge_test_pred)
ridge_test_report = classification_report(y_test, log_ridge_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {ridge_cm_test}\n')
print(f'Classification Report:\n {ridge_test_report}')

Performance on test set:

Confusion Matrix:
 [[41 14]
 [ 2 86]]

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.75      0.84        55
           1       0.86      0.98      0.91        88

    accuracy                           0.89       143
   macro avg       0.91      0.86      0.88       143
weighted avg       0.90      0.89      0.88       143

Lasso Model¶

We will train a regression model with $l1$ penalty (Lasso) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.

Plot Weights for different Lambda values¶

# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [0.1, 1, 10]
e = 5000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='lasso')
    weights.append(w)

# Plot weights and lambda

lasso_fig = weight_versus_lambda_plot(weights, lambd, features)
lasso_fig.suptitle('Lasso: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that most feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero. There are a few features for which weights tend to go up as $\lambda\ >\ 1$.

Build and Fit Model¶

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

# Build and fit model
alpha = 0.01
e = 5000
lambd = 10

# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='lasso')

Epoch: 0, Loss: 2.3778737795894664
Epoch: 1000, Loss: 0.26718522501660646
Epoch: 2000, Loss: 0.2584320409432221
Epoch: 3000, Loss: 0.2565173888446097
Epoch: 4000, Loss: 0.2559426202287423

# Create predictions on test set
log_lasso_test_pred = log_model.predict(x_test, 0.5)

# Calculate accuracy
lasso_test_acc = accuracy(y_test, log_lasso_test_pred)
print(f'Accuracy on test set: {lasso_test_acc}')

Accuracy on test set: 0.916083916083916

# Calculate metrics
lasso_cm_test = confusion_matrix(y_test, log_lasso_test_pred)
lasso_test_report = classification_report(y_test, log_lasso_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {lasso_cm_test}\n')
print(f'Classification Report:\n {lasso_test_report}')

Performance on test set:

Confusion Matrix:
 [[46  9]
 [ 3 85]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.84      0.88        55
           1       0.90      0.97      0.93        88

    accuracy                           0.92       143
   macro avg       0.92      0.90      0.91       143
weighted avg       0.92      0.92      0.92       143

Elastic Net Model¶

We will train a regression model with $l1\ and\ l2$ penalty (Elastic Net) for different values of $\lambda$ and visualize variation of weights w.r.t $\lambda$. Next, we will build a model using a selected value of 𝜆 and evaluate its performance on test set.

Plot Weights for different Lambda values¶

# loop through different penalty score (lambda) and obtain the estimated coefficient (weights)
lambd = [1, 10, 50, 100]
e = 5000

# stores the weights of each feature
weights = []
for l in lambd:    
    b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=0.01, 
                                    epochs=e, verbose=False, lambd=l, 
                                    reg_type='elastic net', l1_ratio=0.6)
    weights.append(w)

# Plot weights and lambda

elastic_fig = weight_versus_lambda_plot(weights, lambd, features)
elastic_fig.suptitle('Elastic Net: Variation in feature weights as lambda grows', fontsize=16, y=0.93);

From the plot, we can see that most feature weights continue to decrease as $\lambda$ increases. Some weights touch the zero line while others remain close to zero. There are a few features for which weights tend to go up as $\lambda\ >\ 1$.

Build and Fit Model¶

Here we will build a model using a selected value of lambda and evaluate its performance on test set.

# Build and fit model
alpha = 0.01
e = 5000
lambd = 100

# Fit Model
b, w, epoch_loss = log_model.fit(x_train, y_train, learning_rate=alpha, 
                                epochs=e, verbose=True, lambd=lambd, 
                                reg_type='elastic net', l1_ratio=0.6)

Epoch: 0, Loss: 4.398304647049499
Epoch: 1000, Loss: 0.5581468375691527
Epoch: 2000, Loss: 0.5581587247090656
Epoch: 3000, Loss: 0.5583725172916181
Epoch: 4000, Loss: 0.5584206447040022

# Create predictions on test set
log_elastic_test_pred = log_model.predict(x_test, 0.5)

# Calculate accuracy
elastic_test_acc = accuracy(y_test, log_elastic_test_pred)
print(f'Accuracy on test set: {elastic_test_acc}')

Accuracy on test set: 0.8321678321678322

# Calculate metrics
elastic_cm_test = confusion_matrix(y_test, log_elastic_test_pred)
elastic_test_report = classification_report(y_test, log_elastic_test_pred)

print("Performance on test set:\n")
print(f'Confusion Matrix:\n {elastic_cm_test}\n')
print(f'Classification Report:\n {elastic_test_report}')

Performance on test set:

Confusion Matrix:
 [[33 22]
 [ 2 86]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.60      0.73        55
           1       0.80      0.98      0.88        88

    accuracy                           0.83       143
   macro avg       0.87      0.79      0.81       143
weighted avg       0.85      0.83      0.82       143

Model Results¶

REGRESSION:

Model	R2	MSE	RMSE
Linear	0.67	26.38	5.14
Ridge	0.51	39.33	6.27
Lasso	0.59	32.54	5.7
Elastic Net	0.64	28.13	5.3

CLASSIFICATION:

Model	Accuracy	Precision		Recall		F1 Score
		0	1	0	1	0	1
Logistic	0.92	0.92	0.92	0.87	0.95	0.9	0.94
Ridge	0.89	0.95	0.86	0.8	0.99	0.84	0.91
Lasso	0.92	0.94	0.9	0.84	0.97	0.88	0.93
Elastic Net	0.83	0.94	0.84	0.6	0.98	0.73	0.88

Summary¶

To summarize, in this notebook we created a naive framework for building linear and logistic regression models with $l1$ (lasso), $l2$ (ridge), $l1\ and\ l2$ (elastic net) penalties. The framework was then used to build and fit various models with and without regularization using different datasets. Later, we evaluated the results of our models using test data. We saw how iterating over different learning rates helped identify a learning rate with reduced loss to generate better results. We also visualized the variation of feature weights for different 𝜆 (penalty) values.

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678