Regularization Techniques for Machine Learning models

6 min readJan 30, 2022

Introduction

When talking about building supervised learning algorithms in ML, it won’t be surprising to think about which regularization technique to use — for better fitting results.

In the training phase the model may perform accurately on training data, but fail to perform well on test data, known as Overfitting — is where we have low error with respect to training datasets, and high error with respect to test datasets.

Taking this graph to better visualize the statement:

Various methods can be adopted to avoid this phenomenon. In this blog, I will try my best to share with you my knowledge about the regularization method techniques.

Regularization

To continue, let us first understand this method — regularization.

In regression analysis, the features are estimated using coefficients while modelling. Also, if the estimates can be restricted, or shrunk or regularized towards zero, then the impact of insignificant features might be reduced and would prevent models from high variance with a stable fit.

In simple words, regularization penalizes the coefficients, (in deep learning, the weight matrices of the nodes to improve the model’s performance)

Regularization techniques:

L2 and L1 regularization

This regularization technique penalizes the complex ML models via adding the regularization term to the loss/cost function of the model. The difference between L1 and L2, comes into the regularization term itself .

The below function calculates an error/loss without the regularization function

Mathematical Formula for L1 and L2 regularization

function that can calculate the error with L1 regularization function,

function that can calculate the error with L2 regularization function,

Knowing these two sub-techniques (L1 and L2) formulas, we learn that L1 regularization gives output in binary weights from 0 to 1 for the model’s features and is adopted for decreasing the number of features in huge dimensional datasets. L2 regularization disperses the error terms in all the weights leading to more accurate customized final models.

L2 code implementation:

In this example we will calculate the Gradient Descent with L2 Regularization

let’s consider this file l2_reg_gradient_descent.py

#!/usr/bin/env python3"""Gradient Descent with L2 Regularization"""import numpy as npdef tanh(Z):    """    Method:        Computes the Hyperbolic Tagent of Z elemnet-wise.    Parameters:        @Z (array): output of affine transformation.    Returns:        A (array): post activation output.    """   A = np.tanh(Z)   return Adef l2_reg_gradient_descent(Y, weights, cache, alpha, lambtha, L):    """    Method:        updates the weights and biases of a neural network using gradient descent with L2 regularization.    Parameters:       @Y: a one-hot that contains the correct labels for the data.       @weights: a dictionary of the weights and biases.        @cache: a dictionary of the outputs of each layer of the neural network        @alpha: the learning rate        @lambtha: the L2 regularization parameter        @L: the number of layers of the network    """    # Number of trainings    m = Y.shape[1]    # Initialization for backpropagation algorithm    dZ = cache['A' + str(L)] - Y    for i in range(L, 0, -1):
     # dW is the Derivative of the cost function w.r.t W of the current layer    dW = (np.matmul(dZ, cache['A' + str(i - 1)].T)) / m    # db is the Derivative of the cost function w.r.t b of the current layer    db = (np.sum(dZ, axis=1, keepdims=True)) / m    # dA is the Derivative of the cost function w.r.t A (cache) of the current layer    dA = 1 - np.square(cache['A' + str(i - 1)])    # dZ is the Derivative of the cost function w.r.t Z of the current layer    dZ = np.multiply(np.matmul(weights['W' + str(i)].T, dZ), dA)    # Updating weight matrix and the bias vector for each layer    reg_term = 1 - ((alpha * lambtha) / m)    weights['W' + str(i)] = reg_term * weights['W' + str(i)] - alpha * dW    weights['b' + str(i)] = weights['b' + str(i)] - alpha * db

main.py file

#!/usr/bin/env python3
import numpy as npl2_reg_gradient_descent = __import__('1-l2_reg_gradient_descent').l2_reg_gradient_descentdef one_hot(Y, classes):    """convert an array to a one-hot matrix"""    m = Y.shape[0]
    one_hot = np.zeros((classes, m))
    one_hot[Y, np.arange(m)] = 1
    return one_hotif __name__ == '__main__':
    lib= np.load('../data/MNIST.npz')    X_train_3D = lib['X_train']    print(X_train_3D.shape)    Y_train = lib['Y_train']    X_train = X_train_3D.reshape((X_train_3D.shape[0], -1)).T    Y_train_oh = one_hot(Y_train, 10)    np.random.seed(0)
    
    weights = {}    weights['W1'] = np.random.randn(256, 784)    print(weights['W1'].shape)    weights['b1'] = np.zeros((256, 1))    weights['W2'] = np.random.randn(128, 256)    weights['b2'] = np.zeros((128, 1))    weights['W3'] = np.random.randn(10, 128)    weights['b3'] = np.zeros((10, 1))    cache = {}    cache['A0'] = X_train    cache['A1'] = np.tanh(np.matmul(weights['W1'], cache['A0']) +     weights['b1'])    cache['A2'] = np.tanh(np.matmul(weights['W2'], cache['A1']) +    weights['b2'])    Z3 = np.matmul(weights['W3'], cache['A2']) + weights['b3']   cache['A3'] = np.exp(Z3) / np.sum(np.exp(Z3), axis=0)   print(weights['W1'])    l2_reg_gradient_descent(Y_train_oh, weights, cache, 0.1, 0.1, 3)    print(weights['W1'])

Dropout

Is another regularization technique for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. It is an efficient way of performing model averaging with neural networks — link

In other words, it relies on stochastic dropping out’’ neurons during training.

Let’s look at the dropout technique from a mathematical perspective — a probability process must be added to each unit of the training network.

Network calculation formula without Dropout:

Using Dropout’s network calculation formula:

Data Augmentation

This technique is different from the previous ones, because it does not add terms or apply any probability on layers, but instead it adds more data to the training sets to improve the performance of the model!

Early Stopping

A major challenge in training neural networks is how long to train them.

Too little training will mean that the model will underfit the train and the test sets. Too much training will mean that the model will overfit the training dataset and have poor performance on the test set.

So the solution is to stop gradient descent early.

So here we come to the end of the blog, I really recommend checking these resources, if you want to get deeper into one of these techniques.

For more coding perspective, check my GitHub directory learning about Regularization, GitHub link.

Analysis of Dropout Principle - Programmer All

In a machine learning model, if the model has too many parameters and too few training samples, the trained model is…

www.programmerall.com

Dilution (neural networks) - Wikipedia

Dilution (also called Dropout or DropConnect) is a regularization technique for reducing overfitting in artificial…

en.wikipedia.org

Figure 4. Dropout Strategy. (a) A standard neural network. (b) Applying...

Download scientific diagram | Dropout Strategy. (a) A standard neural network. (b) Applying dropout to the neural…

www.researchgate.net