Gradient Descent Optimization Techniques for Machine Learning Algorithms

Ahlemkaabi
8 min readFeb 21, 2022
resouce

In Machine learning we aim to build models that gives accurate predictions. In order to achieve this goal, optimizing learning process is key. It is the process of adjusting hyper-parameters in order to minimize the cost function by using one of the optimization techniques.

In this post, I will share with you some knowledge about optimization techniques.

Hyper-parameter (machine learning)

In machine learning, it is always about the model! building, training, testing.

Controlling the learning process (algorithm) of a machine learning model, requires hyperparameters. to control the speed and the quality of the learning process, setting examples of hyper parameters: the topology and size of a neural network, learning rate, and mini-batch size (we will define this term more below)

The time required to train and test a model can depend upon the choice of its hyperparameters.

Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. It is also known as data normalization in the data processing step.

To scale is to make the range of all features normalized so that each feature contributes approximately proportionately to the wanted result.

In other words, we need to bring all features to the same level of magnitudes. This can be achieved by making sure that the features are on a similar scale.

Feature Scaling methods:

1- Re-scaling/min-max normalization:
This distribution will have values between 0 and 1.

 x_p = (x -min(x)) / (max(x) — min(x))

2- Mean normalization:
This distribution will have values between -1 and 1with μ=0.

 x_p = (x — avg(x))/(max(x) — min(x))

3- Standardization/Z-score normalization:
Replaces the values by their Z scores. This redistributes the features with their mean μ = 0 and standard deviation σ =1 .

x_p = (x — m)/s

4- Unit length/unit vector normalization:
Scaling is done considering the whole feature vector to be of unit length.

x_p = x / euclidean_dist(x) OR x_p = x / L1_norm(x)

Pros for Feature Scaling:

1- Some machine learning models don’t work if the data is not normalized.
2- It speeds up gradient descent
3- It’s important to apply feature scaling if regularization is used as part of the loss function (so that coefficients are penalized appropriately).

Now we may ask when to Scale? and does every machine learning algorithms require Feature Scaling ?
The answer will be as follow, if your algorithm computes distance or assumes normality, scale your features!

1- k-nearest neighbors
2- Principal Component Analysis(PCA)

Algorithms in which, is NOT important to scale

1- Tree based models
2- Linear Discriminant Analysis(LDA)
3- Naive Bayes

Batch normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift this title from one of Cornell university articles may explain this technique. But let us understand more the internal covariate shift. It occurs when there is a change in the input distribution to our network.
When the input distribution changes, hidden layers try to learn to adapt to the new distribution. This slows down the training process. If a process slows down, it takes a long time to converge to a global minimum.

the Batch Normalization Standardizes (Z-score norm) the pre-activated output of a hidden layer.

# python Code implementationmean = numpyp.mean(Z, axis=0)variance = numpyp.var((Z — mean), axis=0)z_norm = (Z — mean) / (numpyp.sqrt(variance + epsilon))z_bn = gamma * z_norm + beta.# We add a very small number epsilon to prevent the chance of a divide by zero error

Gamma and beta are parameters that are learned by the network.

Pros of Batch Norm:

1- Works well with other optimization methods (SGD w/ momentum, RMSprop, Adam)
2- Can build deeper networks with shorter training time
3- Is stable if the batch size is large
4- Allows for higher learning rates
5- Can add some regularizing effects

Cons of Batch Norm:

1- Not good for online learning (small batches)
2- Not good for Recurrent neural network (RNNs) / “long short-term memory“ units (LSTMs)
3- Have to use a different calculation between training and testing

Gradient Descent

Gradient descent is an optimization algorithm often used for finding the weights or coefficients of machine learning algorithms.

When the model make predictions on training data set, the gradient descent updates the model parameters using the error on previous model predictions, to minimize the error.

This set of update affect the training process, especially the method used( Stochastic Gradient descent, Batch Gradient Descent, Mini-batch Gradient Descent)

Stochastic Gradient Descent: Make updates after each example trained.
batch Gradient Descent: make updates after each epoch.
Mini-batch Gradient Descent: make updates after each batch.

*In this part I will discuss the Mini-batch gradient descent.*

Mini-batch gradient descent

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training data set into small batches that are used to calculate model error and update model coefficients. In other words, perform back-propagation on each mini-batch.

Pros of Mini-batch Gradient Descent

1- Speeds up gradient descent

2- Has a more stable convergence

Cons of Mini-batch Gradient Descent

1- Difficult to choose an appropriate batch size

2- Hard to deal with saddle points

3- Hard to deal with ravines and local optima

Before we move further let us first understand some terminologies: saddle points, ravines and local optima

Local and Global optima:

  • Local optimization involves finding the optimal solution for a specific region of the search space, or the global optima for problems with no local optima.
  • Global optimization involves finding the optimal solution on problems that contain local optima.
  • A local optimization algorithm should be used when you know that you are in the region of the global optima or that your objective function contains a single optima.

Ravines:

  • Ravines are common near local minima in deep learning and Gradient Descent has troubles navigating them

Saddle points

  • When we optimize neural networks or any high dimensional function, for most of the trajectory we optimize, the critical points(the points where the derivative is zero or close to zero) are saddle points.

Stochastic Gradient descent with momentum

To prevent gradient descent from overshooting or taking too much time to get the minimum. We define a momentum, which is a moving average of our gradients. We then use it to update the weight of the network.

source — Andrew Ng course
# alpha: the learning rate# beta1: the momentum weight#  W: the weight to be updated# grad(W) : the gradient of W# Wt-1: the previous first moment of W# Wt: the new moment for W#### Applying the momentum ####Wt = beta * Wt-1 + (1 — beta) * Grad(w)W = W — alpha * Wt

Pros of gradient descent with momentum:

1- keeps update vector in the same general direction => faster convergence

2- less oscillation around ravines and local optima

Cons of gradient descent with momentum:

1- can cause model to pass the optimum

2- have to remember to scale down the learning rate by a factor of (1 — beta)

RMSProp optimization

Divide the gradient by the square root of the moving (propagated) average (mean) of the square of the gradient

# alpha: the learning rate# beta2 the RMSProp weight# epsilon: a small number to avoid division by zero# W: the weight to be updated# grad: the gradient of w# SdW_t-1: the previous second moment of W# SdW_t: the new momentof W#### Applying the momentum ####SdW_t = beta2 * SdW_t-1 + (1 - beta2)square(Grad(w))W = W - alpha * Grad(w) / (epsilon + np.sqrt(SdW_t))# we need to add a small epsilon to the denominator to avoid division by 0

Pros of RMSprop:

1- Speeds up gradient descent

2- Helps to deal with ravines and saddle points

3- Can use a larger learning rate

Adam optimization

Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum(SGD). It uses the squared gradients to scale the learning rate like RMSprop and it takes advantage of momentum by using moving average of the gradient instead of gradient itself like SGD with momentum.

Its name is derived from adaptive moment estimation, and the reason it’s called that is because Adam uses estimations of first and second moments of gradient to adapt the learning rate for each weight of the neural network.

# alpha: the learning rate
# beta1: the weight used for the first moment
# beta2: the weight used for the second moment
# epsilon: a small number to avoid division by zero
# W: the weight to be updated# grad: the gradient of w# VdW: the previous first moment of W# SdW: the previous second moment of W# t: the time step used for bias correction#### Adam optimization Algorithm ####VdW = beta1 * VdW + (1 - beta1) * grad(W)c_VdW = VdW / (1 - beta1 ** t)SdW = beta2 * SdW + (1 - beta2) * (grad(W) ** 2)c_SdW = SdW / (1 - beta2 ** t)W = W - alpha * c_VdW / (np.sqrt(c_SdW) + epsilon)

Adam does not always achieve the optimal results relative to other optimization methods:

Adam optimization algorithm was knowen for his huge performance in term of speed of training. But in some areas it does not converge to an optimal solution, so for some tasks (such as image classification on popular CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with momentum.

Learning rate decay / Learning Rate Schedules

Learning rate decay is a technique to speed up the learning algorithm modern neural networks.
It slows down the learning rate as it approaches the optimum so that it does not overshoot.

Learning rate decay methods:

1- Exponential decay:

# lr , d (decay) are hyperparameters and t is the iteration number
# lr_0 initial learning rate
lr = lr_0 * e^(- d * t)

2- Time-based decay:

# lr , d are hyperparameters and t is the iteration number
# lr_0 initial learning rate
lr = lr_0 / (1 + d * t)

3- Step decay:
Step decay schedule drops the learning rate by a factor every few epochs.

# lr , d are hyperparameters # lr_0 initial learning rate lr = lr_0 * d^(epoch // epoch_drops)

Cons of learning rate decay:

1- There are too many hyper-parameters that require tuning

2- Hard to know when the learning rate should drop ahead of training

3- Because the hyper-parameters are static, it’s not adaptive

4- Better to use adaptive learning rate optimization methods (RMSprop, Adam, etc.)

Thank you for reading! Notes are welcomed feel free to share your knowledge or thoughts.

--

--