Data Science | Machine Learning | Deep Learning

Neural Networks’ Secret Ingredient: The Learning Rate Hyperparameter

How can you choose an optimal value of the learning rate in gradient descent algorithms?

Swapnil Kangralkar

--

In this module, you’ll learn what is a loss function, how a machine learning model iteratively reduces loss, and how you can use the ‘learning rate’ hyperparameter to optimize the loss function.

What is a loss function?

Loss is a number that indicates the difference between the model predictions to the actual values (labels). When the loss is 0, the model’s prediction is perfect, otherwise, the higher the number, the higher the loss.

Fig 1: Image created by the author

For the linear regression example on the left, the loss function Mean Squared Error (MSE) can be calculated as (1+4)/8 = 0.63. ’N’ here is the number of observations, which is 8.

Now, let’s understand how a machine learning model iteratively reduces loss.

First, the model takes one or more features as input and outputs a prediction (y’). We will consider one feature in this example for simplicity. We then have the equation: y′=m.x + b, where ‘m’ and ‘b’ are your weights. ‘m’ is the slope and ‘b’ is the intercept on the y-axis. As shown in fig 2, the goal is to find the values of ‘b’ and ‘m’ where the line fits through the data points with minimum loss.

Fig 2: Image created by the author

Initially, the machine learning algorithm assumes random values for ‘b’ and ‘m’, and calculates the value of the loss function. The learning continues until the algorithm discovers the model parameters that yield the lowest possible loss, or until the overall loss stops changing or at least changes extremely slowly. When this happens, we say that the model has converged. Fig 3 below shows the iterative process.

Fig 3: Image created by the author

If we were to draw a plot of loss vs weights, we would get the curve shown in fig 4, and the point marked as the star is the only place where the slope is exactly 0. i.e. the point where the loss function converges. But calculating the loss function for every value of ‘m’ would be time-consuming. Therefore, a more efficient way of finding the convergence point is by gradient descent. Gradient descent helps find the optimal values for your weights b and m.

The algorithm first starts with a random value of m, in our case we start with m=0.3. The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. The gradient of the loss is equal to the slope of the curve at that point.

Fig 4: Image created by the author

To find the slope of the curve at a particular point, you need to pass a tangent through that point. Since the tangent is parallel to the curve at that one point, the slope of the tangent line is also the slope of the curve at that point (when you magnify and look closely, the tangent is parallel to the curve at that point. As shown in Fig 5). And you know how to find the slope of a line (Rise/run). Therefore, the gradient of p1 can be calculated as -13.3

Fig 5: Image created by the author

To determine the next point, the algorithm multiplies the magnitude of the gradient by the value of the learning rate (also known as step size) that is set by us while tuning the hyperparameters. The formula is:

Fig 6: Image created by the author

The sign of the gradient tells us the direction of the gradient. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss. So point p1, p2, and p3 will have a negative gradient, but point x will have a positive gradient. So, the algorithm will reduce the value of weights when the gradient becomes positive. When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

This step is simultaneously done for intercept ‘b’. As seen in fig 7, the red dot is where the slope is the minimum. The algorithm tries different values for ‘m’ and ‘b’ using gradient descent until it finds the optimal values which minimize the loss.

Fig 7: Image created by the author

Note: There are no concrete rules for tuning the learning rate. A few points you can keep in mind are:

  1. The loss should steadily decrease, steeply at first, and then slowly until the slope of the curve approaches zero or cannot be minimized further.
  2. If the loss decreases too slowly, increase the learning rate. Note that if you set the learning rate too high, it may prevent the loss from converging and the point will overshoot to a place like a ‘point x’ in fig 5.
  3. If the loss jumps around, decrease the learning rate.
  4. For best results, keep the learning rate low, and increase the number of epochs or the batch size.

We will learn more about epochs and batch size in the next article.

Thank you for reading. Get in touch via LinkedIn if you have further questions.

--

--