The Power of Gradient Descent in Machine Learning

Helping the ML models to minimize the error

Rohan Dawkhar
9 min readJun 27, 2021
Image captured by me :D

Table of Contents -

  1. What are optimization problems in ML ?
  2. Traditional Methods : (a) Differentiation (b) Grad
  3. Maxima and Minima
  4. Why the need for Gradient Descent algorithm ?
  5. Gradient Descent
    (a) Geometric intuition
    (b) Learning rate
    (c) Oscillation problem
    (d) Limitations of Gradient Descent
  6. Stochastic Gradient Descent
  7. Other optimizers
  8. References

1. What are optimization problems in ML ?

Optimization problems are the problems, solved to find an optimal solution from multiple possible solutions through minimizing or maximizing, according to the specific task. Example :- Minimum cost, maximum profit, etc.
In the context of Machine Learning, these are the problems, which are solved to minimize the error or maximize the accuracy of the models with the help of the model’s parameters.
Example of an optimization problem in which <xi, yi> will be the constants as these are the data points from the dataset which are already given. Variable here will be ‘w’ weights which needs to be minimized to get optimal weight w*.

w* is the optimal weight parameter from all the parameters w
Optimization problem of Linear Regression

2. Traditional Methods

Before the Gradient Descent method, people use to solve optimization problems with the help of different methods. This was before the presence of computers. So it was very hard to solve complex optimization problems. I will explain two methods which were used extensively before the pre computers era.

(a) Differentiation

Differentiation is rate of change of a function. dy/dx is the rate of change of y with respect to x. We can call dy/dx as f(x) too. It is scalar calculus.
Geometrically, differentiation is the slope of the tangent to the function f(x).
Differentiation can be used to find out minima and/or maxima of the equations. One example is given in the upcoming 3rd point of this blog.

Must know basics of differentiation

All of the modern Deep Learning is nothing but applying Chain Rule -Jeff Dean(American computer scientist and software engineer).

(b) Grad

Grad is also a calculus method but of vectors. Differentiation is a scalar calculus. Symbol of Grad is ∇ and it is also known as Del.
If x is a vector which means x = <x1, x2, …, xd> d dimensional vector, then
df/dx = ∇ₓ f.

Grad of a function f with x as a vector

∇ₓ f is a vector of d dimensional partial differentiations. For x₁, it will differentiate function f with respect to x₁ only. This is nothing but partial derivative of f with respect to x₁. Same differentiations will occur for all xᵢs.
The symbol used for partial derivatives is ∂ (doh).

3. Maxima and Minima

In the diagram, the function with maximum value is the maxima. The function with minimum value is minima.
If we considered the maxima or minima, one thing to notice is, the slope of both of them is zero. It means the angle between the tangent of the function f(x) at minima or maxima with the X axis is zero. We know tan(0) is zero and Tan(θ) is nothing but slope. So, slope is zero.

Solving optimization problem is actually finding the minima or maxima of a function. The minima or maxima can be found out by traditional calculus methods or Gradient Descent.

A function can have multiple minima and/or multiple maxima. There is only one global minima found but there can be multiple local minima. Similarly, there is only one global maxima found but there can be multiple local maxima.
Its also possible to have no minima and/or no maxima.
For trivial equations, it is easy to find their minima or maxima just by differentiation.

max f(x) = min -f(x) and vice versa.

Lets solve an easy example to find minima/maxima for f(x) = x²-3x+2

To find minima/maxima of a function f(x), we will make derivative of function equals to zero because slope is zero at minima/maxima

i.e. df/dx = 0

f(x) = x²-3x+2 … equation 1

df/dx = 2x-3

But df/dx = 0

So 2x-3 = 0

2x = 3

x = 1.5

Put value of x in equation 1

f(1.5) = (1.5)²-3(1.5)+2 = -0.25

We cant tell x=1.5 is whether maxima or minima.

Lets check f(1) = (1)²-3(1)+2 = 0

f(1.5) < f(1)

x = 1.5 cannot be maxima.

So, x = 1.5 is the minima for f(x).

4. Why the need for Gradient Descent algorithm ?

Optimization problems in Machine Learning, are non-trivial. Its hard to solve them using traditional methods. This is the prime reason for using Gradient Descent algorithm and its variants.
The computation required for these problems can be handled by the computers. Its easy to use Gradient Descent algorithm in computers, to solve different optimization problems.
Also, the updated parameters/weights of the model helps in reducing train loss/error of the ML model.

Gradient Descent is all you need !

5. Gradient Descent

Gradient Descent is an iterative algorithm which can compute the minima/maxima of a function f(x) easily with the help of computers.
To obtain minima/maxima of f(x), the algorithm has to go through k number of iterations to achieve desired optimal value of f(x).
x₀ ← first guess of x* (x₀ is chosen randomly)
x₁ ← iteration 1
x₂ ← iteration 2
:
:
xₖ ← iteration k (here we will get x*)

e.g. Find x* = argminₓ f(x)

(a) Geometric intuition

Lets consider the following diagram.

Here, the slope changes its sign from positive to negative at x*(minima).
As you move closer to x* from positive side, the slope reduces.
As you move closer to x* from negative side, the slope increases.
We already know, slope at minima (x*) is zero.
Its all about the convergence of the points/weights to the x*(minima).

After understanding some basic information regarding the slopes and their convergence to the minima, lets see how Gradient Descent actually works.

  • Working of Gradient Descent algorithm
Converging to the minima (x*) through ‘k’ iterations

Step 1 : Pick an initial point x₀ randomly.

Step 2 : Choose next point x₁ such that x₁ is closer to x* than x₀.
x₁ = x₀ - r [df/dx]ₓ₀ … r is step-size or learning rate and [df/dx] at x₀ is the slope of the line at point x₀/derivative at x₀.
Let r=1. (assume)
Then x₁ = x₀ - 1*(positive) … [df/dx]ₓ₀ will be positive because x₀ is positive (refer above diagram).
Then x₁ will be less than x₀ as we are subtracting positive number from x₀ which means x₀ is greater than x₁.

Step 3 : Now choose x₂ such that x₂ is closer to x* than x₁ and repeat the same calculations done in Step 2.
x₂ = x₁ - r [df/dx]ₓ₁

Step 4 : Calculate x₀, x₁, x₂, …, xₖ. (k iterations)
(If xₖ₊₁ - xₖ) is very small, then terminate the loop.
Declare x* = xₖ. (Minima obtained after k iterations)

xᵢ₊₁ = xᵢ - r [df/dx]ₓᵢ … Update equation for Gradient Descent.

Slope is reducing from x₀ to x* which means
[df/dx]ₓ₀ ≥ [df/dx]ₓ₁ ≥ [df/dx]ₓ₂ ≥ …

As we move from x₀ to x*, initially, there are big jumps and the jumps size decreases as we converge. This happens due to the gradual decrease in slope from x₀ to x*.

(b) Learning rate

xᵢ₊₁ = xᵢ - r [df/dx]ₓᵢ . This is the update equation for Gradient Descent in which r is the learning rate or the step-size.
In working of Gradient Descent, we assumed r as a constant (gave r=1) for simplicity. But r as a constant can create problems while converging. One such problem is Oscillation problem, which is discussed in (c) point.

(c) Oscillation problem

Let r = 1
At iₜₕ iteration, suppose we get xᵢ=0.5 .
Then xᵢ₊₁ = xᵢ - r [df/dx]ₓᵢ
xᵢ₊₁ = 0.5 - 1*(2 × 0.5) as [df/dx] = 2x
So, xᵢ₊₁ = -0.5
xᵢ₊₂ = xᵢ₊₁ - r [df/dx]ₓᵢ₊₁
xᵢ₊₂ = -0.5 - 1*(2× -0.5)
xᵢ₊₂ = 0.5 and so on…
So, it is oscillating between x = 0.5 and x = -0.5, instead of converging to x*.

The above oscillation problem happened just because of constant r.

Solution to this problem → Change r with each iteration.
One technique is, reduce r with each iteration.

After using the above technique in Gradient Descent, it will start converging towards the x* instead of jumping above the x* and we will get minima.

(d) Limitations of Gradient Descent

While calculating the weights/parameters, in every iteration, all the data points of the dataset are taken into consideration. Typically, the dataset contains over 1 million data points or more. The computation is very expensive and time consuming.
Well the Gradient Descent works good for small datasets but when it comes to larger datasets, due to its cost and time consumption, GD is not preferred.
Solution for this problem is Stochastic Gradient Descent (SGD).

When the dataset is huge, go for SGD

6. Stochastic Gradient Descent (SGD)

Stochastic in SGD is probabilistic/randomized.

Stochastic Gradient Descent is considered to be the most important optimization algorithm in Machine Learning.
In SGD, k set of data points are considered in which 1 ≤ k ≤ n
i.e. k lies between 1 data point to n number of data points.

k here is batch size and when k >1, it is batch SGD.
When k = 1, its SGD.

The k set of data points are picked randomly. The points are sampled without replacement which means, the set of k points which we will pick for first iteration will be different for second iteration and so on.
Suppose in Gradient Descent, number of points(n) is 1 million. We took all the points for computation in each iteration.
In SGD, we don’t take 1 million points for each iteration. We will take set of k points(suppose 1000 points). So k here will be 1000.
In SGD, we need to perform more iterations to converge to the optimal solution compared to GD.

As batch size(k) decreases, number of iterations increases.

Despite of the differences in these algorithms(GD and SGD), the final optimal weight(say W*) is same for GD as well as SGD.

(W* of GD) = (W* of SGD)

SGD doesn’t calculate the actual gradients/weights of the dataset but approximates the weights/parameters. It eventually gives optimal weight(W*).

7. Other optimizers

(Q) Why to use other optimizers when we have SGD ?
→ SGD approximation of gradients is noisy as it does not actually calculates the gradients. So it may take longer to achieve convergence to the minima.
Also, SGD gets stuck at saddle points(saddle points are the points where slope is zero, but they are neither minima nor maxima).

In deep learning, people have came up with variations of SGD which have became the most important optimizers in whole of Machine Learning.
These algorithms are well modified for Deep Learning tasks.
I will just name some few such algorithms.
The algorithms are -
1. Batch SGD with Momentum
2. Nesterov Accelerated Gradient(NAG)
3. AdaGrad(Adaptive Gradients)
4. Adadelta
5. RMSProp
6. Adam(Adaptive Moment Estimation)

As this blog is for Gradient Descent and Stochastic GD, we will not go into details of other optimizers, as the topic(other optimizers) is itself a content for a new full fledge blog.

8. References

  1. All the concepts are studied from Applied AI Course(AAIC). (https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course)
  2. Referred couple of videos on Gradient Descent from StatQuest with Josh Starmer’s Youtube channel. (https://www.youtube.com/watch?v=sDv4f4s2SB8&t=342s)

--

--

Rohan Dawkhar

Open Source Contributor, Problem solver, Ex Data Science Intern at Kampd, particularly interested in solving business problems using data.