Stochastic Gradient Descent in Deep Learning

Hey - Nick here! This page is a free excerpt from my $199 course Python for Finance, which is 50% off for the next 50 students.

If you want the full course, click here to sign up.

The last tutorial introduced you to gradient descent, which is the algorithm used to minimize the cost function in deep learning predictive models.

Gradient descent is a useful algorithm. However, it has some drawbacks. There are certain cost functions for which the vanilla gradient descent algorithm will fail to identify the lowest value.

This tutorial will teach you how you can avoid the problems associated with gradient descent models by implementing stochastic gradient descent models instead.

You can skip to a specific section of this stochastic gradient descent tutorial using the table of contents below:

The Problems With Vanilla Gradient Descent Algorithms
What is Stochastic Gradient Descent?
A Word on Gradient Descent Nomenclature
The Performance of Stochastic Gradient Descent Versus Batch Gradient Descent
Final Thoughts

The Problems With Vanilla Gradient Descent Algorithms

Gradient descent algorithms are capable of identifying the minimum value of a cost function in certain situations.

However, there are other situations where a vanilla gradient descent will fail to find the absolute lowest value. Instead, it will find a value that is small compared to its surrounding values, but not the absolute lowest point in the data set.

Machine learning practitioners refer to this as finding the local minimum instead of the absolute minimum value of the cost function. To see how this is possible, consider the following cost function diagram where your initial value is where the red circle is located:

A stochastic gradient descent algorithm

A human eye can easily see where the cost function attains its lowest value. However, a vanilla gradient descent algorithm will erroneously select the red circle as the point of lowest error. The green circle shows the true point of lowest error.

A stochastic gradient descent algorithm

The solution to this problem is to use stochastic gradient descent.

What is Stochastic Gradient Descent?

The differences between a normal gradient descent algorithm and a stochastic gradient descent algorithm are subtle but very important.

In a normal gradient descent algorithm, the weights of each neuron's input signals are adjusted after the cost function has been calculated for the entire data set.

A stochastic gradient descent algorithm adjusts the weights of each neuron's input signals after the cost function has been calculated for each entry in the data set.

This means that in a data set with n observations, a stochastic gradient descent algorithm will have n iterations before passing through the entire data set. A normal gradient descent algorithm uses the entire data set on each pass.

A Word on Gradient Descent Nomenclature

The difference between "normal" gradient descent and stochastic gradient descent can be easier to remember if you understand the etymology behind each term.

First, let's stop using the term "normal" gradient descent and give this technique a more formal name. The first type of gradient descent that we explored in this course is typically called batch gradient descent because it groups together the entire data set into a batch, and uses that batch in each iteration.

Let's now discuss the etymology of stochastic gradient descent.

Stochastic means randomly determined, which refers to the ordering of observations within a data set that is used for deep learning.

Since a data set remains unchanged if you re-order its observations, then the random nature of observations within the data set give stochastic gradient descent its name.

The Performance of Stochastic Gradient Descent Versus Batch Gradient Descent

Perhaps the most important takeaway you should remember from this tutorial is the differences in performance between batch gradient descent algorithms and stochastic gradient descent algorithms.

First of all, note that there are two components to performance:

The ability to identify a global minimum for the cost function
The speed at which that global minimum can be found

Stochastic gradient descent wins the performance horse race on both metrics.

We've already discussed how stochastic gradient descent is capable of finding absolute minimums that batch gradient descent cannot. However, it might not be clear how stochastic gradient descent wins on speed.

Stochastic gradient descent is a faster algorithm than batch gradient descent because it only needs to calculate the cost function once for each iteration of the algorithm. On the other hand, batch gradient descent needs to compute the cost function for every observation in the data set during each iteration of the algorithm.

Final Thoughts

This tutorial introduced you to stochastic gradient descent, which is a variation of gradient descent that is widely considered to be more performant.

Here is a broad summary of what you learned in this tutorial:

That the "normal" gradient descent algorithm that we learned about previously is more commonly called the batch gradient descent algorithm
Why batch gradient descent can fail to identify absolute minimums in certain cost functions
The differences between stochastic gradient descent and batch gradient descent
The etymology behind the terms "batch gradient descent" and "stochastic gradient descent"
That stochastic gradient descent is more performant than batch gradient descent

Nick McCullum