Comparison of Optimizers¶

Version 1.00

This notebook is part of the book Applied Deep Learning: a case based approach, 2nd edition from APRESS by U. Michelucci and M. Sperti.

Introduction¶

This notebook contains the code with which you can compare different optimizers in Keras and see how much faster (or slower) each is when applied on a simple problem. If you are interested in a complete discussion about this check Chapter 5 in my book ADL 2nd edition.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras as keras
import random
import matplotlib.font_manager as fm

# tensorflow libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Environment Set-up¶

The following cells will prepare the environment and download the files that you will need to run this notebook. You can run the entire notebook on Google Colab without downloading anything manually. For more information on each cell check the comments in each.

# Referring to the following cell, if you want to re-clone a repository
# inside the google colab instance, you need to delete it first. 
# You can delete the repositories contained in this instance executing 
# the following two lines of code (deleting the # comment symbol).

# !rm -rf ADL-Book-2nd-Ed 

# This command actually clone the repository of the book in the google colab
# instance. In this way this notebook will have access to the modules
# we have written for this book.

# Please note that in case you have already run this cell, and you run it again
# you may get the error message:
#
# fatal: destination path 'ADL-Book-2nd-Ed' already exists and is not an empty directory.
# 
# In this case you can safely ignore the error message.

!git clone https://github.com/toelt-llc/ADL-Book-2nd-Ed.git

Cloning into 'ADL-Book-2nd-Ed'...
remote: Enumerating objects: 1915, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 1915 (delta 3), reused 32 (delta 2), pack-reused 1880
Receiving objects: 100% (1915/1915), 655.46 MiB | 29.07 MiB/s, done.
Resolving deltas: 100% (951/951), done.
Checking out files: 100% (624/624), done.

# This cell imports some custom written functions that we have created to 
# make the plotting easier. You don't need to undertsand the details and 
# you can simply ignore this cell.
# Simply run it with CMD+Enter (on Mac) or CTRL+Enter (Windows or Ubuntu) to
# import the necessary functions.

import sys
sys.path.append('ADL-Book-2nd-Ed/modules/')

from style_setting import set_style

# The following line contains the path to fonts that are used to plot result in
# a uniform way.

f = set_style().set_general_style_parameters()

Data Generation¶

Now let’s generate some data (with a linear relationship) and let’s try to do linear regression with the different optimizers.

m = 30
w0 = 2
w1 = 0.5
x = np.linspace(-1,1,m)
y = w0 + w1 * x

# plot the data we are going to use

fig = plt.figure()
ax = fig.add_subplot(111)

plt.scatter(x, y, marker = 'o', c = 'blue')

plt.ylabel('x', fontproperties = fm.FontProperties(fname = f))
plt.xlabel('y', fontproperties = fm.FontProperties(fname = f))

plt.ylim(min(y), max(y))
plt.xlim(min(x), max(x))

plt.axis(True)
plt.show()

../_images/Optimizers_comparison_9_0.png

# Equation (4)
def hypothesis(x,w0,w1):
  return w0 + w1*x

To do linear regression we will minimize the Mean Squared Error (MSE).For that we need to define it as a Python function.

# Equation (3)
def cost_function_mse(x,y,w0,w1):
  return np.mean((y - hypothesis(x,w0,w1))**2)/2

Gradient Descent Implementation¶

Since in Keras (with TensorFlow > 2.X) you don’t have the gradient descent optimizer out of the box here is an implementation from scratch to compare it to the other optimizers.

def gradient_descent(x,y,x0,y0,gamma,epochs):
  """
  Returns w0 and w1 that minimizes J(w) [the cost function] and the cost 
  function at each epoch. 
  Inputs:
  - x: samples (array)
  - y: output (array)
  - gamma: learning rate
  - epochs: number of epochs to be performed
  """
  random.seed(42)
  w0_list = [x0]
  w1_list = [y0]
  w0 = x0 # randomly initialize w0
  w1 = y0 # randomly initialize w1
  m = len(x) # number of samples
  cf = []
  for i in range(epochs): # repeat n times (n: number of epochs)
    w0 = w0*(1 - gamma) + (gamma/m)*np.sum(y - w1*x) # update w0
    w1 = w1 + (gamma/m)*np.sum((y - w0 - w1*x)*x) # update w1
    cf.append(cost_function_mse(x,y,w0,w1))
    w0_list.append(w0)
    w1_list.append(w1)

  return w0_list,w1_list,cf

We initialize the two parameters (or weights) to the values of \(0,0\), contained in the variables w0_start and w1_start.

epochs = 200
w0_start = 0.0
w1_start = 0.0
w0l,w1l,cfl = gradient_descent(x,y,w0_start,w1_start,0.1,epochs)

# Cost function vs Iterations plot

fig = plt.figure()
ax = fig.add_subplot(111)

plt.plot(w0l, w1l, 'k--', color = 'blue', label = 'Gradient Descent steps, $\gamma = 0.1$', marker = 'o', markersize = 5)

plt.ylabel('$w_1$', fontproperties = fm.FontProperties(fname = f))
plt.xlabel('$w_0$', fontproperties = fm.FontProperties(fname = f))

plt.scatter([w0],[w1], zorder = 100, color = 'red', s = 50, label = 'Expected minimum location')

plt.ylim(-0.05, 0.8)
plt.xlim(-0.05, None)

plt.axis(True)
legend = ax.legend(loc = 'upper left')

fig.savefig('Figure_GD_alone.png', dpi = 300)
plt.show()

../_images/Optimizers_comparison_17_0.png

Optimization with Adam and RMSProp (from TensorFlow implementation)¶

In the following sections we will minimize the loss function with the different optimizers. To do this we will do linear regression by using a neural network with one neuron without any activation function and by using a custom training loop to keep track of what each optimizer is doing.

# one unit as network's output
# identity function as activation function
# sequential groups a linear stack of layers into a tf.keras.Model
# activation parameter: if you don't specify anything, no activation 
# is applied (i.e. "linear" activation: a(x) = x).
model = keras.Sequential([ 
  layers.Dense(1, input_shape = [1], use_bias = True)
])

# optimizer that implements the Adam algorithm
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1)

# Mean Square Error (mse)
loss_fn = keras.losses.MeanSquaredError()

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_1 (Dense)             (None, 1)                 2         
                                                                 
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________

x_ = np.array(x).reshape(len(x),1)
y_ = np.array(y).reshape(len(y),1)

To make the results comparable we initilize the weights to the same values that we have used for the gradient descent implementation.

model.set_weights([np.array([w0_start]).reshape(1,1),np.array([w1_start]).reshape(1,)])
print(model.layers[0].get_weights())

[array([[0.]], dtype=float32), array([0.], dtype=float32)]

Applied Deep Learning - 2nd Edition

Comparison of Optimizers¶

Introduction¶

Environment Set-up¶

Data Generation¶

Gradient Descent Implementation¶

Optimization with Adam and RMSProp (from TensorFlow implementation)¶

Adam Optimizer Loop¶

RMSProp Loop¶

Plots vs. the EPOCH number¶

3D MSE Surface¶