Abstract
The notebook walks through the fundamental knowledge and intuition required to understand the inner working of a minimally configured neural network model using coding examples built from scratch. By avoiding the use of any high-level machine learning API we can get rid of the abstraction and gain more in-depth hands-on experiences during the journey.
The topics include 1.) an important section revisiting what is linear and logistic regression and how we can train these models using first-order numerical optimizer. Followed by 2.) a quick demonstration of automatic differnetiation which is considered as the computing workhorse for training modern neural networks. And 3.) a simple neural network solver is built and examined. Finally 4.) the regularization techinque to overcome overfitting problem followed by discussion on batch normalization and gradient clipping.Before we even start, it is of crucial importance to first understand deeply linear and logistic regression. This is because (as we will see in later discussion) they are the very basic components in a general neural network model.
A linear model can be written in a matrix form
\[ Y = X\beta + \epsilon, \]
where \(Y\) is the output or label vector of length \(N\) (number of observations), \(X\) is the input feature matrix (referred to as the design matrix in statistics) with dimension \(N\) by \(P\) (number of features), \(\beta\) is the model weights/coefficients (a column vector of length \(P\)) for which we’d like to solve, \(\epsilon\) is the model residual or error vector.
The classical way to solve for \(\beta\) is ordinary least squares. The idea of OLS is find out the weights that minimize the mean of squared model errors:
\[ \begin{aligned} \mbox{mse (loss)} &= \frac{1}{N}\sum_i\epsilon^2_i \\ &= \frac{1}{N}\sum_i(y_i - \beta x_i)^2, \end{aligned} \]
where \(y_i\) and \(x_i\) is the \(i\)-th observation. The first-order condition (requiring that the first-order derivative w.r.t. weights are all zero) gives us the OLS solution for model weights \(\beta\) analytically (in matrix notation):
\[ \begin{equation} \label{eq:ols} \hat{\beta} = (X'X)^{-1}X'Y. \end{equation} \]
import numpy as np
from numpy.linalg import inv
def ols(X, y):
return inv(X.T.dot(X)).dot(X.T).dot(y)
Let’s consider a toy model with only one non-constant feature:
\[ y_i = 6 + 4x_i + \epsilon_i. \]
In this model the outcome \(y\) is determined by a bias term plus a single variable \(x\), with an independently distributed noise term \(\epsilon \sim \mbox{Normal}(0, 1)\).
Create some random data generated from this model:
# Create toy example.
np.random.seed(777)
N = 1000
X = np.stack([np.ones(N), np.random.normal(size=N)], axis=1)
beta = np.array([6, 4], dtype=np.float32) # True model weights.
e = np.random.normal(size=N)
y = X.dot(beta) + e
print(X[:10])
[[ 1. -0.46820879]
[ 1. -0.82282485]
[ 1. -0.0653801 ]
[ 1. -0.71336192]
[ 1. 0.90635089]
[ 1. 0.76623673]
[ 1. 0.82605407]
[ 1. -1.32368279]
[ 1. -1.75244452]
[ 1. 1.00244907]]
from plotly.offline import plot
import plotly.graph_objs as go
def plot_offline(data, layout, ofile):
p = plot({"data": data, "layout": layout},
filename=ofile, auto_open=False,
include_plotlyjs=False)
return p
ofile = "plots/toy_reg.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=X[:,1], y=y, mode="markers")],
layout=go.Layout(title="Data Generated from Toy Model",
xaxis=dict(title="x"),
yaxis=dict(title="y")))
Without consideration of the noise term, the OLS estimator will solve precisely for the true model weights:
[6. 4.]
Of course in the real world the noise term cannot be determined and usually the feature alone cannot explain entirely the variation in the outcome. This means that what we actually estimate will be the expected value of model weights:
[6.01132196 3.96247175]
We can check the result from scikit-learn
(Pedregosa et al. (2011)):
from sklearn.linear_model import LinearRegression
# We set fit_intercept=False since our design matrix already contains intercept.
print(LinearRegression(fit_intercept=False).fit(X, y).coef_)
[6.01132196 3.96247175]
By the Law of Large Number and Central Limit Theorem, under a relatively loose assumption of \(E(\epsilon | X) = 0\), the OLS estimator will converge in probability to the true model weights and distributed asymptotically Normal in large sample.1
The issue of the above approach is that equation \(\eqref{eq:ols}\) is not numerically stable when it comes to large-scale application where we may have lots of observations and lots of features. One very useful solution to solve the estimator numerically in large-scale application is the gradient descent approach.
Instead of solving the first-order condition analytically, we can do it numerically. Gradient descent is a 1st-order optimization technique to find local optimum of a given function.
In the model training exercise our target function is the loss so the optimization problem is:
\[ \operatorname*{argmin}_\beta \mbox{Loss} = \frac{1}{N}\sum_i(y_i - \beta x_i)^2. \]
That is, we’d like to figure out model weights that minimize the loss which is defined by the mean squared errors when the model is a regression model.
The idea of gradient descent is to
Let’s use the toy example to actually implement a gradent descent optimizer from scratch. First we re-write the loss function explicitly with our setup of one coefficient with a constant (\(\beta = [\beta_0, \beta_1]\)):
\[ \mbox{Loss} = \frac{1}{N}\sum_i\big[y_i-(\beta_0 + \beta_1x_i)\big]^2. \]
Now the gradient (or equivalently the 1st-order derivative) w.r.t. to weights will be:
\[ \begin{aligned} \frac{\partial\mbox{Loss}}{\partial\beta_0} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big], \\ \frac{\partial\mbox{Loss}}{\partial\beta_1} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big]x_i. \end{aligned} \]
The corresponding python function can be coded as:
def grad_func(X, y, beta):
"""Calculate vectorized gradients."""
return -2 * ((y - X.dot(beta)).dot(X)) / X.shape[0]
If we set the above equations to zero we can solve for \(\beta_0\) and \(\beta_1\) analytically and the solution will be exactly just equation \(\eqref{eq:ols}\). But as we just pointed out it suffers from numerical stability issue.
The minimum implementation of our gradient descent optimizer is just a few lines:
def gd_optimize(X, y, grad_func, lr=.01, n_step=100):
b = np.random.normal(size=X.shape[1])
out = b.copy()
for step in range(n_step):
b -= lr * grad_func(X, y, b)
out = np.row_stack([out, b]) # Trace result of each step.
return out
b = gd_optimize(X, y, grad_func, n_step=3000)
for s in [10, 50, 100, 500, 1000, 3000]:
print("Training Steps {:5} | Estimate: {}".format(s, b[s]))
Training Steps 10 | Estimate: [0.78324608 1.28348889]
Training Steps 50 | Estimate: [3.71164397 2.83692621]
Training Steps 100 | Estimate: [5.18681121 3.58312847]
Training Steps 500 | Estimate: [6.01108844 3.9624236 ]
Training Steps 1000 | Estimate: [6.01132195 3.96247175]
Training Steps 3000 | Estimate: [6.01132196 3.96247175]
As we can see the result is approaching to our analytical solution as the number of steps grows.
Learning rate is a hyper-parameter for gradient descent optimizer. The gradient update to our model weights is scaled down by the learning rate to make sure convergence. Too large the learning rate will explode the gradient. Too small the learning rate will slow down the convergence and sometimes result in the optimizer trapped at local sub-optimum.
Let’s re-write our gradient descent optimizer to also track the loss from each training step. And we use the same initialization for a fair comparison.
def loss(X, y, beta):
return ((X.dot(beta) - y)**2).mean()
def gd_optimize(X, y, grad_func, lr=.01, n_step=100):
b = np.array([0.0, 0.0])
l = [loss(X, y, b)]
for step in range(n_step):
b -= lr*grad_func(X, y, b)
l.append(loss(X, y, b))
return b, l
Now we run the optimization with a variety of different learning rates. For illustration purpose we will only run a few steps.
betas = {}
losses = {}
for lr in [.001, .01, .05, .1, 1]:
betas[lr], losses[lr] = gd_optimize(X, y, grad_func, lr=lr, n_step=15)
for r, b in betas.items():
print("Learning Rate {:5} | Estimate: {}".format(r, b))
Learning Rate 0.001 | Estimate: [0.18155223 0.12379996]
Learning Rate 0.01 | Estimate: [1.60013258 1.08670659]
Learning Rate 0.05 | Estimate: [4.81547513 3.22246947]
Learning Rate 0.1 | Estimate: [5.81500752 3.85150789]
Learning Rate 1 | Estimate: [19.8158439 18.42703331]
The result suggests that a learning rate of 1 is too large for our problem. The gradient explodes which make our solution diverge. And a lower learning rate in general converges slower to the optimum. Number of examples used to calculate the gradient also will affect the convergence behavior. In general if the sample size is too small a smaller learning rate should be used to avoid gradient explosion.
This can be more clearly seen if we plot the trace of our training losses:
ofile = "plots/toy_reg_loss_compare.html"
pdata = [go.Scatter(x=np.arange(len(losses[lr])), y=losses[lr], name=lr)
for lr in losses.keys()]
p = plot_offline(
ofile=ofile,
data=pdata,
layout=go.Layout(
title="Trace of Training Loss with Various Learning Rates",
xaxis=dict(title="Step"),
yaxis=dict(title="Loss")))
If the loss doesn’t decrease over training iteration, it is a signal that something is wrong with our model.
Unlike our toy implementation, in modern implementation of any numerical optimizer there will be a lot of techniques to do the best to avoid convergence failure. But it is the model developer’s responsibility to diagnose the training behavior before anything is delivered to the stakeholder. Checking the dynamics of loss is usually the first and quick step to examine whether the training task is functioning as expected.
The vanilla gradient descent optimizer we just implemented has one issue. Since for each update it needs to traverse over the entire dataset, it becomes too slow when it comes to large dataset.
Stochastic gradient descent is to overcome this issue. Instead of calculate the gradient using the entire dataset, we can use only one example per update. Each step will be less precise but statistically the final result should be consistent.
Here we introduce the term epoch: One epoch is for the optimizer to traverse the entire dataset once. Number of epochs can be considered as another hyper-parameter of a model.
Here comes the minimum SGD implementation:
def sgd_optimize(X, y, grad_func, lr=.01, n_epoch=10):
b = np.random.normal(size=X.shape[1])
l = [loss(X, y, b)]
out = b.copy()
for epoch in range(n_epoch):
# Shuffle the dataset before each epoch.
sid = np.random.permutation(X.shape[0])
for i in sid:
b -= lr * grad_func(X[None,i,:], y[i], b)
l.append(loss(X[None,i,:], y[i], b))
# Trace the result per epoch.
out = np.row_stack([out, b])
return out, l
sgd_beta, sgd_loss = sgd_optimize(X, y, grad_func, lr=.01, n_epoch=100)
for epoch in [1, 5, 10, 50, 100]:
print("Training Epochs {:4} | Estimate: {}".format(epoch, sgd_beta[epoch]))
Training Epochs 1 | Estimate: [5.94304058 3.9934523 ]
Training Epochs 5 | Estimate: [5.90854388 3.87784646]
Training Epochs 10 | Estimate: [6.16073223 4.04277355]
Training Epochs 50 | Estimate: [5.92701188 3.86582267]
Training Epochs 100 | Estimate: [6.10863654 3.89402869]
SGD won’t be as precise as vanilla GD but for large scale application it can reduce considerable amount of computing time (sometimes from infeasible to feasible). In our simple problem indeed just 1 epoch can have a good approximation already.
The trace of (per-instance) loss for the first 500 updates:
ofile = "plots/toy_reg_sgd_loss.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=np.arange(len(sgd_loss[:500])), y=sgd_loss[:500])],
layout=go.Layout(
title="Trace of SGD Training Loss",
xaxis=dict(title="Step"),
yaxis=dict(title="Loss")))
Since the loss is calculated on a per-instance basis, it will fluctuate but with a decreasing trend if nothing went wrong about the optimization process.
In sklearn
there are dedicated SGD classes for a variety of learning algorithms. For a linear regression model with SGD solver:
from sklearn.linear_model import SGDRegressor
# Turn off the default elasticnet regularization for comparison.
print(SGDRegressor(alpha=0, l1_ratio=0, fit_intercept=False).fit(X, y).coef_)
[6.01665156 3.97557829]
To reduce the noise in SGD we can modify it by replacing 1 training example with a batch of examples in a single gradient update. Here’s such implementation of a batch gradient descent optimizer2:
def gd_batch_optimize(X, y, grad_func, lr=.01, n_epoch=10, batch_size=64):
b = np.random.normal(size=X.shape[1])
l = [loss(X, y, b)]
for epoch in range(n_epoch):
# Shuffle the dataset before each epoch.
sid = np.random.permutation(X.shape[0])
Xs = X[sid,:]
ys = y[sid]
i = 0
n_step = int(np.ceil(X.shape[0] / batch_size))
for step in range(n_step):
Xb = Xs[i:i+batch_size,:]
yb = ys[i:i+batch_size]
b -= lr*grad_func(Xb, yb, b)
l.append(loss(Xb, yb, b))
i += batch_size
return b, l
batch_beta, batch_loss = gd_batch_optimize(X, y, grad_func, n_epoch=100, batch_size=64)
print(batch_beta)
[6.01401595 3.96198209]
For this simple problem batch size doesn’t have any important impact given enough number of training epochs:
for bsize in [8, 16, 32, 64, 128, 256]:
print("Batch Size: {:4} | Estimate: {}".format(
bsize,
gd_batch_optimize(X, y, grad_func, n_epoch=100, batch_size=bsize)[0]))
Batch Size: 8 | Estimate: [6.05434008 3.95548771]
Batch Size: 16 | Estimate: [6.01925548 3.97143916]
Batch Size: 32 | Estimate: [6.00071779 3.9636924 ]
Batch Size: 64 | Estimate: [6.01279841 3.96202727]
Batch Size: 128 | Estimate: [6.01202513 3.96292586]
Batch Size: 256 | Estimate: [6.00931051 3.9613001 ]
Batch optimizer is currently the best practice of training neural networks in large scale application. The batch size depends on the actual application but usually ranges from 8 (mini) to 1024 (large). The mini-batch stochastic gradient descent is so previal that in general when people talk about SGD for a neural network they are actually referring to the batch gradient descent.
A logistic regression models the outcome probablistically:
\[ \begin{equation} \label{eq:logit} P(Y = 1) = \frac{1}{1 + e^{-X\beta}}, \end{equation} \]
Here the sigmoid function \(s(t) = \frac{1}{1 + e^{-t}}\) is used to transform a real number into probability space \([0, 1]\).
We can interpret the model as a linear model in log-odds. Assuming Y is binary and take a value of 0 or 1, the odds of \(Y = 1\) is defined as \(\frac{P(Y = 1)}{P(Y = 0)} = \frac{P(Y = 1)}{1 - P(Y = 1)}\). We can re-arrange the logistic model equation:
\[ \begin{aligned} \ln \Bigg[ \frac{P(Y = 1)}{1 - P(Y = 1)} \Bigg] &= \ln \Bigg[ \frac{\frac{1}{1 + e^{-X\beta}}}{\frac{e^{-X\beta}}{1 + e^{-X\beta}}} \Bigg] \\ &= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta}(1 + e^{-X\beta}) \\ &= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta} - \ln(1 + e^{-X\beta}) \\ &= - \ln e^{-X\beta} \\ &= X\beta. \end{aligned} \]
That is, the model weights are linear in the log-odds of our target outcome.
When the probability is transformed into odds, the range is transformed from \([0,1]\) to \([0,\infty]\).
# We plot only up to .95 since the value of odds will grow unboundedly.
prob = np.linspace(.001, .95, num=100)
odds = prob / (1 - prob)
log_odds = np.log(odds)
ofile = "plots/prob_odds.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=prob, y=odds)],
layout=go.Layout(
title="Event Probabilities to Odds",
xaxis=dict(title="Probability"),
yaxis=dict(title="Odds")))
If we further transform odds to log-odds, the range is transformed from \([0,\infty]\) to \([-\infty,\infty]\).
ofile = "plots/prob_odds.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=odds, y=log_odds)],
layout=go.Layout(
title="Event Odds to Log-Odds",
xaxis=dict(title="Odds"),
yaxis=dict(title="Log-Odds")))
Put it together is the effect of the sigmoid function:
# Note that instead of coding s = 1 / (1 + np.exp(-t)),
# which is numerically unstable,
# we should use a math trick to make it stable.
def sigmoid(t):
"""Numerically stable sigmoid."""
return np.exp(-np.logaddexp(0, -t))
t = np.linspace(-10, 10, num=100)
ofile = "plots/sigmoid.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=t, y=sigmoid(t))],
layout=go.Layout(
title="A Sigmoid Function",
xaxis=dict(title="Raw Value"),
yaxis=dict(title="Probability")))
How do we solve for the model weights \(\beta\) in equation \(\eqref{eq:logit}\)? In the linear regression model we solve for the weights by minimizing the mean squared error. In logistic regression model we need to define measurement for modeling error as well.
Put it differently, we’d like to calculate the distance between our predicted probability and the real event label distribution (called the empirical distribution). In information theory the cross entropy is used to measure the distance between two probability distribution.
The entropy of a discrete probability distribution is defined as:
\[ \begin{equation} \label{eq:entropy} H(p) = - \sum_i p_i \log_2p_i, \end{equation} \]
where \(p_i\) is the probability for event \(i\). It measures the uncertainty of a stochastic event.
Take a coin flip event as example. We plot the entropy value at different value of coin bias (the probability of having a head instead of a tail.)
def entropy_binary(p):
return -p*np.log2(p) - (1 - p)*np.log2(1 - p)
prob = np.linspace(.01, .99, num=100)
ofile = "plots/entropy.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=prob, y=entropy_binary(prob))],
layout=go.Layout(
title="Entropy of a Binary Event (Coin Flip)",
xaxis=dict(title="Probability of a Head"),
yaxis=dict(title="Entropy")))
It is obvious that the entropy of this event is maximized when the probability of flipping a head is exactly 0.5. At this level the event has a highest level of uncertainty in a sense that it is the most difficult case to predict the outcome of a flip.
What is the intuition behind entropy?
In equation \(\eqref{eq:entropy}\) we first need to interpret the seemingly mysterious term \(-\log_2p_i\). It is the log of the reciprocal of a probability. Generally speaking, the reciproal of a probability \(\frac{1}{p(x)}\) is a 1-in-n scale of probability statement saying that in order for the outcome \(x\) to occur at least once from a stochastic event, it is expected to require having repeated \(n = \frac{1}{p(x)}\) times the event. The rare the outcome, the higher the reciprocal of its probability. When a rare event occurs, it contains much more information than a frequent (or more probable) one. Information in this way can be considered as the level of surprisal.
But why taking log? Especially why log of base two in Shannon’s entropy? This relates to the use of bits to encode messages/information/random states/facts. A bit (0 or 1) can be used to encode two facts. For example, 0 for a tail and 1 for a head in a coin flip exercise. Apparently if we have \(N\) bits we can encode up to \(2^N\) different outcomes in a stochastic event.
The discussion doesn’t limit to binary. If we have 100 possible outcomes from a random event, with 10 distinct symbols (0 to 9), we only need \(\log_{10}100 = 2\) digits to encode them (decimal 0 to 99).
Now put them together, the rare the outcome, the higher the reciprocal of its probability, and the more bits required to encode this information (so as to distinguish from other possible outcomes).
The formula \(\log_2\frac{1}{P(x)}\) calculates how many bits are necessary to encode a stochastic outcome \(x\) that may occur once out of \(\frac{1}{P(x)}\) times. Let’s call this amount as the information contained by outcome \(x\). Then the entropy in \(\eqref{eq:entropy}\) essentially is calculating the expected number of bits required to encode any outcome possible from the given stochastic event. Or it is the expected information contained by a given stochastic event.
One last thing to note is that the number of bits here is a mathematical artifect that could not always be represented by our discrete real world. Certainly we don’t have non-integer bit in real world computing but the value of entropy is a positive real number not limited to integer.
Now move on to cross entropy. Cross entropy between two discrete probability distribution \(p\) and \(q\) over the same support is defined as:
\[ \begin{equation} \label{eq:crossentropy} H(p, q) = - \sum_i p_i \log_2q_i. \end{equation} \]
Intuitively, it means that if we encode a stochastic event with a probability distribution NOT linked to the event, for example a predicted distribution \(q\) based on a model, while the true distribution is \(p\), what will be the expected number of bits required to encode all possibilities from that event.
To understand cross entropy one useful concept is the Kullback-Leibler divergence from \(q\) to \(p\):
\[ \begin{aligned} KL(p \vert\vert q) &= H(p, q) - H(p) \\ &= \sum_ip_i\log_2\frac{p_i}{q_i}. \end{aligned} \]
Mathematically it is just the sum of difference of log-probability from two distribution \(p\) and \(q\), weighted by \(p\). Distribution \(p\) is the reference distribution and \(q\) the approximation distribution. It is a measure of how closely the distribution \(q\) is approximating \(p\). If \(q\) is perfectly mimicking \(p\), the KL divergence is 0 and the cross entropy between \(p\) and \(q\) is the same as the entropy of \(p\): \(H(p, q) = H(p)\).
Based on the above formula we can also interpret KL divergence as the extra bits required due to encoding a stochastic event with a wrong probability distribution.
For our logistic regression model, distribution \(p\) is the empirical distribution (label distribution) and distribution \(q\) is our model predicted distribution. Denote \(q_i = P(y_i = 1)\) for the prediction of \(i\)-th example. The cross-entropy-loss of our model hence can be written as:
\[ \mbox{Cross-Entropy Loss} = - \frac{1}{N}\sum_i^N \bigg[ y_i\log_2q_i + (1 - y_i)\log_2(1 - q_i)\bigg], \]
where \(y_i \in \{0,1\}\) is the \(i\)-th binary training label and \(P(y_i = 1)\) the model prediction for the \(i\)-th example out of \(N\) total training examples. This is the mean value of cross entropy for each training example. It measures how well our model is predicting the label based on the distributional difference between the labels and the model outputs.
Since \(q_i = P(y_i = 1)\) is expressed by our model equation \(\eqref{eq:logit}\), now we can apply gradient descent to the loss function to find out the optimum model weights that minimize the cross-entropy loss.
Before we implement our optimizer for a logistic regression model, we demonstrate that minimizing cross entropy is indeed equivalent to maximizing data likelihood.
The data likelihood is just the product of all predicted probabilities for individual example, provided that they are independent.
\[ \mbox{likelihood} = \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i}. \]
The maximum likelihood estimator will try to maximize the log of the likelihood, which is:
\[ \begin{aligned} \mbox{log-lik} &= \log_2 \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i} \\ &= \sum_i^N \log_2 q_i^{y_i}(1 - q_i)^{1 - y_i} \\ &= \sum_i^N \bigg[ y_i\log_2 q_i + (1 - y_i)\log_2(1 - q_i) \bigg]. \end{aligned} \]
By taking the average as well, the negative log-likelihood is exactly the cross entropy.
Though we didn’t discuss this linkage in linear regression, if we assume the error term in the model is independently distributed Normally, the OLS solution to the model weights (equation \(\eqref{eq:ols}\)) is indeed also the MLE solution.
Now let’s also implement a batch gradient descent optimizer for a logistic regression model. We will use the same toy example and create additional random binary labels for this exercise.
The gradient function must be derived with respect to model weights. To simplify the math we replace log of base 2 with natural log (with no impact on the optimization), and we take advantage of the fact that the derivative of the sigmoid function is:3
\[ \begin{aligned} \frac{ds(t)}{dt} &= \frac{d\frac{1}{1 + e^{-t}}}{dt} \\ &= -(1 + e^{-t})^{-2} \cdot (- e^{-t}) \\ &= \frac{e^{-t}}{(1 + e^{-t})^2} \\ &= \frac{1}{1 + e^{-t}} \cdot \frac{e^{-t}}{1 + e^{-t}} \\ &= s(t) \cdot [1 - s(t)]. \end{aligned} \]
The gradient in our univariate model w.r.t. the bias term will be:4
\[ \begin{aligned} \frac{\partial\mbox{LogLoss}}{\partial\beta_0} &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{\partial\ln q_i}{\partial\beta_0} + (1 - y_i)\frac{\partial\ln(1 - q_i)}{\partial\beta_0}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial q_i}{\partial\beta_0} + (1 - y_i)\frac{1}{1 - q_i}\frac{\partial(1 - q_i)}{\partial\beta_0}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t} - (1 - y_i)\frac{1}{1 - q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}q_i(1 - q_i) - (1 - y_i)\frac{1}{1 - q_i}q_i(1 - q_i)\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i(1 - q_i) - (1 - y_i)q_i \bigg] \\ &= - \frac{1}{N}\sum_i (y_i - q_i). \end{aligned} \]
Similary for the weight:
\[ \frac{\partial\mbox{LogLoss}}{\partial\beta_1} = - \frac{1}{N}\sum_i (y_i - q_i)x_i. \]
Now the Python code for batch gradient descent with log-loss:
def logloss(X, y, b):
logloss_pos = y * np.log(sigmoid(X.dot(b)))
logloss_neg = (1 - y) * np.log(1 - sigmoid(X.dot(b)))
return - (logloss_pos + logloss_neg).mean()
def logit_grad_func(X, y, b):
return - (y - sigmoid(X.dot(b))).dot(X) / X.shape[0]
def cross_entropy_gd_optimize(X, y, lr=.01, n_epoch=10, batch_size=64):
b = np.random.normal(size=X.shape[1])
l = [logloss(X, y, b)]
for epoch in range(n_epoch):
# Shuffle the dataset before each epoch.
sid = np.random.permutation(X.shape[0])
Xs = X[sid,:]
ys = y[sid]
i = 0
n_step = int(np.ceil(X.shape[0] / batch_size))
for step in range(n_step):
Xb = Xs[i:i+batch_size,:]
yb = ys[i:i+batch_size]
b -= lr*logit_grad_func(Xb, yb, b)
l.append(logloss(Xb, yb, b))
i += batch_size
return b, l
log_beta, log_loss = cross_entropy_gd_optimize(X, y2, lr=.01, n_epoch=100)
print(log_beta)
[2.50014806 1.28482629]
ofile = "plots/toy_log_loss.html"
pdata = [go.Scatter(x=np.arange(len(log_loss)), y=log_loss)]
p = plot_offline(
ofile=ofile,
data=pdata,
layout=go.Layout(
title="Trace of Training Log-Loss",
xaxis=dict(title="Step"),
yaxis=dict(title="Loss")))
Comparing to linear regression, a logistic regression model is harder to converge. Since our naive implementation does not do convergence diagnostics, let’s use R’s built-in glm
function which use Newton’s method (a 2nd-order optimizer utilizing not only 1st-order but also 2nd-order derivatives) to check the estimation result of our toy example:
(Intercept) x
6.65212 4.50262
Now increase both the learning rate and training epochs of our naive gradient descent optimizer:
[6.66970964 4.45789135]
Seems better.
Or we can use the Python package statsmodels
(Seabold and Perktold (2010)), which also uses a 2nd-order optimizer by default, to check our result:
Optimization terminated successfully.
Current function value: 0.107650
Iterations 10
[6.65211962 4.50262017]
Or the sklearn
result:
from sklearn.linear_model import LogisticRegression
# We need to set param C to an arbitrarily large number to suppress regularization
# since our naive approach doesn't implement any regularization.
# We set fit_intercept=False since our design matrix already contains intercept.
print(LogisticRegression(C=1e16, fit_intercept=False, solver="liblinear").fit(X, y2).coef_)
[[6.65211771 4.5026188 ]]
In the previous section we implement a simple gradient descent optimizer by manually derive the functional form of gradient on our own. This could be troublesome if our model becomes more and more complicated, as in the case of a deep neural net.
Automatci differentiation is a programming technique to calculate the gradient of any given function. One of the most popular library for this purpose is TensorFlow (Abadi et al. (2015)).
Let’s use tensorflow
to implement our simple gradient descent optimizer again. But this time we will NOT explicitly derive the gradient function. Instead, we will only specify the target function which is just the loss function of our model.
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
import tensorflow as tf
print(tf.__version__)
1.13.1
# Use tensor to represent our data.
# Note that we need to be very specific about dtype/shape of our tensors.
X_tf = tf.convert_to_tensor(X, dtype=tf.float32)
beta_tf = tf.reshape(tf.convert_to_tensor(beta, dtype=tf.float32), (2,1))
e_tf = tf.reshape(tf.convert_to_tensor(e, dtype=tf.float32), (N, 1))
y_tf = tf.matmul(X_tf, beta_tf) + e_tf
y2_tf = tf.reshape(tf.convert_to_tensor(y2, dtype=tf.float32), (N, 1))
def loss_mse_tf(X, y, beta):
y_hat = tf.matmul(X, beta)
return tf.reduce_mean(input_tensor = (y_hat - y)**2)
def logloss_tf_bad(X, y, beta):
# This can suffer from numerical stability issue.
logloss_pos = y * tf.log(tf.sigmoid(tf.matmul(X, beta)))
logloss_neg = (1 - y) * tf.log(1 - tf.sigmoid(tf.matmul(X, beta)))
return - tf.reduce_mean(logloss_pos + logloss_neg)
def logloss_tf(X, y, beta):
# tf.sigmoid is not numerically stable.
# But there is a convenient function to do the cross entropy calculation stably.
s = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=tf.matmul(X, beta))
return tf.reduce_mean(s)
def gd_optimize_tf(X, y, loss_func, lr=.01, n_step=100):
grad_func_tf = tf.contrib.eager.gradients_function(loss_func, params=[2])
beta = tf.random_normal((2, 1))
for step in range(n_step):
grad = grad_func_tf(X, y, beta)[0]
beta -= lr * grad
return beta.numpy()
print(gd_optimize_tf(X_tf, y_tf, loss_func=loss_mse_tf, n_step=3000)) # MSE loss.
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.
[[6.0113106]
[3.9624665]]
[[6.6248612]
[4.482333 ]]
In the above coding example one should realize that we no longer need to hardcode the functional form of the gradients. Instead we just plug-in the loss function and let tensorflow
to do the gradient calculation for us.
Of course in actual development we will use higher-level APIs to implement our model, where the entire optimization process is abstracted away from the application code.
Both linear regression and logistic regression can be considered as simple additive model of the form:
\[ \hat{y} = \Phi\bigg(\sum_{i=1}^Pw_ix_i\bigg), \]
where \(P\) is the number of features used, \(x_i\) is the \(i\)-th feature, and \(\Phi(\cdot)\) is a function applied to the output. In linear regression \(\Phi(\cdot)\) is simply an identity function. In logistic regression \(\Phi(\cdot)\) is the standard sigmoid function.
Now consider there is a way to ensemble multiple such additive models together to generate a potentially better and more sophisticated model. We use the following diagram for illustration.
# Draw a simple neural nets.
DiagrammeR::grViz("
digraph subscript {
graph [layout = dot rankdir = LR ordering = in ranksep=2 splines=true]
node [shape = circle]
subgraph cluster_input_layer {
label = 'Input Layer'
x1 [label = 'x@_{1}']
x2 [label = 'x@_{2}']
x3 [label = 'x@_{3}']
}
subgraph cluster_hidden_layer {
label = 'Hidden Layer'
n11 [label = 'y@_{11}' color = blue fontcolor = blue]
n12 [label = 'y@_{12}' color = darkgreen fontcolor = darkgreen]
}
subgraph cluster_output_layer {
label = 'Output Layer'
n21 [label = 'y@_{2}']
}
edge [arrowsize = .25]
x1 -> n11 [label = 'w@_{111}' color = blue fontcolor = blue]
x1 -> n12 [label = 'w@_{112}' color = darkgreen fontcolor = darkgreen]
x2 -> n11 [label = 'w@_{211}' color = blue fontcolor = blue]
x2 -> n12 [label = 'w@_{212}' color = darkgreen fontcolor = darkgreen]
x3 -> n11 [label = 'w@_{311}' color = blue fontcolor = blue]
x3 -> n12 [label = 'w@_{312}' color = darkgreen fontcolor = darkgreen]
n11 -> n21 [label = 'w@_{1121}']
n12 -> n21 [label = 'w@_{1221}']
}")
In the above diagram, \(Y_{11}\) is a single additive model
\[ Y_{11} = \Phi(W_{111}X_1 + W_{211}X_2 + W_{311}X_3). \]
Similarly \(Y_{12}\) is another such model (with the same input feature set but different model weights)
\[ Y_{12} = \Phi(W_{112}X_1 + W_{212}X_2 + W_{312}X_3). \]
Model \(Y_2\) is yet another additive model but takes the output of the above two models:
\[ Y_2 = \Phi(W_{1121}Y_{11} + W_{1221}Y_{12}). \]
The above setup is a simple architecture of a neural network model, with only one hidden layer of two neurons. (We also ignore the constant/bias term in each layer for simplicity.) A neuron is simply an additive model with a so-called activation function \(\Phi(\cdot)\) to transform the output from any real number into a scaled signal.
One now can easily realize that a logistic regression model could be viewed as a degenerated neural network model with single neuron and using sigmoid as the activation function. And a linear regression model is a degenerated neural network model with single neuron and without an activation function.
The above simple neural network is indeed a shallow one, with only one hidden layer. Conventionally people refer to a deep neural network as a network with at least 2 hidden layers. The buzz word deep learning is used to describe complicated neural network architecture that can solve problems that traditional machine learning algorithms have difficulty dealing with, especially in the field of image (and video) and natural language (and voice) processing and understanding.
Deep model is indeed much more than just adding more hidden layers. The way neuron is connected and how they are structured can be highly creative (for outstanding examples being convolutional and recurrent neural network) and way beyond the scope of this notebook. Though a deep model is definitely more complicated than a shallow one, the foundation is all the same.
Why do we need the activation function? In the above neural network model in the absence of activation function the final output model \(Y_2\) will degenerate into a simple linear model. We can use simpler notations to demonstrate this:
\[ \begin{aligned} y_1 &= ax + b, \\ y_2 &= cx + d, \\ y_3 &= e + fy_1 + gy_2 \\ &= e + f(ax + b) + g(cx + d) \\ &= \underbrace{(e + fb + gd)}_\text{Bias} + \underbrace{(fa + gc)}_\text{Weight}x. \end{aligned} \]
Without activation function, no matter how many neurons or layers we design for our model, it eventually reduces to a simple linear model. With the activation function applied to each neuron, the model becomes non-linear and hence can handle much more complicated patterns hidden behind the data.
Some popular activation functions:
To solve for model weights in a neural network, we use a technique called backpropagation which is essentially an iterative process of gradient descent.
To simplify notation we assume each neuron is simply a univariate model. Consider the following minimum architecture:
DiagrammeR::grViz("
digraph subscript {
graph [layout = dot rankdir = LR ordering = in ranksep=2]
node [shape = circle]
x0 [label = '1']
x1 [label = 'x']
z0 [label = '1']
z1 [label = 'z']
y [label = 'y']
edge [arrowsize = .25]
x0 -> z1 [label = 'b@_{0}']
x1 -> z1 [label = 'w@_{0}']
z0 -> y [label = 'b@_{1}']
z1 -> y [label = 'w@_{1}']
}")
Mathematically:
\[ \begin{aligned} \hat{y} &= \Phi(b_1 + w_1z) \\ &= \Phi(b_1 + w_1\Phi(b_0 + w_0x)), \end{aligned} \]
where
\[ z = \Phi(b_0 + w_0x), \]
and
\[ \Phi(t) = \frac{1}{1 + e^{-t}}. \]
Though it may not be very meaningful to use MSE loss when the output layer is applied with an activation function, we can still do it for educational purpose.
\[ \mbox{MSE-Loss} = \frac{1}{N}\sum_i^N (y_i - \hat{y}_i)^2. \]
Firstly we take the derivative w.r.t. the weight in the last layer:
\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_1} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_1} \\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \underbrace{ \frac{\partial t}{\partial w_1}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i. \end{aligned} \]
Similarly for the bias in the last layer:
\[ \frac{\partial \mbox{MSE-Loss}}{\partial b_1} = - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)). \]
Now move on to the bias and weight in the first layer:
\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_0} \\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \underbrace{ \frac{\partial t}{\partial w_0}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot \underbrace{ \Phi(k)(1 - \Phi(k)) }_{k = b_0 +w_0x} \cdot w_1 \cdot x_i, \\ \frac{\partial \mbox{MSE-Loss}}{\partial b_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot \Phi(k)(1 - \Phi(k)). \end{aligned} \]
One can clearly see there is a linkage between the derivative of the weights in consecutive layers:
\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_1} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i, \\ \frac{\partial \mbox{MSE-Loss}}{\partial w_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \underbrace{\Phi(t)(1 - \Phi(t))}_{\hat{y}_i(1 - \hat{y}_i)} \cdot \underbrace{\Phi(k)(1 - \Phi(k)) \cdot w_1 \cdot x_i}_{\frac{\partial w_1z}{\partial w_0} = \frac{\partial w_1\Phi(b0 + w_0x)}{\partial w_0}}. \end{aligned} \]
Similarly we can derive the gradients for cross-entropy loss.
\[ \begin{aligned} \mbox{LogLoss} &= - \frac{1}{N}\sum_i^N \bigg[ y_i\ln\hat{y}_i + (1 - y_i)\ln(1 - \hat{y}_i)\bigg], \\ \frac{\partial \mbox{LogLoss}}{\partial w_1} &= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)z_i, \\ \frac{\partial \mbox{LogLoss}}{\partial w_0} &= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)\Phi(k)(1 - \Phi(k))w_1x_i.\\ \end{aligned} \]
(Skip the bias terms to save space.)
Now let’s implement the simple neural network model in Python:
def sigmoid(x):
return np.exp(-np.logaddexp(0, -x))
def sigmoid_grad(z):
return z*(1 - z)
class NeuralNetwork:
"""Simple neural network with 1 hidden layer of 4 neurons."""
def __init__(self, x, y):
self.input = x
self.w1 = np.random.rand(self.input.shape[1], 4)
self.w2 = np.random.rand(4, 1)
self.y = y
self.output = np.zeros(y.shape)
def feedforward(self):
self.layer1 = sigmoid(np.dot(self.input, self.w1))
self.output = sigmoid(np.dot(self.layer1, self.w2))
def backprop(self, lr):
raise NotImplementedError
def train(self, lr, n_step):
for step in range(n_step):
self.feedforward()
self.backprop(lr=lr)
yhat = ",".join([str(y) for y in np.squeeze(self.output)])
return yhat
class NerualNetworkWithMSELoss(NeuralNetwork):
def backprop(self, lr):
# We ignore all constant terms in the gradient.
self.dy = self.y - self.output
dw2 = - np.dot(self.layer1.T, (self.dy*sigmoid_grad(self.output)))
dw1 = - np.dot(self.input.T,
(np.dot(self.dy*sigmoid_grad(self.output),
self.w2.T)*sigmoid_grad(self.layer1)))
self.w1 -= lr * dw1
self.w2 -= lr * dw2
class NerualNetworkWithLogLoss(NeuralNetwork):
def backprop(self, lr):
# We ignore all constant terms in the gradient.
self.dy = self.y - self.output
dw2 = - np.dot(self.layer1.T, self.dy)
dw1 = - np.dot(self.input.T,
np.dot(self.dy, self.w2.T)*sigmoid_grad(self.layer1))
self.w1 -= lr * dw1
self.w2 -= lr * dw2
Here we will use the hello-world example of artificial neural network: The XOR problem. The XOR logical outcome, besides extremely simple, is not linearly separable. So it serves as a good example of showcasing neural networks’ non-linearity.
The input data is simply combination of two binary switches. (For completeness we also include a constant term which always evaluate to 1 as the first feature.) The output data is the XOR result.
Xnn = np.array(
[[1, 0, 0],
[1, 1, 0],
[1, 0, 1],
[1, 1, 1]])
ynn = np.array([[0], [1], [1], [0]])
print(Xnn) # Input features: Bias + two binary switches.
[[1 0 0]
[1 1 0]
[1 0 1]
[1 1 1]]
[[0]
[1]
[1]
[0]]
Now let’s see if our simple neural network model can learn the XOR pattern.
nn_mse = NerualNetworkWithMSELoss(Xnn, ynn)
print(nn_mse.train(lr=1, n_step=3000)) # The predicted XOR outcome.
0.013322228699994642,0.9755617091310563,0.9799803737183681,0.024680348085204088
[[-1.78102396 -4.57750429 1.42272976 -0.69372436]
[ 6.06730656 2.57250239 5.87690076 -1.30777168]
[ 5.6074491 3.44863401 -3.70793822 4.57256567]]
[[10.35252814]
[-5.91518016]
[-5.30338267]
[-4.3944565 ]]
0.0007966687865677773,0.9998283998346575,0.9990768834133713,0.0004795536969026687
For this simple example, both MSE and cross-entropy loss can work fine to figure out the XOR pattern. (Strictly speaking cross-entropy loss performs better.)
Regularization is a technique to mitigate overfitting. It is not particular to neural networks but general to all machine learning models.
A regularization is a constraint added onto the original optimization problem (loss minimization). Consider model weights as a real vector, we usually use the norm of the vector to constrain its size in the optimization. The norm is a measure of size of a vector. The general idea is to add a penalty to the target function (which is a loss function) such that it discourages using large weights (which increase the size of vector) to achieve the minimization goal.
In general, a MSE loss minimization with a p-norm regularization for a linear model can be written (in matrix form) as:
\[ \begin{aligned} \hat{y} &= X\mathrm{B}, \\ \operatorname*{argmin}_\beta \mbox{Loss} &= \frac{1}{N}\bigg[ \underbrace{(y - X\mathrm{B})^T(y - X\mathrm{B})}_\text{sum of squared errors} + \underbrace{\lambda\Vert\mathrm{B}\Vert_{p}}_\text{p-norm regularization} \bigg], \end{aligned} \]
with a model weight vector \(\mathrm{B} = [\beta_1, \beta2, ..., \beta_k]\). The parameter \(\lambda\) is a new hyper-parameter introduced by the regularization.
Or put it in a scalar form:
\[ \begin{equation} \label{eq:loss_min_reg} \operatorname*{argmin}_\beta \mbox{Loss} = \frac{1}{N}\sum_{i=1}^N\bigg( y_i - \mathrm{B}x_i\bigg)^2 + \frac{\lambda}{N}\bigg(\sum_{j=1}^k\vert\beta_j\vert^p\bigg)^{1/p}. \end{equation} \]
Popular choices of \(p\) is \(p = 1\) (L1-Norm) and \(p = 2\) (L2-Norm). Notably, \(p\) can be any non-integer real value and would have even better result than integer-norm. But their computational difficulty make them much less desirable in practice.
In practice, we also ignore the root in the norm for computational simplicity. So the problem indeed becomes
\[ \operatorname*{argmin}_\beta \mbox{Loss} = \frac{1}{N}\sum_{i=1}^N\bigg( y_i - \mathrm{B}x_i\bigg)^2 + \frac{\lambda}{N}\bigg(\sum_{j=1}^k\vert\beta_j\vert^p\bigg). \]
There is also a so-called elasticnet regularization which combines both L1 and L2 norms together to give potentially better result. However it requires a lot more computing resources and hence become much less feasible in a neural network model.
The L2-Norm shrinks the size of model weights. Those who contribute less to loss minimization will shrink more.
Back to our univariate linear model \(\hat{y}_i = \beta_0 + \beta_1x_i\) for symolic simplicity. Our MSE loss is
\[ \mbox{MSE-Loss} = \frac{1}{N}\sum_i\big[y_i-(\beta_0 + \beta_1x_i)\big]^2 + \frac{\lambda}{N}(\beta_0^2 + \beta_1^2). \]
Since the penaty is additive, our gradient solution is extremely easy:
\[ \begin{aligned} \frac{\partial\mbox{MSE-Loss}}{\partial\beta_0} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big] + \frac{2}{N}\lambda\beta_0, \\ \frac{\partial\mbox{MSE-Loss}}{\partial\beta_1} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big]x_i + \frac{2}{N}\lambda\beta_1. \end{aligned} \]
A linear regression model with l2 regularization is referred to as the ridge regression. Be aware that in implementing the batch optimizer we need to scale down the update from l2 norm by the size of batch in order to balance the size of l2 norm and the original loss.
def mse_grad_l2(X, y, beta):
a = 1 # Hardcode the regularization coefficient for simplicity.
return -2 * ((y - X.dot(beta)).dot(X)) / X.shape[0] + 2 * a * beta / X.shape[0]
batch_beta, batch_loss = gd_batch_optimize(X, y, mse_grad_l2, n_epoch=100, batch_size=64)
print(batch_beta)
[5.917945 3.90542865]
The resulting estimates are smaller than the vanilla gradient descent.
One thing worth noting is that when batch gradient descent is used along with regularization, a side effect exists due to the scaling of the batch size on the regularization term.
Experiments with different batch sizes with l2 regularization:
for bsize in [8, 16, 32, 64, 128, 256, 512, 1000]:
print("Batch Size: {:4} | Estimate: {}".format(
bsize,
gd_batch_optimize(X, y, mse_grad_l2, n_epoch=500, batch_size=bsize)[0]))
Batch Size: 8 | Estimate: [5.40822766 3.58409232]
Batch Size: 16 | Estimate: [5.65860196 3.73012032]
Batch Size: 32 | Estimate: [5.82685968 3.81718301]
Batch Size: 64 | Estimate: [5.91404397 3.90443054]
Batch Size: 128 | Estimate: [5.9665459 3.93418727]
Batch Size: 256 | Estimate: [5.98786988 3.94779378]
Batch Size: 512 | Estimate: [5.99939005 3.95494569]
Batch Size: 1000 | Estimate: [6.00521432 3.95865926]
Let’s check the result using sklearn
’s Ridge
regressor (which doesn’t use gradient descent as its solver):
# We set fit_intercept=False since our design matrix already contains intercept.
# According to the doc the constant 2 in the derivative is not included so we set alpha = 2.
from sklearn.linear_model import Ridge
print(Ridge(alpha=2, fit_intercept=False).fit(X, y).coef_)
[5.99956319 3.95500059]
The result is closed to gradient descent with nearly a full-batch update (i.e., use the entire training data for one update).
Exactly the same logic can apply to logistic regression and is not discussed here to avoid redundancy.
The L1-Norm prefers sparsity in the model weights. That is, it may zero out weights that are not contributing to the loss. While the L2-Norm only makes them arbitrarily small.
MSE loss with l1 norm:
\[ \mbox{MSE-Loss} = \frac{1}{N}\sum_i\big[y_i-(\beta_0 + \beta_1x_i)\big]^2 + \lambda(\vert\beta_0\vert + \vert\beta_1\vert), \]
with gradients:5
\[ \begin{aligned} \frac{\partial\mbox{MSE-Loss}}{\partial\beta_0} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big] + \lambda\frac{\beta_0}{\vert\beta_0\vert}, \\ \frac{\partial\mbox{MSE-Loss}}{\partial\beta_1} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big]x_i + \lambda\frac{\beta_1}{\vert\beta_1\vert}. \end{aligned} \]
A linear regression model with l1 regularization is referred to as the lasso regression. Let’s first check the result using sklearn
’s Lasso
regressor (which uses a coordinate descent solver):
# We set fit_intercept=False since our design matrix already contains intercept.
from sklearn.linear_model import Lasso
print(Lasso(alpha=1, fit_intercept=False).fit(X, y).coef_)
[5.042239 3.00143701]
To derive roughly the same result, we need to calibrate the coefficient on regularization in our gradient descent solver:
def mse_grad_l1(X, y, beta):
# a is set to 2 since in sklearn convention the entire loss is scaled by a factor of 1/2
# to cancel out the derivative constant on the squared error term.
a = 2
return -2 * ((y - X.dot(beta)).dot(X)) / X.shape[0] + a * np.sign(beta)
# We use a full-batch gradient descent here.
batch_beta, batch_loss = gd_batch_optimize(X, y, mse_grad_l1, n_epoch=1000, batch_size=1000)
print(batch_beta)
[5.04223889 3.00143701]
We can also check the consistency by calling sklearn
’s SGDRegressor
:
from sklearn.linear_model import SGDRegressor
print(SGDRegressor(penalty="l1", alpha=1, fit_intercept=False).fit(X, y).coef_)
[5.04755575 3.02641941]
Now let’s purposely add one random feature as a noise into the design matrix and see how the regularization helps to shrink the weight of the noise.
# Introduce a totally irrelevant feature into our linear model.
noise = np.random.uniform(size=N).reshape(N, 1)
Xnoise = np.concatenate([X, noise], axis=1)
print(Ridge(alpha=2, fit_intercept=False).fit(Xnoise, y).coef_)
[5.98214338 3.95518688 0.03392461]
[5.042239 3.00143701 0. ]
As one can see, Lasso
regressor is able to completely wipe out the weight on the redundant noise, while Ridge
can only shrink it toward zero.
To arrive roughly at the same result of Ridge
using our simple gradient descent solver:
[5.97074062 3.95918406 0.0629631 ]
And for the Lasso
case:
[ 5.02107606e+00 3.00162683e+00 -2.54383624e-03]
One thing to note: It is in general difficult to arrive at sparse solution (0 weight) for a naive gradient descent implementation such as ours. Some further mathematical tricks like the proximal gradient metohd must be applied in order to produce sparse result under finite iterations.
Finally, again, exactly the same logic can apply to logistic regression and is not discussed further to avoid redundancy.
Consider a two-parameter modeling case, where we plot the two possible weights \(w_1\) and \(w_2\) as a Euclidean coordinate system. The MSE loss minimization problem stated in equation \(\eqref{eq:loss_min_reg}\) can be rewritten as an optimization with linear constraint:
\[ \begin{aligned} & \operatorname*{argmin}_\beta & & \frac{1}{N}\sum_{i=1}^N\bigg( y_i - \mathrm{B}x_i\bigg)^2 \\ & \text{subject to} & & \bigg(\sum_{j=1}^k\vert\beta_j\vert^p\bigg)^{1/p} \le t. \end{aligned} \]
Solving the above problem (using for example the method of Lagrange multipliers) will result in exactly the same target function as in equation \(\eqref{eq:loss_min_reg}\)
Now if we plot the linear constraint on the Euclidean space (for the case of \(k = 2\) and an arbitrary value of constant \(t\)), the possible parameter combination will fall into a cirle for \(p = 2\) (L2 regularization) or a squared diamond for \(p = 1\) (L1 regularization), both centered at the origin.
t = 2 # Ad-hoc constant constraint on the norm.
# Code some reusable objects.
shapes = [
dict(type="circle", x0=-t, y0=-t, x1=t, y1=t,
fillcolor="darkgreen", opacity=0.25, line=dict(color="darkgreen")),
dict(type="path", path="M 0,-{t} L-{t},0 L0,{t} L{t},0 Z".format(t=t),
fillcolor="blue", opacity=0.25, line=dict(color="blue"))]
w1_axis = dict(title=r"$w_{1}$", range=[-t*1.5, t*1.5])
w2_axis = dict(title=r"$w_{2}$", range=[-t*1.5, t*1.5], scaleanchor="x", scaleratio=1)
ofile = "plots/reg_geom.html"
p = plot_offline(
ofile=ofile,
data=[
go.Scatter(
x=[t+1], y=[t], mode="text", name="L2 Regularization",
text=[r"$\sqrt{w_1^1 + w_2^2} <= " + "{}$".format(t)],
textfont=dict(color="darkgreen")),
go.Scatter(
x=[t+1], y=[-t], mode="text", name="L1 Regularization",
text=[r"$\vert w_1 \vert + \vert w_2 \vert <= " + "{}$".format(t)],
textfont=dict(color="blue"))
],
layout=go.Layout(title="Parameter Space", showlegend=True,
xaxis=w1_axis, yaxis=w2_axis, shapes=shapes))
Simply plot the constraint area is not that informative. We need to also visualize the possible values of loss function.
Let’s use the toy model in our previous linear regression section to do the visualization. We will iterate over the coordinate space within \(-10 \le w_1 \le 10\) and \(-10 \le w_2 \le 10\) to collect all possible resulting loss values. The loss will be a 3rd axis added onto the existing plain.
z_loss = []
_x = np.arange(-10, 10)
_y = np.arange(-10, 10)
for _w1 in _x:
for _w2 in _y:
z_loss.append(loss(X, y, [_w1, _w2]))
z_loss = np.asarray(z_loss).reshape(20, 20, order="F")
w1, w2 = ols(X, y) # Unregularized OLS solution.
ofile = "plots/3d_loss.html"
p = plot_offline(
ofile=ofile,
data=[
go.Surface(
x=_x, y=_y, z=z_loss,
contours=go.surface.Contours(
z=go.surface.contours.Z(
show=True,
usecolormap=True,
highlightcolor="#42f462",
project=dict(z=True)))),
go.Scatter3d(
x=[w1], y=[w2], z=[0], mode="markers",
marker=dict(symbol="cross", color="blue"))
],
layout=go.Layout(
title="MSE Loss on Parameter Space",
scene=dict(
# Latex notation doesn't work in 3d render. Known issue of plotly.js.
xaxis=dict(title="w1"),
yaxis=dict(title="w2"),
zaxis=dict(title="Loss"))))
Note that we also plot the contour on the original 2-d plain. The contour depicts all the weight combinations that lead to the same amount of loss. And we mark the minimum with a cross to clearly point out the optimum loss which can be achieved by the unregularized OLS estimator in equation \(\eqref{eq:ols}\).
Now if we combine the two plots together, especially the contour on the parameter plain, the tangent point between the regularization area and the contour is exactly our regularized solution for the weights.6
ofile = "plots/loss_contour_reg.html"
p = plot_offline(
ofile=ofile,
data=[
go.Contour(
x=_x, y=_y, z=np.log(z_loss),
contours=dict(coloring="lines"),
colorbar=dict(title="Ln(MSE Loss)"),
line=dict(width=2))
],
layout=go.Layout(title="MSE Loss on Parameter Space",
xaxis=w1_axis, yaxis=w2_axis, shapes=shapes))
The constraint constant control the size of regularization, for a larger constant (looser constraint) our tagent point will move closer to the unregularized optimum. For this geometric point of view, one should also realize that since the L1 regularization area is a sharp diamond square, it is more likely to tangent the contour on the edge which exactly represent a sparse solution to the weight (on each edge point there is one weight zeroed out).
If we took a Bayesian approach on the weight estimation, the task is to solve for the posterior distribution of weight:
\[ P(W|y) = \frac{P(y|W) \cdot P(W)}{P(y)} \propto P(y|W) \cdot P(W). \]
then a Normal prior on the weight \(W \sim \mbox{Normal}\) will result in a maximum a posteriori estimator to have a target function with the L2 regularization term, and a Laplacean prior \(W \sim \mbox{Laplace}\) will result in the L1 term.
This is quite straightforward if we take a look at the probability density function of these two distributions:
\[ \begin{aligned} \mbox{Normal}(\mu, \sigma)_\text{pdf} &= f(w) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(w - \mu)^2}{2\sigma^2}}, \\ \mbox{Laplace}(\mu, b)_\text{pdf} &= g(w) = \frac{1}{2b}e^{-\frac{\vert w - \mu\vert}{b}}. \end{aligned} \]
Now if we want to maximize the log of \(P(W|y)\) (maximum a posteriori), the gradient to the log of a Normal prior will give us:
\[ \begin{aligned} \frac{\partial \ln f(w)}{\partial w} &= \frac{\partial \bigg[-\frac{1}{2}\ln(2\pi\sigma^2) - \frac{(w - \mu)^2}{2\sigma^2}\bigg]}{\partial w} \\ &= - \frac{1}{\sigma^2}(w - \mu), \end{aligned} \]
and for a Laplacean prior:
\[ \begin{aligned} \frac{\partial \ln g(w)}{\partial w} &= \frac{\partial \bigg[-\frac{1}{2b} - \frac{\vert w - \mu \vert}{b}\bigg]}{\partial w} \\ &= - \frac{1}{b}\frac{w}{\vert w \vert}. \end{aligned} \]
The regularization coefficient we previously discussed (hyper-parameter \(\lambda\)) can be mapped to the constant on the (reciprocal of) scale parameter of the Bayesian prior. A larger \(\lambda\) corresponds to a smaller variance on the prior, hence the size of weight is being constrained more in the posterior estimation, resulting in smaller weight.
We can plot the standardized Normal and Laplace distribution to clearly see their difference:
from scipy.stats import laplace, norm
xticks = np.linspace(-5, 5, num=1000)
ofile = "plots/reg_prior.html"
p = plot_offline(
ofile=ofile,
data=[go.Scatter(x=xticks, y=laplace.pdf(xticks), name="Laplacean Prior (L1)"),
go.Scatter(x=xticks, y=norm.pdf(xticks), name="Normal Prior (L2)")],
layout=go.Layout(
title="Bayesian Prior on Model Weight",
xaxis=dict(title="Model Weight"),
yaxis=dict(title="Density")))
Compared to a Normal prior, a Laplacean put much more density at the center. When both distribution is standardized (with location at 0 and scale at 1) as just plotted above, this can be interpreted as that a Laplacean prior has a stronger preference for a sparse solution (weight = 0).
Another practical and maybe also easier way of regularization is the dropout mechanism (Srivastava et al. (2014)). It is first introduced in a neural network model but the concept has been successfully applied as well to some other machine learning algorithms (such as gradient boosting trees).
Dropout of a neuron means that it is ignored in the backprop training process. Take our previous simple neural network model as example, there are two neuron in the hidden layer, if we dropout the first unit then the model temporarily becomes:
# Draw a simple neural nets.
DiagrammeR::grViz("
digraph subscript {
graph [layout = dot rankdir = LR ordering = in ranksep=2 splines=true]
node [shape = circle]
subgraph cluster_input_layer {
label = 'Input Layer'
x1 [label = 'x@_{1}']
x2 [label = 'x@_{2}']
x3 [label = 'x@_{3}']
}
subgraph cluster_hidden_layer {
label = 'Hidden Layer'
n11 [label = 'y@_{11}' color = grey fontcolor = grey]
n12 [label = 'y@_{12}' color = darkgreen fontcolor = darkgreen]
}
subgraph cluster_output_layer {
label = 'Output Layer'
n21 [label = 'y@_{2}']
}
edge [arrowsize = .25]
x1 -> n12 [label = 'w@_{112}' color = darkgreen fontcolor = darkgreen]
x2 -> n12 [label = 'w@_{212}' color = darkgreen fontcolor = darkgreen]
x3 -> n12 [label = 'w@_{312}' color = darkgreen fontcolor = darkgreen]
n12 -> n21 [label = 'w@_{1221}']
}")
When a neuron is dropped out, its subsequent connection to the network are all removed. This means that both feed-forward and backpropagation will ignore the discarded connection.
Usually dropout is applied to a model by setting up a dropout rate to one or more layers. For example we can set the dropout rate to 50% at the first hidden layer. This means that for each batch gradient update every neuron in the layer will have a 50% chance of being ignored before a training step.
Dropout doesn’t limit to hidden layer only. One can also apply dropout at the input layer, but usually at a relatively lower drop rate (commonly 20%) than hidden layer. Certainly dropout will never be applied at a output layer because output layer is the required component to well define a model.
After dropout is applied, a neural network model becomes thinner in a sense that it has fewer number of weights to response to both input feed forward and error backprop. The rationale is that we’d like to prevent all the neurons from co-adapting too much in the training process to avoid overfitting to the training data.
There is one side-effect when using dropout to train a neural network. When a trained model is put to an inference task, output from the single neural network (with all neurons unmasked) no longer represents the distribution in the training unless we use exactly the same set of subnets to do the prediction and average them together. But this is not practical.
To solve this problem, we still use only the original neural networks (with all neurons unmasked) to feed forward for prediction, but every model weight is scaled down by the dropout rate of the corresponding neuron. It is proven by many empirical works that this weight scaling technique can give very good modeling result. Probabilistically, the model output here is the expected output from all the subnets given the dropout scheme.
In a general additive model a bias term is just a feature with a constant value. In a neural network model when a bias term is used in a hidden layer, it has no connection between the previous layer but simply act as one extra unit linked to the next layer. (See our built-from-scratch toy example in the previous section.)
It is easier to understand why we may want a bias term in a linear regression model. Since without a bias term we are essentially restricting the fitted model to pass through the origin, which can be unrealistic in most use cases. On the other hand, adding a bias term doesn’t prevent from that special scenario to occur. If the true model does pass through the origin, given enough data the OLS estimator should be able to estimate the weight on bias to be insignificantly different from zero.
Here is a simple example to showcase estimation on a true model without bias:
np.random.seed(777)
N = 1000
X = np.stack([np.ones(N), np.random.normal(size=N)], axis=1)
beta = np.array([0, 4], dtype=np.float32) # True model weights where bias is zero.
e = np.random.normal(size=N)
y = X.dot(beta) + e
# We use `statsmodels` here because it comes with a hypothesis testing on the coefficient.
print(sm.OLS(y, X).fit().summary2())
Results: Ordinary least squares
==================================================================
Model: OLS Adj. R-squared: 0.937
Dependent Variable: y AIC: 2901.1253
Date: 2019-11-21 00:19 BIC: 2910.9408
No. Observations: 1000 Log-Likelihood: -1448.6
Df Model: 1 F-statistic: 1.487e+04
Df Residuals: 998 Prob (F-statistic): 0.00
R-squared: 0.937 Scale: 1.0632
--------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------------
const 0.0113 0.0326 0.3471 0.7286 -0.0527 0.0753
x1 3.9625 0.0325 121.9539 0.0000 3.8987 4.0262
------------------------------------------------------------------
Omnibus: 4.505 Durbin-Watson: 1.973
Prob(Omnibus): 0.105 Jarque-Bera (JB): 5.227
Skew: -0.050 Prob(JB): 0.073
Kurtosis: 3.340 Condition No.: 1
==================================================================
As one can see the estimated weight on bias is statistically insignificant.
Should regularization applies to bias term? In our previous working example we apply norm regularization on both bias and weight. Practically speaking there may not be any benefit on regularizing the bias term. The rationale is that bias comes from a constant feature. Such feature is in theory not possible to overfit the data since it is not varying with the data at all.
Indeed, regularization on bias term may even have negative impact. This is because by regularizing a term that is not exposed to any risk of overfitting, the model is constrained without any benefit. We only end up with a less flexible parameter space which could be potentially suboptimal.
To sum up, the general strategy of regularization is NOT to regularize the bias term, if any.
Before we talk about batch normalization, let’s examine the impact of feature normalization on a simple linear regression model.7
Given a simple univariate regression model:
\[ y = a + bx + \epsilon. \]
What is the impact if we standardize \(x\)? It means that we estimate the following model instead:
\[ y = c + d\bigg(\frac{x - \mu_x}{\sigma_x}\bigg) + \eta. \]
Now simply a re-arrangement of the model formula gives:
\[ \begin{aligned} y &= c + d\bigg(\frac{x - \mu_x}{\sigma_x}\bigg) + \eta \\ &= \underbrace{c - \frac{d\mu_x}{\sigma_x}}_{a} + \underbrace{\frac{d}{\sigma_x}}_{b}\cdot x + \underbrace{\eta}_{\epsilon}. \end{aligned} \]
Apparently by using standardized feature the weights are simply linearly scaled. It won’t have any analytical impact.
However for large scale application we are not solving the model analytically, but rather numerically. Here comes the impact of feature normalization, especially for first-order optimizer like gradient descent. If there is any feature with outlying scale compared to the others, the performance of gradient descent is hugely damaged.
Let’s examine this using a working example:
np.random.seed(777)
N = 1000
X = np.stack([
np.ones(N),
np.random.normal(size=N),
np.random.normal(size=N, loc=50, scale=10)], axis=1)
beta = np.array([6, 4, 3], dtype=np.float32) # True model weights.
e = np.random.normal(size=N)
y = X.dot(beta) + e
print(X[:10,:])
[[ 1. -0.46820879 43.4755395 ]
[ 1. -0.82282485 43.18192098]
[ 1. -0.0653801 54.09287917]
[ 1. -0.71336192 58.53293621]
[ 1. 0.90635089 54.1564682 ]
[ 1. 0.76623673 42.42103854]
[ 1. 0.82605407 54.26729587]
[ 1. -1.32368279 62.14688398]
[ 1. -1.75244452 34.17196666]
[ 1. 1.00244907 44.17910818]]
Here we create a bivariate linear model where the 2nd feature has a much larger scale than the 1st one. The analytical OLS estimator has no problem solving this model:
[6.00952581 3.93270425 3.00017441]
Our batch gradient descent solver, however, suffer a lot:
[nan nan nan]
/home/kylechung/.pyenv/versions/k9-nn/bin/python:2: RuntimeWarning:
overflow encountered in square
/home/kylechung/.pyenv/versions/k9-nn/bin/python:14: RuntimeWarning:
invalid value encountered in subtract
Here due to the imbalanced scale of the features, our solver becomes numerically unstable given a moderate learning rate (resulting in the exploding gradients). This is because the size of gradient update is improperly dominated by high-scaled feature.
We can still manage to solve the model, but with a much lower learning rate to down-scale the high-scaled feature, accompanied with a larger number of epochs since other low-scaled features now are updated by too small a step:
for epoch in [100, 1000, 5000, 10000]:
b, _ = gd_batch_optimize(X, y, grad_func, lr=0.0003, n_epoch=epoch, batch_size=64)
print("Training Epochs {:5} | Estimate: {}".format(epoch, b))
Training Epochs 100 | Estimate: [0.45991644 2.27156008 3.10917524]
Training Epochs 1000 | Estimate: [3.28390533 3.95687798 3.05183397]
Training Epochs 5000 | Estimate: [5.09751447 3.94145204 3.01695605]
Training Epochs 10000 | Estimate: [5.88796916 3.93388052 3.00508586]
Even for such a simple linear model we now need much more training steps to ensure convergence, not to mention a more complicated neural network model.
A second-order optimizer can mitigate the problem. But for training large scale neural networks we must stick to first-order optimizer for the sack of performance. Feature normalization comes to the rescue.
Let’s experiment with a new design matrix where all the features (excluding the bias) are normalized to have zero meanand unit standard deviation.
Xs = np.column_stack([np.ones(N), (X[:,1:] - X[:,1:].mean(axis=0)) / X[:,1:].std(axis=0)])
print(Xs[:10,:])
[[ 1. -0.49861967 -0.64279134]
[ 1. -0.85198872 -0.6712771 ]
[ 1. -0.09720753 0.38726279]
[ 1. -0.74291072 0.81802033]
[ 1. 0.87110636 0.39343196]
[ 1. 0.73148492 -0.74509504]
[ 1. 0.79109191 0.40418404]
[ 1. -1.35108538 1.16863187]
[ 1. -1.77833937 -1.54538879]
[ 1. 0.96686661 -0.57453377]]
Since our feature scale changes, the analytical solution for the weights changes as well. Let’s check the OLS estimation (the correct solution for our reference) after normalization:
[156.44822091 3.94658245 30.92446288]
Now examine our gradient descent solver:
for epoch in [1, 10, 50, 100]:
b, _ = gd_batch_optimize(Xs, y, grad_func, lr=0.1, n_epoch=epoch, batch_size=64)
print("Training Epochs {:5} | Estimate: {}".format(epoch, b))
Training Epochs 1 | Estimate: [152.08538432 3.69797309 30.14852295]
Training Epochs 10 | Estimate: [156.45440292 3.96343058 30.96520361]
Training Epochs 50 | Estimate: [156.44691019 3.98666368 30.88663084]
Training Epochs 100 | Estimate: [156.34646482 3.94522662 30.88778661]
Bingo!
Now it takes no extra effort (less than 10 epochs) for our numerical solver to arrive at a good approximation. The only additional step is to remember to transform all the features also before model prediction since the feature space is not in the original dimension anymore.
When it comes to a logistic regression, normalization can be even more critical since the sigmoid function can suffer from saturating when the input is either too small or too big.
Remember that a sigmoid function looks like this shape. The gradient becomes nearly zero at both extreme sides of the input space. Or mathematically:
\[ \begin{aligned} z &= \frac{1}{1 + e^{-t}}, \\ \frac{dz}{dt} &= z(1 - z), \\ t \to \infty & \Rightarrow e^{-t} \to 0 \mbox{ and } z \to 1 \Rightarrow z(1-z) \to 0, \\ t \to - \infty & \Rightarrow e^{-t} \to \infty \mbox{ and } z \to 0 \Rightarrow z(1-z) \to 0. \end{aligned} \]
Hence as the fitted value \(X\beta\) goes unbounded, the gradient becomes zero and the gradient descent optimization get stuck since all updates will be multiplied by zero. This is referred to as the problem of vanishing gradients.
Feature normalization in this case also help avoid vanishing gradients since the size of the fitted values are well constrained.
Exactly the same logic can apply to training a neural network, especially when sigmoid (or function with similar property such as the tanh) is used as the activation function in any layer of the network. The only problem is when the application is too huge to compute normalization. Since normalization requires the entire dataset to calculate the scale, it may be prohibitively costly to do so.8
To overcome the computing cost the batch normalization approach is proposed (BN hereafter). It simply normalizes the input at batch level when doing batch gradient descent. This is done on a per-layer basis when training a neural network.
Indeed BN exists for more than just to mitigate numerical issues. In the original work of Ioffe and Szegedy (2015) BN is proposed to mainly deal with a problem called internal covariate shift. Such a point of view, however, has been heavily challenged by subsequent research such as in Santurkar et al. (2018).
The success of BN is proven by its empirical results but not its theoretical stand, which is yet to be clear. The debate is still ongoing so as of now we should consider it as a good practice that always worths experimenting.
Should we BN the input before or after applying the activator at each layer? Though the original paper applies BN before the activator, the common practice now seems to be applying it after activator since the empirical results are better.
BN after activator seems also more interpretable since it is the output but not the input of the activator becomes the input to the next layer. By applying BN after activator, we are controlling each input to be normalized just before the multiplication of the weight matrix in the next layer.
But if the sigmoid is used as activator, would BN after activation defeat another purpose which is to mitigate vanishing gradients? If the input layer is also normalized, this should be less of a problem. Of course if the chosen activator has no saturating point (such as ReLU) this is also not a problem.
How do we run inference based on a model trained with BN? If the input data is also by batch we can directly apply the same network architecture. But more of the time the inference is done by a streaming manner (online predicting). In order to BN the data we need to keep track of the feature means and variances when we are doing model training on the fly. A common practice is to use a moving average of feature means and variances as additional parameters used in model inference mode.
Gradient clipping is a technique to handle exploding gradients. Such problem can result from improper learning rates, imbalanced feature scales, or a mis-configured loss function. The general idea is to constrain the calculated gradients before applying backpropagation update.
The simplest way of doing this is to just set a lower and upper bound so that all gradients are bounded by the lower and capped at upper bound.
Another way of doing that is to constrain the L2 norm of gradient vector by a given value:
def clip(g, norm):
gnorm = np.linalg.norm(g)
if gnorm > norm:
gnorm = g * norm / gnorm
return gnorm
def gd_clip_batch_optimize(X, y, grad_func, clip_norm=5, lr=.01, n_epoch=10, batch_size=64):
b = np.random.normal(size=X.shape[1])
l = [loss(X, y, b)]
for epoch in range(n_epoch):
# Shuffle the dataset before each epoch.
sid = np.random.permutation(X.shape[0])
Xs = X[sid,:]
ys = y[sid]
i = 0
n_step = int(np.ceil(X.shape[0] / batch_size))
for step in range(n_step):
Xb = Xs[i:i+batch_size,:]
yb = ys[i:i+batch_size]
b -= lr*clip(grad_func(Xb, yb, b), clip_norm)
l.append(loss(Xb, yb, b))
i += batch_size
return b, l
Let’s add the clipping to our batch gradient descent optimizer in our previous example in BN section where the gradient got exploded with imbalanced scaled features:
[[ 1. -0.46820879 43.4755395 ]
[ 1. -0.82282485 43.18192098]
[ 1. -0.0653801 54.09287917]
[ 1. -0.71336192 58.53293621]
[ 1. 0.90635089 54.1564682 ]]
for epoch in [1000, 5000, 10000, 20000, 30000]:
b, _ = gd_clip_batch_optimize(X, y, grad_func, lr=0.1, n_epoch=epoch, batch_size=64)
print("Training Epochs {:5} | Estimate: {}".format(epoch, b))
Training Epochs 1000 | Estimate: [4.2681595 3.91673636 2.59477635]
Training Epochs 5000 | Estimate: [5.9558284 3.94616001 3.39861039]
Training Epochs 10000 | Estimate: [6.0188452 3.9357066 2.92287519]
Training Epochs 20000 | Estimate: [6.00534916 3.94543276 2.91034558]
Training Epochs 30000 | Estimate: [5.96425905 3.9384467 2.88635615]
We have successfully avoided exploding gradients this time with gradient clipping and without normalization of the input feature. But the noise in the estimation for high-scaled feature seems to be bigger as well. Also gradient clipping won’t help us improve the speed of convergence. It is merely a technique to tacke down numerical stability problem.9
Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” http://tensorflow.org/.
Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv Preprint arXiv:1502.03167.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Santurkar, Shibani, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2018. “How Does Batch Normalization Help Optimization?” In Advances in Neural Information Processing Systems, 2483–93.
Seabold, Skipper, and Josef Perktold. 2010. “Statsmodels: Econometric and Statistical Modeling with Python.” In 9th Python in Science Conference.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15 (1). JMLR. org: 1929–58.
One can try change the sample size arbitrarily to see the behavior of the estimator in our toy example.↩
Sometimes “mini-batch gradient descent” is used to describe SGD with batch size > 1. Here we ignore the redundant wording and just call it batch gradient descent.↩
Here we use the common derivatives: \(\frac{df(x)^n}{dx} = nf(x)^{n-1}f'(x)\) and \(\frac{de^{f(x)}}{dx} = f'(x)e^{f(x)}.\)↩
Here we use the common derivative \(\frac{d\ln f(x)}{dx} = \frac{f'(x)}{f(x)}\) and the chain rule to handle the term \(\frac{\partial\ln q_i}{\partial\beta_0}\) where \(q_i\) is the sigmoid computed for the \(i\)-th example.↩
Note that the derivative \(\frac{du}{dx} = \frac{u}{\vert u \vert}\frac{du}{dx}\) at \(u = 0\) is undefined.↩
Since plotly.js currently doesn’t support uneven interval for contour plot, in the contour-regularization plot we take natrual log of MSE just to scale the value so that the contour line will have non-linear interval. This help us show more lines when loss becomes smaller.↩
The term normalization is indeed not well defined. Here what we mean is actually standardization: by transforming a variable to have zero mean and unit standard deviation. Other common normalization include simple de-meaned (substraction from mean), compression to the range of \([0, 1]\) or \([-1, 1]\), …, etc.↩
Notice that we need to normalize not just the input layer but every hidden layers as well.↩
In tensorflow
the function clip_by_value
implements the bounded method and clip_by_norm
implements exactly the L2 norm method we’ve implemented here.↩