import numpy as np
from IPython.display import clear_output
The Layer class will be the foundation of the neural network. A layer needs to be able to do two things:
The basic layer class will use the identity function $f(x) = x$ in the forward pass, and thus all derivatives in the backward pass are just 1.
class Layer:
'''Basic neural network class with a forward and backward pass functions. '''
def __init__(self, *kwargs):
'''Used to store layer variables, e.g. number of neurons. Empty by default'''
pass
def forward(self, value):
'''Compute the forward pass on the computational graph. Identity function is used by default.'''
return value
def backward(self, value, grad_output):
'''Compute derivative of the loss function from right to left in the computational graph, with respect to a
given input (backprop). Takes the previous layers gradient as an argument to compute efficently. Given
k layers, the chain rule for the derivative of the whole graph is written:
d loss / d value = d loss / d layer * d layer k / d layer (k-1) ... * d layer 0 / d value
At each layer only a single term in the product is computed, taking the previous computations as an input.
By default, the identity function is used for activation in the forward pass, so the Jacobian is simply an
n x n identity matrix. Nevertheless, the proper matrix math is shown here for completeness.
'''
n_units = value.shape[0]
d_layer_d_input = np.eye(n_units)
return grad_output @ d_layer_d_input
A relu function:
$$x = \begin{cases}x & \text{if} \;\; x > 0 \\ 0 & \text{otherwise} \end{cases}$$is a simple activation function commonly used in neural networks. This function can be applied in the forward function by simply applying a maximum. The derivative of the relu function is equally simple, defined piece-wise as above, but equal to either 1 (in the first case) or 0.
class Relu(Layer):
def __init__(self, *kwargs):
'''No additional information is needed by the class, because the Relu function will be applied elementwise
to everything.'''
pass
def forward(self, value):
return np.maximum(value, 0)
def backward(self, value, grad_output):
'''The piecewise derivative is quickly implemented by turning a boolian into a float using multiplication'''
relu_gradient = value > 0
return grad_output * relu_gradient
Testing the backward pass, we can see that it correctly sets values below 0 to 0, and then increases linearly afterwards.
l = Relu()
x = np.linspace(-1,1,10*32).reshape([10,32])
grads = l.backward(x,np.ones([10,32])/(32*10))
print(grads)
So far the layer has just been an activation function with a backward pass -- not much to be excited about. To expand into a full layer, we need to add the ability to learn weights. This functionality is implemented below. Stochastic gradient descent is used to find weights.
To find the derivatives needed by the backward pass, we begin with a linear approximation of the target variable y as the sum of i layers h:
$$ \hat{y} = \sum_i f(h_i)$$Where each layer is itself a linear combination of input variables X:
$$h_i = w_iX + b_i$$Because all layers are linear in inputs X, any arbitrary, differentiable loss function $\mathcal{L}$ has the following derivatives with respect to $X$, $W$, and $b$:
$$\frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial h} W^T$$$$\frac{\partial \mathcal{L}}{\partial W} = X^T \frac{\partial \mathcal{L}}{\partial h}$$$$\frac{\partial \mathcal{L}}{\partial b} = 0$$class Dense(Layer):
def __init__(self, n_inputs, n_outputs, learning_rate=0.001):
'''A functional approximation machine to learn weights W in f(x) = XW + b such that f(x) approximates
target values y as closely as possible, accoarding to some criteria.
Parameters:
---------------------------------------------------------------------------------------------
learning_rate: speed of updating weights W. Choose sufficently small values to ensure timely
convergence. Default = .001.
n_inputs: batch size, equal to number of samples processed each iteration.
n_outputs: number of targets.
'''
self.learning_rate = learning_rate
self.n_inputs = n_inputs
self.n_outputs = n_outputs
self.initialize_weights()
def initialize_weights(self):
self.weights = np.random.normal(scale=1e-2, size=(self.n_inputs, self.n_outputs))
self.bias = np.zeros((1, self.n_outputs))
def forward(self, value):
return value @ self.weights + self.bias
def backward(self, value, grad_output):
'''The backward pass computes the contribution of this layer to the overall loss function, given by'''
self.grad_input = grad_output @ self.weights.T
self.grad_weights = value.T @ grad_output
self.grad_bias = np.sum(grad_output, axis=0, keepdims=True)
'''Update weights and bias via SGD'''
self.weights -= self.learning_rate * self.grad_weights
self.bias -= self.learning_rate * self.grad_bias
return self.grad_input
def get_weights(self):
return self.weights
def get_gradient(self):
return self.grad_input
To transform model output into a probability of class membership given $i$ potential classes, a softmax function is used. The probability of belonging to class $j$ is given:
$$ p_j = \frac{e^{\hat{y}_j}}{\sum_i e^{\hat{y}_i}}$$This returns a log-probability of class membership. We might like to take a log to remove the exponents, which makes using cross-entropy appealing as a loss function. Cross entropy is written:
$$\mathcal{L} = {-\mathbb{E}[\log{p} \; | \;y]}$$Plugging in the value of $p$ using the softmax function, we obtain:
$$\mathcal{L} = -\log{\frac{e^{p_{correct}}}{\sum_i e^{p_i}}}$$Where $p_{correct}$ is the probability the model assigns to the correct class. This can be re-written:
$$\mathcal{L} = p_{correct} - \log{\sum_i e^{p_i}}$$Which is quite convenient from a numerical stability point of view as well. We will use this function to compute the model loss in the forward pass. We also need to know the derivative of this loss function for the backward pass, which is:
$$\frac{\partial \mathcal{L}}{\partial p} = \frac{1}{N}\left(\frac{e^p}{\sum_i e^{p_i}} - 1\right)$$def softmax_cross_entropy(logits, y_true):
"""Given model outputs (logits) and the indexes of the true class label, computes the softmax cross entropy"""
true_class_logits = logits[np.arange(len(logits)), y_true]
cross_entropy = - true_class_logits + np.log(np.sum(np.exp(logits), axis=-1))
return cross_entropy
def grad_softmax_cross_entropy(logits, y_true):
ones_true_class = np.zeros_like(logits)
ones_true_class[np.arange(len(logits)),y_true] = 1
softmax = np.exp(logits) / np.exp(logits).sum(axis=-1,keepdims=True)
return (-ones_true_class + softmax) / logits.shape[0]
To actually build a network, a list of layers and activation functions will be used. The list can then be iterated through to perform forward and backward passes. We need data to build the network, so the MNIST dataset will be used.
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
X, y = fetch_openml('mnist_784', return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
y_train = np.array(list(map(lambda x: np.uint8(x), y_train)))
y_test = np.array(list(map(lambda x: np.uint8(x), y_test)))
fig, ax = plt.subplots(1,12, figsize=(12,1), dpi=100)
for axis, idx in zip(fig.axes, np.arange(1337, 1337+12)):
axis.imshow(X_train[idx, :].reshape(28,28), cmap='gray')
axis.axis('off')
class Network():
def __init__(self):
self.layers = []
def add_layer(self, layer, n_inputs=None, n_outputs=None, activation=None, **kwargs):
self.layers.append(layer(n_inputs=n_inputs,
n_outputs=n_outputs,
**kwargs))
if activation=='relu':
self.layers.append(Relu())
elif activation is not None:
raise NotImplementedError('Only Relu activation supported')
def get_layers(self):
return self.layers
def clear_layers(self):
self.layers = []
def reset_t(self):
for layer in self.layers:
if hasattr(layer, 't'):
layer.t = 0
def reset_weights(self):
for layer in self.layers:
if hasattr(layer, 'initialize_weights'):
layer.initialize_weights()
def get_weights(self, layer):
return self.layers[layer].get_weights()
def forward(self, X):
self.activations = []
self.values = X
for layer in self.layers:
self.values = layer.forward(self.values)
self.activations.append(self.values)
assert len(self.activations) == len(self.layers)
def predict(self, X):
self.forward(X)
logits = self.activations[-1]
y_hat = logits.argmax(axis=-1)
return y_hat
def step(self, X, y):
self.forward(X)
self.layer_inputs = [X] + self.activations
self.logits = self.layer_inputs.pop()
self.loss = softmax_cross_entropy(self.logits, y)
self.loss_grad = grad_softmax_cross_entropy(self.logits, y)
for layer, inputs in zip(self.layers[::-1], self.layer_inputs[::-1]):
self.loss_grad = layer.backward(inputs, self.loss_grad)
return np.mean(self.loss)
def batch_handler(self, X, y, batchsize, shuffle=False):
assert len(X) == len(y)
if shuffle:
idxs = np.random.permutation((len(y)))
for start_idx in range(0, len(X) - batchsize + 1, batchsize):
if shuffle:
batch = idxs[start_idx:start_idx + batchsize]
else:
batch = slice(start_idx, start_idx + batchsize)
yield X[batch], y[batch]
def train(self, X_train, y_train, X_test, y_test, n_epochs=25, batchsize=35,
reset_weights=True, verbose=True, graph=True):
training_log = []
test_log = []
if reset_weights:
self.reset_weights()
for epoch in range(n_epochs):
for x_batch, y_batch in self.batch_handler(X_train, y_train, batchsize=batchsize, shuffle=True):
self.step(x_batch, y_batch)
training_log.append(np.mean(self.predict(X_train) == y_train))
test_log.append(np.mean(self.predict(X_test) == y_test))
clear_output(wait=True)
if verbose:
print(f'Epoch: {epoch+1}')
print(f'Training accuracy: {training_log[-1]}')
print(f'Test accuracy: {test_log[-1]}')
print(self.get_weights(0))
if graph:
plt.plot(training_log, label='Training accuracy')
plt.plot(test_log, label='Test accuracy')
plt.legend(loc='best')
plt.grid()
plt.show()
self.reset_t()
network = Network()
network.add_layer(Dense, n_inputs=X_train.shape[1], n_outputs=100, activation='relu')
network.add_layer(Dense, n_inputs=100, n_outputs=200, activation='relu')
network.add_layer(Dense, n_inputs=200, n_outputs=10, activation=None)
network.get_layers()
network.train(X_train, y_train, X_test, y_test, batchsize=5000)
The network implemented above works well enough, but has plenty of room for improvement. Two such improvements will be made here.
First, the simple SDG used to update weights can be improved. An ADAM optimizer with momentum will be added. Rather than simply stepping in the direction of the anti-gradient scaled by the learning rate, we will scale both the learning rate and the gradient, so that:
$$w_j^t = w_j^{t-1} - \frac{\eta_t}{\sqrt{v_j^t + \epsilon}}m_j^t$$with:
$$m_j^t = \frac{\beta_1 m_j^{t-1} + (1-\beta_1)g_{j}^t}{1-\beta_1^t}$$$$v_j^t = \frac{\beta_2 v_j^{t-1} + (1 - \beta_2)(g^t_j)^2}{1-\beta_2}$$Note that because our bias term is not included in the weight matrix, we need to calculate moment estimates for both the weights and the bias term separately. Failure to do so will cause the function to diverge, as I painfully discovered.
In addition, the initialization of the network can be improved by using the method of Glorot and Bengio 2010, which suggests to set the variance for the distribution of initial weights to:
$$\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}$$Their work was, however, concerned linear activation functions. He et al. (2015) suggest a change to the method that will be valid with a Relu activation function:
$$\text{Var}(W) = \frac{2}{n_{\text{in}}}$$This second implementation will be used here.
class Dense2(Dense):
def __init__(self, name, n_inputs, n_outputs, learning_rate, beta_1, beta_2, ϵ=1e-8):
self.n_inputs = n_inputs
self.n_outputs = n_outputs
self.learning_rate = learning_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.ϵ = ϵ
self.m = None
self.v = None
self.t = 0
self.name = name
self.initialize_weights()
def initialize_weights(self):
'''Use He at al. (2015) variance to initialize weights'''
w_var = 2 / self.n_inputs
self.t = 0
self.weights = np.random.normal(scale=w_var, size=(self.n_inputs, self.n_outputs))
self.bias = np.zeros((1, self.n_outputs))
def backward(self, value, grad_output):
'''The backward pass computes the contribution of this layer to the overall loss function, given by'''
self.grad_input = (grad_output @ self.weights.T)
self.grad_weights = value.T @ grad_output
self.grad_bias = np.sum(grad_output, axis=0, keepdims=True)
'''Update weights and bias via adam. We didnt know the dimensions of the gradient vector when the layer was
constructed, so we couldnt initialize m and v until now.'''
if self.t==0:
self.m_w = np.zeros_like(self.grad_weights)
self.m_b = np.zeros_like(self.grad_bias)
self.v_w = np.zeros_like(self.grad_weights)
self.v_b = np.zeros_like(self.grad_bias)
self.t += 1
self.m_w = self.beta_1 * self.m_w + (1 - self.beta_1) * self.grad_weights / (1 - self.beta_1**self.t)
self.v_w = self.beta_2 * self.v_w + (1 - self.beta_2) * np.power(self.grad_weights, 2) / (1 - self.beta_2**self.t)
self.m_b = self.beta_1 * self.m_b + (1 - self.beta_1) * self.grad_bias / (1 - self.beta_1**self.t)
self.v_b = self.beta_2 * self.v_b + (1 - self.beta_2) * np.power(self.grad_bias, 2) / (1 - self.beta_2**self.t)
self.weights -= self.learning_rate / (np.sqrt(self.v_w) + self.ϵ) * self.m_w
self.bias -= self.learning_rate / (np.sqrt(self.v_b) + self.ϵ) * self.m_b
return self.grad_input
network2 = Network()
network2.add_layer(Dense2, name='Input', n_inputs=X_train.shape[1], n_outputs=100,
activation='relu',
learning_rate=0.001, beta_1=0.9, beta_2=0.999)
network2.add_layer(Dense2, name='Hidden 1', n_inputs=100, n_outputs=200,
activation='relu',
learning_rate=0.001, beta_1=0.9, beta_2=0.999)
network2.add_layer(Dense2, name='Output', n_inputs=200, n_outputs=10,
activation=None,
learning_rate=0.001, beta_1=0.9, beta_2=0.999)
network2.get_layers()
We can see that by using the Xavier initialization and changing the optimizer has significantly improved both the in-and out-of-sample accuracy for the model. In addition, the speed of convergence is much faster. Note the slope of the training curves here compared those above, especially in early epochs. Ultimately the model appears to over-fit the data slightly, which suggests a perhaps simpler architecture, or implementation of some kind of regularization term. That will have to wait for next time, though.
network2.train(X_train, y_train, X_test, y_test, batchsize=5000)