The Basic Class¶

The Layer class will be the foundation of the neural network. A layer needs to be able to do two things:

Perform a forward pass: execute graph computation from left to right
Perform a backward pass: compute the derivatives of the loss function from right to left efficiently

The basic layer class will use the identity function $f(x) = x$ in the forward pass, and thus all derivatives in the backward pass are just 1.

class Layer:
    
    '''Basic neural network class with a forward and backward pass functions. '''
    
    def __init__(self, *kwargs):
        '''Used to store layer variables, e.g. number of neurons. Empty by default'''
        pass
    
    def forward(self, value):
        '''Compute the forward pass on the computational graph. Identity function is used by default.'''
        return value
    
    def backward(self, value, grad_output):
        '''Compute derivative of the loss function from right to left in the computational graph, with respect to a 
        given input (backprop). Takes the previous layers gradient as an argument to compute efficently. Given 
        k layers, the chain rule for the derivative of the whole graph is written:
        
        d loss / d value = d loss / d layer * d layer k / d layer (k-1) ... * d layer 0 / d value
        
        At each layer only a single term in the product is computed, taking the previous computations as an input.
        
        By default, the identity function is used for activation in the forward pass, so the Jacobian is simply an 
        n x n identity matrix. Nevertheless, the proper matrix math is shown here for completeness.
        '''
        
        n_units = value.shape[0]
        d_layer_d_input = np.eye(n_units)
        
        return grad_output @ d_layer_d_input

A Relu Layer¶

A relu function:

$$x = \begin{cases}x & \text{if} \;\; x > 0 \\ 0 & \text{otherwise} \end{cases}$$

is a simple activation function commonly used in neural networks. This function can be applied in the forward function by simply applying a maximum. The derivative of the relu function is equally simple, defined piece-wise as above, but equal to either 1 (in the first case) or 0.

class Relu(Layer):
    
    def __init__(self, *kwargs):
        '''No additional information is needed by the class, because the Relu function will be applied elementwise
        to everything.'''
        pass
    
    def forward(self, value):
        return np.maximum(value, 0)
    
    def backward(self, value, grad_output):
        '''The piecewise derivative is quickly implemented by turning a boolian into a float using multiplication'''
        relu_gradient = value > 0 
        return grad_output * relu_gradient

Testing the backward pass, we can see that it correctly sets values below 0 to 0, and then increases linearly afterwards.

l = Relu()
x = np.linspace(-1,1,10*32).reshape([10,32])
grads = l.backward(x,np.ones([10,32])/(32*10))
print(grads)

[[0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.      ]
 [0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.      ]
 [0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.      ]
 [0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.      ]
 [0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.
  0.       0.       0.       0.       0.       0.       0.       0.      ]
 [0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125]
 [0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125]
 [0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125]
 [0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125]
 [0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125
  0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125 0.003125]]

A Dense Layer¶

So far the layer has just been an activation function with a backward pass -- not much to be excited about. To expand into a full layer, we need to add the ability to learn weights. This functionality is implemented below. Stochastic gradient descent is used to find weights.

To find the derivatives needed by the backward pass, we begin with a linear approximation of the target variable y as the sum of i layers h:

$$ \hat{y} = \sum_i f(h_i)$$

Where each layer is itself a linear combination of input variables X:

$$h_i = w_iX + b_i$$

Because all layers are linear in inputs X, any arbitrary, differentiable loss function $\mathcal{L}$ has the following derivatives with respect to $X$, $W$, and $b$:

$$\frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial h} W^T$$$$\frac{\partial \mathcal{L}}{\partial W} = X^T \frac{\partial \mathcal{L}}{\partial h}$$$$\frac{\partial \mathcal{L}}{\partial b} = 0$$

class Dense(Layer):
    
    def __init__(self, n_inputs, n_outputs, learning_rate=0.001):
        '''A functional approximation machine to learn weights W in f(x) = XW + b such that f(x) approximates 
        target values y as closely as possible, accoarding to some criteria.

        Parameters:
        ---------------------------------------------------------------------------------------------
        learning_rate: speed of updating weights W. Choose sufficently small values to ensure timely
                       convergence. Default = .001.
        n_inputs: batch size, equal to number of samples processed each iteration.
        n_outputs: number of targets.
        '''

        self.learning_rate = learning_rate
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs
        self.initialize_weights()
        
    def initialize_weights(self):
        self.weights = np.random.normal(scale=1e-2, size=(self.n_inputs, self.n_outputs))
        self.bias = np.zeros((1, self.n_outputs))

    def forward(self, value):
        return value @ self.weights + self.bias
    
    def backward(self, value, grad_output):
        '''The backward pass computes the contribution of this layer to the overall loss function, given by'''
        self.grad_input = grad_output @ self.weights.T
    
        self.grad_weights = value.T @ grad_output
        self.grad_bias = np.sum(grad_output, axis=0, keepdims=True)
        
        '''Update weights and bias via SGD'''
        
        self.weights -= self.learning_rate * self.grad_weights
        self.bias -= self.learning_rate * self.grad_bias
        return self.grad_input
    
    def get_weights(self):
        return self.weights
    
    def get_gradient(self):
        return self.grad_input

Softmax-Cross Entropy Loss Function¶

To transform model output into a probability of class membership given $i$ potential classes, a softmax function is used. The probability of belonging to class $j$ is given:

$$ p_j = \frac{e^{\hat{y}_j}}{\sum_i e^{\hat{y}_i}}$$

This returns a log-probability of class membership. We might like to take a log to remove the exponents, which makes using cross-entropy appealing as a loss function. Cross entropy is written:

$$\mathcal{L} = {-\mathbb{E}[\log{p} \; | \;y]}$$

Plugging in the value of $p$ using the softmax function, we obtain:

$$\mathcal{L} = -\log{\frac{e^{p_{correct}}}{\sum_i e^{p_i}}}$$

Where $p_{correct}$ is the probability the model assigns to the correct class. This can be re-written:

$$\mathcal{L} = p_{correct} - \log{\sum_i e^{p_i}}$$

Which is quite convenient from a numerical stability point of view as well. We will use this function to compute the model loss in the forward pass. We also need to know the derivative of this loss function for the backward pass, which is:

$$\frac{\partial \mathcal{L}}{\partial p} = \frac{1}{N}\left(\frac{e^p}{\sum_i e^{p_i}} - 1\right)$$

def softmax_cross_entropy(logits, y_true):
    """Given model outputs (logits) and the indexes of the true class label, computes the softmax cross entropy"""
    true_class_logits = logits[np.arange(len(logits)), y_true]
    
    cross_entropy = - true_class_logits + np.log(np.sum(np.exp(logits), axis=-1))
    return cross_entropy

def grad_softmax_cross_entropy(logits, y_true):
    ones_true_class = np.zeros_like(logits)
    ones_true_class[np.arange(len(logits)),y_true] = 1

    softmax = np.exp(logits) / np.exp(logits).sum(axis=-1,keepdims=True)
    
    return (-ones_true_class + softmax) / logits.shape[0]

Build a Network¶

To actually build a network, a list of layers and activation functions will be used. The list can then be iterated through to perform forward and backward passes. We need data to build the network, so the MNIST dataset will be used.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = fetch_openml('mnist_784', return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y)

y_train = np.array(list(map(lambda x: np.uint8(x), y_train)))
y_test = np.array(list(map(lambda x: np.uint8(x), y_test)))

fig, ax = plt.subplots(1,12, figsize=(12,1), dpi=100)

for axis, idx in zip(fig.axes, np.arange(1337, 1337+12)):
    axis.imshow(X_train[idx, :].reshape(28,28), cmap='gray')
    axis.axis('off')

class Network():
    
    def __init__(self):
        self.layers = []
        
    def add_layer(self, layer, n_inputs=None, n_outputs=None, activation=None, **kwargs):
        self.layers.append(layer(n_inputs=n_inputs, 
                                 n_outputs=n_outputs,
                                 **kwargs))
        if activation=='relu':
            self.layers.append(Relu())
        elif activation is not None:
            raise NotImplementedError('Only Relu activation supported')
        
    def get_layers(self):
        return self.layers
    
    def clear_layers(self):
        self.layers = []
        
    def reset_t(self):
        for layer in self.layers:
            if hasattr(layer, 't'):
                layer.t = 0
        
    def reset_weights(self):
        for layer in self.layers:
            if hasattr(layer, 'initialize_weights'):
                layer.initialize_weights()
                
    def get_weights(self, layer):
        return self.layers[layer].get_weights()
    
    def forward(self, X):

        self.activations = []
        self.values = X

        for layer in self.layers:
            self.values = layer.forward(self.values)
            self.activations.append(self.values)
        assert len(self.activations) == len(self.layers)
    
    def predict(self, X):

        self.forward(X)
        
        logits = self.activations[-1]
        y_hat = logits.argmax(axis=-1)
        
        return y_hat

    def step(self, X, y):

        self.forward(X)
        self.layer_inputs = [X] + self.activations
        self.logits = self.layer_inputs.pop()

        self.loss = softmax_cross_entropy(self.logits, y)
        self.loss_grad = grad_softmax_cross_entropy(self.logits, y)

        for layer, inputs in zip(self.layers[::-1], self.layer_inputs[::-1]):
            self.loss_grad = layer.backward(inputs, self.loss_grad)

        return np.mean(self.loss)

    def batch_handler(self, X, y, batchsize, shuffle=False):

        assert len(X) == len(y)

        if shuffle:
            idxs = np.random.permutation((len(y)))

        for start_idx in range(0, len(X) - batchsize + 1, batchsize):
            if shuffle:
                batch = idxs[start_idx:start_idx + batchsize]
            else:
                batch = slice(start_idx, start_idx + batchsize)

            yield X[batch], y[batch]
    
    def train(self, X_train, y_train, X_test, y_test, n_epochs=25, batchsize=35, 
              reset_weights=True, verbose=True, graph=True):
        
        training_log = []
        test_log = []
        
        if reset_weights:
            self.reset_weights()

        for epoch in range(n_epochs):

            for x_batch, y_batch in self.batch_handler(X_train, y_train, batchsize=batchsize, shuffle=True):
                self.step(x_batch, y_batch)

            training_log.append(np.mean(self.predict(X_train) == y_train))
            test_log.append(np.mean(self.predict(X_test) == y_test))
            clear_output(wait=True)
            if verbose:
                print(f'Epoch: {epoch+1}')
                print(f'Training accuracy: {training_log[-1]}')
                print(f'Test accuracy: {test_log[-1]}')
                print(self.get_weights(0))

            if graph:
                plt.plot(training_log, label='Training accuracy')
                plt.plot(test_log, label='Test accuracy')
                plt.legend(loc='best')
                plt.grid()
                plt.show()
                
            self.reset_t()

network = Network()
network.add_layer(Dense, n_inputs=X_train.shape[1], n_outputs=100, activation='relu')
network.add_layer(Dense, n_inputs=100, n_outputs=200, activation='relu')
network.add_layer(Dense, n_inputs=200, n_outputs=10, activation=None)

network.get_layers()

[<__main__.Dense at 0x17d9dedc7c0>,
 <__main__.Relu at 0x17d9dedc040>,
 <__main__.Dense at 0x17d9d7ae6d0>,
 <__main__.Relu at 0x17d9dedcbb0>,
 <__main__.Dense at 0x17d9dedc6a0>]

network.train(X_train, y_train, X_test, y_test, batchsize=5000)

Epoch: 25
Training accuracy: 0.8653333333333333
Test accuracy: 0.8679428571428571
[[-0.01706702  0.01138093  0.00039369 ... -0.02086708 -0.00052835
   0.00301145]
 [ 0.00398888  0.0037105   0.01050611 ...  0.00756149  0.00103411
  -0.00964506]
 [-0.00546524  0.00637421  0.00282538 ...  0.00059777 -0.01479116
   0.00716643]
 ...
 [ 0.00377047  0.00225967  0.01088449 ...  0.00990514  0.00084542
   0.00276261]
 [ 0.01118148  0.01728287  0.00582262 ...  0.00540275  0.01808837
  -0.01239691]
 [ 0.00807876  0.00410311  0.01063573 ... -0.00255981  0.0111888
  -0.00715812]]

Improving the Network¶

The network implemented above works well enough, but has plenty of room for improvement. Two such improvements will be made here.

First, the simple SDG used to update weights can be improved. An ADAM optimizer with momentum will be added. Rather than simply stepping in the direction of the anti-gradient scaled by the learning rate, we will scale both the learning rate and the gradient, so that:

$$w_j^t = w_j^{t-1} - \frac{\eta_t}{\sqrt{v_j^t + \epsilon}}m_j^t$$

with:

$$m_j^t = \frac{\beta_1 m_j^{t-1} + (1-\beta_1)g_{j}^t}{1-\beta_1^t}$$$$v_j^t = \frac{\beta_2 v_j^{t-1} + (1 - \beta_2)(g^t_j)^2}{1-\beta_2}$$

Note that because our bias term is not included in the weight matrix, we need to calculate moment estimates for both the weights and the bias term separately. Failure to do so will cause the function to diverge, as I painfully discovered.

In addition, the initialization of the network can be improved by using the method of Glorot and Bengio 2010, which suggests to set the variance for the distribution of initial weights to:

$$\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}$$

Their work was, however, concerned linear activation functions. He et al. (2015) suggest a change to the method that will be valid with a Relu activation function:

$$\text{Var}(W) = \frac{2}{n_{\text{in}}}$$

This second implementation will be used here.

class Dense2(Dense):
    
    def __init__(self, name, n_inputs, n_outputs, learning_rate, beta_1, beta_2, ϵ=1e-8):
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs
        self.learning_rate = learning_rate
        self.beta_1 = beta_1
        self.beta_2 = beta_2
        self.ϵ = ϵ
        self.m = None
        self.v = None
        self.t = 0
        self.name = name
        
        self.initialize_weights()
    
    def initialize_weights(self):
        '''Use He at al. (2015) variance to initialize weights'''
        w_var = 2 / self.n_inputs
        self.t = 0

        self.weights = np.random.normal(scale=w_var, size=(self.n_inputs, self.n_outputs))
        self.bias = np.zeros((1, self.n_outputs))
        
    def backward(self, value, grad_output):
        '''The backward pass computes the contribution of this layer to the overall loss function, given by'''
        self.grad_input = (grad_output @ self.weights.T)

        self.grad_weights = value.T @ grad_output
        self.grad_bias = np.sum(grad_output, axis=0, keepdims=True)
        
        '''Update weights and bias via adam. We didnt know the dimensions of the gradient vector when the layer was
        constructed, so we couldnt initialize m and v until now.'''
        if self.t==0:
            self.m_w = np.zeros_like(self.grad_weights)
            self.m_b = np.zeros_like(self.grad_bias)
            
            self.v_w = np.zeros_like(self.grad_weights)
            self.v_b = np.zeros_like(self.grad_bias)
            
        self.t += 1
        self.m_w = self.beta_1  * self.m_w + (1 - self.beta_1) * self.grad_weights / (1 - self.beta_1**self.t)
        self.v_w = self.beta_2  * self.v_w + (1 - self.beta_2) * np.power(self.grad_weights, 2) / (1 - self.beta_2**self.t)
        
        self.m_b = self.beta_1  * self.m_b + (1 - self.beta_1) * self.grad_bias / (1 - self.beta_1**self.t)
        self.v_b = self.beta_2  * self.v_b + (1 - self.beta_2) * np.power(self.grad_bias, 2) / (1 - self.beta_2**self.t)
        
        self.weights -= self.learning_rate / (np.sqrt(self.v_w) + self.ϵ) * self.m_w
        self.bias -= self.learning_rate / (np.sqrt(self.v_b) + self.ϵ) * self.m_b
            
        return self.grad_input

network2 = Network()
network2.add_layer(Dense2, name='Input', n_inputs=X_train.shape[1], n_outputs=100, 
                   activation='relu', 
                   learning_rate=0.001, beta_1=0.9, beta_2=0.999)
network2.add_layer(Dense2, name='Hidden 1', n_inputs=100, n_outputs=200, 
                   activation='relu', 
                   learning_rate=0.001, beta_1=0.9, beta_2=0.999)
network2.add_layer(Dense2, name='Output', n_inputs=200, n_outputs=10, 
                   activation=None, 
                   learning_rate=0.001, beta_1=0.9, beta_2=0.999)

network2.get_layers()

[<__main__.Dense2 at 0x17d838b0610>,
 <__main__.Relu at 0x17d804a6df0>,
 <__main__.Dense2 at 0x17d838b0cd0>,
 <__main__.Relu at 0x17d838b07c0>,
 <__main__.Dense2 at 0x17d8037ca90>]

We can see that by using the Xavier initialization and changing the optimizer has significantly improved both the in-and out-of-sample accuracy for the model. In addition, the speed of convergence is much faster. Note the slope of the training curves here compared those above, especially in early epochs. Ultimately the model appears to over-fit the data slightly, which suggests a perhaps simpler architecture, or implementation of some kind of regularization term. That will have to wait for next time, though.

network2.train(X_train, y_train, X_test, y_test, batchsize=5000)

Epoch: 25
Training accuracy: 0.9864380952380952
Test accuracy: 0.9721142857142857
[[ 2.94466221e-04 -2.60370501e-03  2.04460620e-03 ...  2.08674545e-03
  -2.33354397e-03 -6.93196189e-04]
 [ 8.87516360e-04 -1.03922507e-03 -3.18308036e-03 ...  3.98072414e-05
  -2.17139142e-03 -3.85950253e-03]
 [-3.00597055e-03  4.05710701e-03 -4.69508400e-04 ... -5.48462955e-03
   4.05172501e-03  2.69525983e-03]
 ...
 [ 4.04535221e-03 -9.19483632e-04  7.26324753e-04 ... -9.70873476e-04
  -3.00771398e-03 -2.00132922e-03]
 [-6.80045834e-03 -8.37552844e-04 -2.36941458e-03 ... -9.77751308e-04
   8.35533728e-04 -1.27257010e-03]
 [ 3.57132394e-03  1.87920514e-03 -1.29671139e-04 ...  2.74876529e-03
  -7.33313957e-05 -9.01769132e-04]]

Table of Contents