With the week being largely focused on the introduction of feed-forward neural networks, I thought it would be a decent junction to introduce the different types of network; with a a particular focus on how they differ, and the structures covered so far.
In its simplest iteration, as seen in the previous installment, machine learning networks can consist of a single layer of calculation. For classification problems, this would be known as a perceptron.
Perceptrons, as an abstraction, map roughly to the functions known as ‘activation functions’ within the processes we use in PyTorch. They apply transformations to datasets in tensor / matrix form that enable binary selection of output. As a piece of biomimetic design, they mirror the activation channels of actual biological neural networks.
As can be seen from the diagram to the side, activation functions only allow through values above a certain cutoff (often zero), and can be used to add weight to those inputs.
This allows for classification in the abstract, as categorisation can be determined based on numerical input. In addition they can be used to add selectional elements to multilayered networks, and prevent their simplification into linear relationships.
The move from single into multilayer ‘feedforward’ type networks relies heavily on these functions. The structure of the processing might be as follows:
Though many of the layers are commented out (a side effect of experimentation) it can be seen that each linear tensor relation of the form
y = aX^T + b is sandwiched between ‘relu’ activation functions that prevent its simplification. These activation functions in practice select which elements of the weights introduced in each layer (represented in the equation by
a) will be applied moving forward through the processes.
In this way, complexity between the different input variables can be introduced, modeling the less straightforward relationship between datapoints.
The equations demonstrated above are called during the training step of the ‘fit’ function above. This will produce an output from which loss can be calculated. The gradient of the weights relative to this loss are then fed into an optimiser which adjusts them for the next run through the process (known as an ‘epoch’).
Our current optimiser runs a form of stochastic gradient descent, by which the hyperparameters of the model (batch size and learning rate) are kept constant during the training.
Once the model has been assessed against the training outputs, it can be evaluated relative to the validation set. This will provide a history of loss and accuracy that can be used to measure the learning progress of the model. An example of which can be found below.
And that’s all we’ve got time for this week. Catch up in the following weeks as we tackle more complex deep learning problems, and start work on the course projects and data science competitions.