Pytorch Zero to GANs: Final Week

TH Clapp
The Startup
Published in
15 min readJul 4, 2020

--

Here, at the end of the course, with the final project in the rearview mirror, I’d wanted to take the time to summarise the main learning points I’d achieved throughout the experience. Largely by way of comparing the various models we’ve encountered and exploring their relative usecases and coding patterns.

Back in the second week, using simple linear modelling, the WHO life expectancy dataset was tackled. An accuracy of 95% was achieved, amongst the highest of those in the course. So why was this?

At least part of the explanation can be covered by the innate relationship within the data. As single point linear modelling can aptly cover the data relations it’s likely that the data itself was well fitted to this style.

However, this dataset taught an important lesson in data handling. No matter how powerful the library code base, the data itself has to be correctly pre-processed in order to make best use of it. Feature scaling is a core part of this, and takes the form of two processes:

Normalisation

Normalisation, in its simplest form, is the rescaling of a diverse dataset to a common scale. In terms of machine learning, this most often will aim to convert all values into a proportional spread across values of 0 to 1.

This aims to remove disproportionate weighting to individual features as they pass through the layers of a learning network. This is especially important for single layer linear relations, as the presence of outlier scalings in the raw data can lead to incredibly inaccurate assumptions about the importance of particular columns.

Looking back at the WHO Health Data, it can be seen that the GDP and similar financial metrics are of vastly different scale to the other columns. If the data is then processed with this discrepancy intact vast losses will ensue as the curve is fitted to massively disparate data points.

Methods for this process include:

  • Student’s T Statistic
  • Coefficient of Variation
  • Min-Max Feature Scaling

Having not come from a data science or statistical background, the selection of options can be somewhat imposing. The distribution of the dataset would need to be measured in order to ascertain exactly which normalisation method is best suited.

Standardisation

Standardisation in many spheres appears to be used concurrently and interchangeably with normalisation, yet most accurately it attempts to scale the data such that it has a mean of 0 and a standard deviation of 1.

The proposed benefit of standardisation over other normalisation procedures such as min-max scaling would be its preservation of outliers.

In terms of the given dataset, a normal distribution was implied, and so the ‘Standard Score’ calculation was used to pre-process the data. This involves the subtraction of the mean from the dataset, before dividing by the standard deviation.

This can also be known as ‘z-testing’ or ‘z-scoring’.

But what about the structure?

As can be seen from the code block to the left, the model utilised is extremely simple. A class is created with a single linear layer, and a forward function is provided to call the __init__ attached methods.

The other functions correspond to the stages of processing through the dataset that occur during each epoch (run-through) of the ‘fit’ phase of training the model.

Structurally, the training and validation stages are near identical. Both involve first the application of the model to the current batch, before it is then compared to the output, and a loss is generated.

The generation of loss within machine learning models is one of the most important steps, as it allows the feedback of data in order to improve the model. There are many loss functions available in the PyTorch package, but for this simple relationship, the L1 smoothed loss function was the most appropriate.

A more in depth look at choosing loss functions for ML projects can be found here.

At the end of each epoch the related losses are stored and printed to enable a sequential view of the model’s (hopeful) improvement. This can later be used for graphing functions or as a direct data view at the progress telemetry.

That second week would also prove the introduction to the model evaluation and fit functions (shown below) which would act as the base framework for the rest of the course. Some key points to notice in the layout of the training code block are:

  • The optimiser: quite literally the heart of the machine learning paradigm, and what allows the model to self improve its ‘fit’ to the data. In this instance the optimiser runs a form of stochastic gradient descent, an iterative method of identifying close fit points on a curve via gradients and errors to enable better fitting of the model. This requires a certain smoothness of data to enable the differentiation of data points.
  • loss.backward(): the method by which the optimiser calculates the losses for those values which have grad=True settings.

By the third week we moved onto dealing with classification problems, and started looking more at image based datasets, and the complexities found within. To this end we took a look at the CIFAR10 dataset consisting of some 60000 images broadly split into 10 categories.

As can be seen from the selection; the images are small, a mere 32x32, and not necessarily easy to distinguish even for humans. The simplistic single layer model would fare poorly.

Feed-forward networks allow for greater complexity.

The layers used, also linear functions, are demarcated by the inclusion of step functions known as activation functions. These mimic the ‘stepwise’ activation of neural cells in biological neural networks, and provide a discriminatory capability to the multilayer network, as well as preventing the layers being simplified into single linear equations.

Whilst most of the examples we worked with utilised ‘ReLU’, a rectified linear approach to the stepwise activation of particular data channels, it should be noted that other activation functions exist. The preservation, and degree of preservation, of channels, and hence features of the images, are a key assessment indicator for the models.

Graphical representation of a feed-forward neural network with a single hidden layer

The introduction of additional ‘hidden’ layers, mediated by the activation functions, enabled the modelling of more complex data relationships. Indeed for the CIFAR dataset, with the addition of only two additional layers, accuracy in modelling jumped from 30 to over 50%.

Due to the massive additional computational load these forms of tensor modelling operations cause, the week saw the inclusion of hardware optimisation functions for the models.

CUDA, Nvidea’s framework for enabling general purpose hardware acceleration with enabled chipsets, was the first port of call. The hardware attached to a running instance can be searched for CUDA enabled cards, and the models created ported to them.

GPUs, ever at the forefront of specialised chipsets, have come into their own with the wider uptake of parallel processing tasks. The industry, previously driven largely by video gaming advancements, is starting to see the adaptation of other modelling tasks to the advantages offered by specialised processing units.

In addition to Nvidea’s offering, specific TPU (Tensor Processing Unit) offers are starting to be seen on market. Principally developed by Google, they are often found on data science related cloud IaaS and SaaS offerings, for instance Kaggle and Google’s own Colab.

However, whilst it is possible to run PyTorch code on TPUs, it is an uncommon path. TPUs, given their Google in-house development, remain best utilised by programs written in their propietary TensorFlow language library.

So, given the accelerated hardware and improved complexity of the model, what then is the next stage for improving performance?

Enter convolutional neural networks.

Similar to the feedforward networks they utilise multi-layered perceptrons to iteratively process multi-channeled data into discrete outcomes. They introduce two key differences from the intial strict linear functions

Convolutional kernel

Convolutional kernels are one of the key improvements in these new network architectures. The use of a ‘kernel’, a smaller tensor, enables the processing of layers to different layouts, and the selective retention or abstraction of the data within.

As shown in the example, the use of ‘strides’, a form of border to the numeric representation of the image channels, prevents data loss from the image as a whole. Through the use of stepwise decreases in size of hidden layers information is distilled into the final output.

In the case of a categorisation problem the final layer would represent the categorisations themselves. For identification networks, a one dimensional binary representation would be sufficient.

As can be seen, different forms of convolution can be applied, some acting more like filters than anything else. Size of layers can increase, decrease, or stay the same; dependent on the use case of the layer.

Inside a ‘convolutional block’ is found the following:

  • A convolutional process.
  • Batch normalisation.
  • An activation function.

Batch normalisation re-normalises the data, which may have had scaling differences introduced due to the convolutional processing, prior to the application of the activation function. This ensures that comparative data weighting is maintained throughout the layers of the network.

This technique is known to improve the overall performance and scalability of the networks, though the exact mechanism behind this advantage is still contested.

This combination of greater flexibility and performance enhancement from the structure of the model itself is bolstered by tweaks elsewhere.

The loss function is replaced with a cross-entropy calculation. It measures the difference between real and simulated probability distributions and ensures their direction of minimisation. This loss method is most commonly used in classification problems.

The data itself, beyond the standardisation and normalisation, can also be further preprocessed to aid in modelling.

  1. Validation test sets: Instead of setting aside a fraction (e.g. 10%) of the data from the training set for validation, the test set doubles as validation. This marginally improves the utility scope of the data.
  2. Channel-wise data normalization: As above but maintaining the scale relationships between contiguous channel sets.
  3. Randomized data augmentations: The core unique augmentation set for image base data. Methods include the horizontal or vertical flipping of images, the padding and moving of focus points, and stretching or squashing the image. This aims to prevent overtraining of the set.

Residual Networks

Despite the increases in complexity of structure and the methods of pre-processing the data available, performance of the network is still relatively undynamic. A range of so called ‘hyperparameters’ and a selection of structural choices still represent experimental points for the data training, which can cost significant time to process.

Hyperparameters represent those aspects of the model which are not automated by the learning process and must be manually chosen and set.

  • Learning Rate: learning rate refers to the ‘distance travelled’ by the stepwise increase of the optimisation functions used. A larger learning rate leads to greater jumps in values for the implicit parameters of the model, whereas lower learning rates equate to fine grain adjustment of values. Common advice will be to start with a lower learning rate, head to a high, and then back to a low over the course of training.
  • Epochs: an epoch represents one run through the fit process across the entire training dataset. The number of epochs a model runs through will effect the accuracy of the final outcome. Underfitting would be the inaccuracy of the model due to insufficient training, whereas overfitting can result in too much time spent on a single set to the point where irrelevant quirks of the data itself can become the fixation of the modelling process.
    Both are to be avoided.
  • Batch Size: represents the number of images present in each training of the model before its internal parameters are updated. It has an impact on the training speed and stability of the model, yet is often most strictly limited by the hardware requirements rather than anything else.
  • Layers: the number of layers present in the architecture of the learning process, dictating the complexity of the model’s relationships.

Residual networks can add flexibility and a degree of stability to the question of how many layers to include in a model’s structure.

A 4 block residual network

By this stage groups of layers are often represented as ‘blocks’ from a code perspective. As explored earlier, and seen in the diagram above, a standard ‘convolutional block’ is likely to consist of; a convolution, a batch normalisation, and an activation function.

But how many of these layers will be necessary to maximally model the dataset? Are all of the layers included actually increasing accuracy?

A residual block

Representing one block in the flow of a tensor through the network, the inclusion of a datapath which bypasses the weighting process prior to final activation allows a dynamic method of judging layered effectiveness.

If the layer is ineffective, and cut off by activation, the reinclusion of the original x identity layer bypasses the effect of the block. By careful structuring of these paths, the iterative increase of complexity in the model itself can be statistically automated.

The exact positioning of these bypasses relative to the different layers can provide optional layers of flexibility to the responsiveness of the network, as demonstrated graphically below.

In addition to the quasi-automation of the layering of the model, the learning rate can also be given greater impetus.

  • Learning rate scheduling: Instead of using a fixed learning rate, a rate scheduler can be enabled to alter the learning rate per batch. A range of mechanistic and policy based approaches to scheduling are available, including the ‘One Cycle Learning Rate Policy’.
  • Weight decay: Additional terms can be added to the loss function alongside the effects of the activation layers to prevent channel-wise weights becoming too large, this is known as weight decay.
  • Gradient clipping: gradient values can also grow to disproportionately affect the model, which can be solved by the ‘clipping’ of the gradients after loss.backwards() is calculated.
Variants in learning rate scheduling

Amongst the many options for dynamic learning rate assignment, entire alternate optimisation strategies can be implemented. Some, such as ADAM (Adaptive Moment Estimation) multiplies the learning rate by the momentum, in addition to dividing by a factor related to the variance.

This overhaul of the optimisation strategy itself alows for far more efficient modelling, with local minima in the accuracy mapping to be located extremely quickly.

The issue of how to jump between minima, or predict the deepest gradients ahead of time is a major area of study for this area of statistics.

As can be seen, by this point in the evolution of the original model framework, the complexity has increased significantly, with a greater pool of library resources being pulled. In addition to the greater usage of the PyTorch package, complex uses of Python’s language quirks are being utilised, such as the use of functional decorators and class extenders.

DCGANs and Generative Modelling

Thus far; be they regression, classification, or clustering challenges; all of the problems have been what are known as ‘supervised problems’. This means that the end point is pre-defined. From a dataset, the desired output is preselected and reified as the object truth. The model then seeks to define the process by which the inputs are processed in order to arrive at the defined output.

What then for generative problems?

When new content is being created, as in the well known cases of deepfakes, non-existent portraits, and scripting, the process is not supervised. This unsupervised learning bridges the gap from an existing dataset toward machine-created content.

Whilst there are many approaches to this form of generative modelling, we focused on the creation of GANs; generative adversarial networks.

Generative Adversarial Network Superstructure

As can be seen from the diagram, the modelling in this case is split into two distinct areas, that of generation and that of discrimination.

It is important to note that for the creation of content, the previously mentioned image augmentations should be largely avoided, so as to not bias the generator. The two models are trained in turn, and reliant on each others success via a symbiotic relationship of alternate increased and minimised loss.

The fiftieth epoch output of an anime face GANs

As can be seen from the above image, drawn from a GANs that ran 50 epochs through a dataset of over 60000 anime faces, the results can be rather volatile, even after significant training. The reliance on very precise adjustments in the hyperparameters for the paired model makes this structure fairly difficult to train.

The discriminator network follows the previously discussed mechanism of multi-layered convolutional architecture. As the discriminator will need to make a boolean decision as to the veracity of presented images, the loss in this instance is calculated using binary cross-entropy. This means that the layered progression will need to distill the assessed image down to a 1 x 1 x 1 tensor, representation.

Leaky ReLU activation

In place of the previous activation functions, leaky ReLU allows preservation of a greater number of channels, so whilst relationships can be minimised when not useful, far less superfluous information is discarded during the layering processes.

As can be seen, despite the convolutional structure and relative complexity, the code itself has become simplified. Rather than a class based application, in place calculation is performed to streamline the model objects to discrete assignments. Their transfer to accelerated hardware is vital for this sort of processing, as it is highly calculation intensive.

Transposed Convolution

Deconvolution takes places as a fundamental equivalent of the discrimination process for the generative portion of the model. As content is being actively generated, the per-datapoint padding seen in the input can enable the rapid upscaling of the predictive model.

In addition to the inversion of the convolutional blocks, the activation function is once more changed to allow even greater data overflow, as a bounded function is said to allow greater speed in saturation and colour building during the training process.

tanh(x)

Tanh() is chosen for this process, though others are available. The complexity of the input images appear to have some influence over which activation functions allow greater speed or precision during this aspect of the training.

As can be seen from the structure, the processing is more or less parallel.

Over the course of experimenting with this setup, the following images were produced with reference to the CelebA dataset of celebrity facial images.

This is represented by the following paired progressing graphs, first displaying the data loss per epoch, and then the relative scoring of the images.

Notice the inverse relationship between the two sets.

As opposed to the complimentary scoring.

Recommendations and Words of Warning

All in all, having reached the end of this experience, I have to say I’d thoroughly recommend the PyTorch Zero to GANs course offered by jovian.ml.

Aakash has been a great instructor, and the opportunity to learn about and play with machine learning programming is an awesome opportunity not just for those in the industry but also for the casual observer.

Machine learning and adaptive algorithms are an ever greater part of everyday life. From recommendations to deepfakes, from gaming to politics, from data to praxis. Whether we like it or not, this technology isn’t going away.

This library, paired with the relative accessability of Python as a language, is a great start to anyone’s data science and modelling journey.

That being said, various aspects of the training of models can be immensely frustrating.

BE WARNED THAT KAGGLE NOTEBOOKS CUT OUT AFTER 9 HOURS.

I really wish that someone had told me that prior to losing an entire day’s work.

In addition to the now infamous timeout, if your computer goes into sleep mode, the processing will stop. It will also stop if no processes are running for more than an hour.

You will lose your work.

You have been warned.

To anyone who owns an Nvidea graphics card, you can set up CUDA at home.

…good luck.

And finally, to anyone attempting to import and open datasets in Google Colab. If you know how to make it work, please message me, I still don’t.

Every attempt at unzipping folders within GDrive ended with a crashed browser and snarky messages from Google.

Make your sacrifices to the code gods.

--

--