Kicking off my second extension project for the folks over at Jovian.ml I took a gander through the WHO’s Life Expectancy Data with an aim of building a linear prediction algorithm. Found mirrored here, for ease of use, the dataset provides a range of health and population metrics with which to train a model.
Running this time on the Kaggle platform, the dataset was imported as a Pandas dataframe for ease of initial manipulation.
With the data imported, the first order of business was to clean and normalise the values, ready for splitting.
With the output preset as ‘Life Expectancy’, it was necessary to identify those columns which contained non-numeric data, in order to prevent processing errors. This functionality is inbuilt to Pandas, simplifying the problem from a technical perspective, but a quick look through the headings will turn up ‘Country’ and ‘Status’ as the likely offenders.
With these identified, and bound to a variable, it was time to start the pre-processing.
A copy of the entire dataframe is first made, and the missing data dropped. This step is essential to preventing later errors. Datasets often require cleaning prior to processing becoming possible, and this is a frequent part. Despite the 20 something headers, it is not a given that data will exist for all rows in all categories.
With these removed, the previously identified non-numeric categories can be processed into numerical equivalents. Whilst ‘Country’ as merely a tag of sorts for the data, will not be used, the processing of ‘Status’ is of interest.
The column contains two categories
Developing which can be seen as a binary option. This will present values between 0 and 1.
The data will now be split into two principle categories:
Out of the available categories, ‘Life Expectancy’ is already confirmed as the target. Therefore it, ‘Country’, and ‘Year’ can be removed from the Input data, as the two are unlikely to have a direct causal relationship with the target.
Of the two, ‘Year’ is more likely to be of use, however any relevance between recent events and life expectancy would be outside of the scope of the data availabe, and hence should not be a focus for the model.
With extraneous and missing data removed, it would seem the work is done. Yet this is not the case.
In the various categories, values for data range between single figures and the millions. With this varience, it will be extremely difficult to build a linear model that accounts for the chaotic nature of the relations.
The data must first be standardised.
There are a selection of methods availabe for standardising data, including min-maxing, range division, data centering, and standard deviation division. Of the available tools z-score scaling was selected.
By first calculating the mean of the data, taking it away from the values, then dividing by the standard deviation, the data can be both centered and scaled appropriately.
Whilst not pursued in this iteration of the project, it would be recommended with much larger datasets to prevent crossflow of the data by isolating sets that belong to the same country, and ensuring they are evenly distributed whilst remaining contiguous.
The Inputs and Target datasets have now been prepared, dropping from some 2500+ values to 1649. They are converted to a numpy array with the .values() call for easier manipulation as PyTorch tensors.
This done, both sets are then converted to tensors, and 12.5% of them are split to form a validation set discrete from the training data. Values are stored as
float32 rather than
float64 values to lower hardware costs in the later fitting steps.
Data loaders are constructed, with the batch size set at 64, in order to shuffle the data into batches for processing.
Next, the model itself is constructed.
As recommended by the course notes, the model used it structured according to that provided by Aakash, the website CEO, and course leader. Over the course of testing, I experimented with a couple of different loss calculation mechanisms, but the unsmoothed l1 loss method appeared to produce the best results.
On instantiation, the class loads the model with randomised values, ready for training. Gradient data is maintained in order to calculate the direction of movement for the later training.
Evaluation and fit functions are defined, and the model is ready for training.
As expected from randomised constants, the first evaluation loses a significant amount of data, and is wildly inaccurate as a result.
With a learning rate set at 1e-1 and running for 100epochs (loops through the training and evaluation system). The accuracy improves immensely.
As can be seen from the graph, the accuracy improves at a constant rate until the reaching of a plateau at around data loss 2.9, representing an accuracy of 95%.
Playing around with reinstantiating the model, which will pick a different start point for the stochastic gradient descent process, the highest the accuracy got was 98%, though this can be considered something of a fluke.
It was an interesting week, and, barring technical difficulties, enjoyable.
See you all again for the next installment.