In an essay published in 1962, an IBM researcher called Arthur Samuel proposed a way to have computers ‘learn’, a different process from how we normally write code to instruct computers:
“Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.”
This idea of having 1 - the idea of a weight assignment, 2 - An automatic means of testing the model actual performance, and 3 - The need for a mechanism, has laid the foundation for deep learning implementations we have today. Jeremy Howard describes a 7 step process which are key to the training of all deep learning models utilising stochastic gradient descent :
Step 1 — Initialise parameters
First we need to initialize our parameters. Please note we will use the terminology “parameters” instead of “weights”. For new models we can start with random values as this process includes a mechanism to improve these parameters over iterations.
For example, we start with a quadratic function for our model:
f(x) = ax² + bx + c
“a”, “b” and “c” are our parameters and we will randomly assign values to them for the first iteration.
Step 2— Predict
Next we need to run the model using these parameters. We will observe some differences between the model and our input dataset. The image below shows us the results of our function(red dots) in comparison to the actual results(blue dots):
Step 3 — Calculating Loss
To calculate how good the model is based on these predictions we need to design a loss function. We need to know how wrong we are. The function should give a small loss as an indicator of good performance. Our loss function might be defined as the root mean squared differences from the actual results (using Python syntax):
loss = ((predicted speeds - actual speeds)**2).mean().sqrt()
So the loss should decrease as our model gets more accurate, as we shift the parameters.
Step 4— Calculating the gradient
Without going into the details we need to next establish a relationship between our loss function and the parameters. This relationship will be defined by some sort of function. As an example lets use a quadratic function again:
We can see by eye which direction we need to move the parameter to minimise loss. Computationally we can also do this using differential equations which will calculate gradients for us.
Although in reality we have a lot more parameters and the function will be more complex, we just need to trust Isaac Newton who proved the calculus used works for more complex functions as well!
Step 5— Changing the parameters
We now know in which direction we need to move our parameters. But by how much? There are various ways to optimise this but essentially we make use of learning rates to adjust how fast we adjust our parameters for each iteration. This is also known as adjusting the size of the “step”.
The larger the learning rate the bigger step we take in the x-axis, this will potentially make it learn faster but it may also have the opposite effect as it will take too big of a step that will make it miss the point with the smallest loss value. Small learning rates will allow us to arrive at a more precise minimum loss point but as there is a lesser chance that it will miss the minimum loss value; however this will take much longer.
Step 6— Repeat
Once we have adjusted our parameters, we can go back to step 2, run the model again and make a new set of predictions. Hopefully the model will have improved and we should see the loss metric decrease. By going through a number of iterations our model should become more and more accurate.
Step 7— Stop
At some point the model will no longer improve and we should stop iterating. There is also a second issue of overfitting which essential means the model is very accurate for the set of data it is trained on but may not be a good model for prediction for new, unseen data.