Research Progression 3: Asymptotic Extrapolation(2)
....continuing from the last post, we have imagined the idea based on a limitation of Neural Networks, based on that idea, we have even come up with a hypothesis, now let’s continue with some logic and plots.
Background/ Supporting Logic:
Since the model weights are theoretically supposed to slow their rate of change (let’s call it “Delta”) as the model gets trained
i.e. ”Delta” should slow down the closer you are to the ideal value
This curve would appear to be similar to the curve of an “Asymptote”
(Example of a horizontal asymptote, the curve (green) has a horizontal asymptote represented by the line y = 4 (Blue))
Furthermore, logically speaking, “Delta” of the value of curve would not just suddenly, jerkily decrease as it nears its ideal value, “Delta” would keep on decreasing with it’s own pattern. Like we observe in the above graph, the slope of the graph (Delta is the slope) does not suddenly change, it’s value keeps on decreasing with an increasing rate (slow - exponential in this case).
What we are trying to do is make a prediction model which is able to study the patterns of Delta and approximate the value of the curve at which Delta becomes zero (or close to zero), That will be the point where we reach the end of our training period
Supporting logic verification:
Everything mentioned up to this point is completely theoretical.
Is that bad? Not at all, everything starts with an idea, and discussing the idea is one of the most important things. However, ideas without evidence, are just thoughts.
So, how do we get supporting evidence of our claim that model weight training patterns are similar to Asymptote graphs which we have often observed in mathematics?
It’s simple, let’s train a simple Neural network and save the values of all the layers during the training period.
Now, we have all the values of training weights, let’s split them into individual sequences based for each trainable variable.
Finally, we can plot the function weights,
These are the ones I got,
1 Layer deep Vanilla Neural Network:
2 Layer deep Vanilla Neural Network:
(you can try it out by checking out this GitHub repository)
So, we have established a similarity between the training pattern of model weights and curves with “Asymptomatic tendencies”
What other inferences can we draw?
One other thing we can notice is that as the depth of the model increases, we get more complex training value curves, however, one fact that remains is that almost all values begin to head towards the “correct direction” after a very brief confusion period.
What I mean by this is if we take a complex model, which is estimated to be trained in around 4000 steps, if we train it for the 200 epochs to avoid the majority of the confusion period, then repeat a cycle of training for 200 steps and predicting value after 400 steps,
i.e.
0 - 200 steps: Training (initial confusion period)
200 - 400 steps: Training
400 - 800 steps: skipped by predicting future value
800 - 1000 steps: Training
1000 - 1400 steps: skipped by predicting future value
... and so on, we can train the model in 1600 Training steps and 2400 predicted steps instead of 4000 training steps.
Keep in mind the 2400 predicted steps are not actually steps, they’re more like a jump, and if we exclude the time and computation for prediction model (I will explain why we can skip that from the calculations), we can get an approximate efficiency of 60%
Furthermore, this efficiency is based off of a very conservative optimisation (i.e. jump window is only of 400 steps per 200 steps)
.... let that sink in...
Reason to skip prediction jump computation time from calculation:
For all of you hopeless critics (JK, always ask questions, as much as you can), the reason we can skip prediction computation cost from the calculations has 2 parts:
a) The computation required for predicting through a Neural Network is much much more negligible compared to the training phase (since training requires both forward and backward passes, that too for every single step while predicting requires a single forward pass)
b) The other reason is that time series forecasting type models work on a sequence of inputs (in our case a sequence of values of weights after every step of the training procedure). Since the 2 models (the original model being trained and the asymptote predicting model) are separate, during actual deployment, we can give the sequence of values (200 values in the above example) to the asymptote predicting model simultaneously while the original model is being trained, thus ignoring the time required to process the asymptote predicting model (let’s say it takes 1 second for every step of the original model to train, then as soon as each step finishes, we can send the new value to the asymptote predicting model, thereby letting it do it’s computation for 199 steps before the 200 step training sequence for the original model is even finished)
That was it for the idea,
Feel free to ask any questions in the comments, code explanation for the repository coming soon, stay tuned...
Thanks for following along,
One enigma decoded, on to the next one....










