Slow, Fast, or Furious? A Deep Dive into Learning Rates in Deep Learning
In deep learning, dozens of hyperparameters influence how effectively a model learns — but none is more critical than the learning rate. Whether training proceeds smoothly, crawls painfully, or explodes into chaos depends largely on this single value. A model can have the perfect architecture, the right optimizer, and a clean dataset, but with a poor learning rate, everything falls apart.
In this blog, I explore the learning rate from both a theoretical and experimental lens. After explaining what it is and why it matters, I run a simple experiment using a small CNN on the MNIST digit dataset and train it with three different learning rates. By comparing how the same model behaves under different settings, we see how dramatically this hyperparameter shapes the entire training process.
The aim is not just to explain what the learning rate is, but to understand how it shapes the learning process — slow, fast, or downright furious.
1. What Exactly Is the Learning Rate?
The learning rate, commonly represented as η, is a hyperparameter that controls how much a neural network’s weights are updated during training. Each update step aims to reduce the model’s loss, and the learning rate determines how large or small that step should be.
At the core of deep learning optimization is gradient descent. During each iteration, the model computes the gradient of the loss function with respect to its parameters and adjusts those parameters in the opposite direction of the gradient.
Mathematically, weight updates follow the rule:
This equation highlights the purpose of the learning rate.
Purpose of the Learning Rate
Controls update size: A high learning rate results in large weight updates, while a low one leads to small updates.
Determines convergence behavior: It influences how quickly the model learns, how stable training is, and whether the model reaches an optimal solution.
Balances speed and stability: The learning rate must be high enough to make meaningful progress but low enough to avoid overshooting the optimum.
Theoretical Effects on Training
1. Effect on Model Convergence
The learning rate directly influences how fast and how smoothly the model converges to a minimum of the loss function.
Too high: The model may overshoot minima, causing oscillation or divergence.
Too low: Training becomes extremely slow and may get stuck in flat regions of the loss surface.
Well-tuned: The model converges efficiently with stable decreases in loss.
2. Effect on Model Performance
A poorly chosen learning rate impacts performance significantly:
High LR → unstable gradients → poor accuracy
Low LR → underfitting due to insufficient progress
Optimal LR → best balance of exploration + stability
3. Effect on Generalization
Generalization refers to how well the model performs on unseen data.
If LR is too high, the model may never settle into a good minimum and ends up with high test error.
If LR is too low, the model may converge to a suboptimal minimum or overfit due to slow learning.
Proper LR helps the optimizer find a region of the loss landscape that generalizes well.
Experiment: Testing the Effect of Learning Rates on CNN Training
To understand how the learning rate influences the behavior of a deep learning model, I conducted a controlled experiment using a simplified convolutional neural network trained on a reduced subset of the MNIST dataset. By keeping the model architecture, optimizer, and number of epochs constant—and changing only the learning rate—we can clearly isolate its impact on convergence and performance.
Objective:
The goal of this experiment is to observe how three different learning rates affect:
• training speed • stability of the loss curve • accuracy progression • overall learning behavior
This helps demonstrate, in a concrete and visual way, why the learning rate is considered one of the most important hyperparameters in deep learning.
Dataset
To make the experiment lightweight and fast while still meaningful, I used a subset of the MNIST dataset:
10,000 training images
2,000 test images
Grayscale, 28×28 resolution
10 class labels (digits 0–9)
MNIST is widely used for introductory experiments because it is clean, simple, and efficient to train on—even with modest hardware.
Model Architecture
For this experiment, I used a compact CNN designed specifically for speed while retaining enough capacity to learn digit features. The model consists of:
Conv2D (1 → 8 filters), kernel size 3
ReLU activation
Max Pooling (2×2)
Conv2D (8 → 16 filters), kernel size 3
ReLU activation
Max Pooling (2×2)
Flatten layer
Fully Connected Layer (400 → 64 units)
ReLU activation
Output Layer (64 → 10 units)
This smaller CNN trains in seconds, making it ideal for quick comparisons across multiple learning rate settings.
Learning Rates Tested
I trained the exact same model with three different learning rates:
0.1
0.01
0.0001
5. Results
After training the same CNN with three different learning rates (0.1, 0.01, and 0.0001) for 3 epochs on a 10,000-sample MNIST training subset and evaluating on 2,000 test images, I recorded the loss and accuracy per epoch.
Numerical summary
LR = 0.1
Loss: 3.1660 → 2.3058 → 2.3066
Accuracy: 11.7% → 8.9% → 11.7%
LR = 0.01
Loss: 0.5158 → 0.1239 → 0.0872
Accuracy: 94.25% → 96.20% → 96.20%
LR = 0.0001
Loss: 2.2539 → 2.0417 → 1.6081
Accuracy: 43.20% → 60.50% → 66.80%
Figure 1 (Training Loss vs Epochs) shows that:
The model with LR = 0.01 rapidly drives the loss down close to zero within just three epochs.
The LR = 0.0001 model also reduces loss, but much more slowly and remains relatively high after three epochs.
The LR = 0.1 model’s loss decreases initially but then plateaus around ~2.3, indicating that it fails to continue learning effectively.
Figure 2 (Test Accuracy vs Epochs) shows that:
LR = 0.01 reaches around 96% test accuracy, stabilizing by the second epoch.
LR = 0.0001 improves steadily from about 43% to 66.8%, but remains far behind LR = 0.01 within the same number of epochs.
LR = 0.1 hovers around 10–12% accuracy, which is close to random guessing for 10 classes, meaning the model essentially fails to learn.
6. Analysis: Slow, Fast, or Furious?
These results line up very well with the theoretical expectations discussed earlier.
LR = 0.1 — Furious and Unstable
The highest learning rate behaves like the “furious” setting:
Accuracy stays near chance level (around 10–12%), even after three epochs.
Loss briefly drops from 3.16 to 2.30 but then stops improving, suggesting that the optimizer is taking steps that are too large to move toward a good minimum.
This matches the theory that an excessively high learning rate can cause the optimizer to overshoot minima and effectively bounce around without settling.
In practice, this learning rate is clearly too aggressive for this model and dataset.
LR = 0.01 — Fast and Effective
The middle learning rate, 0.01, clearly wins in this experiment:
Training loss drops sharply from 0.51 to 0.08 in just three epochs.
Test accuracy quickly rises above 94% in the first epoch and stabilizes at around 96.2% by the second epoch.
The loss and accuracy curves are smooth and stable, indicating healthy convergence.
This setting represents the “fast but controlled” regime: the model learns quickly without becoming unstable. It also shows that a good learning rate can reach strong performance in just a few epochs, saving both time and compute.
LR = 0.0001 — Slow but Safe
The smallest learning rate behaves like the “slow” setting:
Loss decreases gradually but remains relatively high after three epochs.
Accuracy improves from 43.2% to 66.8%, which is clearly better than random but still much lower than the 96.2% achieved by LR = 0.01.
The curves are stable and monotonic, indicating that the model is learning, just very slowly.
This reflects the theoretical behavior of a too-small learning rate: safe, stable progress but inefficient learning that may require many more epochs to reach competitive performance.
Overall insight
With everything else held constant (same CNN, optimizer, dataset subset, and epochs), the learning rate alone produced three very different behaviors:
Too high (0.1) → almost no learning, unstable training.
Too low (0.0001) → clear learning, but slow and incomplete within the same time budget.
Well-tuned (0.01) → fast, stable convergence and the best accuracy.
This experiment illustrates why the learning rate is often called the most important hyperparameter in deep learning: a good choice can turn the same model from “random guessing” into “highly accurate” with no other changes.
7. Conclusion
In this mini-project, I explored the effect of the learning rate on a small CNN trained on a subset of the MNIST dataset. By keeping the model, optimizer, and data fixed and varying only the learning rate, I observed three distinct regimes:
A furious learning rate (0.1) that was too aggressive and prevented the model from learning.
A slow learning rate (0.0001) that allowed gradual learning but failed to reach high accuracy within a small number of epochs.
A fast but stable learning rate (0.01) that achieved around 96% test accuracy in just a few epochs.
These findings reinforce the theoretical view that the learning rate directly shapes convergence speed, training stability, and final performance. In practical terms, this suggests:
Never rely on a single default learning rate without checking its behavior.
Start with a moderate value (e.g., 0.001–0.01 for Adam) and adjust based on loss/accuracy curves.
If loss barely changes, the learning rate is probably too low; if loss is unstable or accuracy stays near random, it may be too high.
Overall, this experiment made the abstract concept of “learning rate tuning” much more concrete: a single scalar hyperparameter can decide whether a neural network learns slowly, learns efficiently, or doesn’t learn at all.
References
TensorFlow. Learning Rate Schedule Documentation. https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/LearningRateSchedule
PyTorch. Optimizers – Learning Rate and Parameters. https://pytorch.org/docs/stable/optim.html
Keras. Optimizers in Keras. https://keras.io/api/optimizers/
Brownlee, Jason. Learning Rate for Deep Learning Neural Networks. Machine Learning Mastery. https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/
Towards Data Science. Understanding Learning Rates and Their Effect on Training. https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-training-5c7b054ca35b
GeeksforGeeks. Learning Rate in Neural Networks. https://www.geeksforgeeks.org/learning-rate-in-neural-network/













