Most deep learning tutorials tell you to “just use Adam.”
That works — until it doesn’t.
This post breaks down gradient noise, adaptive optimization, and why learning rate scheduling still matters for stable training.
Cross-posted from Zeromath. Original article: https://zeromathai.com/en/adaptive-optimization-en/
The Real Problem: Gradient Noise
In theory:
θ ← θ − ε ĝ
In practice:
- gradients are computed from mini-batches
- updates are noisy
- optimization becomes unstable
Deep learning training is fundamentally stochastic.
Momentum Solves Direction
Momentum smooths gradients:
vₜ = αvₜ₋₁ + (1 − α)ĝₜ
It acts like inertia:
- reduces oscillation
- stabilizes direction
- speeds up convergence
Without momentum:
- zig-zag updates
- slow progress
Adaptive Learning Rates Solve Scale
Different parameters need different step sizes.
AdaGrad
- shrinks learning rate over time
- works for sparse features
- but decays too aggressively
RMSProp
- uses moving averages
- keeps updates responsive
- fixes AdaGrad’s decay problem
Adam Combines Both
Adam = Momentum + RMSProp
That’s why it’s the default:
- stable
- fast
- easy to use
But Adam Isn’t the Full Story
In many real-world cases:
- Adam converges faster
- SGD generalizes better
A common strategy:
→ start with Adam
→ switch to SGD later
Learning Rate Scheduling Solves Time
Even with Adam, learning rate still matters.
Because training changes over time:
- early → explore
- late → refine
What Actually Works in Practice
- cosine decay
- warm-up (especially for large models)
- step decay for simple setups
Big Picture
Optimization =
- Momentum → direction
- Adaptive LR → scale
- Scheduling → time
Adaptive methods fix parameter-level issues.
Schedulers fix time-level issues.
You need both.
Practical Defaults
If you start a new project:
- Adam + cosine decay
- warm-up for large models
If performance matters:
- try switching to SGD at the end
One Insight That Changes Everything
In large-scale deep learning:
learning rate schedule often matters more than optimizer choice
Question
Do you stick with Adam the whole time,
or switch to SGD for better generalization?
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)