"The present contains nothing more than the past, and what is found in the effect was already in the cause."
Henri Bergson
Everything So Far Assumed a Snapshot
Every network we've built treats the input as a static snapshot. Feed it in, get a prediction out. The order doesn't matter. There's no before or after.
That works for isolated images. A digit is a digit regardless of what came before it. But even for images, context matters in complex scenes. A round object next to a table is a plate. The same round object in the sky is the moon. CNNs detect the shape but don't understand the surrounding context.
For language, the problem is even more fundamental. Text is inherently sequential. What came before changes the meaning of what comes after.
"My teacher said I was slow, but he didn't know I was just getting started."
What does "he" refer to? The teacher. But only because you held "my teacher" in mind while reading the rest. You carried context forward, unconsciously, effortlessly.
Every architecture we've built so far would fail this. None of them carry anything forward.
Learning to Read, Letter by Letter
I remember learning to read. Not the fluent reading I do now. The early, effortful kind.
Each letter had to be identified consciously. Then combined with the next to form a sound. Then sounds stitched into a word. Then words assembled into meaning. It was slow, sequential, and exhausting. By the time I reached the end of a long sentence, I'd often forgotten how it started.
That's a vanilla RNN. It processes sequences one step at a time, maintaining a hidden state, a running summary of everything seen so far:
At each step t:
hidden(t) = tanh( W_h × hidden(t-1) + W_x × input(t) )
output(t) = W_o × hidden(t)
The hidden state is the memory. It blends the new input with what came before. The same weights are reused at every step. One set of weights, applied repeatedly across time.
h(0) ──► h(1) ──► h(2) ──► h(3) ──► ...
▲ ▲ ▲ ▲
│ │ │ │
x(0) x(1) x(2) x(3)
It works for short sequences. Just like the early reader who handles a short word fine but loses the thread of a long sentence.
The Long Sentence Problem
Training uses backpropagation unrolled across time steps. And here's where the familiar problem returns: the vanishing gradient from Post 07.
For a sequence of 50 words, the gradient gets multiplied by the weight matrix at each step backward. That's 50 multiplications. The gradient reaching step 1 is effectively zero. The network forgets the beginning of the sentence. Like my early reading days: by the end of a long sentence, I'd forgotten how it started.
In Post 07, skip connections fixed vanishing gradients by adding a direct additive path. We need the same idea, but for time.
LSTM: Learning to Read Fluently
Think about what changes when reading becomes fluent. You stop processing letter by letter. You chunk into words, phrases, meaning. More importantly, you become selective. You don't hold every word in memory with equal weight. You retain what matters: the subject, the tension, the unresolved question. You discard the filler.
That selectivity is what the Long Short-Term Memory network introduced.
An LSTM has two states: a hidden state (what it's currently working with) and a cell state (long-term memory). The cell state runs through the sequence with only small, controlled modifications, an additive path that lets gradients flow backward without decaying.
Three gates control what happens to memory at each step:
Forget gate: f = sigmoid( W_f × [h(t-1), x(t)] ) → keep or erase old memory?
Input gate: i = sigmoid( W_i × [h(t-1), x(t)] ) → is this input worth storing?
Output gate: o = sigmoid( W_o × [h(t-1), x(t)] ) → what to expose right now?
Cell update: c(t) = f × c(t-1) + i × candidate
Hidden: h(t) = o × tanh( c(t) )
Each gate outputs a value between 0 and 1. Near 1 means "yes, do this." Near 0 means "no, skip it." Consider reading "My teacher said I was slow, but he didn't know I was just getting started." When the network reads "my teacher," the input gate fires high to store the subject. As it reads "said I was slow," the forget gate stays high to keep "teacher" in memory. When it reaches "he," the output gate surfaces "teacher" from memory to resolve the reference.
All three gates are learned from data. Nobody programs when to remember or forget.
The cell state update is additive: old memory plus new information. That additive structure is what saves the gradient. Instead of multiplying through a squashing function at every step, gradients flow through the cell state with far less decay. Same idea as the ResNet skip connection from Post 07, applied to time instead of depth.
The hidden state isn't a recording of the past. It's a compressed summary of the parts that seem relevant for predicting what comes next. Just like a fluent reader doesn't remember the exact words from three pages ago, but does remember that the detective is suspicious of the butler.
See It
Open the playground. Train both a vanilla RNN and an LSTM, then pick a sentence length and watch the confidence bars update word by word. You'll see the exact step where the vanilla RNN changes its mind and the LSTM doesn't.
That's the difference between letter-by-letter reading and fluent reading. One forgets. The other holds on.
What's Next
RNNs gave networks memory. But they process sequences step by step. Step 2 waits for step 1. Step 50 waits for step 49. For a sequence of 100 tokens, that's 100 sequential operations. You can't parallelize.
There's a deeper problem too. The hidden state has to compress everything seen so far into a fixed-size vector. For long sequences, that bottleneck loses information no matter how good the gating is.
What if the network could look back at any part of the input directly, regardless of distance? No compression. No sequential chain.
That's attention. And it's what made Transformers possible.
References:
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8).
Cho, K., et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder. EMNLP.
Series: From Perceptrons to Transformers | Code: GitHub

Top comments (0)