The Overestimation Problem Cost Me 40% Performance
Vanilla DQN scored 287 average reward on Breakout after 10M frames. Double DQN hit 412. Dueling DQN reached 438.
That's not just a numbers game. The gap between vanilla and Double DQN represents the cost of Q-value overestimation bias — a silent failure mode that takes hours to surface in training curves. I ran all three variants on the same hardware (RTX 3080, Gymnasium 0.29.1, Python 3.11) with identical hyperparameters to isolate the architectural differences. Here's what actually breaks and why.
Why Vanilla DQN Overestimates Everything
The core DQN update uses the Bellman equation to learn Q-values:
$$Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')$$
Continue reading the full article on TildAlice

Top comments (0)