Recurrent Networks Explained: How RNNs Process Sequences
The Basic RNN Architecture
A standard feedforward neural network takes a fixed-size input and produces a fixed-size output. Every input is processed independently, with no connection between one prediction and the next. This works fine for tasks like classifying an image, where all the information needed is present in a single input. But it fails for sequences. Understanding the word "bank" in a sentence requires knowing what came before it: "river bank" and "savings bank" mean completely different things.
An RNN solves this by adding a loop. At each time step, the network takes two inputs: the current element of the sequence (like the current word) and the hidden state from the previous time step. It produces two outputs: an output for the current time step and an updated hidden state that will be passed to the next time step. The hidden state is a vector of numbers that encodes a compressed summary of everything the network has seen so far in the sequence.
Mathematically, the computation at each time step is: h(t) = activation(W_h * h(t-1) + W_x * x(t) + b), where h(t) is the new hidden state, h(t-1) is the previous hidden state, x(t) is the current input, W_h and W_x are weight matrices, and b is a bias vector. The activation function is typically tanh, which squashes values to the range -1 to 1. The same weights W_h and W_x are shared across all time steps, which means the network applies the same transformation at every position in the sequence, just with different hidden state context.
When the RNN is "unrolled" across time, it looks like a very deep feedforward network where each layer corresponds to one time step and all layers share the same weights. This perspective is important for understanding how training works: backpropagation through time (BPTT) computes gradients by unrolling the network and applying the standard backpropagation algorithm through all the time steps.
The Vanishing Gradient Problem
The basic RNN architecture has a fundamental flaw that limits its practical usefulness. During backpropagation through time, gradients must flow backward through every time step. At each step, the gradient is multiplied by the weight matrix W_h. If the largest eigenvalue of W_h is less than 1, the gradient shrinks exponentially with each step. After 50 or 100 time steps, the gradient is so close to zero that the network cannot learn dependencies that span more than a few steps.
This is the vanishing gradient problem, and it means that basic RNNs effectively cannot learn long-range dependencies. If the answer to a question depends on a word that appeared 200 words earlier in the text, a basic RNN will fail because the gradient signal from the answer cannot reach back to the relevant word during training. The network can maintain information in its hidden state for short spans, typically 10 to 20 time steps, but information from earlier than that is effectively lost.
The opposite problem, exploding gradients, occurs when the largest eigenvalue of W_h is greater than 1. Gradients grow exponentially, causing weight updates so large that the training becomes unstable. Gradient clipping, which rescales the gradient vector whenever its norm exceeds a threshold, is a simple and effective solution to exploding gradients. But vanishing gradients cannot be solved by such a simple trick, because the problem is not that the gradient is too extreme but that it carries no useful information.
LSTM: Long Short-Term Memory
Long Short-Term Memory networks, introduced by Hochreiter and Schmidhuber in 1997, solved the vanishing gradient problem by introducing a gating mechanism that controls information flow. An LSTM cell maintains two states: the hidden state h (which functions like the basic RNN's hidden state) and the cell state c (a memory that can carry information across many time steps without degradation).
The cell state is the key innovation. Information flows through the cell state via element-wise multiplication and addition, not matrix multiplication. This means that gradient signals can pass through the cell state without being multiplied by the weight matrix at each time step, avoiding the exponential decay that plagues basic RNNs. Three gates control what happens to the cell state at each time step.
The forget gate decides what information to remove from the cell state. It takes the previous hidden state and the current input, passes them through a sigmoid function (which outputs values between 0 and 1), and multiplies the result element-wise with the cell state. A value near 0 means "forget this information," while a value near 1 means "keep this information." The input gate decides what new information to add. It has two parts: a sigmoid that determines which values to update, and a tanh that creates a vector of candidate values. The output gate determines what the hidden state should be, based on a filtered version of the cell state.
LSTMs can maintain relevant information across hundreds or even thousands of time steps. This made them the dominant architecture for sequence modeling from the late 1990s through 2017. Machine translation systems, speech recognition, text generation, handwriting recognition, music composition, and countless other sequence tasks were built on LSTMs during this period.
GRU: Gated Recurrent Unit
The Gated Recurrent Unit, introduced in 2014, simplifies the LSTM by merging the cell state and hidden state into a single state vector and reducing the number of gates from three to two. The reset gate determines how much of the previous state to incorporate when computing the candidate new state. The update gate determines how much of the candidate state to blend with the previous state, effectively combining the LSTM's forget and input gates into a single mechanism.
GRUs have fewer parameters than LSTMs (roughly 25% fewer for the same hidden size) and train faster as a result. Performance comparisons between LSTMs and GRUs on various benchmarks show no consistent winner: sometimes LSTM is slightly better, sometimes GRU is slightly better, and often the difference is within the noise margin. The choice between them is often practical, GRUs are preferable when training speed and memory matter, and LSTMs are the safer default when maximum sequence modeling capability is needed.
Bidirectional and Deep RNNs
A standard RNN processes sequences in one direction, from first element to last. But many tasks benefit from context in both directions. Understanding a word in the middle of a sentence is easier when you know what comes after it as well as what came before. Bidirectional RNNs address this by running two separate RNNs over the same sequence: one forward, one backward. The outputs of both are concatenated at each time step, giving the network full context from both directions.
Deep RNNs stack multiple RNN layers on top of each other, with the output sequence of one layer serving as the input sequence to the next. This allows the network to learn hierarchical representations of sequential data, analogous to how deep CNNs learn hierarchical spatial features. A two-layer LSTM might have the first layer learning word-level patterns and the second layer learning phrase-level patterns. Stacking 2 to 4 layers is common; beyond that, training becomes difficult without residual connections.
Applications and Current Status
Before transformers, RNNs powered virtually every sequence-to-sequence application. Google Translate used a deep LSTM encoder-decoder architecture from 2016 to 2019. Apple's Siri, Amazon's Alexa, and Google Assistant all used LSTM-based speech recognition. Text generation, sentiment analysis, named entity recognition, and summarization were all RNN territory.
The transformer architecture, introduced in 2017, has largely replaced RNNs for most natural language processing tasks. Transformers process all positions in a sequence simultaneously rather than sequentially, which makes them far more parallelizable and much faster to train on modern hardware. They also avoid the vanishing gradient problem entirely because attention connects every position to every other position directly. For language tasks, transformers consistently outperform RNNs at every scale.
RNNs remain useful in specific niches. For real-time applications where data arrives one sample at a time and predictions must be made immediately, RNNs are natural because they process input sequentially. For resource-constrained environments like embedded systems, small LSTM or GRU models require far less memory and computation than transformers. For very long time series data where the quadratic memory cost of transformer self-attention becomes prohibitive, RNNs scale linearly with sequence length. Recent architectures like Mamba and RWKV attempt to combine the linear scaling of RNNs with the performance of transformers.
RNNs process sequences by maintaining a hidden state that carries context from previous time steps. LSTMs and GRUs solved the vanishing gradient problem that limited basic RNNs, enabling them to learn long-range dependencies. While transformers have replaced RNNs for most language tasks, RNNs remain relevant for real-time processing, resource-constrained environments, and very long sequences.