What Is an RNN? Recurrent Neural Networks Explained

Updated May 2026
A recurrent neural network (RNN) is a neural network designed for sequential data, processing inputs one step at a time while maintaining a hidden state that carries information from previous steps. This hidden state acts as a memory, allowing the network to use context from earlier in the sequence when processing later elements. RNNs were the dominant architecture for text, speech, and time series tasks from the late 1990s until transformers replaced them for most applications around 2018.

Why Sequences Need Special Architecture

Feedforward networks treat each input independently. If you feed a sentence word by word into a feedforward network, each word is processed in isolation with no knowledge of what came before or after. But language is inherently sequential: the meaning of "bank" depends on whether the previous words were "river" or "investment." Time series data has the same property: a stock price of $150 means something very different if the previous price was $100 (rapid growth) versus $200 (sharp decline).

RNNs address this by adding recurrence: the output of the network at one time step feeds back as input at the next time step. This feedback loop creates a chain of information flow across the sequence, allowing the network to build context that accumulates over time.

How an RNN Processes a Sequence

At each time step t, an RNN takes two inputs: the current element in the sequence (x_t) and the hidden state from the previous time step (h_t-1). It produces two outputs: a new hidden state (h_t) and optionally an output (y_t). The computation is: h_t = activation(W_h * h_t-1 + W_x * x_t + b), where W_h and W_x are weight matrices and b is a bias.

The same weights (W_h, W_x, b) are used at every time step. This weight sharing across time is analogous to how CNNs share weights across spatial positions. It means the RNN applies the same learned transformation at every position in the sequence, and it can process sequences of any length with a fixed number of parameters.

For a sentence like "The cat sat on the mat," the RNN processes "The" first, producing a hidden state h1 that encodes information about the first word. At the next step, it processes "cat" along with h1, producing h2 that encodes information about "The cat." This continues until the final word, where h6 encodes information about the entire sentence. The final hidden state can then be used for classification (sentiment analysis, topic detection) or generation (predicting the next word).

The Vanishing Gradient Problem in RNNs

Vanilla RNNs have a severe limitation: they struggle with long-range dependencies. If information from the beginning of a long sequence is needed at the end, it must survive through dozens or hundreds of recurrent steps. At each step, the hidden state is multiplied by the weight matrix and passed through an activation function. If the weight matrix's eigenvalues are less than 1, the hidden state shrinks exponentially with each step. If the eigenvalues are greater than 1, it grows exponentially.

During backpropagation through time (BPTT), gradients must flow backward through the same chain of multiplications. When the weights cause signals to shrink (vanishing gradients), the network cannot learn to use information from early in the sequence because the gradient signal is too weak to update the relevant weights. When the weights cause signals to grow (exploding gradients), training becomes unstable as parameter updates become enormous.

In practice, vanilla RNNs effectively have a memory of only 10 to 20 time steps. Information from further back fades to insignificance. For many tasks (short text classification, simple time series), this is sufficient. For tasks requiring longer memory (document understanding, long conversations, music composition), vanilla RNNs fail.

LSTM: Long Short-Term Memory

LSTMs, introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997, solve the vanishing gradient problem by adding a separate memory cell and gating mechanisms that control what information flows in and out.

An LSTM unit has three gates, each a small neural network with sigmoid activation that outputs values between 0 and 1:

The forget gate decides what to remove from the cell state. It looks at the previous hidden state and current input and produces a value between 0 (forget completely) and 1 (keep completely) for each dimension of the cell state. This allows the LSTM to clear outdated information when it is no longer relevant.

The input gate decides what new information to add to the cell state. It combines two operations: a sigmoid that selects which values to update and a tanh that creates candidate values. Only the selected candidate values are written to the cell state.

The output gate decides what to output from the cell state. It applies a sigmoid to select which parts of the cell state to expose as the hidden state output, which gets passed to the next time step and optionally to an output layer.

The cell state is the key innovation. It runs through the entire sequence with only element-wise operations (addition and multiplication), never passing through a weight matrix or activation function. This creates a "gradient highway" where gradients can flow backward through hundreds of time steps without vanishing. The gates learn when to write, read, and erase information, giving the LSTM fine-grained control over its memory.

LSTMs can maintain information across sequences of 100 to 1,000 steps, a dramatic improvement over vanilla RNNs. They became the dominant architecture for machine translation, speech recognition, text generation, and time series forecasting from roughly 2014 to 2018.

GRU: Gated Recurrent Unit

The GRU, introduced by Kyunghyun Cho in 2014, simplifies the LSTM by combining the forget and input gates into a single update gate and merging the cell state and hidden state. The result is a model with two gates instead of three and fewer parameters, which trains faster and often performs comparably to LSTMs.

The GRU's update gate (z) controls how much of the previous hidden state to retain versus how much to replace with new information. The reset gate (r) controls how much of the previous hidden state to use when computing the candidate new state. When z is close to 1, the GRU copies the previous state forward unchanged (preserving long-term memory). When z is close to 0, the GRU replaces the state with new information computed from the current input.

In practice, LSTMs and GRUs perform similarly on most benchmarks. GRUs are slightly faster to train and have fewer parameters, making them a reasonable default when compute is constrained. LSTMs have a slight edge on tasks requiring very long-range dependencies, likely because the separate cell state provides a cleaner gradient highway.

Bidirectional RNNs

A standard RNN processes the sequence left to right, which means each hidden state only contains information from previous elements. For many tasks, context from the future is equally important. In "The bank by the river was steep," you need the word "river" (which comes after "bank") to disambiguate the meaning of "bank."

Bidirectional RNNs run two separate RNNs: one processing left-to-right and one processing right-to-left. At each position, the outputs of both RNNs are concatenated, giving the network access to context from both directions. This is only possible when the entire sequence is available upfront (not for real-time generation, where future words do not exist yet). Bidirectional LSTMs were the state-of-the-art for many NLP tasks before transformers.

Where RNNs Stand Today

Transformers have replaced RNNs for most language tasks and many sequence tasks. The self-attention mechanism processes all positions simultaneously rather than sequentially, providing better parallelization (faster training on GPUs) and better long-range dependency modeling (direct attention between any two positions, no matter how far apart).

However, RNNs retain advantages in specific scenarios. They process sequences in O(n) time (linear in sequence length), while transformers require O(n^2) time due to pairwise attention. For very long sequences or real-time streaming applications where latency matters, RNNs can be more practical. State-space models (like Mamba), which share properties with RNNs, have recently shown competitive performance with transformers at lower computational cost for long sequences.

RNNs also remain common in time series forecasting, robotics control, and audio processing, where the data is naturally sequential and the sequences may not justify the overhead of a transformer's attention mechanism. Understanding RNNs is important both for historical context (much of the progress in NLP from 2013 to 2018 was built on LSTM foundations) and for the specific applications where they remain the best tool.

Key Takeaway

Recurrent neural networks process sequences by maintaining a hidden state that accumulates information across time steps. Vanilla RNNs suffer from vanishing gradients that limit their effective memory, which LSTMs and GRUs solve with gating mechanisms that control information flow. While transformers have replaced RNNs for most language tasks, RNNs remain relevant for streaming applications, long sequences where attention is too expensive, and time series tasks where their sequential nature is a natural fit.