What Is Reinforcement Learning?
The Core Framework
Every reinforcement learning system has four components: an agent, an environment, actions, and rewards.
The agent is the decision-maker. It observes the current state of the environment, chooses an action, and receives feedback. The agent is the system being trained.
The environment is everything the agent interacts with. In a chess game, the environment is the board and the rules. In robotics, the environment is the physical world. In a recommendation system, the environment is the user and their behavior.
Actions are the choices available to the agent at each step. A chess agent can move any legal piece. A robot arm can rotate each joint by some angle. A recommendation agent can show any item from its catalog. The set of possible actions can be discrete (a finite list of options) or continuous (any value within a range).
Rewards are numerical signals that tell the agent how good its action was. Positive rewards reinforce the action. Negative rewards (penalties) discourage it. In chess, winning gives a positive reward, losing gives a negative one, and every move in between might give zero reward. The agent's goal is to maximize the total reward it accumulates over time, not just the immediate reward from the next action.
The Exploration vs Exploitation Problem
The central challenge of reinforcement learning is the tension between exploration and exploitation.
Exploitation means using the best strategy the agent has found so far. If the agent knows that action A gives a reward of 10, it can keep choosing action A. This is safe and predictable but may miss better options.
Exploration means trying new actions to discover whether something better exists. The agent might try action B, which could give a reward of 15 or a penalty of 5. The agent does not know until it tries. Exploration involves accepting short-term risk for the possibility of long-term gain.
A simple strategy is epsilon-greedy: with probability epsilon (say, 10%), the agent chooses a random action (exploration), and with probability 1 minus epsilon (90%), it chooses the best-known action (exploitation). More sophisticated strategies adjust epsilon over time, exploring heavily at the start of training when the agent knows little and exploiting more as it gains experience.
This tradeoff appears everywhere in real life. A restaurant that always orders the same dish (exploitation) might miss a better option on the menu. A restaurant that orders randomly every time (exploration) will have many bad meals. The optimal strategy balances both, which is exactly what reinforcement learning algorithms try to achieve.
Key Algorithms
Q-Learning
Q-learning is one of the foundational reinforcement learning algorithms. It maintains a table (the Q-table) that estimates the expected total future reward for each combination of state and action. If the agent is in state S and takes action A, Q(S, A) estimates how much total reward it will accumulate from this point forward.
After each action, the agent updates its Q-value based on the reward it received and the best Q-value available in the next state. Over time, these estimates converge to the true values, and the agent learns the optimal strategy: in each state, choose the action with the highest Q-value.
Q-learning works well for small problems but breaks down when the number of states is large. A chess game has roughly 10 to the power of 47 possible positions. A Q-table that large is impossible to store. This limitation led to the development of deep reinforcement learning.
Deep Q-Networks (DQN)
Deep Q-Networks, introduced by DeepMind in 2013, replace the Q-table with a neural network. Instead of looking up Q-values in a table, the network takes the state as input and outputs Q-values for all possible actions. The network is trained using the same update rule as Q-learning but with gradient descent.
DQN's landmark achievement was learning to play Atari games directly from pixel inputs. The system received raw screen images (the state), could press joystick buttons (the actions), and received the game score (the reward). With no prior knowledge of what the games were or how they worked, DQN learned to play dozens of Atari games at superhuman levels. This demonstrated that deep reinforcement learning could handle high-dimensional sensory inputs.
Policy Gradient Methods
Instead of learning Q-values (which estimate the value of actions), policy gradient methods directly learn a policy, a function that maps states to action probabilities. The policy is a neural network whose output is a probability distribution over actions. Training adjusts the network's parameters to increase the probability of actions that led to high rewards and decrease the probability of actions that led to low rewards.
The REINFORCE algorithm is the simplest policy gradient method. More advanced algorithms like Proximal Policy Optimization (PPO) add constraints that prevent the policy from changing too much in a single update, which stabilizes training. PPO is the algorithm behind the reinforcement learning phase of ChatGPT's training (RLHF), where human preferences serve as the reward signal.
Real-World Applications
Game playing produced the most famous RL successes. AlphaGo (2016) defeated the world champion at Go, a game with more possible positions than atoms in the universe. AlphaZero (2017) learned chess, Go, and shogi from scratch with no human games to study, reaching superhuman performance in all three within hours. AlphaStar (2019) reached grandmaster level in StarCraft II, a game requiring real-time strategy with incomplete information.
Robotics uses RL to teach physical systems to walk, grasp objects, and navigate environments. Training in simulation is standard because physical robots cannot crash thousands of times to learn, but simulated robots can. The policies learned in simulation are then transferred to physical robots, with additional fine-tuning to bridge the sim-to-real gap.
Recommendation systems treat user interaction as a sequential decision problem. Each recommendation is an action, user engagement (clicks, purchases, time spent) is the reward, and the goal is to maximize long-term engagement rather than just immediate clicks. This framing helps avoid the trap of recommending clickbait that gets short-term engagement but drives users away over time.
RLHF for language models uses human preferences as rewards to align language model outputs with human values. Annotators compare pairs of model outputs and indicate which is better. A reward model is trained on these preferences, and then PPO is used to adjust the language model to produce outputs that the reward model scores highly. This process is why modern chatbots are more helpful and less harmful than raw language models.
Challenges and Limitations
Sample inefficiency is the biggest practical limitation. RL agents often need millions or billions of interactions to learn a good policy. AlphaGo played millions of games against itself. This is fine in simulation where games are free, but prohibitive in the physical world where each interaction takes time, costs money, or involves risk.
Reward specification is surprisingly difficult. The agent optimizes exactly the reward you give it, which may not be what you actually want. A cleaning robot rewarded for reducing visible dirt might learn to push dirt under furniture rather than actually cleaning. This problem, called reward hacking, is a central concern in AI safety research.
Unstable training plagues deep RL. The combination of function approximation (neural networks), bootstrapping (using estimates to update estimates), and exploration creates a notoriously unstable optimization landscape. Training runs frequently diverge, and small changes to hyperparameters can cause large changes in performance.
Reinforcement learning trains agents through trial and error, using rewards to reinforce good actions and penalties to discourage bad ones. The exploration vs exploitation tradeoff is the central challenge. RL has achieved superhuman game playing and powers the RLHF alignment process for modern language models, but it remains sample-inefficient and difficult to apply in environments where mistakes are costly.