What Are Parameters in AI?

Updated May 2026
Parameters are the numerical values inside an AI model that are learned during training and collectively define everything the model knows. In a neural network, parameters consist of weights (which control how strongly one neuron influences another) and biases (which shift neuron activations). A model with 70 billion parameters has 70 billion individual numbers that were adjusted through training, and it is the specific combination of these numbers that gives the model its capabilities.

Parameters Are What the Model Learns

When you train an AI model, you start with randomly initialized parameters and gradually adjust them until the model makes good predictions. Before training, the model's millions or billions of parameters are essentially random noise. After training, those same parameters encode the patterns, relationships, and knowledge the model learned from its training data.

Consider a simple example. A linear model predicting house prices from square footage has two parameters: a weight (how much each additional square foot adds to the price) and a bias (the base price independent of size). If training produces a weight of 200 and a bias of 50,000, the model predicts that a house costs $50,000 plus $200 per square foot. Those two numbers are the model's entire knowledge. Everything the model learned about the relationship between size and price is encoded in those two parameters.

Neural networks work on the same principle, just at enormous scale. Instead of two parameters, a modern large language model has hundreds of billions. Each parameter is a single floating-point number, typically stored in 16 or 32 bits. Together, these numbers encode grammar rules, factual knowledge, reasoning patterns, and every other capability the model exhibits.

Weights and Biases

In a neural network, parameters come in two forms: weights and biases.

Weights are the connections between neurons. If neuron A connects to neuron B with a weight of 0.7, that means neuron A's output is multiplied by 0.7 before being sent to neuron B. A large positive weight means "neuron A strongly activates neuron B." A large negative weight means "neuron A strongly suppresses neuron B." A weight near zero means "neuron A has little influence on neuron B." The weights collectively define the network's wiring, which pathways are strong, which are weak, and which are inhibitory.

Biases are additional parameters added to each neuron. A bias shifts the neuron's activation threshold. A positive bias makes the neuron more likely to activate (fire) even with weak input. A negative bias makes it harder to activate, requiring stronger input signals. Without biases, a neuron's activation would always be zero when all its inputs are zero, which limits what functions the network can learn.

In a single layer of a neural network connecting 1,000 input neurons to 1,000 output neurons, there are 1,000,000 weights (every input connected to every output) plus 1,000 biases (one per output neuron), giving 1,001,000 parameters for just that one layer. A model with dozens of layers, each with thousands of neurons, accumulates billions of parameters quickly.

Why Parameter Count Matters

The number of parameters in a model roughly determines its capacity, the complexity of the function it can represent. A model with more parameters can encode more nuanced patterns and store more knowledge. This is why the AI industry tracks parameter counts as a proxy for model capability.

The progression has been dramatic. GPT-2 (2019) had 1.5 billion parameters. GPT-3 (2020) had 175 billion. GPT-4 is rumored to have over a trillion parameters across its mixture-of-experts architecture. With each jump in parameter count, researchers observed qualitative improvements: larger models could write better, reason more accurately, and handle more complex tasks.

This relationship has been formalized as scaling laws. Research by Kaplan et al. at OpenAI (2020) showed that model performance improves as a predictable power law function of three variables: parameter count, dataset size, and compute budget. Doubling the parameter count while keeping data constant gives diminishing but consistent improvements. The optimal strategy is to scale parameters, data, and compute together.

However, parameter count alone does not determine quality. A 7-billion-parameter model trained on high-quality data with good training techniques can outperform a 70-billion-parameter model trained poorly. Architecture matters too: transformer models use their parameters more efficiently than older architectures like RNNs, getting better performance from the same number of parameters. Mixture-of-experts architectures go further, using only a fraction of their total parameters for any given input, which makes them computationally cheaper to run despite having very large total parameter counts.

Parameters vs. Hyperparameters

The terminology is confusing because "hyperparameter" sounds like a fancier version of "parameter," but they are fundamentally different things.

Parameters are learned during training. The model adjusts them automatically through gradient descent. You never set them manually. The learning rate, loss function, and optimizer work together to find good parameter values. After training, the parameter values are fixed and define the model's behavior.

Hyperparameters are set by the engineer before training begins. They control the training process itself. Examples include the learning rate (how fast parameters change), the number of layers (which determines how many parameters exist), the batch size (how many examples are processed together), and the dropout rate (what fraction of neurons are randomly disabled during training). Hyperparameters are not learned; they are chosen through experimentation, intuition, or automated search.

The distinction is straightforward: parameters are inside the model and are adjusted by the training algorithm. Hyperparameters are outside the model and are adjusted by the engineer. Choosing good hyperparameters is essential because they determine whether the training algorithm can find good parameters. A poor learning rate, for instance, can prevent the model from converging to a good solution regardless of how much data or compute you provide.

How Parameters Are Stored

Each parameter is a floating-point number. The precision of that number affects both the model's accuracy and its memory requirements. Full precision (32-bit floating point, or FP32) uses 4 bytes per parameter. A 70-billion-parameter model in FP32 requires 280 gigabytes of memory just for the parameters, not counting the additional memory needed for gradients and optimizer states during training.

Quantization reduces this by using fewer bits per parameter. Half precision (FP16) uses 2 bytes, cutting memory in half with minimal accuracy loss. More aggressive quantization to 8-bit or even 4-bit integers reduces memory further. A 70-billion-parameter model quantized to 4-bit precision needs only about 35 gigabytes, making it possible to run on a single high-end consumer GPU.

The tradeoff is precision. Lower-bit representations cannot distinguish as many values, which introduces rounding errors. For most tasks, FP16 is indistinguishable from FP32. 8-bit quantization shows slight degradation on some benchmarks. 4-bit quantization is noticeable but often acceptable for practical applications. Research into quantization techniques continues to push the boundary, making it possible to run larger models on cheaper hardware.

What Parameters Actually Encode

It is tempting to think of individual parameters as storing individual facts, like one parameter knowing the capital of France and another knowing that water boils at 100 degrees Celsius. But parameters do not work this way. Knowledge is distributed across millions of parameters, and any single parameter participates in encoding many different pieces of knowledge.

Researchers have studied this by selectively modifying parameters and observing the effects. Changing a single parameter typically has no noticeable effect on model outputs, because the model is highly redundant, many parameters contribute to the same functionality. But changing a group of related parameters in a specific layer can alter the model's factual knowledge (making it believe Paris is in Germany, for instance) or change its style (making it more formal or more casual).

This distributed representation is both a strength and a weakness. It makes models robust to small perturbations and enables them to generalize beyond their training data. But it also makes models difficult to interpret, fix, or edit. You cannot simply look up which parameters encode a specific fact and correct it. The knowledge is too entangled across the network's structure.

Key Takeaway

Parameters are the learned numerical values (weights and biases) that define everything an AI model knows. Parameter count roughly determines model capacity, with modern large language models containing tens to hundreds of billions of parameters. Unlike hyperparameters, which are set by engineers before training, parameters are learned automatically through the training process. Knowledge is distributed across many parameters rather than stored in individual values, making models robust but difficult to interpret.