What Is an AI Model?
The Simplest Model: Linear Regression
The easiest way to understand an AI model is to start with the simplest one. Linear regression predicts a number based on a straight-line equation:
In this equation, x is the input (say, the square footage of a house), y is the output (the predicted price), w is the weight (how much each additional square foot adds to the price), and b is the bias (the base price when square footage is zero). The weight and bias are the model's parameters, the numbers it needs to learn.
Before training, w and b start as random values. The model's predictions are terrible. Training means showing the model actual house sales (input-output pairs), measuring the error, and adjusting w and b to reduce that error. After training on enough examples, w and b settle on values that produce reasonable predictions for houses the model has never seen.
This two-parameter model is trivial, but it contains every concept present in models with billions of parameters. The architecture defines the mathematical operation (multiply and add). The parameters (w and b) hold the learned knowledge. Training adjusts the parameters to minimize prediction error. The only difference between this and GPT-4 is scale and complexity.
What Parameters Actually Are
Parameters are numbers stored in the model that encode what it has learned. In a neural network, the parameters are the weights that connect neurons between layers and the biases that shift each neuron's activation. During a forward pass, input data flows through the network, getting multiplied by weights and shifted by biases at each layer, until it reaches the output.
A small neural network for classifying handwritten digits might have 100,000 parameters. A medium-sized image classifier like ResNet-50 has about 25 million. GPT-3 has 175 billion. GPT-4 is estimated at over 1 trillion (across its mixture-of-experts architecture). Each parameter is a 16-bit or 32-bit floating point number, so a model with 175 billion parameters takes roughly 350 gigabytes of memory to store.
Every parameter was learned during training. Before training, they were random. After training, they encode statistical patterns from the training data. A weight connecting two neurons in a language model might encode the fact that the word "the" is very likely to follow certain words and very unlikely to follow others. No single parameter encodes a clean, interpretable fact, the knowledge is distributed across billions of parameters in ways that are difficult for humans to decode.
Architecture: The Structure of the Function
The architecture is the blueprint of the model, what operations it performs and in what order. Different architectures are suited to different types of data.
Feedforward networks are the simplest neural architecture. Data flows in one direction: input to hidden layers to output. Each layer applies a linear transformation (multiply by weights, add biases) followed by a nonlinear activation function (like ReLU, which simply zeroes out negative values). The nonlinearity is critical because without it, stacking multiple layers would reduce to a single linear operation, no matter how many layers you used.
Convolutional neural networks (CNNs) are designed for spatial data, primarily images. Instead of connecting every input to every neuron, CNNs use small filters that slide across the image, detecting local features like edges, corners, and textures. Early layers detect simple features. Deeper layers combine those into complex patterns like faces, objects, and scenes. This architecture exploits the fact that visual features are local (an edge looks the same regardless of where it appears in the image).
Recurrent neural networks (RNNs) process sequential data by maintaining a hidden state that accumulates information from previous steps. Each element in the sequence updates the hidden state, so the model can consider context. RNNs were the standard for language processing before transformers, but they struggle with long sequences because the hidden state must compress the entire history into a fixed-size vector.
Transformers are the architecture behind every major language model since 2018. Instead of processing sequences step-by-step, transformers use a self-attention mechanism that lets every position in the sequence attend to every other position simultaneously. This solves the long-range dependency problem because the model can directly connect a word at the beginning of a document to a word at the end. The computational cost is quadratic in sequence length, but the parallelism makes training much faster than RNNs on modern hardware.
Pre-trained Models and Fine-Tuning
Modern practice rarely trains a model from scratch. Instead, researchers start with a large pre-trained model and adapt it to their specific task through fine-tuning.
A pre-trained language model like GPT has already learned the statistical patterns of language from trillions of words of text. It knows grammar, facts, reasoning patterns, and coding syntax, all encoded in its parameters. To adapt it for a specific task, like answering medical questions, you fine-tune it on a smaller dataset of medical question-answer pairs. The model's existing knowledge transfers to the new task, and the fine-tuning adjusts the parameters slightly to specialize.
This approach works because the representations learned during pre-training are broadly useful. The features a language model learns, word meanings, sentence structure, reasoning patterns, are relevant to almost any language task. Fine-tuning leverages this shared foundation rather than rebuilding it from scratch for every application.
Model Size and the Scaling Debate
The AI industry has been on a trajectory of increasing model size since 2018. BERT (2018) had 340 million parameters. GPT-2 (2019) had 1.5 billion. GPT-3 (2020) had 175 billion. The trend continued with models estimated to exceed a trillion parameters.
Scaling laws, published by OpenAI in 2020, showed that model performance improves as a smooth power law of parameter count, data size, and compute budget. This mathematical relationship gives researchers a way to predict how much better a larger model will be, and the predictions have held up remarkably well.
But scaling is not free. Training GPT-3 cost an estimated $4.6 million in compute. GPT-4 reportedly cost over $100 million. The energy consumption is substantial: training a single large model can produce as much carbon dioxide as five cars over their entire lifetimes, though the number varies widely depending on the data center's energy source.
Some researchers argue that scaling will eventually produce AGI, because the scaling laws show no sign of plateauing. Others argue that bigger models are hitting diminishing returns on benchmark tasks, and that fundamentally new architectures or training methods will be needed for the next leap in capability. Both sides have evidence, and the question is not yet settled.
An AI model is a mathematical function with trainable parameters. The architecture defines the structure, the parameters hold the learned knowledge, and training adjusts those parameters to minimize errors on specific tasks. Modern practice favors large pre-trained models that are fine-tuned for specific applications, and the trend toward larger models continues despite growing costs.