What Is a Neural Network?
The Basic Building Block: The Artificial Neuron
An artificial neuron performs one simple operation. It takes multiple numerical inputs, multiplies each input by a corresponding weight, sums the results, adds a bias term, and passes the total through an activation function. The output becomes input for neurons in the next layer.
Written as math, a neuron computes: output = f(w1*x1 + w2*x2 + ... + wn*xn + b), where x values are inputs, w values are weights, b is the bias, and f is the activation function. This is just a weighted sum followed by a nonlinear transformation. Each neuron is trivially simple, but thousands or millions of them working together produce remarkably complex behavior.
The weights determine what the neuron responds to. A neuron in an image recognition network might have large positive weights for input pixels in a horizontal line pattern and small or negative weights elsewhere. This neuron would activate strongly when it encounters a horizontal edge and weakly for everything else. The neuron did not start with this pattern; it learned these weight values during training by processing thousands of images.
Layers: From Input to Output
Neural networks organize neurons into layers. A network has at minimum three types of layers: an input layer that receives the raw data, one or more hidden layers that transform the data through learned operations, and an output layer that produces the final result.
The input layer does not compute anything; it simply passes the raw data to the first hidden layer. For an image classifier processing a 28x28 grayscale image, the input layer has 784 neurons (one per pixel), each holding a value from 0 (black) to 255 (white). For a text model, the input layer receives numerical token embeddings.
Hidden layers perform the actual computation. In a feedforward network, each neuron in a hidden layer connects to every neuron in the previous layer (fully connected). The first hidden layer learns simple features, the second layer combines those into more complex features, and so on. Adding more hidden layers allows the network to represent more complex functions, which is why "deep" learning uses networks with many layers. ResNet-152, a popular image model, has 152 layers.
The output layer produces the network's prediction. For a classifier distinguishing 10 categories, the output layer has 10 neurons, each producing a score for one category. A softmax function converts these scores into probabilities that sum to 1. For a regression task (predicting a number), the output layer has a single neuron producing the predicted value.
How a Neural Network Makes a Prediction
Making a prediction is called a forward pass. The data enters the input layer and propagates forward through each hidden layer until it reaches the output. At each neuron, the same operation occurs: weighted sum, add bias, apply activation function, pass the result forward.
Consider a small network classifying handwritten digits (the classic MNIST task). A 28x28 image enters as 784 pixel values. The first hidden layer (say 128 neurons) transforms these 784 values into 128 values, each representing a learned feature like an edge orientation or curve. The second hidden layer (64 neurons) combines those 128 features into 64 higher-level features, perhaps digit parts like loops or straight strokes. The output layer (10 neurons) combines the 64 features into 10 scores, one per digit. The highest-scoring digit is the network's prediction.
The entire forward pass is pure arithmetic: multiply, add, apply function, repeat. A small network like this one performs roughly 100,000 arithmetic operations per prediction. GPT-4 performs an estimated 1.8 trillion operations per token generated. The computations are simple individually but staggering in aggregate.
How a Neural Network Learns
Learning in a neural network means finding weight values that produce correct predictions. The process has three steps that repeat for every batch of training data.
Forward pass. The network makes a prediction for each example in the batch.
Loss computation. A loss function measures how wrong the predictions are. For classification, cross-entropy loss is standard. If the network predicts 0.1 probability for the correct class and 0.9 for a wrong class, the loss is high. If it predicts 0.95 for the correct class, the loss is low.
Backward pass (backpropagation). The loss is propagated backward through the network using the chain rule of calculus. This computes the gradient, how much each weight contributed to the error and in which direction it should change to reduce the error.
Weight update. The optimizer (typically Adam or SGD) adjusts each weight by a small amount in the direction that reduces the loss. The learning rate controls how much each weight changes per step.
This cycle repeats millions of times. Initially, the weights are random and the network's predictions are random. After thousands of iterations, the network begins to learn simple patterns. After millions of iterations, it has learned complex, reliable features. The final weights encode everything the network knows about the task.
Why Neural Networks Need Nonlinearity
Without activation functions, a neural network of any depth would be equivalent to a single layer. This is because the composition of linear functions is still linear. No matter how many layers of weighted sums you stack, the result is always a linear transformation of the input, which can only represent linear relationships.
Activation functions introduce nonlinearity, allowing the network to represent curved, complex decision boundaries. The ReLU function (output the input if positive, zero otherwise) is the most common. It is computationally cheap, does not saturate for positive values (avoiding vanishing gradients), and empirically works well across a wide range of tasks.
The universal approximation theorem proves that a feedforward network with a single hidden layer and nonlinear activation functions can approximate any continuous function to arbitrary precision, given enough neurons. In practice, using multiple layers with fewer neurons per layer is more efficient than one enormous layer, which is why deep networks dominate over wide, shallow ones.
A Brief History
The idea of artificial neurons dates to 1943, when McCulloch and Pitts showed that networks of simple threshold units could compute logical functions. Frank Rosenblatt built the perceptron in 1958, a single-layer network that could learn to classify linearly separable patterns. Enthusiasm was high until Minsky and Papert published their 1969 book demonstrating the perceptron's limitations, which contributed to the first "AI winter" of reduced funding and interest.
The field revived in the 1980s with the popularization of backpropagation (credited to Rumelhart, Hinton, and Williams in 1986, though the algorithm had been discovered earlier). Backpropagation made training multi-layer networks practical, enabling networks to learn nonlinear patterns. The second wave produced useful applications like handwriting recognition and speech processing, but computational limits prevented scaling to larger problems.
The modern deep learning revolution began around 2012, when Alex Krizhevsky's AlexNet, a deep CNN trained on GPUs, won the ImageNet competition by a large margin. The combination of more data (the internet), more compute (GPUs), and better training techniques (ReLU, dropout, batch normalization) unlocked capabilities that had been theoretically possible but practically unreachable. Since then, neural networks have advanced at a pace that has repeatedly surprised even researchers in the field.
Why Neural Networks Work So Well
Several properties make neural networks exceptionally effective learning machines.
Compositionality. Each layer builds on the representations of the previous layer, creating a hierarchy from simple to complex features. This compositional structure matches the hierarchical structure of many real-world problems: images are composed of objects, objects of parts, parts of edges; sentences are composed of clauses, clauses of phrases, phrases of words.
Distributed representations. Each concept is represented by a pattern of activity across many neurons, and each neuron participates in representing many concepts. This distributed coding is extraordinarily efficient, allowing a network with N neurons to represent far more than N concepts.
End-to-end learning. The network learns all of its representations from raw data to final output in a single, unified optimization. There is no hand-designed feature extraction step. This means the features the network learns are optimized for the actual task, not for what a human thought would be useful features.
Scalability. Neural network performance improves predictably with more parameters, more data, and more compute. Scaling laws show that doubling resources produces consistent, quantifiable improvements. This predictability has enabled the systematic investment in larger models that has driven recent advances.
A neural network is a system of layered artificial neurons that learns to perform tasks by adjusting weights through training. Each neuron computes a simple weighted sum with a nonlinear activation, but millions of neurons working together produce systems capable of image recognition, language understanding, and generation. Neural networks learn their own features from data rather than following hand-coded rules, and their performance scales predictably with size, data, and compute.