How Artificial Neurons Work
The Computation Inside a Single Neuron
Every artificial neuron performs the same three-step operation. First, it computes a weighted sum of its inputs. If the neuron receives three inputs with values 0.5, 0.8, and 0.2, and the corresponding weights are 0.3, -0.7, and 1.2, the weighted sum is (0.5 * 0.3) + (0.8 * -0.7) + (0.2 * 1.2) = 0.15 - 0.56 + 0.24 = -0.17.
Second, it adds a bias term. The bias is a learned constant that shifts the neuron's activation threshold. If the bias is 0.5, the pre-activation value becomes -0.17 + 0.5 = 0.33. The bias allows the neuron to activate even when all inputs are zero, and it gives the neuron additional flexibility in fitting the data.
Third, the pre-activation value passes through an activation function. If using ReLU, the function outputs 0.33 (since the value is positive). If using sigmoid, it outputs approximately 0.58. The activation function's nonlinearity is what gives neural networks their power, allowing them to learn curved decision boundaries and complex mappings that linear models cannot represent.
Weights: The Learned Knowledge
Weights are the most important part of a neuron because they encode what the neuron has learned. A weight controls how much influence one input has on the neuron's output. A large positive weight means that input strongly activates the neuron. A large negative weight means that input strongly suppresses the neuron. A weight near zero means the neuron essentially ignores that input.
Before training, weights are initialized randomly, typically from a carefully chosen distribution. Xavier initialization (for sigmoid and tanh activations) and He initialization (for ReLU) set the initial weight variance based on the number of input connections, preventing signals from growing too large or shrinking too small as they propagate through the network. Poor initialization can prevent the network from learning at all.
During training, each weight is adjusted by gradient descent. The gradient tells the optimizer the direction and magnitude of the weight change that would reduce the loss. Over millions of updates, the weights converge to values that produce useful computations. In a trained image recognition network, the weights of a neuron in the first layer might form a pattern that detects a specific type of edge, while a neuron in a deeper layer might have weights that combine multiple edge-detecting neurons into a shape detector.
The number of weights in a layer equals the number of input connections times the number of neurons. A fully connected layer with 1,000 inputs and 1,000 neurons has 1,000,000 weights. This is why neural networks accumulate billions of parameters quickly, especially with many large layers.
Connection Patterns
The way neurons connect to each other defines the network's architecture and determines what kinds of patterns it can learn efficiently.
Fully connected (dense) layers connect every neuron in one layer to every neuron in the next. This is the most general connection pattern, placing no assumptions on the data structure. It is also the most parameter-heavy: a dense connection between two layers of 4,096 neurons requires over 16 million weights.
Convolutional connections share weights across spatial positions. A single 3x3 filter with 9 weights is applied at every position in the image, detecting the same pattern wherever it occurs. This weight sharing reduces parameters dramatically (a convolutional layer might have 50,000 parameters where a dense layer would need 50 million) and builds in translation invariance, the ability to recognize a feature regardless of where it appears in the image.
Recurrent connections feed a neuron's output back to itself (or its layer) at the next time step. This creates a loop that maintains state across sequential inputs. The same weights process every time step, with the hidden state carrying information from previous steps. This design is natural for sequential data where the same computation should apply at every position.
Attention connections are dynamic rather than fixed. In a transformer, each neuron computes attention weights that determine how much to attend to every other neuron in the input. The connection strengths change for every input, allowing the network to focus on the most relevant parts of the data for each specific prediction. This flexibility is why transformers outperform fixed-connectivity architectures on many tasks.
Skip connections (residual connections) pass the input of a layer directly to a layer further ahead, bypassing intermediate layers. The intermediate layers learn to compute a residual, a correction to add to the input, rather than a complete transformation. Skip connections are essential for training very deep networks because they give gradients a direct path backward through the network, preventing them from vanishing in the intermediate layers.
What Individual Neurons Learn
Researchers have studied what specific neurons respond to by finding the inputs that maximally activate each neuron, a technique called feature visualization. The results reveal a clear hierarchy.
Neurons in the first layer of an image network learn edge detectors, Gabor-like filters that respond to edges at specific orientations and frequencies. These are the simplest possible visual features, and remarkably, every CNN trained on natural images independently discovers nearly identical first-layer features.
Neurons in middle layers respond to textures, patterns, and simple object parts. One neuron might activate for polka-dot patterns, another for honeycomb textures, another for circular shapes that resemble eyes. These features are more specific to the training data and task than the edge detectors in the first layer.
Neurons in deep layers respond to high-level concepts. Researchers have identified neurons that activate for specific object categories (dogs, cars, buildings), specific attributes (color, texture, orientation), and even specific individuals (there are neurons in face recognition networks that respond to specific people). These high-level neurons emerge from the combination of many simpler neurons in earlier layers.
In language models, individual neurons have been found that respond to specific syntactic structures (negation, questions, relative clauses), specific topics (sports, science, politics), and even specific factual associations (neurons that activate when the model processes text about the relationship between countries and capitals).
From Single Neurons to Network Intelligence
No single neuron is intelligent. A neuron that detects horizontal edges cannot recognize a cat. But when thousands of edge-detecting neurons feed into texture-detecting neurons, which feed into shape-detecting neurons, which feed into object-detecting neurons, the network as a whole can recognize cats with superhuman accuracy. The intelligence is emergent, arising from the interaction of many simple parts rather than residing in any individual component.
This emergent intelligence is both the power and the mystery of neural networks. The individual computations are completely transparent (multiply, add, apply function), but the behavior that emerges from billions of these computations interacting is difficult to predict or explain. Understanding this gap between simple components and complex behavior is one of the central challenges in AI interpretability research.
An artificial neuron computes a weighted sum of its inputs, adds a bias, and applies a nonlinear activation function. The weights are learned during training and encode what the neuron responds to. Different connection patterns (dense, convolutional, recurrent, attention-based) determine what kinds of data structures the network can process efficiently. Individual neurons learn simple features, but their collective interaction produces the complex pattern recognition that makes neural networks useful.