Weights and Biases in Neural Networks

Updated May 2026
Weights and biases are the learnable parameters inside a neural network. Weights control the strength of connections between neurons, determining how much influence one neuron's output has on another neuron's input. Biases are additional constants added to each neuron that shift its activation threshold. Together, weights and biases define the mathematical function that the network computes, and finding good values for them through training is the entire point of machine learning.

Weights: Connection Strengths

Every connection between two neurons has a weight, a single number that multiplies the signal traveling along that connection. If neuron A has an output of 0.8 and the weight connecting A to neuron B is 0.5, then A's contribution to B's input is 0.8 * 0.5 = 0.4. If the weight were -0.5, the contribution would be -0.4, meaning A would inhibit rather than excite neuron B.

The sign and magnitude of a weight encode what the network has learned about the relationship between two neurons. A large positive weight means "when this input neuron is active, the output neuron should also be active." A large negative weight means "when this input neuron is active, the output neuron should be suppressed." A weight near zero means "this input is not relevant to this output."

In a fully connected layer with 1,000 input neurons and 1,000 output neurons, there are 1,000,000 weights (every input connected to every output). Each weight is an independent learnable parameter. This is why neural networks accumulate millions to billions of parameters: the weight matrices between large layers are enormous.

Weights are typically stored as matrices. A layer connecting n inputs to m outputs has a weight matrix of shape m x n. The forward pass for that layer is a matrix multiplication: output = W * input + bias. This formulation allows efficient computation on GPUs, which are optimized for large matrix operations.

Biases: Shifting the Threshold

Each neuron has a single bias value that is added to its weighted sum before the activation function is applied. The bias acts as a learnable threshold that controls how easily the neuron activates.

Without a bias, a neuron's pre-activation value is zero when all its inputs are zero. This means the activation function always returns its value at zero (0.5 for sigmoid, 0 for ReLU) regardless of the learned weights. The bias gives the neuron the freedom to shift this baseline. A positive bias makes the neuron more likely to activate (it "wants" to fire by default). A negative bias makes it harder to activate (it requires strong positive input to fire).

In mathematical terms, the bias allows the network to fit functions that do not pass through the origin. A linear model y = wx can only represent lines that pass through the point (0, 0). Adding a bias, y = wx + b, lets the line shift up or down. The same principle extends to neural networks: biases let the network represent a wider family of functions.

A layer with m output neurons has m biases, one per neuron. Compared to the millions of weights in the weight matrix, biases are a tiny fraction of the total parameters (1,000 biases vs. 1,000,000 weights in our earlier example). But removing biases can measurably hurt performance because it constrains the functions the network can represent.

Initialization: Where Weights Start

Weights are initialized randomly before training begins, but the choice of initialization distribution has a significant impact on whether training succeeds.

If weights are initialized too large, the signals flowing through the network grow exponentially with each layer, causing numerical overflow and unstable training. If weights are initialized too small, signals shrink exponentially, vanishing to near zero by the time they reach the output, and gradients during backpropagation vanish as well.

Xavier initialization (also called Glorot initialization) sets each weight by sampling from a distribution with variance 2 / (n_in + n_out), where n_in and n_out are the number of input and output neurons for that layer. This keeps the variance of activations roughly constant across layers, preventing both explosion and vanishing. Xavier initialization is designed for sigmoid and tanh activations.

He initialization (also called Kaiming initialization) uses variance 2 / n_in, which is slightly larger. This accounts for the fact that ReLU sets half of its outputs to zero, effectively halving the signal at each layer. He initialization compensates by starting with larger weights. It is the standard for networks using ReLU and its variants.

Biases are typically initialized to zero. Starting with zero biases means each neuron has no initial preference for activating or not; the preferences develop entirely through training. Some architectures initialize biases to small positive values to prevent ReLU neurons from starting in the dead zone, but zero initialization is the most common default.

How Weights and Biases Learn

During training, gradient descent adjusts every weight and bias to reduce the loss function. The process is identical for both:

The forward pass computes the network's prediction using the current weight and bias values. The loss function measures how wrong the prediction is. Backpropagation computes the gradient of the loss with respect to each weight and each bias. The optimizer updates each parameter by subtracting a fraction (the learning rate) of its gradient.

For a weight connecting neuron A to neuron B, the gradient depends on: how much the loss depends on neuron B's output (the downstream gradient) and how strongly neuron A was active during this example (the input activation). If both the downstream gradient and the input activation are large, the weight gets a large update. This is the credit assignment mechanism: weights on pathways that contributed most to the error get the largest corrections.

For a bias, the gradient is simply the downstream gradient (there is no input activation to multiply by, since the bias is added regardless of the input). Biases tend to converge faster than weights because their gradients are more direct.

Weight Matrices in Different Architectures

Dense layers have a full m x n weight matrix. Every input connects to every output. This is the most general and most parameter-heavy configuration.

Convolutional layers use a small weight tensor (the filter) shared across all spatial positions. A 3x3 filter with 64 output channels on a 3-channel input has 3 * 3 * 3 * 64 = 1,728 weights, compared to the millions a dense layer would need for the same input size. Weight sharing is what makes CNNs practical for images.

Attention layers have four weight matrices per head: W_Q, W_K, W_V (projecting inputs to queries, keys, and values) and W_O (projecting concatenated head outputs). These are dense matrices, but they project to and from a smaller dimension than the full model width, keeping the parameter count manageable.

Embedding layers are essentially a weight matrix where each row is the learned vector for one token. The "forward pass" is a table lookup rather than a matrix multiplication, but the weights are learned by gradient descent just like any other parameters.

Regularizing Weights

Without regularization, weights can grow very large during training, leading to a model that memorizes training data rather than learning generalizable patterns. Several techniques constrain weight values:

L2 regularization (weight decay) adds the sum of squared weights to the loss function, penalizing large values. This pushes all weights toward zero, effectively preferring simpler models that use smaller weights. A weight decay coefficient of 0.01 is a common starting point.

L1 regularization adds the sum of absolute weight values to the loss. Unlike L2, L1 can push weights to exactly zero, effectively pruning connections and producing a sparser network. This is useful when you want the model to discover which connections are truly unnecessary.

Dropout randomly sets neuron outputs to zero during training, which is equivalent to randomly removing connections. This forces the network to not rely on any single weight or connection, learning more robust, distributed representations.

Weight clipping caps weight values at a maximum magnitude. This is used primarily in generative adversarial networks and certain reinforcement learning algorithms to stabilize training.

Key Takeaway

Weights control the strength of connections between neurons, and biases shift activation thresholds. Together, they define everything a neural network computes. Proper initialization (Xavier for sigmoid/tanh, He for ReLU) is critical for training stability. Both are learned through the same gradient descent process, with weights receiving updates proportional to how much they contributed to the prediction error. Regularization techniques like weight decay and dropout prevent weights from growing too large and causing overfitting.