We have covered what neural networks are, how their layers are structured, and how activation functions introduce nonlinearity. But there is one critical question we have not fully answered: How does a neural network actually learn?
The answer is backpropagation — short for “backward propagation of errors.” It is the algorithm that allows a neural network to adjust its weights based on the mistakes it makes. Without backpropagation, deep learning as we know it would not exist.
And despite its fearsome reputation, the core idea is surprisingly intuitive.
The Big Idea: Learning From Mistakes
Imagine you are learning to throw darts at a dartboard. You throw a dart, and it lands far to the left of the bullseye. What do you do? You adjust — you aim a bit more to the right on your next throw. If that throw goes too far right, you adjust back a little. Over many throws, you get closer and closer to the center.
Backpropagation works on the same principle:
- The network makes a prediction (throws a dart)
- The prediction is compared to the correct answer (how far from the bullseye?)
- The error is calculated (the distance from the target)
- The error is used to adjust the weights (aim adjustment)
- Repeat thousands of times
The clever part is how the error is propagated backward through the network to determine which weights should be adjusted, and by how much.
Step-by-Step Walkthrough
Let me walk through the entire process with a concrete example.
Step 1: Forward Pass
Data flows through the network, and a prediction is produced.
import numpy as np
# Simple 2-layer network
# Input: [0.5, 0.8]
# Hidden layer: 2 neurons
# Output: 1 neuron
np.random.seed(42)
# Initialize weights and biases
W1 = np.array([[0.3, 0.7], [0.5, 0.2]]) # Input -> Hidden
b1 = np.array([0.1, 0.1])
W2 = np.array([[0.4], [0.6]]) # Hidden -> Output
b2 = np.array([0.1])
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Forward pass
input_data = np.array([0.5, 0.8])
# Hidden layer
hidden_raw = np.dot(input_data, W1) + b1
hidden_activated = sigmoid(hidden_raw)
# Output layer
output_raw = np.dot(hidden_activated, W2) + b2
prediction = sigmoid(output_raw)
print(f"Prediction: {prediction[0]:.4f}")
# Let's say the true answer is 1.0
Step 2: Calculate the Loss
The loss function (also called cost function) measures how wrong the prediction is. A common choice for binary classification is binary cross-entropy, and for regression, mean squared error (MSE).
true_label = 1.0
loss = (prediction[0] - true_label) ** 2 # MSE for simplicity
print(f"Loss: {loss:.4f}")
The loss is a single number that summarizes the network’s error. The goal of training is to minimize this number.
Step 3: Backward Pass (Backpropagation)
Now comes the core of the algorithm. We need to figure out: how much did each weight contribute to the error?
This is where calculus enters the picture, specifically the chain rule. The chain rule tells us how to compute the derivative of a composite function — and a neural network is nothing but a series of composed functions.
The key insight: the gradient of the loss with respect to any weight tells us how much that weight should change to reduce the error.
# The chain rule in action
# Derivative of sigmoid
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
# Step 3a: Output layer gradients
# d(loss)/d(prediction) = 2 * (prediction - true_label)
d_loss = 2 * (prediction[0] - true_label)
# d(prediction)/d(output_raw) = sigmoid_derivative(output_raw)
d_output = d_loss * sigmoid_derivative(output_raw[0])
# d(output_raw)/d(W2) = hidden_activated
d_W2 = hidden_activated.reshape(-1, 1) * d_output
# d(output_raw)/d(b2) = 1
d_b2 = d_output
# Step 3b: Hidden layer gradients (this is the "back" propagation)
# Error flows backward through W2
d_hidden = d_output * W2.flatten() * sigmoid_derivative(hidden_raw)
# d(hidden_raw)/d(W1) = input_data
d_W1 = np.outer(input_data, d_hidden)
# d(hidden_raw)/d(b1)
d_b1 = d_hidden
print(f"Gradient for W2: {d_W2.flatten()}")
print(f"Gradient for W1: {d_W1}")
Step 4: Update the Weights
Once we have the gradients, we update each weight in the direction that reduces the loss. This is gradient descent.
learning_rate = 0.1
W2 = W2 - learning_rate * d_W2
b2 = b2 - learning_rate * d_b2
W1 = W1 - learning_rate * d_W1
b1 = b1 - learning_rate * d_b1
print("Weights updated!")
The learning rate controls how big each adjustment is. Too high, and the network overshoots the optimal weights. Too low, and training takes forever.
Step 5: Repeat
Steps 1-4 constitute a single training iteration. In practice, this process is repeated thousands or millions of times across the entire dataset.
# Full training loop
for epoch in range(1000):
# Forward pass
hidden_raw = np.dot(input_data, W1) + b1
hidden_activated = sigmoid(hidden_raw)
output_raw = np.dot(hidden_activated, W2) + b2
prediction = sigmoid(output_raw)
# Loss
loss = (prediction[0] - true_label) ** 2
# Backward pass (compute gradients)
d_loss = 2 * (prediction[0] - true_label)
d_output = d_loss * sigmoid_derivative(output_raw[0])
d_W2 = hidden_activated.reshape(-1, 1) * d_output
d_b2 = d_output
d_hidden = d_output * W2.flatten() * sigmoid_derivative(hidden_raw)
d_W1 = np.outer(input_data, d_hidden)
d_b1 = d_hidden
# Update weights
W2 -= learning_rate * d_W2
b2 -= learning_rate * d_b2
W1 -= learning_rate * d_W1
b1 -= learning_rate * d_b1
if epoch % 200 == 0:
print(f"Epoch {epoch}: Loss = {loss:.6f}, Prediction = {prediction[0]:.4f}")
You would see the loss decreasing and the prediction approaching 1.0 over the iterations.
The Chain Rule: The Mathematical Core
The chain rule is the mathematical foundation of backpropagation. It states that if you have a composite function like f(g(h(x))), the derivative is:
df/dx = df/dg * dg/dh * dh/dx
In a neural network, each layer is a function composed with the next. The chain rule lets us decompose the derivative of the loss with respect to any weight, no matter how deep in the network, into a product of local derivatives.
This is both beautiful and practical — each layer only needs to know its own local gradient and the gradient flowing in from the layer above.
Stochastic Gradient Descent and Mini-Batches
In practice, computing the gradient over the entire dataset for every update is computationally expensive. Instead, we use Stochastic Gradient Descent (SGD) or mini-batch gradient descent:
- Batch gradient descent: Compute gradient over the entire dataset. Slow but stable.
- Stochastic gradient descent: Compute gradient on a single sample. Fast but noisy.
- Mini-batch gradient descent: Compute gradient on a small batch (e.g., 32 or 64 samples). The practical sweet spot.
# Mini-batch training
batch_size = 32
for epoch in range(100):
# Shuffle the dataset
indices = np.random.permutation(len(X_train))
for i in range(0, len(X_train), batch_size):
batch_idx = indices[i:i + batch_size]
X_batch = X_train[batch_idx]
y_batch = y_train[batch_idx]
# Forward pass on batch
predictions = network.forward(X_batch)
# Compute loss and gradients on batch
loss = compute_loss(predictions, y_batch)
gradients = compute_gradients(loss)
# Update weights
network.update_weights(gradients, learning_rate)
Modern Optimizers: Beyond Basic Gradient Descent
Plain gradient descent works but is not the best we can do. Modern optimizers improve upon it in several ways:
Adam (Adaptive Moment Estimation) — the most popular optimizer today:
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(100):
optimizer.zero_grad() # Clear old gradients
output = model(input_data) # Forward pass
loss = criterion(output, target) # Compute loss
loss.backward() # Backpropagation!
optimizer.step() # Update weights
Adam adapts the learning rate for each weight individually based on the history of its gradients. Weights that consistently have large gradients get smaller updates, and weights with small gradients get larger updates.
Common Problems in Training
Vanishing Gradients
In deep networks with sigmoid or tanh activations, gradients can become exponentially small as they propagate backward through many layers. Early layers barely learn because the gradient signal is too weak.
Solutions: Use ReLU activations, batch normalization, skip connections (as in ResNet).
Exploding Gradients
The opposite problem — gradients become exponentially large, causing weights to swing wildly and training to diverge.
Solutions: Gradient clipping, careful weight initialization, lower learning rates.
Overfitting
The network memorizes the training data instead of learning general patterns.
Solutions: Dropout, regularization, data augmentation, early stopping.
# Dropout: randomly disable neurons during training
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.5), # 50% of neurons randomly zeroed during training
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 10),
)
Backpropagation in Modern Frameworks
The beautiful thing about modern deep learning frameworks is that you rarely need to implement backpropagation yourself. PyTorch and TensorFlow handle it automatically through automatic differentiation (autograd).
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Training step
x = torch.randn(32, 10)
y = torch.randn(32, 1)
prediction = model(x) # Forward pass
loss = criterion(prediction, y) # Compute loss
loss.backward() # Automatic backpropagation!
optimizer.step() # Update weights
optimizer.zero_grad() # Clear gradients for next step
That loss.backward() line does all the heavy lifting. PyTorch builds a computational graph during the forward pass and then traverses it backward to compute all gradients automatically. You never need to derive a single gradient by hand.
The Intuition to Remember
Here is the mental model I want you to walk away with:
Backpropagation answers the question: “How much is each weight responsible for the error?”
It does this by flowing the error signal backward through the network, layer by layer, using the chain rule to calculate each weight’s contribution. Weights that contributed heavily to the error get adjusted more. Weights that had little effect get adjusted less.
Over thousands of iterations, this process sculpts the weights into a configuration that transforms the input data into accurate predictions. That is how a neural network learns.
This is part of my Deep Learning series. Next up: Convolutional Neural Networks (CNNs) — the architecture that conquered computer vision.