In my previous post, I introduced neural networks as computational systems inspired by the brain. Now it is time to zoom in on their architecture — specifically, the three fundamental types of layers that make up every neural network: input layers, hidden layers, and output layers.
Understanding these layers is essential because every neural network architecture — whether it is a simple classifier or a massive language model — is built from these same building blocks. The differences between architectures come down to how these layers are arranged, connected, and configured.
The Big Picture: Data In, Prediction Out
Before we dive into each layer type, let me give you the high-level view. A neural network is essentially a pipeline:
- Raw data enters through the input layer
- The data is transformed through one or more hidden layers
- A prediction emerges from the output layer
Each layer takes the output of the previous layer, applies a mathematical transformation, and passes the result forward. This sequential flow is called forward propagation.
Think of it like a factory assembly line. Raw materials (data) enter at one end, go through multiple processing stations (hidden layers), and a finished product (prediction) comes out at the other end. Each station adds value by transforming the materials in a specific way.
The Input Layer: Where Data Enters
The input layer is the simplest layer conceptually. Its job is to receive the raw data and pass it to the first hidden layer. It does not perform any computation — it simply acts as a gateway.
The number of neurons in the input layer is determined by the shape of your data:
- If you are classifying a 28x28 pixel grayscale image (like the MNIST handwritten digits dataset), your input layer has 784 neurons (28 x 28 = 784 pixels, each flattened into a single value)
- If you are predicting house prices based on 5 features (square footage, bedrooms, bathrooms, age, location score), your input layer has 5 neurons
- If you are processing a color image of size 224x224 (like in ImageNet), your input layer has 150,528 neurons (224 x 224 x 3 color channels)
import torch
import torch.nn as nn
# Example: MNIST digit classification
# Input: 784 pixels (28x28 flattened)
# Hidden: 128 neurons
# Output: 10 classes (digits 0-9)
class DigitClassifier(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(784, 128) # Input -> Hidden
self.layer2 = nn.Linear(128, 64) # Hidden -> Hidden
self.layer3 = nn.Linear(64, 10) # Hidden -> Output
def forward(self, x):
x = x.view(-1, 784) # Flatten the 28x28 image
x = torch.relu(self.layer1(x))
x = torch.relu(self.layer2(x))
x = self.layer3(x)
return x
Common Input Preprocessing
Before data reaches the input layer, it typically goes through preprocessing:
- Normalization: Scaling values to a range like [0, 1] or [-1, 1]. This helps the network train faster and more stably.
- Encoding: Converting categorical data (like “red”, “blue”, “green”) into numerical representations (one-hot encoding or embeddings).
- Flattening: Converting multi-dimensional data (like images) into a single vector.
import numpy as np
# Normalize pixel values from [0, 255] to [0, 1]
raw_image = np.array([128, 255, 0, 64, 192])
normalized = raw_image / 255.0
print(normalized)
# [0.502, 1.0, 0.0, 0.251, 0.753]
# One-hot encode categories
categories = {"cat": [1, 0, 0], "dog": [0, 1, 0], "bird": [0, 0, 1]}
print(categories["dog"])
# [0, 1, 0]
The key takeaway: the input layer does not learn anything. Its size is fixed by your data, and its only job is to pass that data forward.
Hidden Layers: Where the Magic Happens
Hidden layers are where the actual learning and transformation occur. They are called “hidden” because their values are not directly observable from the outside — you see the inputs and the outputs, but the internal representations are hidden.
Each neuron in a hidden layer:
- Receives inputs from all neurons in the previous layer (in a fully connected network)
- Multiplies each input by a learned weight
- Adds a bias term
- Applies an activation function
- Sends the result to all neurons in the next layer
What Do Hidden Layers Learn?
The fascinating thing about hidden layers is that they automatically learn to extract features from the data. The features become increasingly abstract as you go deeper into the network.
For an image classification network:
- Layer 1 might learn to detect edges — horizontal lines, vertical lines, diagonal lines
- Layer 2 might combine edges into textures and simple shapes — corners, curves, circles
- Layer 3 might combine shapes into parts — eyes, ears, wheels, windows
- Layer 4 might combine parts into objects — faces, cars, animals
This hierarchical feature learning is what makes deep networks so powerful. You do not need to tell the network what features to look for — it discovers them on its own during training.
How Many Hidden Layers?
This is one of the most common questions in neural network design, and the honest answer is: it depends.
Here are some general guidelines:
| Problem Type | Typical Hidden Layers | Example |
|---|---|---|
| Simple linear problems | 0-1 | Basic regression |
| Moderate complexity | 1-3 | Tabular data classification |
| Image recognition | 5-150+ | ResNet, VGG |
| Natural language processing | 12-96+ | BERT, GPT |
More layers allow the network to learn more complex patterns, but they also:
- Require more data to train effectively
- Take longer to train
- Are more prone to overfitting (memorizing training data instead of learning general patterns)
- Can suffer from the vanishing gradient problem (gradients become too small in early layers to enable learning)
How Many Neurons Per Layer?
Another critical design decision is the width of each layer — how many neurons it contains. Common patterns include:
Pyramid (decreasing width):
# Gradually compress the representation
model = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10),
)
This is like a funnel that progressively distills the most important information from the data.
Constant width:
# Keep the same width throughout
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 10),
)
Bottleneck (compress then expand):
# Used in autoencoders
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 32), # Bottleneck - compressed representation
nn.ReLU(),
nn.Linear(32, 256),
nn.ReLU(),
nn.Linear(256, 784), # Reconstruct original input
)
The bottleneck architecture is particularly interesting because it forces the network to learn a compressed representation of the data — capturing only the most essential information in the narrow middle layer.
Fully Connected vs. Other Connection Patterns
In a fully connected (also called “dense”) layer, every neuron is connected to every neuron in the adjacent layers. But this is not the only option:
- Convolutional layers connect each neuron to only a small local region of the input (used in image processing)
- Recurrent connections allow neurons to connect back to themselves or previous time steps (used in sequence processing)
- Attention mechanisms allow neurons to dynamically select which inputs to focus on (used in transformers)
Each connection pattern is optimized for different types of data and problems. I will cover these specialized architectures in dedicated posts.
The Output Layer: Where Predictions Are Made
The output layer produces the final result of the network. Its design depends entirely on what kind of problem you are solving.
Binary Classification (Yes/No)
For binary classification, the output layer has 1 neuron with a sigmoid activation function that outputs a value between 0 and 1 (interpreted as a probability).
class BinaryClassifier(nn.Module):
def __init__(self):
super().__init__()
self.hidden = nn.Linear(10, 32)
self.output = nn.Linear(32, 1)
def forward(self, x):
x = torch.relu(self.hidden(x))
x = torch.sigmoid(self.output(x)) # Output between 0 and 1
return x
If the output is 0.87, the model is 87% confident the answer is “yes.”
Multi-Class Classification
For multi-class classification (e.g., classifying an image as cat, dog, or bird), the output layer has one neuron per class with a softmax activation function. Softmax ensures all outputs sum to 1, so they can be interpreted as probabilities.
class MultiClassifier(nn.Module):
def __init__(self):
super().__init__()
self.hidden = nn.Linear(784, 128)
self.output = nn.Linear(128, 10) # 10 classes
def forward(self, x):
x = torch.relu(self.hidden(x))
x = torch.softmax(self.output(x), dim=1)
return x
# Output might look like:
# [0.01, 0.02, 0.05, 0.80, 0.01, 0.03, 0.02, 0.04, 0.01, 0.01]
# Highest probability at index 3 -> prediction is class 3
Regression (Predicting a Number)
For regression (e.g., predicting house prices), the output layer has 1 neuron with no activation function (or a linear activation), allowing it to output any real number.
class PricePredictor(nn.Module):
def __init__(self):
super().__init__()
self.hidden = nn.Linear(5, 64)
self.output = nn.Linear(64, 1) # Single output, no activation
def forward(self, x):
x = torch.relu(self.hidden(x))
x = self.output(x) # Raw output, can be any value
return x
Summary Table
| Task | Output Neurons | Activation | Output Range |
|---|---|---|---|
| Binary classification | 1 | Sigmoid | [0, 1] |
| Multi-class (N classes) | N | Softmax | [0, 1] each, sum = 1 |
| Regression | 1 (or more) | None/Linear | (-inf, +inf) |
| Multi-label | N | Sigmoid (each) | [0, 1] each, independent |
Putting It All Together
Let me show a complete example that classifies iris flowers into three species based on four measurements:
import torch
import torch.nn as nn
class IrisClassifier(nn.Module):
def __init__(self):
super().__init__()
# Input layer: 4 features (sepal length, sepal width,
# petal length, petal width)
# Hidden layer 1: 16 neurons
# Hidden layer 2: 8 neurons
# Output layer: 3 classes (setosa, versicolor, virginica)
self.network = nn.Sequential(
nn.Linear(4, 16), # Input -> Hidden 1
nn.ReLU(),
nn.Linear(16, 8), # Hidden 1 -> Hidden 2
nn.ReLU(),
nn.Linear(8, 3), # Hidden 2 -> Output
)
def forward(self, x):
return self.network(x)
model = IrisClassifier()
# Simulated input: one flower's measurements
sample = torch.tensor([5.1, 3.5, 1.4, 0.2])
output = model(sample)
predicted_class = torch.argmax(output).item()
species = ["setosa", "versicolor", "virginica"]
print(f"Predicted species: {species[predicted_class]}")
This simple network has:
- Input layer: 4 neurons (one per feature)
- Hidden layer 1: 16 neurons with ReLU activation
- Hidden layer 2: 8 neurons with ReLU activation
- Output layer: 3 neurons (one per species)
Key Takeaways
- The input layer is determined by your data shape — you do not choose its size
- Hidden layers are where learning happens — their depth and width are design decisions that affect model capacity
- The output layer is determined by your task — classification vs. regression, number of classes, etc.
- Data flows forward through the network (forward propagation), and learning happens by flowing errors backward (backpropagation)
- Deeper networks can learn more complex patterns but require more data and careful training
In the next post, I will explore activation functions — the mathematical functions that give neural networks their ability to learn nonlinear patterns. Without them, even a hundred-layer network would be no more powerful than a single layer.
This post is part of my Deep Learning series. Start with “What is a Neural Network?” if you have not read it yet.