The Structure of a Neural Network: Input, Hidden, and Output Layers

In my previous post, I introduced neural networks as computational systems inspired by the brain. Now it is time to zoom in on their architecture — specifically, the three fundamental types of layers that make up every neural network: input layers, hidden layers, and output layers.

Understanding these layers is essential because every neural network architecture — whether it is a simple classifier or a massive language model — is built from these same building blocks. The differences between architectures come down to how these layers are arranged, connected, and configured.

The Big Picture: Data In, Prediction Out

Before we dive into each layer type, let me give you the high-level view. A neural network is essentially a pipeline:

Raw data enters through the input layer
The data is transformed through one or more hidden layers
A prediction emerges from the output layer

Each layer takes the output of the previous layer, applies a mathematical transformation, and passes the result forward. This sequential flow is called forward propagation.

Think of it like a factory assembly line. Raw materials (data) enter at one end, go through multiple processing stations (hidden layers), and a finished product (prediction) comes out at the other end. Each station adds value by transforming the materials in a specific way.

Deep learning neural network architecture

The Input Layer: Where Data Enters

The input layer is the simplest layer conceptually. Its job is to receive the raw data and pass it to the first hidden layer. It does not perform any computation — it simply acts as a gateway.

The number of neurons in the input layer is determined by the shape of your data:

If you are classifying a 28x28 pixel grayscale image (like the MNIST handwritten digits dataset), your input layer has 784 neurons (28 x 28 = 784 pixels, each flattened into a single value)
If you are predicting house prices based on 5 features (square footage, bedrooms, bathrooms, age, location score), your input layer has 5 neurons
If you are processing a color image of size 224x224 (like in ImageNet), your input layer has 150,528 neurons (224 x 224 x 3 color channels)

import torch
import torch.nn as nn

# Example: MNIST digit classification
# Input: 784 pixels (28x28 flattened)
# Hidden: 128 neurons
# Output: 10 classes (digits 0-9)

class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(784, 128)    # Input -> Hidden
        self.layer2 = nn.Linear(128, 64)     # Hidden -> Hidden
        self.layer3 = nn.Linear(64, 10)      # Hidden -> Output

    def forward(self, x):
        x = x.view(-1, 784)  # Flatten the 28x28 image
        x = torch.relu(self.layer1(x))
        x = torch.relu(self.layer2(x))
        x = self.layer3(x)
        return x

Common Input Preprocessing

Before data reaches the input layer, it typically goes through preprocessing:

Normalization: Scaling values to a range like [0, 1] or [-1, 1]. This helps the network train faster and more stably.
Encoding: Converting categorical data (like “red”, “blue”, “green”) into numerical representations (one-hot encoding or embeddings).
Flattening: Converting multi-dimensional data (like images) into a single vector.

import numpy as np

# Normalize pixel values from [0, 255] to [0, 1]
raw_image = np.array([128, 255, 0, 64, 192])
normalized = raw_image / 255.0
print(normalized)
# [0.502, 1.0, 0.0, 0.251, 0.753]

# One-hot encode categories
categories = {"cat": [1, 0, 0], "dog": [0, 1, 0], "bird": [0, 0, 1]}
print(categories["dog"])
# [0, 1, 0]

The key takeaway: the input layer does not learn anything. Its size is fixed by your data, and its only job is to pass that data forward.

Hidden Layers: Where the Magic Happens

Hidden layers are where the actual learning and transformation occur. They are called “hidden” because their values are not directly observable from the outside — you see the inputs and the outputs, but the internal representations are hidden.

Each neuron in a hidden layer:

Receives inputs from all neurons in the previous layer (in a fully connected network)
Multiplies each input by a learned weight
Adds a bias term
Applies an activation function
Sends the result to all neurons in the next layer

What Do Hidden Layers Learn?

The fascinating thing about hidden layers is that they automatically learn to extract features from the data. The features become increasingly abstract as you go deeper into the network.

For an image classification network:

Layer 1 might learn to detect edges — horizontal lines, vertical lines, diagonal lines
Layer 2 might combine edges into textures and simple shapes — corners, curves, circles
Layer 3 might combine shapes into parts — eyes, ears, wheels, windows
Layer 4 might combine parts into objects — faces, cars, animals

This hierarchical feature learning is what makes deep networks so powerful. You do not need to tell the network what features to look for — it discovers them on its own during training.

How Many Hidden Layers?

This is one of the most common questions in neural network design, and the honest answer is: it depends.

Here are some general guidelines:

Problem Type	Typical Hidden Layers	Example
Simple linear problems	0-1	Basic regression
Moderate complexity	1-3	Tabular data classification
Image recognition	5-150+	ResNet, VGG
Natural language processing	12-96+	BERT, GPT

More layers allow the network to learn more complex patterns, but they also:

Require more data to train effectively
Take longer to train
Are more prone to overfitting (memorizing training data instead of learning general patterns)
Can suffer from the vanishing gradient problem (gradients become too small in early layers to enable learning)

How Many Neurons Per Layer?

Another critical design decision is the width of each layer — how many neurons it contains. Common patterns include:

Pyramid (decreasing width):

# Gradually compress the representation
model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

This is like a funnel that progressively distills the most important information from the data.

Constant width:

# Keep the same width throughout
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

Bottleneck (compress then expand):

# Used in autoencoders
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 32),   # Bottleneck - compressed representation
    nn.ReLU(),
    nn.Linear(32, 256),
    nn.ReLU(),
    nn.Linear(256, 784),  # Reconstruct original input
)

The bottleneck architecture is particularly interesting because it forces the network to learn a compressed representation of the data — capturing only the most essential information in the narrow middle layer.

Fully Connected vs. Other Connection Patterns

In a fully connected (also called “dense”) layer, every neuron is connected to every neuron in the adjacent layers. But this is not the only option:

Convolutional layers connect each neuron to only a small local region of the input (used in image processing)
Recurrent connections allow neurons to connect back to themselves or previous time steps (used in sequence processing)
Attention mechanisms allow neurons to dynamically select which inputs to focus on (used in transformers)

Each connection pattern is optimized for different types of data and problems. I will cover these specialized architectures in dedicated posts.

Layers and nodes in a neural network diagram

The Output Layer: Where Predictions Are Made

The output layer produces the final result of the network. Its design depends entirely on what kind of problem you are solving.

Binary Classification (Yes/No)

For binary classification, the output layer has 1 neuron with a sigmoid activation function that outputs a value between 0 and 1 (interpreted as a probability).

class BinaryClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(10, 32)
        self.output = nn.Linear(32, 1)

    def forward(self, x):
        x = torch.relu(self.hidden(x))
        x = torch.sigmoid(self.output(x))  # Output between 0 and 1
        return x

If the output is 0.87, the model is 87% confident the answer is “yes.”

Multi-Class Classification

For multi-class classification (e.g., classifying an image as cat, dog, or bird), the output layer has one neuron per class with a softmax activation function. Softmax ensures all outputs sum to 1, so they can be interpreted as probabilities.

class MultiClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(784, 128)
        self.output = nn.Linear(128, 10)  # 10 classes

    def forward(self, x):
        x = torch.relu(self.hidden(x))
        x = torch.softmax(self.output(x), dim=1)
        return x

# Output might look like:
# [0.01, 0.02, 0.05, 0.80, 0.01, 0.03, 0.02, 0.04, 0.01, 0.01]
# Highest probability at index 3 -> prediction is class 3

Regression (Predicting a Number)

For regression (e.g., predicting house prices), the output layer has 1 neuron with no activation function (or a linear activation), allowing it to output any real number.

class PricePredictor(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(5, 64)
        self.output = nn.Linear(64, 1)  # Single output, no activation

    def forward(self, x):
        x = torch.relu(self.hidden(x))
        x = self.output(x)  # Raw output, can be any value
        return x

Summary Table

Task	Output Neurons	Activation	Output Range
Binary classification	1	Sigmoid	[0, 1]
Multi-class (N classes)	N	Softmax	[0, 1] each, sum = 1
Regression	1 (or more)	None/Linear	(-inf, +inf)
Multi-label	N	Sigmoid (each)	[0, 1] each, independent

Deep learning model training and optimization

Putting It All Together

Let me show a complete example that classifies iris flowers into three species based on four measurements:

import torch
import torch.nn as nn

class IrisClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        # Input layer: 4 features (sepal length, sepal width,
        #                          petal length, petal width)
        # Hidden layer 1: 16 neurons
        # Hidden layer 2: 8 neurons
        # Output layer: 3 classes (setosa, versicolor, virginica)

        self.network = nn.Sequential(
            nn.Linear(4, 16),      # Input -> Hidden 1
            nn.ReLU(),
            nn.Linear(16, 8),      # Hidden 1 -> Hidden 2
            nn.ReLU(),
            nn.Linear(8, 3),       # Hidden 2 -> Output
        )

    def forward(self, x):
        return self.network(x)

model = IrisClassifier()

# Simulated input: one flower's measurements
sample = torch.tensor([5.1, 3.5, 1.4, 0.2])
output = model(sample)
predicted_class = torch.argmax(output).item()

species = ["setosa", "versicolor", "virginica"]
print(f"Predicted species: {species[predicted_class]}")

This simple network has:

Input layer: 4 neurons (one per feature)
Hidden layer 1: 16 neurons with ReLU activation
Hidden layer 2: 8 neurons with ReLU activation
Output layer: 3 neurons (one per species)

Key Takeaways

The input layer is determined by your data shape — you do not choose its size
Hidden layers are where learning happens — their depth and width are design decisions that affect model capacity
The output layer is determined by your task — classification vs. regression, number of classes, etc.
Data flows forward through the network (forward propagation), and learning happens by flowing errors backward (backpropagation)
Deeper networks can learn more complex patterns but require more data and careful training

In the next post, I will explore activation functions — the mathematical functions that give neural networks their ability to learn nonlinear patterns. Without them, even a hundred-layer network would be no more powerful than a single layer.

This post is part of my Deep Learning series. Start with “What is a Neural Network?” if you have not read it yet.