What Is a Transformer?

The Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need", fundamentally changed how we build AI systems. Every major language model today — GPT, Claude, Gemini, LLaMA — is built on Transformers.

Why Transformers Matter

Before Transformers, the dominant approach for processing sequences (like text) was Recurrent Neural Networks (RNNs). RNNs process tokens one at a time, left to right. This sequential nature made them:

Slow to train — you can't parallelize sequential processing
Forgetful — information from early tokens fades over long sequences
Difficult to scale — training costs grow linearly with sequence length

Transformers solved all three problems with a single key innovation: self-attention.

The Self-Attention Mechanism

Self-attention allows every token in a sequence to directly attend to every other token, computing relationships in parallel rather than sequentially.

The core idea is surprisingly simple. For each token in the input:

Create three vectors: Query (Q), Key (K), and Value (V)
Compute attention scores by comparing the Query against all Keys
Use the scores to create a weighted sum of all Values

Mathematically, self-attention is computed as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where $d_k$ is the dimension of the key vectors. The $\sqrt{d_k}$ scaling prevents the dot products from growing too large.

Multi-Head Attention

Rather than computing attention once, Transformers use multi-head attention — running several attention computations in parallel, each with different learned projections:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
 
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
 
    def forward(self, x):
        batch_size, seq_len, d_model = x.shape
 
        # Project and reshape for multi-head
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
 
        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attn = torch.softmax(scores, dim=-1)
        out = attn @ V
 
        # Concatenate heads and project
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        return self.W_o(out)

Each head can learn to attend to different types of relationships — one might focus on syntax, another on semantics, another on positional patterns.

The Full Architecture

A Transformer block combines:

Multi-head self-attention — captures relationships between tokens
Feed-forward network — processes each token independently
Layer normalization — stabilizes training
Residual connections — helps gradients flow through deep networks

The original Transformer uses an encoder-decoder structure. Modern LLMs like GPT use only the decoder (with causal masking), while models like BERT use only the encoder.

Why It Changed Everything

The Transformer's parallel computation means:

Training is massively parallelizable on GPUs
Long-range dependencies are captured equally well regardless of distance
Scaling works — more parameters and data consistently improves performance

This scalability is what enabled the jump from models with millions of parameters to models with hundreds of billions — and the emergent capabilities that come with that scale.

Key Takeaways

Transformers replaced sequential processing (RNNs) with parallel self-attention
The attention mechanism lets every token directly relate to every other token
Multi-head attention captures diverse relationship types simultaneously
The architecture's parallelism enabled the scaling revolution behind modern AI

References

Vaswani et al., "Attention Is All You Need" (2017) — arXiv:1706.03762
Jay Alammar, "The Illustrated Transformer" — a visual walkthrough of the architecture