What Is a Transformer?
A beginner-friendly explanation of the Transformer architecture that powers modern AI — from attention mechanisms to why it changed everything.
The Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need", fundamentally changed how we build AI systems. Every major language model today — GPT, Claude, Gemini, LLaMA — is built on Transformers.
Why Transformers Matter
Before Transformers, the dominant approach for processing sequences (like text) was Recurrent Neural Networks (RNNs). RNNs process tokens one at a time, left to right. This sequential nature made them:
- Slow to train — you can't parallelize sequential processing
- Forgetful — information from early tokens fades over long sequences
- Difficult to scale — training costs grow linearly with sequence length
Transformers solved all three problems with a single key innovation: self-attention.
The Self-Attention Mechanism
Self-attention allows every token in a sequence to directly attend to every other token, computing relationships in parallel rather than sequentially.
The core idea is surprisingly simple. For each token in the input:
- Create three vectors: Query (Q), Key (K), and Value (V)
- Compute attention scores by comparing the Query against all Keys
- Use the scores to create a weighted sum of all Values
Mathematically, self-attention is computed as:
Where is the dimension of the key vectors. The scaling prevents the dot products from growing too large.
Multi-Head Attention
Rather than computing attention once, Transformers use multi-head attention — running several attention computations in parallel, each with different learned projections:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
batch_size, seq_len, d_model = x.shape
# Project and reshape for multi-head
Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)
attn = torch.softmax(scores, dim=-1)
out = attn @ V
# Concatenate heads and project
out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
return self.W_o(out)Each head can learn to attend to different types of relationships — one might focus on syntax, another on semantics, another on positional patterns.
The Full Architecture
A Transformer block combines:
- Multi-head self-attention — captures relationships between tokens
- Feed-forward network — processes each token independently
- Layer normalization — stabilizes training
- Residual connections — helps gradients flow through deep networks
The original Transformer uses an encoder-decoder structure. Modern LLMs like GPT use only the decoder (with causal masking), while models like BERT use only the encoder.
Why It Changed Everything
The Transformer's parallel computation means:
- Training is massively parallelizable on GPUs
- Long-range dependencies are captured equally well regardless of distance
- Scaling works — more parameters and data consistently improves performance
This scalability is what enabled the jump from models with millions of parameters to models with hundreds of billions — and the emergent capabilities that come with that scale.
Key Takeaways
- Transformers replaced sequential processing (RNNs) with parallel self-attention
- The attention mechanism lets every token directly relate to every other token
- Multi-head attention captures diverse relationship types simultaneously
- The architecture's parallelism enabled the scaling revolution behind modern AI
References
- Vaswani et al., "Attention Is All You Need" (2017) — arXiv:1706.03762
- Jay Alammar, "The Illustrated Transformer" — a visual walkthrough of the architecture