What IsBeginner

What Is a Transformer?

A beginner-friendly explanation of the Transformer architecture that powers modern AI — from attention mechanisms to why it changed everything.

February 20, 20263 min read
Prerequisites: Basic linear algebra, Neural network fundamentals

The Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need", fundamentally changed how we build AI systems. Every major language model today — GPT, Claude, Gemini, LLaMA — is built on Transformers.

Why Transformers Matter

Before Transformers, the dominant approach for processing sequences (like text) was Recurrent Neural Networks (RNNs). RNNs process tokens one at a time, left to right. This sequential nature made them:

  • Slow to train — you can't parallelize sequential processing
  • Forgetful — information from early tokens fades over long sequences
  • Difficult to scale — training costs grow linearly with sequence length

Transformers solved all three problems with a single key innovation: self-attention.

The Self-Attention Mechanism

Self-attention allows every token in a sequence to directly attend to every other token, computing relationships in parallel rather than sequentially.

The core idea is surprisingly simple. For each token in the input:

  1. Create three vectors: Query (Q), Key (K), and Value (V)
  2. Compute attention scores by comparing the Query against all Keys
  3. Use the scores to create a weighted sum of all Values

Mathematically, self-attention is computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where dkd_k is the dimension of the key vectors. The dk\sqrt{d_k} scaling prevents the dot products from growing too large.

Multi-Head Attention

Rather than computing attention once, Transformers use multi-head attention — running several attention computations in parallel, each with different learned projections:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
 
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
 
    def forward(self, x):
        batch_size, seq_len, d_model = x.shape
 
        # Project and reshape for multi-head
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
 
        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attn = torch.softmax(scores, dim=-1)
        out = attn @ V
 
        # Concatenate heads and project
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        return self.W_o(out)

Each head can learn to attend to different types of relationships — one might focus on syntax, another on semantics, another on positional patterns.

The Full Architecture

A Transformer block combines:

  1. Multi-head self-attention — captures relationships between tokens
  2. Feed-forward network — processes each token independently
  3. Layer normalization — stabilizes training
  4. Residual connections — helps gradients flow through deep networks

The original Transformer uses an encoder-decoder structure. Modern LLMs like GPT use only the decoder (with causal masking), while models like BERT use only the encoder.

Why It Changed Everything

The Transformer's parallel computation means:

  • Training is massively parallelizable on GPUs
  • Long-range dependencies are captured equally well regardless of distance
  • Scaling works — more parameters and data consistently improves performance

This scalability is what enabled the jump from models with millions of parameters to models with hundreds of billions — and the emergent capabilities that come with that scale.

Key Takeaways

  • Transformers replaced sequential processing (RNNs) with parallel self-attention
  • The attention mechanism lets every token directly relate to every other token
  • Multi-head attention captures diverse relationship types simultaneously
  • The architecture's parallelism enabled the scaling revolution behind modern AI

References

  1. Vaswani et al., "Attention Is All You Need" (2017) — arXiv:1706.03762
  2. Jay Alammar, "The Illustrated Transformer" — a visual walkthrough of the architecture

## Related Posts