Every major AI system you've heard of — ChatGPT, Claude, Gemini, Llama, Midjourney, even AI video generators — is built on the same fundamental architecture. It's called a transformer, and understanding it gives you a mental model for understanding everything happening in AI.

You don't need to know linear algebra. You need the right analogies.

The Core Problem

Language is full of references that point backward and forward. Consider: "The trophy wouldn't fit in the suitcase because it was too big." What does "it" refer to? The trophy, obviously — because trophies are big things that might not fit.

But change one word: "The trophy wouldn't fit in the suitcase because it was too small." Now "it" refers to the suitcase. Same sentence structure, different meaning, and you need world knowledge to figure it out.

Before transformers, AI processed language like reading through a keyhole — one word at a time, left to right. By the time the model reached "big" or "small," it had a fading memory of "trophy" and "suitcase." Long-range connections were hard.

The Key Insight: Attention

The transformer's breakthrough idea is called self-attention, and it works like this: instead of reading left-to-right, the model looks at every word in relation to every other word, all at once.

Imagine you're at a dinner party. The old approach is like hearing one conversation one word at a time, in order. The transformer approach is like being able to simultaneously hear every conversation at the table and understand how they all relate to each other.

When the transformer processes "it was too big," it calculates how strongly "it" should attend to every other word. It discovers that "it" has a strong connection to "trophy" (they're grammatically linked) and "big" has a strong connection to "trophy" (trophies are big). The model effectively "looks back" at the whole sentence simultaneously.

Layers of Understanding

A transformer isn't one attention step — it's dozens of them stacked. Each layer builds on the previous one:

Early layers handle syntax: "this word is a noun," "these words form a phrase." Middle layers handle meaning: "this sentence is about physical objects and sizes." Later layers handle complex reasoning: "given the context about sizes and fitting, 'it' refers to the trophy."

GPT-4 has roughly 120 layers. Claude 3.5 likely has a similar count. Each layer refines the model's understanding of every word in context.

Why This Matters For You

Understanding transformers explains several things you've probably noticed about AI:

Why AI is good at some things and bad at others: Transformers excel at pattern recognition across text — summarization, translation, coding, analysis. They struggle with precise counting, multi-step math, and tasks requiring information not in their training data.

Why "context windows" matter: The attention mechanism compares every word to every other word. Doubling the context length quadruples the computation (roughly). That's why processing a 100-page document is much harder than a 1-page document — it's not just more text, it's exponentially more relationships to track.

Why bigger models are smarter: More layers and wider layers mean more nuanced attention patterns. A small model might capture "it refers to trophy." A large model captures "it refers to trophy, and this sentence is making an implicit argument about the relationship between object size and container size, which is a common rhetorical structure."

The transformer was invented in 2017 by researchers at Google, in a paper titled "Attention Is All You Need." It's one of the most consequential inventions of the 21st century so far, and it happened less than a decade ago.