Foundation Models & LLMs

What makes generative AI different, how transformers and tokens work, and the vocabulary of foundation models.

10 min read

A new kind of model

Traditional ML trains one model per task on your own labeled dataset. Generative AI flips this: a foundation model (FM) is pre-trained once — on internet-scale unlabeled data at enormous cost — and then adapted to *many* tasks. Large language models (LLMs) are foundation models specialized in text; other FMs generate images, audio, video, or code.

Core vocabulary

  • Token — the unit LLMs read and write; roughly a word chunk (~4 characters of English). Pricing and limits are measured in tokens.
  • Embedding — a list of numbers (vector) representing a piece of text's *meaning*; similar meanings sit close together in vector space. Powers semantic search and RAG.
  • Context window — the maximum tokens a model can consider at once (prompt + response). Bigger windows = more documents/conversation fit.
  • Transformer — the neural network architecture behind modern FMs; its self-attention mechanism lets the model weigh how every token relates to every other token.
  • Parameters — the billions of learned weights inside the model.
  • Inference — generating output from a prompt; LLMs predict one most-likely next token at a time.
Think of it like this

A foundation model is like a brilliantly well-read new employee: they've read practically everything (pre-training), so instead of teaching them from scratch you just give good instructions (prompting), hand them your company docs to reference (RAG), or send them to a specialized bootcamp (fine-tuning).

How FMs are built and adapted

  1. Pre-training — self-supervised learning over massive unlabeled corpora; costs millions in compute. You will almost never do this.
  2. Fine-tuning — further training on a smaller labeled dataset to specialize the model.
  3. Instruction tuning & RLHF — teaching the model to follow instructions and align with human preferences (Reinforcement Learning from Human Feedback).
  4. In-context learning — no training at all: the model adapts from examples inside the prompt itself.
Exam tip

Multimodal? Unimodal models handle one data type (text→text). Multimodal models accept or produce multiple types (image+text → text). Diffusion models are the architecture behind image generation — they learn to turn noise into images step by step.

Knowledge check
Question 1 of 4

What is a foundation model?