Foundation Models & LLMs
What makes generative AI different, how transformers and tokens work, and the vocabulary of foundation models.
A new kind of model
Traditional ML trains one model per task on your own labeled dataset. Generative AI flips this: a foundation model (FM) is pre-trained once — on internet-scale unlabeled data at enormous cost — and then adapted to *many* tasks. Large language models (LLMs) are foundation models specialized in text; other FMs generate images, audio, video, or code.
Core vocabulary
- Token — the unit LLMs read and write; roughly a word chunk (~4 characters of English). Pricing and limits are measured in tokens.
- Embedding — a list of numbers (vector) representing a piece of text's *meaning*; similar meanings sit close together in vector space. Powers semantic search and RAG.
- Context window — the maximum tokens a model can consider at once (prompt + response). Bigger windows = more documents/conversation fit.
- Transformer — the neural network architecture behind modern FMs; its self-attention mechanism lets the model weigh how every token relates to every other token.
- Parameters — the billions of learned weights inside the model.
- Inference — generating output from a prompt; LLMs predict one most-likely next token at a time.
A foundation model is like a brilliantly well-read new employee: they've read practically everything (pre-training), so instead of teaching them from scratch you just give good instructions (prompting), hand them your company docs to reference (RAG), or send them to a specialized bootcamp (fine-tuning).
How FMs are built and adapted
- Pre-training — self-supervised learning over massive unlabeled corpora; costs millions in compute. You will almost never do this.
- Fine-tuning — further training on a smaller labeled dataset to specialize the model.
- Instruction tuning & RLHF — teaching the model to follow instructions and align with human preferences (Reinforcement Learning from Human Feedback).
- In-context learning — no training at all: the model adapts from examples inside the prompt itself.
Multimodal? Unimodal models handle one data type (text→text). Multimodal models accept or produce multiple types (image+text → text). Diffusion models are the architecture behind image generation — they learn to turn noise into images step by step.
What is a foundation model?