Large language models (LLMs) are the foundation of the current wave of generative AI. They power chatbots, code assistants, document summarization tools, and a growing list of enterprise applications. Despite widespread use, the mechanics of how LLMs work — how they are trained, why they can follow instructions, and where their limitations come from — are less clearly understood by many practitioners deploying them.
This article covers the fundamentals: what LLMs actually are, how the transformer architecture enables them, how training works, and why concepts like fine-tuning and alignment matter in production.
An LLM is a neural network trained to predict the probability of sequences of tokens — typically subword units from a tokenized text corpus. At inference time, the model generates text autoregressively: it predicts the next most likely token, appends it to the sequence, and repeats.
The “large” in LLM refers to scale on two dimensions:
Scale matters because it changes what the model is capable of. Research has consistently shown that sufficiently large models trained on sufficiently large datasets exhibit emergent capabilities — abilities like multi-step reasoning, code generation, and analogical thinking that are not present at smaller scales.
The dominant architecture for LLMs is the transformer, introduced in the 2017 paper Attention Is All You Need. While the field has evolved significantly since then, the core components remain central to all modern LLMs.
Before a language model can process text, it must convert characters to tokens. Modern LLMs use subword tokenization algorithms like BPE (Byte Pair Encoding) or SentencePiece. Common words become single tokens; rare words are split into subword pieces. A vocabulary typically contains 32,000 to 128,000 tokens.
Tokenization matters operationally because:
Each token is mapped to a high-dimensional vector (the embedding). These embeddings encode semantic and syntactic relationships learned during training. Positional encodings or positional biases are added to communicate the position of each token in the sequence.
The attention mechanism is the core innovation of the transformer. For each token position, attention computes a weighted combination of all other token representations in the context window. The weights are determined by how relevant each position is to the current position.
This allows the model to:
Multi-head attention runs several attention computations in parallel with different learned projections, allowing the model to attend to different kinds of relationships simultaneously.
Each transformer layer follows the attention sub-layer with a position-wise feed-forward network — typically two linear transformations with a nonlinearity (GELU or similar). These layers are where much of the model’s “knowledge” is stored, though the precise mechanisms are still an active research area.
Modern LLMs stack many transformer layers (24 to 96+ for frontier models). Depth allows models to build progressively more abstract representations. The final layer’s output is mapped back to vocabulary probabilities via a linear projection and softmax.
Most current LLMs — GPT-4, Llama, Mistral, Falcon, and their variants — use a decoder-only architecture: each token position can only attend to previous positions (causal attention). This is well-suited to autoregressive text generation.
Encoder-decoder architectures (like T5 and BART) are also used, particularly for tasks like translation and summarization where the full input should be processed before generating output.
Pre-training is the first and most computationally expensive phase. The objective is simple: predict the next token given all previous tokens (autoregressive language modeling). The model processes vast amounts of text and updates its weights to minimize prediction error.
Despite this simple objective, pre-training on diverse, high-quality data at scale produces a model with broad knowledge, reasoning patterns, language fluency, and some coding and mathematical ability.
Pre-training datasets typically combine:
Data quality matters enormously. Filtering, deduplication, and curation of pre-training data has a significant impact on model quality. NVIDIA NeMo Curator is an example of tooling designed specifically for this challenge at scale.
Pre-training frontier models requires enormous compute budgets — typically thousands of NVIDIA H100 GPUs running for weeks or months. The scaling laws research (Chinchilla, OpenAI scaling laws) shows that optimal pre-training balances model size and training token count: bigger models do not always outperform smaller models trained on more data, given fixed compute budgets.
A pre-trained LLM is a powerful next-token predictor but is not directly useful as an assistant — it continues text rather than answering questions or following instructions. Instruction fine-tuning (also called supervised fine-tuning, SFT) trains the model on examples of instruction-response pairs.
After SFT, the model learns to:
This is the step that transforms a raw pre-trained model into something like a chat assistant.
Instruction fine-tuning alone is not sufficient for production assistant models. Models can still produce harmful, inaccurate, or unhelpful responses. Alignment training is the process of making model behavior more consistent with human preferences and safety requirements.
RLHF is the most widely used alignment technique for frontier models:
RLHF is computationally demanding and operationally complex, but it has been central to the quality improvements in models like InstructGPT and GPT-4.
DPO is a simpler alternative that eliminates the need for a separate reward model. Instead, it trains directly on preference pairs (chosen vs rejected outputs) using a modified loss function. DPO has gained adoption because it is more stable and easier to implement than PPO-based RLHF.
NVIDIA NeMo Aligner supports both RLHF and DPO.
The context window is the maximum number of tokens an LLM can consider at once. Early models had context limits of 2,048 tokens. Modern models commonly support 8,192 to 128,000 tokens or more.
The KV cache (key-value cache) is a critical inference optimization: during autoregressive generation, the key and value tensors from the attention computation for previously generated tokens do not need to be recomputed — they can be cached and reused. Without KV caching, inference cost would scale quadratically with sequence length. With KV caching, the incremental cost of generating each new token is much lower.
In production serving, KV cache management is one of the primary challenges for high-concurrency LLM deployments. TensorRT-LLM and Triton Inference Server implement KV cache pooling and paged attention to improve cache utilization across concurrent requests.
LLMs generate text that sounds plausible but is factually incorrect. This is a structural property of how they are trained — the model learns patterns in text, not ground truth about the world. Hallucination rates vary by model and domain. RAG (retrieval-augmented generation) is a primary mitigation strategy.
For long documents or extended conversations, models eventually exceed their context window. Summarization, chunking, and retrieval patterns are common approaches to this constraint.
LLMs can produce significantly different outputs with minor prompt variations. Prompt engineering, few-shot examples, and structured output constraints are standard techniques for making outputs more reliable.
LLMs have a knowledge cutoff date. They cannot access real-time information unless paired with a retrieval system or tool use capability.
Training and serving LLMs at scale depend heavily on GPU infrastructure. NVIDIA’s technology stack is designed around this: