Why Data Quality Matters More Than Quantity
How much instruction data do you actually need to fine-tune a language model? The intuitive answer is "as much as possible" — more data, better model. But one of the most surprising results in recent alignment research turns that intuition on its head.
(Zhou et al., 2023) published LIMA: Less Is More for Alignment , where they fine-tuned LLaMA-65B on just 1,000 carefully curated examples. The result? LIMA matched or exceeded models trained on 50,000+ examples, including those trained with reinforcement learning from human feedback. One thousand examples. That's a dataset small enough to fit in a single spreadsheet, beating datasets that cost hundreds of thousands of dollars to annotate.
Why does this work? Because the purpose of supervised fine-tuning (SFT) is often misunderstood. A model that has been pre-trained on trillions of tokens — like LLaMA 2 (Touvron et al., 2023) which was trained on 2 trillion tokens — already knows an enormous amount. It has absorbed facts, syntax, reasoning patterns, code conventions, mathematical relationships, and the structure of dozens of languages. All of that knowledge is already encoded in its weights. What it doesn't know is how to present that knowledge in a helpful format . It can continue any text, but it doesn't know that when a user asks a question, it should respond with a clear, structured answer instead of just continuing the sentence as though it were a Wikipedia article.
SFT teaches the model a behaviour : this is what a helpful response looks like. This is where the answer starts. This is the tone, the format, the level of detail. It's not teaching the model new facts — it's teaching it a new style of interaction . Think of it this way: you don't need 50,000 examples to teach someone how to write a professional email if they already speak the language fluently. Ten excellent examples might be enough, because the person already has the vocabulary, grammar, and world knowledge. They just need to see the format.
This insight reshapes how we should think about building instruction datasets. Instead of maximising the number of examples, we should be maximising the quality, diversity, and consistency of each example. A dataset with 1,000 perfect examples will typically outperform one with 100,000 noisy, repetitive, or contradictory examples. The rest of this article is about how to build that high-quality dataset — what formats to use, how to template it for the model, which tokens to train on, and where to source the data.
Instruction Formats
What does an instruction-tuning example actually look like in practice? There are three dominant formats in the ecosystem, each with different strengths. Understanding them matters because the format you choose determines what tools you can use, which pre-built datasets are compatible, and how your training pipeline processes the data.
The first and simplest is the Alpaca format , introduced by Stanford's Alpaca project (Taori et al., 2023) . Each example is a flat JSON object with three fields: an instruction (what the model should do), an optional input (additional context), and an output (the desired response):
{
"instruction": "Summarize the following text in two sentences.",
"input": "Retrieval-Augmented Generation (RAG) combines a retrieval step with a generative model. Instead of relying solely on parametric knowledge stored in model weights, RAG first retrieves relevant documents from an external corpus, then conditions the generation on those documents. This approach reduces hallucination and allows the knowledge base to be updated without retraining the model.",
"output": "RAG enhances language models by retrieving relevant documents before generating a response, rather than relying only on knowledge stored in weights. This reduces hallucinations and allows knowledge updates without retraining."
}
Alpaca is popular for single-turn tasks: summarisation, translation, classification, question answering. Its simplicity is its strength — there's no ambiguity about what each field means, and converting existing NLP datasets into Alpaca format is straightforward. The
input
field is optional; many examples only use
instruction
and
output
(e.g., "Write a poem about recursion" needs no separate input).
The second format is the ShareGPT / conversation format , which captures multi-turn dialogues. Each example is a list of conversation turns, alternating between the human and the model:
{
"conversations": [
{"from": "human", "value": "What is LoRA and why is it useful?"},
{"from": "gpt", "value": "LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method. Instead of updating all model weights during fine-tuning, LoRA freezes the pre-trained weights and injects small trainable low-rank matrices into each layer. This reduces memory usage by 3-10x while achieving comparable performance to full fine-tuning."},
{"from": "human", "value": "How does it compare to full fine-tuning in terms of quality?"},
{"from": "gpt", "value": "For most tasks, LoRA with rank 16-64 achieves 95-100% of full fine-tuning performance. The gap is smallest for tasks that require stylistic or formatting changes (exactly what SFT typically teaches) and largest for tasks that require learning entirely new knowledge domains. The key insight from the intrinsic dimensionality literature is that weight updates during fine-tuning tend to be low-rank anyway, so LoRA is just making that structure explicit."}
]
}
ShareGPT format is essential when you want the model to handle multi-turn dialogue well — follow-up questions, clarifications, building on previous context. Many real-world chatbot interactions are multi-turn, so if your use case involves conversation, this format captures the dynamics that Alpaca cannot.
The third format is the OpenAI messages format , which has become the industry standard. It's used by OpenAI's fine-tuning API, HuggingFace TRL, vLLM, and most modern training frameworks:
{
"messages": [
{"role": "system", "content": "You are a machine learning tutor. Explain concepts clearly and concisely, using analogies where helpful. Always mention relevant papers."},
{"role": "user", "content": "What is LoRA and why is it useful?"},
{"role": "assistant", "content": "LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method introduced by Hu et al. (2021). Think of it like this: instead of rewriting an entire textbook to add a chapter, you clip in a thin insert at each section. The original book (pre-trained weights) stays untouched, and the inserts (low-rank matrices) capture only what's new. This cuts memory usage by 3-10x with minimal quality loss."},
{"role": "user", "content": "How do I choose the rank r?"},
{"role": "assistant", "content": "Start with r=16 for most tasks. If the task is simple (style/format changes), r=8 often suffices. For complex domain adaptation, try r=32 or 64. The key diagnostic: if training loss plateaus and eval metrics are below target, increase r. If you see overfitting (train loss drops but eval gets worse), decrease r or add regularisation."}
]
}
The key addition here is the system message , which sets the model's persona, constraints, and behaviour before the conversation begins. This is powerful because it lets you control the model's personality at training time — you can train the same base model to be a concise code reviewer, a patient tutor, or a formal legal assistant just by varying the system message in your training examples.
Here's a practical comparison of when to use each format:
- Alpaca: best for single-turn tasks (classification, extraction, summarisation). Simple to create, easy to debug. Use when you don't need dialogue.
- ShareGPT: best for multi-turn training data sourced from existing conversations. Common in community datasets (ShareGPT, LMSYS-Chat). Use when your data is already in conversation form.
- OpenAI messages: best for production systems. Supports system messages, multi-turn, and is compatible with most modern frameworks (TRL, Axolotl, LLaMA-Factory). Use this as your default unless you have a specific reason not to.
Chat Templates and Tokenization
Here's a question that trips up almost every first-time fine-tuner: when the model sees the token sequence during training, how does it know where the user's message ends and the assistant's response begins? The JSON format from the previous section is human-readable, but the model never sees JSON. It sees a flat stream of tokens. Something needs to translate the structured conversation into a token sequence with unambiguous boundaries. That something is the chat template .
A chat template is a set of special tokens and formatting rules that mark the boundaries between roles (system, user, assistant) in the token stream. Different model families use different templates, and using the wrong one is one of the most common and hardest-to-debug fine-tuning mistakes.
Let's look at two concrete examples. LLaMA 2 uses a template with
[INST]
and
[/INST]
tags to wrap user messages, with the system prompt nested inside special
<<SYS>>
delimiters:
<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is LoRA? [/INST] LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes pre-trained weights and injects trainable low-rank matrices into each layer. </s><s>[INST] How do I choose the rank? [/INST] Start with r=16 for most tasks. </s>
Notice the structure:
<s>
is the beginning-of-sequence token,
</s>
is end-of-sequence. The user message lives between
[INST]
and
[/INST]
, and the assistant response comes right after
[/INST]
. For multi-turn, each turn pair is wrapped in
<s>...</s>
.
Many newer models (Qwen, Yi, Hermes, and others) use a different standard called ChatML :
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is LoRA?<|im_end|>
<|im_start|>assistant
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes pre-trained weights and injects trainable low-rank matrices into each layer.<|im_end|>
<|im_start|>user
How do I choose the rank?<|im_end|>
<|im_start|>assistant
Start with r=16 for most tasks.<|im_end|>
ChatML uses
<|im_start|>
and
<|im_end|>
as boundary markers, with the role name (system, user, assistant) on the same line as the start token. It's cleaner and more uniform than the LLaMA 2 template — every role follows the same pattern, which makes parsing easier.
Why does this matter so much? Because if you train with one template but run inference with another, the model sees token patterns it has never encountered. The special tokens are the "anchors" the model uses to understand conversational structure. Mismatched templates lead to garbled outputs, the model refusing to stop generating, or responses that start mid-sentence. It's the equivalent of training someone to respond after hearing a bell, then using a whistle at test time.
Fortunately, HuggingFace provides a clean abstraction that handles template selection automatically. Every tokenizer ships with a
apply_chat_template()
method that formats your messages according to the model's expected template:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
messages = [
{"role": "system", "content": "You are a helpful ML tutor."},
{"role": "user", "content": "What is LoRA?"},
{"role": "assistant", "content": "LoRA is a parameter-efficient fine-tuning method."},
]
# Automatically applies the correct template for this model
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False, # return string, not token IDs
add_generation_prompt=False # don't add assistant prompt at the end
)
print(formatted)
# Output for LLaMA 2:
# <s>[INST] <<SYS>>
# You are a helpful ML tutor.
# <</SYS>>
#
# What is LoRA? [/INST] LoRA is a parameter-efficient fine-tuning method. </s>
# To get token IDs directly for training:
token_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt"
)
print(f"Sequence length: {token_ids.shape[1]} tokens")
The beauty of
apply_chat_template()
is that it works the same way regardless of the model. Swap
meta-llama/Llama-2-7b-chat-hf
for
Qwen/Qwen2-7B-Instruct
and the same code produces the correct ChatML format automatically. Your data pipeline stays model-agnostic, and when you switch base models, the template switches with it.
Loss Masking: Only Train on Completions
When you train a language model, the loss function measures how well the model predicts the next token at every position in the sequence. But in instruction tuning, should the model be penalised for not "predicting" the user's message? The user's message is given context, not something the model should learn to generate. If we include it in the loss, the model wastes capacity memorising user prompts instead of learning to produce good responses.
This is why loss masking (also called completion-only training ) is standard practice in SFT. We compute the cross-entropy loss only on the response tokens — the tokens we want the model to learn to generate — and mask out (set to zero) the loss on all prompt/instruction tokens.
Formally, given a sequence of tokens $[x_1, \ldots, x_m, y_1, \ldots, y_n]$ where $x_1, \ldots, x_m$ are the prompt tokens (system message + user message) and $y_1, \ldots, y_n$ are the response tokens, the masked SFT loss is:
Let's break this down carefully. The sum runs from $t = 1$ to $n$, covering only the response tokens . At each position $t$, the model predicts $y_t$ given everything before it: all $m$ prompt tokens and all previous response tokens $y_1, \ldots, y_{t-1}$. The normalisation factor $\frac{1}{n}$ averages over the number of response tokens, not the total sequence length. Without masking, the loss would instead sum over all $m + n$ positions, including the prompt tokens — which means the model would be trained to predict the user's message, a task that's at best irrelevant and at worst harmful (it can cause the model to "echo" user-like text during generation).
Why does this distinction matter quantitatively? Consider a typical training example where the prompt is 200 tokens and the response is 300 tokens. Without masking, 40% of the loss comes from prompt tokens — 40% of the gradient signal is teaching the model something we don't care about. That's not just wasted compute; it actively competes with the learning signal from the response tokens, because the model's capacity is finite and the optimizer is trying to minimise the sum of both.
In HuggingFace TRL, loss masking is handled by the
DataCollatorForCompletionOnlyLM
class. You tell it which token marks the beginning of the assistant's response, and it sets the label to
-100
(PyTorch's ignore index for cross-entropy loss) for every position before that token:
from transformers import AutoTokenizer
from trl import DataCollatorForCompletionOnlyLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
# The response template marks where the assistant's response begins.
# For LLaMA 2, the assistant response starts right after "[/INST]"
response_template = "[/INST]"
collator = DataCollatorForCompletionOnlyLM(
response_template=response_template,
tokenizer=tokenizer,
)
# What this does under the hood (conceptual):
# 1. Tokenize the full conversation: [prompt_tokens..., response_tokens...]
# 2. Find the position of [/INST] in the token sequence
# 3. Set labels[0 : inst_pos] = -100 (masked, no loss)
# 4. Set labels[inst_pos :] = token_ids (compute loss here)
#
# PyTorch's CrossEntropyLoss ignores positions where label == -100,
# so the gradient only flows through response tokens.
To see loss masking in action more concretely, here's what the labels look like before and after masking:
# Simplified example: what the collator produces
tokens = ["<s>", "[INST]", "What", "is", "LoRA", "?", "[/INST]",
"LoRA", "is", "a", "method", ".", "</s>"]
# WITHOUT masking — loss on every token:
labels_unmasked = ["<s>", "[INST]", "What", "is", "LoRA", "?", "[/INST]",
"LoRA", "is", "a", "method", ".", "</s>"]
# WITH masking — loss only on response:
labels_masked = [ -100, -100, -100, -100, -100, -100, -100,
"LoRA", "is", "a", "method", ".", "</s>"]
# The -100 entries produce zero loss and zero gradient.
# Only the 6 response tokens contribute to learning.
Building Your Dataset
Where does good instruction data come from? You have four main sources, each with a different quality-cost trade-off, and most successful fine-tuning projects use a combination of them.
Public datasets are the easiest starting point. The open-source community has released dozens of high-quality instruction datasets that you can use directly or mix into your own:
- Open-Orca / SlimOrca: large-scale datasets (~500K–1M examples) generated by GPT-4 and GPT-3.5, filtered for quality. Good general-purpose coverage.
- Dolly (Databricks): 15K examples written by Databricks employees. Smaller but human-authored, which means more natural and fewer artifacts.
- OASST1 (Open Assistant): 66K messages in a conversation tree format, contributed by volunteers. Multi-turn and multilingual.
- UltraChat: 1.5M multi-turn dialogues generated by ChatGPT, covering a wide range of topics. Good for conversation fine-tuning.
Synthetic generation uses a strong model (GPT-4, Claude, or another frontier model) to generate training examples for a weaker model. This is the approach Stanford used for Alpaca: they took 175 seed instruction-output pairs, sent them to GPT-3.5 as few-shot examples, and generated 52,000 new examples. The broader idea, formalised as Self-Instruct (Wang et al., 2023) , is that a model can bootstrap its own instruction data from a small seed set. We'll explore synthetic data in more depth in the next section.
Manual curation is the gold standard for quality. Having domain experts write instruction-response pairs produces the cleanest, most accurate data. This is what the LIMA paper did — 1,000 examples, each handpicked from sources like Stack Exchange, wikiHow, and Reddit, then manually filtered to keep only the best. The downside is obvious: it's expensive and slow. But if you're building a model for a specific domain (medical, legal, financial), manually curated examples from actual experts are often irreplaceable.
Domain data conversion transforms existing assets — documentation, FAQs, support tickets, internal wikis — into instruction format. If your company has a knowledge base with 10,000 question-answer pairs already, that's 10,000 training examples waiting to be reformatted. The conversion is often mechanical:
import json
# Your existing FAQ data
faqs = [
{
"question": "How do I reset my password?",
"answer": "Go to Settings > Security > Reset Password. You'll receive a confirmation email within 5 minutes."
},
{
"question": "What file formats do you support?",
"answer": "We support PDF, DOCX, TXT, and CSV files up to 50MB each."
},
]
# Convert to OpenAI messages format
training_examples = []
for faq in faqs:
example = {
"messages": [
{"role": "system", "content": "You are a helpful customer support agent for Acme Corp. Answer questions accurately and concisely."},
{"role": "user", "content": faq["question"]},
{"role": "assistant", "content": faq["answer"]},
]
}
training_examples.append(example)
# Save as JSONL (one JSON object per line — the standard format)
with open("training_data.jsonl", "w") as f:
for ex in training_examples:
f.write(json.dumps(ex) + "\n")
print(f"Created {len(training_examples)} training examples")
Regardless of the source, there are quality signals you should check before training. Think of this as a pre-flight checklist for your dataset:
- Response substance: responses should be substantive relative to prompts. If most responses are shorter than the prompts, the model may learn to give terse, unhelpful answers.
- Instruction diversity: a dataset where 80% of instructions are "summarise" will produce a model that's great at summarisation and mediocre at everything else. Aim for diverse instruction types: explain, compare, debug, write, classify, translate, brainstorm.
- Internal consistency: if one example says "always use bullet points" and another says "never use bullet points", the model receives contradictory gradient signals. Audit for conflicting instructions, especially around formatting and style.
- Deduplication: near-duplicate examples waste compute and can cause overfitting to specific phrasings. Use techniques like MinHash or exact substring matching to find and remove duplicates.
Finally, if you're combining multiple sources (say, SlimOrca for general coverage plus your own domain data), balance by quality, not volume. A common mistake is mixing 100K generic examples with 1K high-quality domain examples and expecting the domain behaviour to come through. The 100K examples will dominate training. Instead, upsample the high-quality data or downsample the generic data to bring them closer to balance.
Synthetic Data: Teaching with a Teacher Model
What if you need thousands of training examples but don't have the budget for manual annotation? This is where synthetic data generation comes in — using a strong model (the "teacher") to generate training data for a weaker model (the "student"). It's often called distillation in the broad sense, because you're distilling the teacher's capabilities into training signal for the student.
The original Alpaca approach was straightforward: take 175 seed examples written by humans, use them as few-shot prompts for GPT-3.5, and ask it to generate new instruction-output pairs. This produced 52,000 examples at a cost of under $500. The resulting model, trained on LLaMA-7B, was surprisingly capable for its size — demonstrating that cheap synthetic data can transfer real capability.
But simple generation has a problem: the instructions tend to be repetitive and easy. "Write a poem about X", "Summarise the following text", "What is X?" — the teacher model falls into comfortable patterns. (Xu et al., 2023) addressed this with Evol-Instruct , the method behind WizardLM. The idea is to start with simple instructions and use an LLM to progressively make them more complex through specific evolution strategies:
# Evol-Instruct evolution strategies (conceptual)
evolution_prompts = {
"add_constraints": """I want you to act as a Prompt Rewriter.
Given the prompt: "{instruction}"
Rewrite it by adding one or more constraints or requirements.
Make sure the new prompt is reasonable and can be answered.""",
"deepen": """I want you to act as a Prompt Rewriter.
Given the prompt: "{instruction}"
Rewrite it to require deeper thinking or more detailed reasoning.
The new prompt should test understanding, not just recall.""",
"concretize": """I want you to act as a Prompt Rewriter.
Given the prompt: "{instruction}"
Rewrite it by replacing general concepts with specific, concrete ones.
Add specific numbers, names, or scenarios.""",
"increase_reasoning": """I want you to act as a Prompt Rewriter.
Given the prompt: "{instruction}"
Rewrite it to require multi-step reasoning or problem decomposition.
The answer should need at least 3 logical steps.""",
}
# Example evolution chain:
# Step 0: "What is gradient descent?"
# Step 1 (deepen): "Explain how gradient descent can get stuck in
# local minima and what techniques help escape them."
# Step 2 (add_constraints): "Explain how gradient descent can get stuck
# in local minima for non-convex loss surfaces, comparing at
# least three escape techniques with their trade-offs."
# Step 3 (concretize): "For a ResNet-50 trained on ImageNet with SGD,
# explain how gradient descent can get stuck in local minima,
# comparing momentum, learning rate warmup, and stochastic
# weight averaging with specific examples of when each helps."
Each evolution step produces a harder, more nuanced instruction. After evolving, you send the evolved instruction to a strong model (GPT-4 or similar) to generate the response. The result is a dataset where the instructions range from simple to complex, which is exactly the diversity you need for a well-rounded model.
Raw synthetic data is noisy, though. The teacher model sometimes generates low-quality responses: factually wrong, overly verbose, off-topic, or lazy ("As an AI language model, I..."). Quality filtering is essential. A common pipeline looks like this:
import openai
def generate_and_filter(seed_instructions, n_target=10000):
"""Generate synthetic data with multi-stage filtering."""
raw_examples = []
# Stage 1: Generate responses for evolved instructions
for instruction in evolved_instructions:
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Answer the following instruction helpfully and accurately."},
{"role": "user", "content": instruction},
],
temperature=0.7, # some diversity, not too random
)
raw_examples.append({
"instruction": instruction,
"response": response.choices[0].message.content,
})
# Stage 2: Rule-based filtering
filtered = []
for ex in raw_examples:
# Remove short or lazy responses
if len(ex["response"]) < 100:
continue
# Remove refusals and meta-commentary
if any(phrase in ex["response"].lower() for phrase in [
"as an ai", "i cannot", "i'm sorry, but"
]):
continue
# Remove near-duplicates (simplified; use MinHash in practice)
if is_near_duplicate(ex, filtered):
continue
filtered.append(ex)
# Stage 3: LLM-as-judge scoring (optional but effective)
scored = []
for ex in filtered:
score = judge_quality(ex["instruction"], ex["response"])
if score >= 4: # keep only 4/5 and 5/5
scored.append(ex)
return scored[:n_target]
def judge_quality(instruction, response):
"""Use a strong model to rate response quality 1-5."""
judgment = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Rate the quality of the response on a scale of 1-5. Consider accuracy, helpfulness, and completeness. Reply with just the number."},
{"role": "user", "content": f"Instruction: {instruction}\nResponse: {response}"},
],
temperature=0.0,
)
return int(judgment.choices[0].message.content.strip())
This three-stage pipeline — generate, filter by rules, filter by judge — is a pattern you'll see across most serious synthetic data efforts. The rule-based stage is cheap and removes obvious junk. The LLM-as-judge stage is more expensive but catches subtler quality issues: responses that are technically correct but poorly structured, or accurate but unhelpful.
There is a fundamental trade-off to be aware of: synthetic data inherits the biases and limitations of the teacher model. If GPT-4 tends to be verbose, your synthetic data will be verbose, and your student model will learn verbosity. If the teacher makes factual errors on a topic, those errors propagate. Many leading open-source models (Mistral, Qwen, LLaMA-3, and others) use synthetic data extensively, but they combine it with human-curated data and careful filtering to mitigate these issues.
Quiz
Test your understanding of instruction data design and preparation.
According to the LIMA paper (Zhou et al., 2023), why can just 1,000 examples be sufficient for instruction fine-tuning?
Why is loss masking important during supervised fine-tuning?
What happens if you train a model with one chat template (e.g., LLaMA 2's [INST] format) but run inference with a different template (e.g., ChatML)?
What is the core idea behind Evol-Instruct (WizardLM)?