Lesson 7.1: How do you write a RAG prompt that actually works?

Mastering RAG- The Brain Friendly Way: Module 7: Making the LLM Use What You Retrieved

May 26, 2026

You’ve built solid retrieval. Your chunks are clean, your embeddings are tuned, and the right documents are landing in the prompt. So why is the LLM still answering like it’s never seen them? In this lesson, we’ll dissect the anatomy of a RAG prompt — six parts, four classic failures, and the small structural choices that decide whether your model grounds itself in your docs or wanders off into its training data.

🧠 Brain Power

Before reading on, pause. If you had to guess, which part of the prompt is most likely making the model “go rogue” and answer from memory?

A) The role definition
B) Missing grounding instruction
C) The temperature setting
D) The order of context and question

(Hint: We’ll find out it’s two of these. Keep reading.)

The Anatomy of a RAG Prompt

A RAG prompt isn’t one thing. It’s six things stacked together. Skip one, and the whole thing wobbles.

Think of a RAG prompt like a recipe. You can have the best ingredients (great retrieval), but if you skip the step that says "preheat the oven" or "add salt at the end," the dish fails. Each of the six parts plays a different role: some narrow the model's behavior, some give it raw material, and some tell it how to package the output. Missing any one creates a specific class of bug, which is why we'll map each part to the failures it prevents.

Let’s walk through them, one by one, like checking the parts of an engine.

Part 1: Role Definition: “Who are you, model?”

The role tells the model what hat to wear. It narrows the model’s behavior.

Example:
You are a customer support assistant for Acme Corp.
You answer questions strictly based on Acme’s internal documentation.

Compare with no role at all, the model defaults to “helpful general AI,” which means it’ll happily blend in things it learned on the internet in 2025.

Why is the "helpful general AI" default dangerous? When no role is set, the model assumes it should be maximally helpful using everything it knows. So if you ask "What's our refund policy?", it might combine your retrieved doc with general knowledge about Amazon's, Shopify's, or random e-commerce refund norms. By assigning a narrow role ("customer support assistant for Acme Corp"), you create a behavioral boundary: the model now thinks of itself as someone who only knows Acme's world. It's a small psychological nudge, but it measurably reduces drift.

Part 2: Instruction: “What do I do with this stuff?”

This is where you tell the model how to use the retrieved docs. The single most important sentence in any RAG prompt:

“Answer only using the information provided in the context below. If the context does not contain enough information, say: ‘I don’t have enough information to answer that.’”

Why does this one sentence matter so much? It does two jobs at once:

“Answer only using…” — locks the model’s source of truth. It tells the model: parametric knowledge is off-limits.
“If the context does not contain enough information, say…” — gives the model a graceful exit. Without this, the model has only one option when context is missing: invent something. Adding the escape hatch turns hallucination into honest uncertainty — which is what you actually want in production.

Part 3: The Context Block: “Here are your sources”

Here’s where most beginners trip. They concatenate all chunks into one blob:

Example:
❌ BAD:
Context: Refunds are issued within 14 days. Customers must contact support 
within 30 days. International orders take longer to refund. Refunds require 
the original receipt...

The model sees one giant document and can’t distinguish sources. It can’t tell you which chunk it used.

Why does concatenation confuse the model? LLMs reason about text using patterns and boundaries. When chunks are glued together, the model has no signal that says "this sentence came from doc A, that one came from doc B." It treats the whole block as a single narrative. So if doc A says "refunds in 14 days" and doc B says "refunds in 30 days," the model might average them ("refunds in 14–30 days") or pick whichever sounds more natural, not whichever is actually correct for the question. Numbered, delimited chunks restore those boundaries.

Do this instead:

Example:
✅ GOOD:
[1] (Source: refund_policy_2024.pdf, Section 2)
Refunds are issued within 14 days of return receipt.

[2] (Source: support_faq.md, Q12)
Customers must contact support within 30 days of purchase.

[3] (Source: international_orders.pdf, Page 4)
International orders may take 21 days for refund processing.

Three things to notice:

Numbered ([1], [2], [3]) → enables citation
Source metadata → traceable
Visually delimited → the model treats them as separate documents

Part 4: Citation Instruction: “Show your work”

A short sentence with massive ROI:

"When you use information from a source, cite it by its number, e.g., [1] or [2][3]."

Why does this matter? Because now your users (and your QA team) can verify the answer. And when the model is forced to cite, it hallucinates noticeably less.

Why does forced citation reduce hallucination? When the model has to attach a source number to each claim, it must mentally “anchor” every fact to a specific chunk before generating it. If no chunk supports the claim, the model has no number to attach — and that friction makes it more likely to either skip the claim or admit uncertainty. It’s the same reason students cheat less on open-book exams that require citations: the act of pointing to a source forces honesty. Researchers call this attribution-induced grounding.

Part 5: The Question: the user’s actual query

Goes near the end, after the context. We’ll see why in a minute.

Part 6: Format Instruction: “How should the answer look?”

Don’t leave this to chance:

"Answer in 2–3 sentences. Use a bulleted list if there are multiple steps. Always cite your sources."

Without this, you get answers ranging from one word to seven paragraphs. With it, you get something predictable enough to put in a UI.

Why predictability matters in production: if your frontend expects a short paragraph but the model returns a 500-word essay, your UI breaks. If it expects bullet points but gets prose, your parser breaks. Format instructions are the contract between the model and your application.

System Prompt vs User Turn — Where Does Everything Go?

A common confusion. Here’s the clean split:

Why this split? The system prompt sets stable, conversation-wide rules — things that don’t change between turns. The user turn carries per-query data — the retrieved chunks (which differ for every question) and the question itself. Mixing these up causes two real bugs:

Putting rules in the user turn → the model may “forget” them across turns.
Putting retrieved chunks in the system prompt → on multi-turn chats, old context lingers and contaminates new questions. A clean rule of thumb: if it’s the same for every query, it’s a system prompt. If it changes per query, it’s the user turn.

The Four Classic Prompt Failures

Let’s go back to those bug reports. Every RAG failure usually traces back to one of these.

Failure #1: No delimiter between chunks

Example:
❌ Context: chunk1 text chunk2 text chunk3 text...

Symptom: Model can’t cite. Can’t tell sources apart. Mixes facts from different documents into Frankenstein answers.

Fix: Number and label every chunk. Always.

Failure #2: No grounding instruction

Example:
❌ “Use the context below to answer the question.”

Fix:
✅ “Answer ONLY using the provided context. Do not use prior knowledge.”

Why is "use" too soft? Models are trained on millions of human-written instructions where "use this reference" usually means "consult it as one of many sources." So when you write "use the context," the model maps it to its training pattern of "consult, then synthesize with what I know." Switching to "ONLY using" changes the linguistic signal entirely — it activates patterns associated with strict, exclusive instructions (like legal, medical, or compliance writing) where deviation isn't allowed. The word choice literally shifts which behavioral patterns the model activates.

Failure #3: No “I don’t know” escape hatch

Without explicit permission to say “I don’t know,” the model will fabricate. Models are trained to be helpful, and silence feels unhelpful.

Fix:

Example:
✅ “If the answer is not present in the context, reply: 
‘I don’t have enough information to answer that.’”

Failure #4: Context placed after the question

Example:
❌ User: What is our refund window?
   Context: [1] Refunds are issued within 14 days...

LLMs pay more attention to text that comes before the question, not after. Bury context at the bottom and the model treats it as an afterthought.

Fix:

Example:
✅ Context: [1] ... [2] ... [3] ...
   Question: What is our refund window?

Why does position matter? LLMs generate text by predicting the next token based on everything that came before. The closer information is to the point of generation, the stronger its influence on the output. When the question comes last, the model reads the context first → builds its understanding → then sees the question → and the answer flows naturally from the just-read context. When the question comes first, the model has already started forming an answer (often from parametric memory) before it sees the context, and the context becomes a weak afterthought. There's also a known issue called "lost in the middle": information at the start and end of a long prompt gets more attention than info buried in the middle. Putting context first and question last places both critical pieces in high-attention zones.

The Reusable RAG Prompt Template

System prompt

Example:
You are a {{ROLE}} that answers questions using ONLY the documents 
provided in the user message.

Rules:
1. Answer ONLY using information from the provided context.
2. If the context does not contain enough information to answer,
   reply exactly: “I don’t have enough information to answer that.”
3. Do NOT use prior knowledge or make assumptions.
4. When you use information from a source, cite it by its number,
   e.g., [1] or [2][3].
5. Format your answer as: {{FORMAT}}.

How to fill in the placeholders:

{{ROLE}} → Be specific and narrow. “customer support assistant for Acme Corp” beats “helpful assistant.” The narrower the role, the tighter the behavior.
{{FORMAT}} → Match your UI. Examples:
- "a single short paragraph (2–3 sentences) with inline citations"
- "a bulleted list, each bullet ending with the source number"
- "a JSON object with keys 'answer' and 'sources'"

Pro tip: keep one template per use case (FAQ bot, internal search, code assistant). Trying to make one template fit all use cases dilutes each.

User turn

Example:
Context:
[1] (Source: {{source_1}})
{{chunk_1_text}}

[2] (Source: {{source_2}})
{{chunk_2_text}}

[3] (Source: {{source_3}})
{{chunk_3_text}}

Question: {{user_question}}

Before & After — A Real Example

Before (all four failures):

Example:
Answer the user’s question. 
Refunds are within 14 days. Support must be contacted within 30 days.
What is our refund policy?

Output: “Most companies offer 30-day refund windows...” 😱 (parametric knowledge leak)

Why did this fail? Look at how many failures stack here:

No role → the model defaults to “general AI” and reaches for industry-wide knowledge.
Soft instruction → “Answer the user’s question” doesn’t restrict the source of truth.
No delimiter → the two facts (”14 days” and “30 days”) are glued together; the model can’t tell them apart.
No “I don’t know” escape → the model feels obligated to produce something.
No citation rule → the model has no incentive to anchor its claims to the provided text.

The result? The model ignores the actual context entirely and falls back on what it “knows” about typical refund policies. That’s a parametric knowledge leak — the model answered from its training data, not from your docs.

After (template applied):

Example:
SYSTEM:
You are a customer support assistant. Answer ONLY using the context.
If insufficient info, say “I don’t have enough information to answer that.”
Cite sources with [n].

USER:
Context:
[1] (Source: refund_policy_2024.pdf)
Refunds are issued within 14 days of return receipt.

[2] (Source: support_faq.md)
Customers must contact support within 30 days of purchase.

Question: What is our refund policy?

Output: “Customers must contact support within 30 days of purchase [2]. Once a return is received, refunds are issued within 14 days [1].” ✅

Same retrieval. Same model. Different prompt. Night and day difference.

🧠 Brain Power: Answer

Looking back at our earlier puzzle, the two big culprits behind “rogue” answers are:

B) Missing grounding instruction (”answer ONLY from context”)
D) Context placed after the question

A good role helps a little. Temperature barely matters here. But these two structural choices define whether your RAG works or not.

Why not A or C?
A) Role definition helps shape tone and scope, but a strong role with a weak grounding instruction still hallucinates. It’s a supporting actor, not the lead.
C) Temperature controls randomness in word choice, not whether the model uses parametric knowledge. A model at temperature 0 will hallucinate just as confidently as one at temperature 1 if it’s not grounded properly. Temperature is a stylistic dial, not a factuality dial.
The two winners (B and D) are structural, they change what information the model sees and how strictly it’s told to use it.

A great RAG prompt isn't about wording, it's about structure. Nail the six parts, avoid the four classic failures, and you'll turn the same retrieval and the same model to give better answers. Next up: what happens when even a perfect prompt isn't enough because the question itself spans multiple documents.

AI Simplified

Discussion about this post

Ready for more?