Skip to content
Back to Knowledge Base
Pillar GuideMarch 24, 202615 min read

RAG — Retrieval-Augmented Generation: The Complete Guide

How to connect your AI applications with real company knowledge using RAG — architecture, implementation, and best practices from the field.

HS

Harald Schwankl

Dipl.-Ing., Fullstack Developer & AI Specialist

On this page

Introduction: What is RAG?

Retrieval-Augmented Generation — RAG for short — is one of the most effective methods for connecting Large Language Models (LLMs) with current, company-specific knowledge. The principle is elegant: Instead of teaching the language model everything during training, we provide it with exactly the information it needs for a given question at runtime.

I have been working intensively with RAG systems for over two years — from internal knowledge bases to contract analysis to customer support automation. In this guide, I share my complete practical knowledge: How a RAG pipeline is structured, which decisions matter at which stage, and what mistakes you should avoid.

Whether you are building an AI assistant for your company, modernizing a document search, or want to understand how RAG works under the hood — this guide gives you the necessary foundation. At the end, you will find a link to our live RAG demo where you can try out what you have learned firsthand.

The Problem: Why LLMs Alone Are Not Enough

Large Language Models like Mistral, GPT, or Claude are impressive — they can summarize texts, write code, and answer complex questions. But they have three fundamental weaknesses that make their use in enterprises risky without additional measures.

1. Hallucinations LLMs generate text based on statistical patterns. When they lack reliable information about a topic, they invent plausible-sounding answers. In a business context — such as legal questions, technical documentation, or customer communication — this can have fatal consequences. A chatbot that cites false contract clauses is worse than no chatbot at all.

2. Outdated Knowledge An LLM's knowledge is limited to its training cutoff date. A model trained in January 2025 knows nothing about events in March 2026. For companies whose products, prices, and policies change constantly, this is a dealbreaker.

3. No Access to Company Knowledge No publicly available LLM knows your internal processes, contract documents, product specifications, or customer data. It simply was not trained on them. Even if it could be — you do not want to feed your confidential data into a third-party training dataset.

The Solution: RAG RAG solves all three problems simultaneously. Instead of retraining the model, we provide it with the right documents as context. The model answers based on facts rather than patterns. Sources are traceable, knowledge is current, and confidential data remains under your control.

The RAG Architecture in Detail

A RAG pipeline consists of two phases: the indexing phase (once per document) and the query phase (for every user question). Here is the complete flow:

RAG Pipeline — Complete Architecture

  INDEXING PHASE (offline)             QUERY PHASE (per request)
  ========================             =========================

  [Document]                           [User Question]
      |                                     |
      v                                     v
  1. Parsing                           5. Query Embedding
  (PDF/MD/TXT -> Text)                (Question -> Vector)
      |                                     |
      v                                     v
  2. Chunking                          6. Similarity Search
  (Text -> Segments)                   (Top-K most similar chunks)
      |                                     |
      v                                     v
  3. Embedding                         7. Context Assembly
  (Chunks -> Vectors)                  (Chunks + System Prompt)
      |                                     |
      v                                     v
  4. Vector Store                      8. LLM Generation
  (pgvector/Supabase)                  (Answer with source citations)

Phase 1: Indexing

During the indexing phase, your documents are prepared and made searchable. First, they are parsed — meaning the raw text is extracted from PDFs, Markdown files, or other formats. Then the text is divided into smaller segments (chunks), typically 500 to 2000 characters long, with overlaps to prevent context loss at chunk boundaries.

Each chunk is then converted by an embedding model into a high-dimensional vector — a mathematical representation of its meaning. These vectors are stored alongside the original text in a vector database.

Phase 2: Query

When a user asks a question, it goes through the same embedding process. The resulting question vector is compared against all stored chunk vectors. The most similar chunks — measured by cosine similarity — are selected as context.

This context is passed to the LLM along with the original question and a system prompt. The model generates an answer grounded in the provided facts, ideally referencing the specific sources.

The result: precise, verifiable answers based on your own data — without hallucinations, without outdated knowledge.

Document Ingestion & Chunking

The quality of your RAG answers stands and falls with the quality of your chunking. Poor chunks lead to irrelevant retrieval results, and the LLM cannot generate good answers from bad context.

Document Parsing

The first step is text extraction. In our implementation, we support PDF, Markdown, and plaintext. PDF parsing is the biggest challenge — tables, multi-column layouts, and scanned documents require special treatment:

python
from PyPDF2 import PdfReader
import io

def parse_document(content: bytes, filename: str, content_type: str) -> str:
    """Extracts text from PDF, Markdown, or plaintext."""
    if content_type in ("text/plain", "text/markdown"):
        return content.decode("utf-8", errors="replace")

    if content_type == "application/pdf":
        reader = PdfReader(io.BytesIO(content))
        text_parts = []
        for page in reader.pages:
            text = page.extract_text()
            if text:
                text_parts.append(text)
        return "\n\n".join(text_parts)

    raise ValueError(f"Unsupported format: {content_type}")

Chunking Strategies

There are various approaches to splitting text into chunks:

  • Fixed-size chunking: Fixed character count, simple but context-unaware
  • Paragraph-based: Splits at paragraph boundaries, respects natural text structure
  • Semantic chunking: Uses embeddings to detect semantic boundaries, more complex but more precise
  • Recursive character splitting: Attempts to split at natural boundaries (paragraphs, sentences, words)

In practice, I use paragraph-based chunking with sentence fallback — it offers the best compromise between quality and simplicity:

python
import re

def chunk_document(text: str, chunk_size: int = 2000, overlap: int = 200) -> list[str]:
    """Splits text into chunks with overlap. Paragraph-based with sentence fallback."""
    text = re.sub(r"\n{3,}", "\n\n", text).strip()
    paragraphs = text.split("\n\n")
    chunks: list[str] = []
    current_chunk = ""

    for paragraph in paragraphs:
        paragraph = paragraph.strip()
        if not paragraph:
            continue

        if len(current_chunk) + len(paragraph) + 2 <= chunk_size:
            current_chunk += ("\n\n" if current_chunk else "") + paragraph
        else:
            if current_chunk:
                chunks.append(current_chunk)
                overlap_text = current_chunk[-overlap:]
                current_chunk = overlap_text + "\n\n" + paragraph
            else:
                sentences = re.split(r"(?<=[.!?])\s+", paragraph)
                for sentence in sentences:
                    if len(current_chunk) + len(sentence) + 1 <= chunk_size:
                        current_chunk += (" " if current_chunk else "") + sentence
                    else:
                        if current_chunk:
                            chunks.append(current_chunk)
                        current_chunk = sentence

    if current_chunk:
        chunks.append(current_chunk)
    return chunks

Why overlap matters: Without overlap between chunks, information can be lost at the cut points. With 200 characters of overlap, we ensure that context is preserved. In our implementation, we use 2000 characters per chunk with 200 characters of overlap — approximately 500 tokens per chunk, a solid value for most use cases.

Embedding Models: The Heart of RAG

Embeddings are numerical representations of text in a high-dimensional vector space. Texts with similar meanings have similar vectors — this is the foundation of semantic search that makes RAG so powerful.

What Happens During Embedding?

An embedding model converts text of any length into a vector of fixed dimensions. The sentences "How do I cancel my contract?" and "Submit contract cancellation" are close together in vector space, even though they share almost no words. This is the decisive advantage over traditional keyword search.

Model Comparison

ModelDimensionsLanguagesStrength
BGE-Gemma2 (Multilingual)102430+Multilingual, strong for DE/EN
OpenAI text-embedding-3-small153650+Good all-round performance
OpenAI text-embedding-3-large307250+Highest quality, expensive
Cohere embed-v31024100+Strong multilingual performance
E5-Mistral-7B409630+Open source, very precise

Our Choice: BGE-Multilingual-Gemma2

For schwankl.info, I use the BGE-Multilingual-Gemma2 model via the Scaleway Inference API. The reasons:

  • 1024 dimensions: Good compromise between quality and storage requirements
  • Multilingual: Excellent performance in German and English — important for our bilingual platform
  • EU hosting: Scaleway operates inference servers in France, GDPR-compliant
  • Cost-effective: Significantly cheaper than OpenAI embeddings with comparable quality
python
from openai import AsyncOpenAI

async def embed_text(text: str) -> list[float]:
    """Creates a 1024-dimensional embedding vector via Scaleway."""
    client = AsyncOpenAI(
        base_url="https://api.scaleway.ai/v1",
        api_key=SCW_SECRET_KEY,
    )
    response = await client.embeddings.create(
        model="bge-multilingual-gemma2",
        input=text,
    )
    return response.data[0].embedding  # list[float], length 1024

Important for embeddings: The same model version must be used for indexing and querying. If you switch the embedding model, all documents must be re-indexed — the vectors from different models are not compatible.

Vector Databases Compared

The vector database stores your embeddings and enables fast similarity searches. Choosing the right database depends on your scaling needs, budget, and existing infrastructure.

Comparison Overview

CriterionpgvectorPineconeWeaviateQdrant
TypePostgreSQL extensionManaged serviceSelf-hosted / CloudSelf-hosted / Cloud
SetupSimple (1 SQL)No setup neededDocker / HelmDocker / Helm
ScalingUp to ~5M vectorsUnlimitedHorizontalHorizontal
CostFree (OSS)From $70/monthFree (OSS)Free (OSS)
FilteringSQL WHEREMetadata filterGraphQLPayload filter
StandoutIn existing DBFully managedHybrid searchFastest latency

Our Choice: pgvector + Supabase

For schwankl.info, I use pgvector as a PostgreSQL extension within Supabase. The advantages:

  1. No additional infrastructure: Vectors live in the same database as all other data — user data, blog posts, configurations. One less system to maintain.
  2. SQL-based filtering: We can combine vector search with arbitrary SQL conditions. For example: "Find the most similar chunks, but only from documents with session ID X and status 'ready'."
  3. HNSW index: pgvector supports the HNSW index type (Hierarchical Navigable Small World), enabling searches in milliseconds — even with hundreds of thousands of vectors.
  4. Supabase RPC: We encapsulate complex vector searches in PostgreSQL functions that we call via RPC:
sql
CREATE FUNCTION match_document_chunks(
    query_embedding vector(1024),
    match_count INTEGER DEFAULT 5,
    filter_session_id TEXT DEFAULT NULL
)
RETURNS TABLE (
    id UUID, document_id UUID, chunk_text TEXT,
    chunk_index INTEGER, metadata JSONB, similarity FLOAT
) AS $$
BEGIN
    RETURN QUERY
    SELECT dc.id, dc.document_id, dc.chunk_text,
           dc.chunk_index, dc.metadata,
           1 - (dc.embedding <=> query_embedding) AS similarity
    FROM document_chunks dc
    JOIN demo_documents dd ON dc.document_id = dd.id
    WHERE dd.status = 'ready'
      AND (filter_session_id IS NULL OR dd.session_id = filter_session_id)
    ORDER BY dc.embedding <=> query_embedding
    LIMIT match_count;
END;
$$ LANGUAGE plpgsql;

When a dedicated vector database makes sense: If you have more than 5 million vectors, need sub-millisecond latency, or require complex hybrid search (vector + full-text), take a look at Qdrant or Weaviate. For most mid-sized use cases, pgvector is more than sufficient.

Retrieval & Similarity Search

Retrieval is the most critical step of the RAG pipeline. If the wrong chunks are retrieved, even the best LLM cannot generate a good answer.

Cosine Similarity

The standard metric for vector similarity is cosine similarity. It measures the angle between two vectors in high-dimensional space:

  • 1.0 = identical meaning
  • 0.0 = no semantic connection
  • -1.0 = opposite meaning (rare in practice)

In pgvector, cosine distance is calculated with the <=> operator. Similarity is computed as 1 - distance.

Top-K Retrieval

We typically retrieve the top 5 most similar chunks. Too few chunks mean missing context; too many confuse the LLM with irrelevant information and waste tokens.

python
async def retrieve_chunks(question: str, session_id: str, top_k: int = 5):
    """Vector search: embed question and find most similar chunks."""
    question_embedding = await embed_text(question)

    result = supabase.rpc("match_document_chunks", {
        "query_embedding": question_embedding,
        "match_count": top_k,
        "filter_session_id": session_id,
    }).execute()

    return result.data  # Top-K chunks with similarity score

Reranking: The Second Stage

For higher precision, a reranking step can be added after the initial retrieval. A cross-encoder model evaluates each (question, chunk) pair individually — slower, but significantly more precise than pure vector similarity.

A typical pattern: First retrieve Top-20 via vector search, then reduce to Top-5 via reranker. This noticeably improves answer quality, especially for ambiguous questions.

Quality Metrics

  • Recall@K: How many relevant chunks are in the Top-K results?
  • MRR (Mean Reciprocal Rank): At which position is the first relevant chunk on average?
  • Precision@K: How many of the Top-K chunks are actually relevant?

In our production environment, we achieve a Recall@5 of over 85% with BGE-Gemma2 and pgvector — meaning in 85% of cases, at least one of the Top-5 chunks contains the relevant information.

Generation with Context

The final step of the RAG pipeline is answer generation. Here, the retrieved context is passed to the LLM along with the user question.

Context Assembly

The art lies in preparing the context so the LLM can use it optimally. We number the sources so the model can reference them in its answer:

python
def assemble_context(chunks: list[dict]) -> str:
    """Builds the context string with numbered sources."""
    parts = []
    for i, chunk in enumerate(chunks):
        parts.append(f"[Source {i + 1}]: {chunk['chunk_text']}")
    return "\n\n".join(parts)

System Prompt

The system prompt defines the model's behavior. For RAG, clear rules are essential — the model should answer ONLY based on the provided context:

python
RAG_SYSTEM_PROMPT = """You are a domain assistant that answers questions
based on provided documents.

Rules:
- Answer questions ONLY based on the provided context.
- If the context does not answer the question, say so honestly.
- Cite the source with [Source X] at the end of relevant statements.
- Answer precisely and informatively.
- Use Markdown formatting."""

LLM Call with Streaming

In practice, I use streaming so the user sees the answer token by token immediately — this feels much more responsive than waiting several seconds for a complete answer:

python
async def generate_answer_stream(question: str, context: str):
    """Generates a RAG answer as a stream via Scaleway Mistral."""
    client = AsyncOpenAI(
        base_url="https://api.scaleway.ai/v1",
        api_key=SCW_SECRET_KEY,
    )

    stream = await client.chat.completions.create(
        model="mistral-small-3.2-24b-instruct-2506",
        messages=[
            {"role": "system", "content": RAG_SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        max_tokens=1000,
        temperature=0.3,
        stream=True,
    )

    async for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield token

Temperature setting: For RAG applications, I recommend a low temperature (0.1—0.3). This reduces the model's creativity and ensures it stays closer to the provided context. Higher temperatures would be suitable for creative tasks, but for fact-based questions, we want the most precise answers possible.

RAG vs. Fine-Tuning: When to Use What?

One of the most common questions I get from clients: "Should we do RAG or fine-tune the model?" The answer is almost always: RAG first. But there are cases where fine-tuning is the better choice.

Decision Matrix

CriterionRAGFine-Tuning
Knowledge updatable?Yes, documents exchangeableNo, retraining required
Source citations?Yes, chunks are traceableNo, knowledge is "baked in"
Cost (setup)Low (pipeline + DB)High (GPU, training data)
Cost (operation)Medium (embedding + LLM)Low (inference only)
LatencyHigher (+retrieval step)Lower (direct answer)
Data volumeArbitrarily largeLimited by training budget
Hallucination riskLow (context available)Medium (knowledge can fade)
Domain languageGood (context provides terms)Very good (model learns style)
  • Your knowledge changes regularly (products, prices, policies)
  • Traceability and source citations are important
  • You need different document collections per user/context
  • You want to start quickly without spending weeks preparing training data
  • Compliance requires the AI to disclose its sources
  • The model should learn a specific communication style
  • Answers must be generated in a fixed format
  • Latency is critical and every millisecond counts
  • You want a small specialized model that excels at one task

The Combination: RAG + Fine-Tuning

In practice, I often combine both: A model optimized via fine-tuning that has learned the company's communication style is augmented with RAG for current knowledge. This gives you the best of both worlds — though at the cost of higher complexity.

My recommendation: Always start with RAG. It is faster to set up, more flexible, and delivers sufficiently good results in most cases. You can always add fine-tuning later.

In Practice: Our RAG Implementation on schwankl.info

On schwankl.info, I run a complete RAG demo that you can try directly in your browser as a visitor. Here is the tech stack overview:

  • Frontend: Next.js 15, TypeScript — with streaming display for answer generation
  • Backend: FastAPI (Python 3.11), Gunicorn — asynchronous RAG pipeline with SSE streaming
  • Database: Supabase PostgreSQL with pgvector extension — vector storage and metadata in one DB
  • Embedding: Scaleway Inference API with BGE-Multilingual-Gemma2 (1024 dimensions)
  • LLM: Scaleway Inference API with Mistral Small 3.2 24B Instruct
  • Hosting: IONOS VPS in Germany, Docker containers, Plesk/nginx reverse proxy

The Complete Flow

  1. User uploads a document (PDF, Markdown, or plaintext, max 5 MB)
  2. FastAPI parses the document and extracts text
  3. Text is split into chunks (2000 characters, 200 character overlap)
  4. Each chunk is converted to a 1024-dimensional vector via Scaleway API
  5. Chunks and vectors are stored in Supabase (document_chunks table)
  6. User asks a question
  7. Question is embedded, Top-5 most similar chunks retrieved via pgvector search
  8. Chunks are sent as context along with the question to Mistral
  9. Answer is delivered to the frontend via SSE stream
  10. Sources with similarity scores are displayed at the end

Privacy and Security

  • All uploaded documents are session-bound and automatically deleted after 24 hours
  • PII (personally identifiable information) is anonymized before processing
  • All AI inference runs via Scaleway in the EU (France)
  • No data is transmitted to US services

Common Mistakes and Best Practices

After dozens of RAG implementations, I have compiled a clear list of dos and don'ts. These lessons can save you weeks of debugging.

Mistake 1: Chunks too large Chunks with 5000+ characters contain too much irrelevant information. The LLM loses focus and the answer becomes imprecise. Better: 1000—2000 characters, so each chunk contains a clear, self-contained piece of information.

Mistake 2: No overlap Without overlap between chunks, information is lost at the cut points. A sentence cut exactly at a chunk boundary becomes invisible to retrieval. 10—15% overlap (e.g., 200 characters for 2000-character chunks) is a good guideline.

Mistake 3: Switching embedding models without re-indexing Different embedding models produce incompatible vectors. If you switch models, you must re-embed all documents. Plan for this from the start — with a re-indexing script and versioning of the embedding config.

Mistake 4: Sending too many or too few chunks to the LLM Top-1 is too few (high chance of missing the relevant chunk), Top-20 is too many (the LLM gets lost in noise and the context window is wasted). Top-3 to Top-5 is optimal in most cases. Test with your actual data.

Mistake 5: System prompt without clear rules Without an explicit instruction like "Answer ONLY based on the context," the LLM will mix its training knowledge with the provided context — leading to subtle hallucinations that are hard to detect.

Best Practices Overview

  • Measure retrieval quality with real test questions (Recall@K, MRR)
  • Test different chunk sizes with your specific documents
  • Implement monitoring for answer quality and user satisfaction
  • Use metadata filtering to narrow the search space (e.g., by document type, date, department)
  • Keep the system prompt simple and specific — every rule counts
  • Log questions without relevant chunks — this shows you which documents are missing
  • Implement a feedback loop: users can mark answers as helpful/unhelpful

Conclusion and Next Steps

RAG is not hype, but a proven architecture that transforms LLMs from impressive text generators into reliable knowledge assistants. With the right pipeline — solid chunking, good embedding model, performant vector database, and clear system prompt — you can build AI applications that create real value for your company.

The key takeaways from this guide:

  1. RAG solves the three core problems of LLMs: Hallucinations, outdated knowledge, and missing company knowledge
  2. Chunking quality determines answer quality: Invest time in finding the right chunk size and overlap strategy
  3. Start simple: pgvector + Supabase is sufficient for most use cases. You can scale later.
  4. Measure: Without metrics like Recall@K, you are optimizing blind. Build a test set.
  5. RAG before fine-tuning: In 90% of cases, RAG alone delivers sufficiently good results.

Try It Yourself

On our RAG demo, you can upload your own documents and ask questions — directly in your browser, no registration required. Experience how RAG works in practice.

If you want to implement RAG for your company — whether it is an internal knowledge assistant, an intelligent document search, or a customer support bot — I am happy to help. Let us find out in an obligation-free conversation how RAG can improve your processes.

Experience RAG in Action

Try our interactive RAG demo — upload your documents and ask questions. Or let us find out in a personal conversation how RAG can improve your processes.

Made in Germany
100% GDPR Compliant
AI Act Ready
Secure Hosting
Accessible
Cookie Consent
Data Anonymization
Harald Schwankl | Fullstack Developer & AI Specialist