Contents

Contents

Vector Databases and RAG Systems — Building Intelligent LLM Applications

Complete guide to embeddings, retrieval-augmented generation, and semantic search

Contents

A Practical Guide to Building Knowledge-Enhanced AI Applications


1. Introduction to Vector Databases

A vector database is a specialized database designed to store and query high-dimensional vector embeddings efficiently. Unlike traditional databases that excel at exact matches (“find users named John”), vector databases excel at similarity searches (“find documents semantically similar to this query”).

Important

Why This Matters: Vector databases are the backbone of modern AI applications—enabling semantic search, recommendation systems, and knowledge-augmented LLMs.

Key Characteristics:

  • Similarity Search: Find the most similar items based on vector distance
  • Scalability: Handle billions of vectors with sub-second query times
  • AI-Native: Optimized for machine learning pipelines

Large Language Models (GPT-4, Claude, Llama) have critical limitations that vector databases solve:

LLM LimitationVector DB Solution
Knowledge CutoffStore and retrieve current information
HallucinationsGround responses in factual data
No Long-Term MemoryPersist context across sessions
Token LimitsRetrieve only relevant information
Lacks Domain KnowledgeInject specialized expertise

Vector databases power many AI applications you use daily:

  • Customer Support Bots: Retrieve relevant help articles to answer user questions
  • Enterprise Search: Find documents by meaning, not just keywords
  • Code Assistants: Search codebases for similar implementations
  • Research Tools: Find related papers and citations
  • Recommendation Systems: “Users who liked X also liked Y”
  • Chatbots with Memory: Remember past conversations

2. Understanding Embeddings

Embeddings are numerical representations of data (text, images, etc.) as dense vectors of floating-point numbers. These vectors capture semantic meaning in a way machines can process and compare.

Example: The sentence “The quick brown fox” becomes a vector like:

[0.0123, -0.0456, 0.0789, -0.0234, 0.0567, ..., 0.0891]  (1536 dimensions)

Neural networks (specifically transformers) create embeddings through a training process:

  1. Input Processing: Text is broken into tokens (words or subwords)
  2. Contextual Understanding: The transformer processes all tokens together, understanding relationships
  3. Vector Output: The final layer produces a fixed-size vector representing the entire input

Training Method - Contrastive Learning:

  • The model sees pairs of similar texts (e.g., question and its answer)
  • It learns to place similar texts close together in vector space
  • Dissimilar texts are pushed apart
  • After training on millions of pairs, the model understands semantic relationships
Note

You don’t need to train your own embedding model. Pre-trained models like OpenAI’s text-embedding-3-small already understand language well. You just use them via API.

The training process creates a semantic space where:

  • “Dog” and “Puppy” → Close together (similar meaning)
  • “Dog” and “Airplane” → Far apart (unrelated)
  • “King - Man + Woman ≈ Queen” → Vector arithmetic captures relationships

This is why searching for “automobile” can find documents about “cars”—they occupy nearby regions in the embedding space.

When text is converted to embeddings, semantically similar items cluster together in vector space:

Vector Embedding Semantic Space - 3D visualization showing how semantically similar words cluster together. Animals (Wolf, Dog, Cat, Kitten) cluster on one side while Fruits (Banana, Apple) cluster on another.
Semantic embedding visualization showing word clustering based on meaning

Key Insight: A query for “Kitten” naturally finds related animal terms, not fruits—because they’re close in vector space.

TypeDescriptionExample
DenseEvery dimension has a value; captures semantic meaning[0.01, 0.74, 0.52, ...]
SparseMost dimensions are zero; captures keyword presence{"cat": 4, "dog": 1}
Note

Modern systems combine both in hybrid search for semantic understanding AND keyword precision.

More dimensions capture more nuance, but at a cost:

DimensionsSpeedAccuracyMemoryBest For
384FastestGood~1.5 KB/vectorReal-time apps, chatbots
768FastBetter~3 KB/vectorBalanced performance
1536MediumExcellent~6 KB/vectorMost production use cases
3072SlowerBest~12 KB/vectorMaximum accuracy needed

Rule of Thumb: Start with 1536 dimensions (OpenAI’s default). Only go larger if accuracy is critical and you have the infrastructure.

ModelDimensionsProviderBest For
text-embedding-3-small1536OpenAICost-effective general use
text-embedding-3-large3072OpenAIMaximum accuracy
all-MiniLM-L6-v2384HuggingFaceFast, open-source
BGE-large-en-v1.51024BAAITop open-source performance
embed-multilingual-v3.01024Cohere100+ languages

To find similar vectors, we measure “distance” between them:

Cosine Similarity (most common for text):

𝐬𝐢𝐦𝐢𝐥𝐚𝐫𝐢𝐭𝐲 = (𝐀 ⋅ 𝐁) ÷ (‖𝐀‖ ⨯ ‖𝐁‖)

  • Score of 1.0 = Identical meaning
  • Score of 0.0 = Completely unrelated
  • Score of -1.0 = Opposite meaning
MetricBest For
CosineText embeddings
Euclidean (L2)Image embeddings
Dot ProductNormalized vectors

Query: “Is Windows 8 any good?”

Semantic Similarity Scoring - Query matched against documents showing scores: 0.88 for “Windows 10 is good”, 0.82 for “I love Windows 11!”, 0.88 for “Not enjoying Windows 10”, 0.87 for “Always found Windows 8 kinda weird”
Semantic similarity scoring demonstration

Warning

Pitfall: “Windows 10 is good” scores 0.88—as high as actual Windows 8 content! Embeddings treat version numbers as semantically similar. This is why hybrid search matters.

Searching billions of vectors requires Approximate Nearest Neighbor (ANN) algorithms. Here’s how the main ones work:

The most popular algorithm for production systems. Think of it as a multi-level express train system:

How it works:

  1. Creates multiple layers of connected nodes (vectors)
  2. Top layer: Few nodes, long-distance connections (express trains)
  3. Bottom layer: All nodes, short-distance connections (local stops)
  4. Search: Start at top, quickly narrow down, then refine at bottom

Key Parameters:

  • M: Connections per node (higher = better recall, more memory)
  • ef: Search width (higher = more accurate, slower)

Trade-off: Excellent speed and accuracy, but uses more memory than other methods.

Uses clustering to organize vectors into buckets:

How it works:

  1. Training: Use k-means to create cluster centers (centroids)
  2. Indexing: Assign each vector to its nearest centroid
  3. Search: Find nearest centroids to query, then search only those clusters

Key Parameter:

  • nprobe: How many clusters to search (higher = more accurate, slower)

Trade-off: Fast for huge datasets, but requires training data and careful tuning.

Compression technique that shrinks vectors dramatically:

How it works:

  1. Split each vector into sub-vectors (e.g., 1536D → 8 chunks of 192D)
  2. Replace each sub-vector with a code pointing to a codebook entry
  3. Store only the codes (8 bytes instead of 6 KB!)

Trade-off: Massive memory savings (32-64x), but loses some accuracy.

AlgorithmHow It WorksSearch SpeedMemoryAccuracyBest For
HNSWGraph navigationO(log N)HighExcellentProduction systems
IVFCluster searchO(√N)MediumGoodLarge-scale search
PQCompressed codesO(N)Very LowModerateBillions of vectors
IVF-PQClusters + compressionO(√N)LowGoodBalance of all factors
Tip

Most production systems use HNSW (Pinecone, Weaviate, Qdrant). It offers the best balance of speed and accuracy. Use IVF-PQ only when you have billions of vectors and limited memory.


4. RAG (Retrieval-Augmented Generation)

RAG enhances LLM responses by retrieving relevant information from external sources before generating an answer.

RAG Architecture Overview - Flow: User query → Embedding model → Vector database → Retrieved documents → LLM → Answer
Basic RAG pipeline architecture

  1. User Query → User asks a question
  2. Query Embedding → Convert query to vector
  3. Vector Search → Find similar documents
  4. Context Retrieval → Fetch document content
  5. Prompt Construction → Combine query + context
  6. LLM Generation → Generate grounded answer
Important

Why RAG Works: LLMs can’t read thousands of documents and remember them. Fine-tuning influences style, not knowledge. RAG retrieves relevant knowledge dynamically at runtime.

BenefitDescription
No RetrainingUpdate knowledge by updating the database
VerifiableCan cite sources for answers
Cost-EffectiveCheaper than fine-tuning
Up-to-DateAdd new information instantly

RAG combats hallucinations by:

  1. Providing factual context in the prompt
  2. Instructing the LLM to answer only from provided context
  3. Allowing “I don’t know” when context is insufficient

5. Document Processing Pipeline

Document Processing Pipeline - Flow: PDF → Raw text → Text chunks → Embeddings → Vector database
Basic document chunking pipeline

Steps:

  1. Extract raw text from documents (PDF, DOCX, HTML)
  2. Chunk into fixed-size pieces (500-1000 characters)
  3. Embed each chunk to a vector
  4. Store (vector, chunk) pairs in database

There are multiple ways to split documents. Each has trade-offs:

StrategyHow It WorksProsConsBest For
Fixed-sizeSplit every N charactersSimple, predictableMay break mid-sentenceQuick prototyping
Sentence-basedSplit at sentence boundariesGrammatically correctVariable chunk sizesArticles, blogs
RecursiveTry paragraphs → sentences → wordsBalances size and meaningMore complexMixed documents
SemanticUse embeddings to find topic shiftsBest coherenceComputationally expensiveTechnical docs

Recursive Chunking (most recommended):

  1. First, try splitting by paragraph (\n\n)
  2. If chunks are still too large, split by sentence (. )
  3. If still too large, split by words
  4. This preserves natural document structure
SizeTokensBehaviorBest For
Small (~256)~64Precise retrieval, less contextSpecific fact lookup
Medium (~512)~128Balanced approachGeneral Q&A
Large (~1024)~256More context, fewer chunksComplex explanations
Very Large (~2048)~512Full paragraphsBroad topic summaries

Rule of Thumb: Match chunk size to expected query length. Short questions → smaller chunks. Complex questions → larger chunks.

Tip

Always use chunk overlap (10-20%). This prevents losing information that spans chunk boundaries. If a key fact is at the edge of two chunks, overlap ensures it appears in at least one complete chunk.

Document Summarization Pipeline - Flow: PDF → Raw text → Text chunks → Small LLM → Summary → Embeddings → Vector database
Enhanced pipeline with LLM-based summarization

Benefits:

  • Removes filler text that confuses embeddings
  • Creates denser, more meaningful representations
  • Normalizes formatting across document types

For hierarchical documents (books, papers):

  1. Index smaller chunks (paragraphs) for precise retrieval
  2. When retrieved, also fetch the parent section for context
  3. Or use “windowed” approach—retrieve neighboring chunks

6. Prompt Engineering with RAG

Instead of hardcoding examples, retrieve relevant examples from a vector database:

Few-Shot Prompting with Vector Database - User query flows to vector database for examples, combined into prompt sent to LLM
Dynamic few-shot prompt construction

Flow:

  1. User asks a question
  2. Vector DB retrieves similar Q&A examples
  3. Examples are injected into the prompt
  4. LLM generates answer using those examples as guidance
TemperatureBehaviorBest For
0.0 - 0.3Deterministic, factualRAG, Q&A, summarization
0.4 - 0.7BalancedGeneral conversation
0.8 - 1.0+Creative, diverseBrainstorming, fiction
Warning

For RAG: Use low temperature (0.0-0.3). Higher temperatures increase hallucination risk—defeating the purpose of retrieval-augmented generation.


Semantic search can fail for:

  • Entity-specific queries: “Windows 8” retrieves Windows 10/11 content
  • Exact terminology: Medical terms, legal citations, product SKUs
  • Negations: “not Python” still retrieves Python content
  • Rare terms: Words not well-represented in training data

Before combining approaches, understand how keyword search works:

A classic algorithm that scores documents based on keyword importance:

  • TF (Term Frequency): How often does the word appear in this document?
  • IDF (Inverse Document Frequency): How rare is this word across ALL documents?
  • Score = TF × IDF

Intuition: If “quantum” appears 5 times in a document and is rare across your corpus, that document scores high for “quantum” queries.

An improved version of TF-IDF used by most search engines:

  • Adds document length normalization (short docs don’t unfairly win)
  • Diminishing returns for repeated terms (10 mentions isn’t 10x better than 1)
  • Tunable parameters (k1, b) for different use cases

When BM25 Shines: Exact terminology, product codes, legal citations, medical terms.

Hybrid Search Architecture - Input flows to Dense embedding model AND Sparse embedding model, combined via Pinecone/hybrid into Hybrid index
Hybrid search combining dense and sparse vectors

How It Works:

  1. Dense Vector → Captures semantic meaning (“car” finds “automobile”)
  2. Sparse Vector → Captures exact keywords (BM25/TF-IDF)
  3. Combined Query → Search both indexes simultaneously
  4. Score Fusion → Merge results using Reciprocal Rank Fusion (RRF)

The standard method for combining search results:

𝐑𝐑𝐅(𝒅) = ∑ 𝟏 / (𝒌 + 𝐫𝐚𝐧𝐤ᵣ(𝒅)) for each 𝒓 ∈ 𝑹

Where:

  • d = document
  • R = set of ranking methods (dense, sparse)
  • k = constant (typically 60)
  • rank_r(d) = position of document d in ranking r

Why It Works: Documents appearing in BOTH dense and sparse results get boosted. A document ranked #3 in dense and #5 in sparse will outrank one that’s #1 in only one method.

ScenarioApproachWhy
General Q&ADense (semantic)Users phrase questions differently than docs
Technical docsHybridNeed both concepts AND specific terms
Legal/MedicalHybrid (favor sparse)Exact terminology is critical
Product search with SKUsHybridMust match exact product codes
Conversational AIDenseNatural language varies widely
Search TypeFindsMissesExample
Dense onlySynonyms, paraphrasesExact codes, rare terms“Tell me about vehicles” → finds “car”, “automobile”
Sparse onlyExact matchesSemantic variations“SKU-12345” → finds exact match
HybridBoth semantic AND exactRarely misses anything“Windows 8 issues” → finds Windows 8 specifically
Tip

Rule of Thumb: If exact keywords matter, use hybrid search. Most modern vector databases (Pinecone, Qdrant, Weaviate) support it natively. Start with a 50/50 weight, then tune based on your data.


8. Reranking and Context Compression

Initial vector search isn’t perfect:

  • Relevant documents may rank low
  • Too much context overwhelms the LLM
  • Documents buried in the middle get ignored (“lost in the middle” problem)

Reranking Search Results - Initial results pass through Rerank model, scores adjust (20%, 15%, 80%), final ranking reorders results
Search result reranking process

Two-Stage Retrieval:

  1. Stage 1: Vector search retrieves broad candidates (fast, ~100 docs)
  2. Stage 2: Reranker re-scores top candidates by relevance (precise, ~20 docs)

The key to understanding reranking is understanding these two architectures:

  • Embeds query and document separately
  • Compares pre-computed embeddings using cosine similarity
  • Speed: Can search millions of docs in milliseconds
  • Accuracy: Good, but misses nuanced query-document relationships
  • Processes query AND document together as one input
  • Considers every word interaction between query and document
  • Speed: Slow (must process each doc individually)
  • Accuracy: Excellent (understands exact relevance)
TypeHow It WorksSpeedAccuracyStage
Bi-EncoderEmbed separately, compareMilliseconds for millionsGoodInitial retrieval
Cross-EncoderProcess togetherSeconds for dozensExcellentReranking

Analogy: Bi-encoders are like speed dating (quick impressions). Cross-encoders are like in-depth interviews (thorough evaluation).

You can’t use cross-encoders for initial search—scoring 1 million documents would take hours. Instead:

  1. Stage 1 (Bi-Encoder): Cast a wide net, retrieve ~100 candidates fast
  2. Stage 2 (Cross-Encoder): Carefully evaluate top ~20 candidates
  3. Return the best ~5-10 to the LLM

This gives you the best of both worlds: speed AND accuracy.

ModelTypeSpeedAccuracyNotes
rerank-english-v3.0Cohere APIFastExcellentProduction-ready, paid
bge-reranker-largeOpen SourceMediumExcellentBest open-source option
ms-marco-MiniLMOpen SourceFastGoodLightweight, fast
cross-encoder/ms-marco-MiniLM-L-6-v2HuggingFaceFastGoodEasy to deploy

Use a smaller LLM to:

  • Extract only relevant portions from each document
  • Discard irrelevant context
  • Reduce token usage and cost

Research shows LLMs have a U-shaped attention pattern:

  • Pay most attention to the beginning of context
  • Pay good attention to the end of context
  • Ignore or forget information in the middle

Implications for RAG:

  • Don’t just append retrieved docs in order of similarity score
  • Put the MOST relevant document first
  • Put the SECOND most relevant document last
  • Less critical docs go in the middle
Warning

Lost in the Middle: If you retrieve 10 documents and the answer is in document #5, the LLM might miss it entirely. Reorder your context strategically!


9. Query Transformation Techniques

Sometimes the user’s query isn’t optimal for retrieval. Query transformation techniques improve retrieval by reformulating the query before searching.

Problem: User queries are short and may not match document vocabulary.

Solution: Generate a hypothetical answer, then search for documents similar to that answer.

How it works:

  1. User asks: “How do I fix memory leaks in Python?”
  2. LLM generates a hypothetical answer (even if imperfect)
  3. Embed the hypothetical answer (not the question)
  4. Search for real documents similar to this hypothetical
  5. Retrieved docs are often more relevant than direct query search

When to use: Technical queries, specialized domains where users and documents use different vocabulary.

Problem: A single query may miss relevant documents phrased differently.

Solution: Generate multiple variations of the query and search with all of them.

How it works:

  1. Original query: “Python web frameworks”
  2. Generate variations:
    • “Django vs Flask comparison”
    • “Best backend frameworks for Python”
    • “Building web apps with Python”
  3. Search with each query
  4. Combine results (using RRF or union)

When to use: Ambiguous queries, broad topics, when recall is more important than precision.

Problem: Specific questions may miss broader context needed to answer well.

Solution: Generate a more general version of the query first.

How it works:

  1. Specific query: “What’s the boiling point of water at 2000m altitude?”
  2. Step-back query: “How does altitude affect boiling point?”
  3. Retrieve documents for BOTH queries
  4. Combine context (general + specific)

When to use: Complex questions that require background knowledge.

TechniqueWhen to UseLatency ImpactRecall Improvement
HyDETechnical/specialized queries+1 LLM call+15-25%
Multi-QueryAmbiguous queries+3-5 searches+10-20%
Step-BackComplex questions+1 LLM call, +1 searchBetter context
Query RewritingPoor user queries+1 LLM callVariable
Tip

Start simple: Most RAG systems work fine without query transformation. Add these techniques only if you see retrieval quality issues.


10. Advanced RAG Patterns

Basic RAG (chunk → embed → retrieve → generate) works well for simple use cases. For more complex scenarios, consider these advanced patterns:

Problem: Retrieved documents might be irrelevant or outdated.

Solution: Evaluate retrieval quality and self-correct if needed.

How it works:

  1. Retrieve documents normally
  2. Use an LLM to evaluate: “Are these documents relevant to the query?”
  3. If confident → proceed to generation
  4. If uncertain → try alternative retrieval (e.g., web search)
  5. If irrelevant → fall back to web search or “I don’t know”

When to use: Knowledge bases that may be incomplete, time-sensitive information.

Problem: Not every query needs retrieval. Basic RAG always retrieves, wasting resources.

Solution: Let the LLM decide WHEN to retrieve and verify its own outputs.

How it works:

  1. LLM evaluates: “Do I need external information for this query?”
  2. If yes → retrieve and generate with context
  3. If no → generate directly from knowledge
  4. After generation → LLM verifies: “Is this answer supported by the context?”

When to use: Mixed query types (some factual, some conversational), cost-sensitive applications.

Problem: Complex questions require multiple retrieval steps and reasoning.

Solution: LLM acts as an agent that can retrieve, reason, and retrieve again.

How it works:

  1. LLM analyzes the question
  2. Breaks it into sub-questions if needed
  3. Retrieves information for each sub-question
  4. Reasons over retrieved information
  5. May retrieve again if gaps are found
  6. Synthesizes final answer

Example: “Compare the economic policies of the last 3 US presidents”

  • Agent retrieves info on President 1
  • Agent retrieves info on President 2
  • Agent retrieves info on President 3
  • Agent synthesizes comparison

When to use: Research questions, multi-hop reasoning, complex analysis.

PatternComplexityBest ForKey Benefit
Basic RAGSimpleStraightforward Q&AEasy to implement
CRAGMediumIncomplete knowledge basesHandles retrieval failures
Self-RAGMediumMixed query typesEfficient (skips unnecessary retrieval)
Agentic RAGHighComplex researchMulti-step reasoning
Important

Start with Basic RAG. Only add complexity when you have evidence that basic RAG isn’t working. Each additional pattern adds latency and cost.


11. RAG Evaluation Framework

You can’t improve what you don’t measure. RAG systems require evaluation at two stages: retrieval and generation.

These measure how well your system finds relevant documents:

MetricWhat It MeasuresFormulaGood Score
Recall@k% of relevant docs in top k resultsrelevant_in_k / total_relevant> 0.8
Precision@k% of top k results that are relevantrelevant_in_k / k> 0.6
MRRPosition of first relevant result1 / rank_of_first_relevant> 0.7
NDCGRanking quality (position matters)DCG / ideal_DCG> 0.7

Example: For query “What is RAG?”, if you retrieve 10 docs and 3 are relevant:

  • If relevant docs are at positions 1, 2, 5 → Good (high MRR, good NDCG)
  • If relevant docs are at positions 6, 8, 10 → Bad (low MRR, poor NDCG)

RAGAS is the standard framework for evaluating RAG generation quality:

MetricWhat It MeasuresHow It’s Calculated
FaithfulnessIs answer grounded in context?LLM checks if claims are supported
Answer RelevancyDoes answer address the question?LLM scores relevance
Context PrecisionAre retrieved docs actually useful?% of context that contributed to answer
Context RecallDid we retrieve all needed info?Can answer be derived from context alone?

To evaluate your RAG system, you need:

  1. Test Questions: 50-100 representative queries
  2. Ground Truth Answers: What the correct answer should be
  3. Relevant Documents: Which docs should be retrieved

Quick Start:

  • Extract real user questions from logs
  • Have domain experts write answers
  • Run evaluation weekly to catch regressions
  1. Separate retrieval and generation evaluation - A bad answer might be due to poor retrieval OR poor generation
  2. Test edge cases - Queries with no answer, ambiguous queries, multi-hop questions
  3. Track metrics over time - Catch regressions early
  4. Use human evaluation for final quality assessment - Automated metrics don’t catch everything
Tip

Minimum viable evaluation: Start with 50 test questions and track Recall@10 for retrieval and Faithfulness for generation. Expand from there.


12. Debugging RAG Failures

When your RAG system gives wrong answers, use this systematic approach to find and fix the problem.

SymptomLikely CauseSolution
Returns irrelevant documentsEmbedding model mismatchTry domain-specific embeddings
Misses obvious answersChunks too smallIncrease chunk size + overlap
“I don’t know” for known factsDocument not indexedCheck ingestion pipeline
Contradictory answersMultiple conflicting sourcesAdd source reliability scoring
Slow responsesToo many docs retrievedReduce k, add reranking
Hallucinations despite contextLLM ignoring contextLower temperature, stronger instructions
Wrong version infoSemantic similarity ignores numbersUse hybrid search

When a query fails, check these in order:

1. Is the document even in the index?

  • Search for an exact phrase from the expected document
  • If not found → ingestion problem

2. Is the document retrieved?

  • Look at similarity scores of retrieved docs
  • If expected doc has low score → embedding or chunking problem

3. Is the right chunk retrieved?

  • Check if the answer spans multiple chunks
  • If answer is split → adjust chunk size/overlap

4. Is the LLM using the context?

  • Check if answer matches retrieved context
  • If LLM ignores context → adjust prompt, lower temperature

5. Is there conflicting information?

  • Check for contradictory docs in results
  • If present → add source filtering or recency scoring

Test #1: Direct phrase search

  • Search for exact text from a document you know exists
  • If it doesn’t appear in top 10 → indexing problem

Test #2: Synonym search

  • Search using synonyms of known document content
  • If it works → your embedding model is fine
  • If it fails → consider different embedding model

Test #3: Compare dense vs sparse

  • Run query through dense search only
  • Run query through sparse (keyword) search only
  • Compare results → decide if hybrid search would help
ComponentSuspect If…
ChunkingAnswers are partially correct, missing context
Embedding ModelSynonyms don’t match, domain terms fail
Index ConfigHigh recall but slow, or fast but missing results
Retrieval kGood docs exist but not in top k
RerankingGood docs retrieved but ranked low
PromptCorrect context but wrong answer
LLM TemperatureAnswers vary wildly, or include made-up facts
Warning

Don’t guess - measure! Log every query, retrieval result, and answer. Use these logs to identify patterns in failures.


13. Cost Optimization

RAG systems can get expensive at scale. Here’s how to optimize costs without sacrificing quality.

ComponentCost DriverTypical Cost
Embedding APIPer token~$0.02 per 1M tokens
Vector DatabaseStorage + queries$0.01-0.10 per 1K queries
LLM APIInput + output tokens$0.50-5.00 per 1K queries
RerankingPer document scored~$0.001 per document

Key Insight: LLM calls dominate costs (often 80-90% of total).

Not every query needs RAG. Route simple queries directly to the LLM.

  • “What’s 2+2?” → Direct LLM (no retrieval needed)
  • “What’s our refund policy?” → RAG (needs company docs)

Savings: 30-50% reduction in RAG operations.

Cache at multiple levels:

  • Query cache: Same query → same results
  • Embedding cache: Same text → same embedding
  • Answer cache: Frequent questions → cached answers

Savings: 20-40% reduction in API calls.

Only use expensive operations when needed:

  1. Fast vector search (always)
  2. Reranking (only if initial results are uncertain)
  3. Query transformation (only if initial retrieval fails)

Savings: 15-25% reduction in compute.

Reduce tokens sent to LLM:

  • Retrieve fewer documents (5 instead of 10)
  • Use smaller chunks
  • Compress context before sending

Savings: 20-40% reduction in LLM costs.

Use cheaper models for appropriate tasks:

TaskExpensive OptionCheaper Alternative
Embeddingtext-embedding-3-largetext-embedding-3-small
Query routingGPT-4GPT-3.5 or classifier
Simple Q&AGPT-4GPT-3.5 or Claude Haiku
Complex reasoningGPT-4Keep GPT-4 (worth the cost)

Scenario: 10,000 queries/month, 5 docs retrieved per query

ApproachEmbeddingVector DBLLMTotal/Month
Basic RAG$0.20$1.00$50.00~$51
With reranking$0.20$1.00$70.00~$71
Optimized$0.10$0.50$25.00~$26

Optimizations applied: Query routing (50%), caching (20%), context compression (30%).

Tip

Measure before optimizing. Log costs per query type. You’ll often find 80% of costs come from 20% of query types—focus there first.


14. Popular Vector Databases

DatabaseTypeKey FeaturesBest For
PineconeManaged SaaSServerless, hybrid search, easy setupProduction without DevOps
WeaviateOpen Source/CloudGraphQL API, built-in ML modulesFlexible deployments
QdrantOpen Source/CloudRust-based, very fast, good filteringHigh performance needs
MilvusOpen SourceDistributed, GPU support, massive scaleBillions of vectors
ChromaOpen SourceSimple API, in-memory optionPrototyping, local dev
pgvectorPostgres ExtensionSQL integration, familiar toolingTeams using Postgres
OptionProsConsBest For
Managed SaaS (Pinecone)Zero DevOps, high reliabilityHigher cost, vendor lock-inSmall teams, quick start
Managed Cloud (Weaviate Cloud, Qdrant Cloud)Balance of control and convenienceMedium costGrowing teams
Self-Hosted (Milvus, Qdrant, Weaviate)Full control, lowest cost at scaleDevOps requiredLarge teams, enterprises

Choose Pinecone if: You want zero infrastructure management and fastest time-to-production.

Choose Weaviate if: You need flexibility and like GraphQL APIs.

Choose Qdrant if: Performance is critical and you want both cloud and self-hosted options.

Choose Milvus if: You have billions of vectors and need distributed architecture.

Choose Chroma if: You’re prototyping or building local-first applications.

Choose pgvector if: You already use PostgreSQL and want to add vector search without new infrastructure.

Warning

Security

  • Never expose API keys in client-side code
  • Implement rate limiting
  • Sanitize queries to prevent prompt injection
  • Use row-level security for multi-tenant applications

Performance:

  • Batch embedding requests when indexing
  • Use async operations for concurrent retrieval
  • Cache frequently-asked queries
  • Monitor query latency and adjust index parameters

Reliability:

  • Set up automated backups
  • Use replicas for high availability
  • Implement graceful degradation when DB is slow

15. Best Practices Summary

Tip
  1. Preserve context: Don’t split mid-sentence
  2. Use overlap: 10-20% prevents information loss
  3. Respect structure: Use headers as natural break points
  4. Include metadata: Store source, page number, section title
  5. Test empirically: Evaluate with different chunk sizes
Use CaseModelRationale
General Englishtext-embedding-3-smallCost-effective
Maximum accuracytext-embedding-3-largeBest quality
Multi-languageembed-multilingual-v3.0100+ languages
Open-sourceBGE-large-en-v1.5Top OSS performance
Low latencyall-MiniLM-L6-v2Fast, 384 dimensions
  1. Embeddings transform text into searchable vectors capturing semantic meaning
  2. Vector databases enable fast similarity search at scale (HNSW is the standard)
  3. RAG grounds LLM responses in factual, retrievable knowledge
  4. Chunking significantly impacts quality—use recursive chunking with overlap
  5. Hybrid search combines semantic + keyword for best coverage
  6. Reranking improves precision using cross-encoders
  7. Evaluation is essential—track Recall@k and Faithfulness at minimum
  8. Cost optimization starts with query routing and caching

Phase 1: Basic RAG (Start here)

  1. Choose embedding model (start with text-embedding-3-small)
  2. Set up vector database (Pinecone for managed, Chroma for local)
  3. Implement basic chunking (512 tokens, 10% overlap)
  4. Build retrieve → generate pipeline
  5. Test with 20-30 sample questions

Phase 2: Quality Improvements (When basic RAG isn’t enough)

  • Add hybrid search if keyword matching matters
  • Add reranking if relevant docs are retrieved but ranked low
  • Adjust chunk sizes based on retrieval quality
  • Build evaluation dataset, track metrics

Phase 3: Advanced Features (For complex use cases)

  • Query transformation (HyDE, multi-query) for better retrieval
  • CRAG/Self-RAG for handling failures
  • Agentic RAG for multi-hop reasoning
  • Cost optimization for scale
Important

Start Simple, Iterate

Don’t add complexity until you have evidence it’s needed. Each additional component adds latency, cost, and maintenance burden.

The best RAG system is the simplest one that meets your quality requirements.

QuestionIf Yes →If No →
Do exact keywords matter?Use hybrid searchDense search is fine
Are relevant docs ranked low?Add rerankingSkip reranking
Do queries use different vocabulary than docs?Try HyDEBasic retrieval is fine
Are some queries too complex?Consider Agentic RAGBasic RAG is fine
Is cost a concern?Implement query routing + cachingOptimize later

Further Reading: