Search Pipeline

What is a RAG Pipeline?

If you've used ChatGPT, you know that language models are incredibly smart — but they only know what they were trained on. They don't know about your company's documents, your product's features, or your customer's support tickets.

RAG (Retrieval-Augmented Generation) solves this by adding a "retrieval" step before generation:

User asks a question → "What's our refund policy?"
Retrieval → Search your documents and find the relevant chunks
Augmented Generation → Feed those chunks to the LLM as context, and it generates an accurate answer

Without RAG, the LLM guesses (and hallucinates). With RAG, it answers based on your actual data.

Memcity's getContext method is the retrieval part — it finds the most relevant chunks from your knowledge base. You then pass those chunks to whatever LLM you're using for the generation step.

Why 16 Steps?

A naive RAG pipeline has 3 steps: embed query, find similar chunks, return them. It works, but the results are mediocre. Each additional step addresses a specific failure mode:

Queries are vague → Query routing, decomposition, and expansion fix this
Short queries match poorly → HyDE generates a hypothetical document to search with
Keyword searches miss synonyms → Semantic search understands meaning
Semantic search misses exact terms → BM25 keyword search catches them
One ranking isn't enough → RRF fusion + reranking improve precision
Related concepts are invisible → Knowledge graph traversal finds connections
Results lack context → Chunk expansion fetches surrounding text
Old results bury new ones → Temporal boost favors recent content

Each step is optional and tier-gated. Community tier uses steps 2, 6-8, 10, 14, 16. Pro enables everything.

Pipeline Overview

typescript

Query → [1. Quota Check] → [2. Cache] → [3. Route] → [4. Decompose]
  → [4.5. Expand] → [5. HyDE] → [6. Embed] → [7. Search ×2]
  → [8. RRF Fusion] → [9. ACL Filter] → [10. Dedup]
  → [11. GraphRAG] → [12. Rerank] → [13. Expand Chunks]
  → [13.5. Temporal Boost] → [13.6. Citations]
  → [14. Confidence] → [14.5. RAPTOR] → [15. Memory]
  → [16. Format + Cache + Analytics + Audit]

Step-by-Step Breakdown

Step 1: Quota Check (Team)

If enterprise.quotas is enabled, Memcity checks whether the requesting organization has exceeded their API rate limit before doing any expensive work. If they're over quota, the request is rejected immediately with a clear error.

This is intentionally the first step — no point embedding a query or searching if you're going to reject the request anyway.

Step 2: Cache Check

Memcity caches query embeddings. If the exact same query was searched recently, it skips re-embedding and jumps straight to the search step. This saves both time (~50ms) and money (Jina embedding API calls).

Cache hit rates vary by application — documentation search sites might see 30-50% cache hits (users ask similar questions), while creative applications see lower rates.

Step 3: Query Routing (Pro+)

Not all queries are created equal. "What is React?" is simple. "Compare the authentication approaches used in our microservices and recommend the best one for our new API" is complex.

The LLM classifies the query into one of three categories:

Classification	What happens	Example
Simple	Skip decomposition and HyDE. Fast path.	"What is the refund policy?"
Moderate	Use query expansion but skip decomposition	"How do refunds work for digital vs physical?"
Complex	Full pipeline — decomposition, HyDE, max expansions	"Compare our refund policy with competitor X and identify gaps"

This means simple queries are fast (~200ms) and complex queries get the full treatment (~800ms) without you writing any conditional logic.

Step 4: Query Decomposition (Pro+)

Complex queries contain multiple sub-questions. Decomposition breaks them apart so each can be searched independently:

Input query: "What are the differences between our vacation and sick leave policies, and how do they compare to industry standards?"

Decomposed into:

"What is our vacation policy?"
"What is our sick leave policy?"
"What are industry standard vacation allowances?"
"What are industry standard sick leave allowances?"

Each sub-query gets its own search, and results are merged. This dramatically improves recall for multi-part questions that a single search would struggle with.

Step 4.5: Query Expansion (Pro+)

Generates semantic variations of the query to cast a wider net. Configurable via maxQueryExpansions (default: 3).

Original query: "Python web development"

Expanded to:

"Django Flask FastAPI web frameworks"
"Building web applications in Python"
"Python HTTP server REST API development"

Each variation might match different documents that the original query would miss. More expansions = better recall but higher latency.

Step 5: HyDE Generation (Pro+)

HyDE (Hypothetical Document Embeddings) is one of the most powerful techniques in modern RAG. Here's the insight: your query is short, but the answer you want is in a long document. Short queries and long documents don't always match well in embedding space.

The trick: Ask the LLM to imagine the perfect document that answers the query, then search for real documents similar to that imaginary one.

Example:

Query: "vacation days"
Hypothetical document the LLM generates: "Our company provides all full-time employees with 20 days of paid vacation per year. Part-time employees receive a prorated amount based on their weekly hours. Vacation days must be used within the calendar year and do not roll over..."
What happens: This hypothetical text gets embedded alongside the original query. The embedding of this detailed hypothetical is much closer to the real vacation policy document than the two-word query "vacation days" would be.

Step 6: Embedding Generation

The query (and any expansions/HyDE outputs) are converted into vectors using Jina v4 embeddings.

What are embeddings? Think of them as coordinates in meaning-space. The word "king" might be at position [0.2, 0.8, 0.1, ...] and "queen" at [0.2, 0.7, 0.1, ...] — they're close because they're semantically similar. "Banana" would be far away at [0.9, 0.1, 0.8, ...].

Jina v4 produces 1,024-dimensional vectors — that's 1,024 numbers per text snippet. These high-dimensional vectors capture nuanced meaning: not just that "king" and "queen" are similar, but how they're similar (both royalty, different gender).

Embeddings are cached, so repeated queries skip this step entirely.

Step 7: Parallel Search

Two searches run simultaneously:

Semantic search (vector similarity): Finds chunks whose embeddings are closest to the query embedding. This understands meaning — "How do I cancel?" matches "To terminate your subscription, navigate to account settings."

BM25 search (keyword matching): Finds chunks containing the same words as the query, weighted by term frequency and document frequency. This catches exact matches — "error code 4012" matches documents containing exactly "4012".

Running both in parallel means you get the strengths of each without doubling the latency.

Step 8: Weighted RRF Fusion

Reciprocal Rank Fusion merges the two result sets. The formula:

typescript

RRF_score(doc) = weight_semantic × (1 / (k + rank_semantic))
               + weight_bm25 × (1 / (k + rank_bm25))

Where k = 60 (a constant that prevents top results from dominating too much).

Worked example:

Document A is ranked #1 in semantic, #5 in BM25
Document B is ranked #3 in semantic, #1 in BM25
With default weights (0.7 semantic, 0.3 BM25):
- A: 0.7 × (1/61) + 0.3 × (1/65) = 0.01148 + 0.00462 = 0.01610
- B: 0.7 × (1/63) + 0.3 × (1/61) = 0.01111 + 0.00492 = 0.01603
Document A wins slightly — semantic ranking is weighted higher.

You can tune these weights via search.weights.semantic and search.weights.bm25.

Step 9: ACL Filtering (Team)

If enterprise.acl is enabled, results are filtered based on the requesting user's principals — identifiers like user:alice, role:admin, group:engineering.

A document is visible to a user only if the user's principals overlap with the document's ACL list. Documents without ACLs are visible to everyone.

This happens after retrieval, not during — so the vector search still operates over the full index for best recall, and ACL filtering narrows the results afterward.

Step 10: Semantic Deduplication

After merging results from multiple sources (semantic, BM25, graph, expansions), you often get near-duplicates — chunks that say essentially the same thing in slightly different words.

Memcity computes cosine similarity between all result pairs and removes duplicates above a threshold. This ensures your results are diverse, not repetitive.

Step 11: Entity Search + GraphRAG (Pro+)

This is where the knowledge graph comes in. Memcity:

Extracts entities from the query — "What does the CEO think about remote work?" → entities: ["CEO", "remote work"]
Matches entities in the graph — Finds the "CEO" node and the "remote work" node
Traverses relationships — Follows connections: CEO → "leads" → Company → "has policy" → Remote Work Policy
Retrieves connected chunks — Documents connected through the graph that vector search alone might miss

This is incredibly powerful for questions that span multiple documents. See Knowledge Graph for details.

Step 12: Jina Reranker v3 (Pro+)

The initial search uses bi-encoders — the query and documents are embedded separately and compared by distance. This is fast but approximate.

The reranker uses a cross-encoder — it looks at the query and each candidate together, considering how they interact word by word. This is slower (you can only rerank tens of results, not thousands) but much more accurate.

Think of it like a two-stage hiring process:

Stage 1 (initial search): Scan all 10,000 resumes with keyword matching → 50 candidates
Stage 2 (reranker): Interview each of the 50 candidates in depth → top 10

The reranker typically adds ~100ms of latency but significantly improves result quality.

Step 13: Chunk Expansion (Pro+)

When a chunk matches your query, the surrounding context is often useful too. If the matching chunk says "All employees get 20 vacation days", the next chunk might explain the request process.

Chunk expansion fetches maxChunkExpansions chunks before and after each top result, assembling a wider context window. Think of it like highlighting a sentence in a book, then reading the full paragraph around it.

Step 13.5: Temporal Boost

Applies a recency score based on when the document was ingested. Recent documents get a small boost, older documents get slightly penalized.

This is useful when you have evolving documentation — the 2024 policy should rank higher than the 2022 policy if both match the query.

Step 13.6: Citation Generation (Pro+)

For each result, Memcity generates citations — breadcrumbs that tell you exactly where the answer came from:

json

{
  "text": "All employees get 20 vacation days per year.",
  "citations": {
    "source": "vacation-policy.md",
    "heading": "Vacation Policy > Requesting Time Off",
    "page": 1,
    "lineStart": 12,
    "lineEnd": 14
  }
}

This is essential for building trustworthy AI applications — users can verify the answer by clicking through to the source.

Step 14: Confidence Scoring

Each result gets a confidence score (0-1) based on multiple signals:

Relevance score from the search
Reranker score (if enabled)
Number of corroborating results
Source document quality signals

High confidence (above 0.8) means the result is very likely relevant. Low confidence (below 0.3) means it's a weak match — you might want to filter these out or flag them.

Step 14.5: RAPTOR Summary Search (Pro+)

For high-level queries like "Give me an overview of our HR policies", individual chunks are too granular. RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) pre-builds hierarchical summaries:

Level 0: Individual chunks
Level 1: Summaries of chunk groups (e.g., "Vacation policy section")
Level 2: Summaries of summaries (e.g., "HR policies overview")
Level 3: Document-level summaries

When a high-level query is detected, RAPTOR searches these summaries instead of individual chunks.

Step 15: Memory Search (Pro+)

If the user has episodic memories, relevant ones are retrieved and included in the results. This personalizes search — a user who previously asked about "React" might get React-related results boosted.

Step 16: Format + Cache + Analytics + Audit

The final step:

Format — Assemble the final response with all results, scores, citations, and metadata
Cache — Store the query embedding for future cache hits
Analytics — Record timing, result count, and pipeline statistics
Audit (Team) — Write an immutable audit log entry recording the search

Performance Characteristics

Step	Typical Latency	Tier
1. Quota Check	~1ms	Team
2. Cache Check	~5ms	All
3. Query Routing	~50ms	Pro+
4. Decomposition	~100ms	Pro+
4.5. Expansion	~80ms	Pro+
5. HyDE	~150ms	Pro+
6. Embedding	~50ms (or 0 if cached)	All
7. Parallel Search	~30ms	All
8. RRF Fusion	~5ms	All
9. ACL Filtering	~5ms	Team
10. Deduplication	~5ms	All
11. GraphRAG	~100ms	Pro+
12. Reranking	~100ms	Pro+
13. Chunk Expansion	~20ms	Pro+
14-16. Scoring + Format	~10ms	All

Total pipeline time:

Community (basic): ~100-150ms
Pro (full pipeline): ~400-800ms
Team (with ACL + audit): ~450-850ms

Tuning for Your Use Case

Minimize Latency

Disable expensive steps. Best for autocomplete, real-time search.

search: {
  enableQueryRouting: false,
  enableHyde: false,
  reranking: false,
  maxQueryExpansions: 1,
}

Maximize Recall

Enable everything. Best for research, analysis, complex questions.

search: {
  enableQueryRouting: true,
  enableQueryDecomposition: true,
  enableHyde: true,
  reranking: true,
  maxQueryExpansions: 5,
  maxChunkExpansions: 3,
}

Balanced (Recommended)

Good for most production applications.

search: {
  enableQueryRouting: true,
  enableHyde: true,
  reranking: true,
  maxQueryExpansions: 3,
  maxChunkExpansions: 2,
}