Why We Built a Custom RAG Pipeline Instead of Buying a Vector DB SaaS

Qdrant local + direct LLM calls gives us full control, zero egress costs, and 80% less debugging than managed RAG services — the trade-off is owning the infrastructure.

Published 2026-06-10

Why We Built a Custom RAG Pipeline Instead of Buying a Vector DB SaaS

TL;DR: Managed RAG (Pinecone, Weaviate Cloud, Pinecone) adds latency, cost, and vendor lock-in for features we don’t need. Local Qdrant + 80 lines of Python beats them on our workload. Full comparison →

The Context

Hermes agents need retrieval over a 200k-token documentation corpus (brain/, project HERMES.md files, planning docs) for 18 cron jobs. Queries are sporadic (cron-scheduled), latency-sensitive (<5s end-to-end), and must work offline. We evaluated Pinecone Serverless, Weaviate Cloud, and Qdrant Cloud against local Qdrant embedded. Team: 1 operator. Constraint: zero external API dependencies, monthly cost <$10, data never leaves machine, debuggable in <5 min.

What We Tested

Tool / Configuration	Use Case	Verdict	Why
Pinecone Serverless	Managed vector DB	❌	Cold start latency (2-8s); $70/mo minimum; data leaves machine; overkill for 200k tokens
Weaviate Cloud	Managed vector DB + GraphQL	❌	Same cold starts; GraphQL adds complexity; $50/mo minimum; schema migration pain
Qdrant Cloud	Managed Qdrant	🟡	Better cold starts; but still $30/mo; why pay for what runs free locally?
Qdrant Local (embedded) + custom pipeline	Production RAG	✅ Current	Zero cold start (<50ms); $0; full control; 80 lines Python; sqlite-compatible backup
Chroma (local)	Previous attempt	❌	No HNSW persistence reliability; silent corruption on crash; slower at scale
SQLite-vec / pgvector	Lightweight alternatives	🟡	Good for <50k tokens; our corpus grew; Qdrant’s HNSW + filtering won

The Pivot Point

Pinecone Serverless cold start killed our 3 AM cron reliability. deep-research-001 would timeout on first query (8s Pinecone + 12s LLM = 20s > 15s cron timeout). We added retry logic — now first run takes 30s, second run 3s (warmed). But “warming” a serverless index by running dummy queries at 2:55 AM is absurd for 200k tokens. Local Qdrant: first query 40ms, every query 40ms. No warmup. No bill.

What We Use Now

Local Qdrant + custom pipeline (production):

# qdrant_rag.py — ~80 lines, zero deps beyond qdrant-client + anthropic/openai
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import json, hashlib

class LocalRAG:
    def __init__(self, path="./qdrant", collection="hermes-brain"):
        self.client = QdrantClient(path=path)
        self.collection = collection
        self._ensure_collection()
    
    def _ensure_collection(self):
        if not self.client.collection_exists(self.collection):
            self.client.create_collection(
                self.collection,
                vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
            )
    
    def upsert_docs(self, docs: list[dict]):  # {"id": "", "text": "", "metadata": {}}
        points = [
            {"id": d["id"], "vector": embed(d["text"]), "payload": {"text": d["text"], **d.get("metadata", {})}}
            for d in docs
        ]
        self.client.upsert(self.collection, points=points)
    
    def search(self, query: str, limit: int = 5) -> list[dict]:
        hits = self.client.search(self.collection, query_vector=embed(query), limit=limit)
        return [{"text": h.payload["text"], "score": h.score, "metadata": {k:v for k,v in h.payload.items() if k!="text"}} for h in hits]

# Cron job indexer (runs daily 02:00):
# 1. Walk brain/ + projects/*/HERMES.md + planning/
# 2. Chunk by header (max 500 tokens)
# 3. Upsert to Qdrant with metadata: source_file, header_path, token_count
# 4. Log: chunks_indexed, duration, errors

Observability:

Every search logs: query_hash, top_5_scores, latency_ms, tokens_retrieved
hermes rag-stats shows index size, last index time, avg query latency (p50/p95/p99)
Backup: cp -r qdrant/ backups/qdrant-$(date +%F)/ (sqlite files, instantly portable)

When You’d Choose Differently

Use Pinecone/Weaviate Cloud if: multi-region, team collaboration on index, >10M vectors, need managed scaling, compliance requires SOC2 vendor, budget absorbs $500+/mo.
Use Qdrant Cloud if: you want Qdrant but don’t want to manage the server (single-node, <$100/mo), team needs shared index.
Use Chroma/pgvector if: corpus <50k tokens, simplicity > performance, already on Postgres.
Stay local Qdrant if: single machine, data sovereignty, zero budget, operator can manage a directory.

Tool Crucible Rating

Dimension	Score (1–5)	Notes
Overall	5	Solves the problem perfectly at our scale; zero recurring cost; full ownership
Ease of Adoption	4	`pip install qdrant-client` + 80 lines; but you must write the pipeline (no managed UI)
Value	5	$0/mo vs $50-700/mo; better latency; no vendor risk; portable sqlite files
Support/Ecosystem	4	Qdrant team responsive; Python client stable; but you’re on your own for ops

This is part of our RAG Infrastructure evaluation series. See full comparison: Tool Crucible Vector DB Comparison 2026

Last reviewed 2026-06-10. See our methodology and affiliate policy.