Why We Built a Custom RAG Pipeline Instead of Buying a Vector DB SaaS
Qdrant local + direct LLM calls gives us full control, zero egress costs, and 80% less debugging than managed RAG services — the trade-off is owning the infrastructure.
Published 2026-06-10
Why We Built a Custom RAG Pipeline Instead of Buying a Vector DB SaaS
TL;DR: Managed RAG (Pinecone, Weaviate Cloud, Pinecone) adds latency, cost, and vendor lock-in for features we don’t need. Local Qdrant + 80 lines of Python beats them on our workload. Full comparison →
The Context
Hermes agents need retrieval over a 200k-token documentation corpus (brain/, project HERMES.md files, planning docs) for 18 cron jobs. Queries are sporadic (cron-scheduled), latency-sensitive (<5s end-to-end), and must work offline. We evaluated Pinecone Serverless, Weaviate Cloud, and Qdrant Cloud against local Qdrant embedded. Team: 1 operator. Constraint: zero external API dependencies, monthly cost <$10, data never leaves machine, debuggable in <5 min.
What We Tested
| Tool / Configuration | Use Case | Verdict | Why |
|---|---|---|---|
| Pinecone Serverless | Managed vector DB | ❌ | Cold start latency (2-8s); $70/mo minimum; data leaves machine; overkill for 200k tokens |
| Weaviate Cloud | Managed vector DB + GraphQL | ❌ | Same cold starts; GraphQL adds complexity; $50/mo minimum; schema migration pain |
| Qdrant Cloud | Managed Qdrant | 🟡 | Better cold starts; but still $30/mo; why pay for what runs free locally? |
| Qdrant Local (embedded) + custom pipeline | Production RAG | ✅ Current | Zero cold start (<50ms); $0; full control; 80 lines Python; sqlite-compatible backup |
| Chroma (local) | Previous attempt | ❌ | No HNSW persistence reliability; silent corruption on crash; slower at scale |
| SQLite-vec / pgvector | Lightweight alternatives | 🟡 | Good for <50k tokens; our corpus grew; Qdrant’s HNSW + filtering won |
The Pivot Point
Pinecone Serverless cold start killed our 3 AM cron reliability. deep-research-001 would timeout on first query (8s Pinecone + 12s LLM = 20s > 15s cron timeout). We added retry logic — now first run takes 30s, second run 3s (warmed). But “warming” a serverless index by running dummy queries at 2:55 AM is absurd for 200k tokens. Local Qdrant: first query 40ms, every query 40ms. No warmup. No bill.
What We Use Now
Local Qdrant + custom pipeline (production):
# qdrant_rag.py — ~80 lines, zero deps beyond qdrant-client + anthropic/openai
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import json, hashlib
class LocalRAG:
def __init__(self, path="./qdrant", collection="hermes-brain"):
self.client = QdrantClient(path=path)
self.collection = collection
self._ensure_collection()
def _ensure_collection(self):
if not self.client.collection_exists(self.collection):
self.client.create_collection(
self.collection,
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
def upsert_docs(self, docs: list[dict]): # {"id": "", "text": "", "metadata": {}}
points = [
{"id": d["id"], "vector": embed(d["text"]), "payload": {"text": d["text"], **d.get("metadata", {})}}
for d in docs
]
self.client.upsert(self.collection, points=points)
def search(self, query: str, limit: int = 5) -> list[dict]:
hits = self.client.search(self.collection, query_vector=embed(query), limit=limit)
return [{"text": h.payload["text"], "score": h.score, "metadata": {k:v for k,v in h.payload.items() if k!="text"}} for h in hits]
# Cron job indexer (runs daily 02:00):
# 1. Walk brain/ + projects/*/HERMES.md + planning/
# 2. Chunk by header (max 500 tokens)
# 3. Upsert to Qdrant with metadata: source_file, header_path, token_count
# 4. Log: chunks_indexed, duration, errors
Observability:
- Every search logs: query_hash, top_5_scores, latency_ms, tokens_retrieved
hermes rag-statsshows index size, last index time, avg query latency (p50/p95/p99)- Backup:
cp -r qdrant/ backups/qdrant-$(date +%F)/(sqlite files, instantly portable)
When You’d Choose Differently
- Use Pinecone/Weaviate Cloud if: multi-region, team collaboration on index, >10M vectors, need managed scaling, compliance requires SOC2 vendor, budget absorbs $500+/mo.
- Use Qdrant Cloud if: you want Qdrant but don’t want to manage the server (single-node, <$100/mo), team needs shared index.
- Use Chroma/pgvector if: corpus <50k tokens, simplicity > performance, already on Postgres.
- Stay local Qdrant if: single machine, data sovereignty, zero budget, operator can manage a directory.
Tool Crucible Rating
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Overall | 5 | Solves the problem perfectly at our scale; zero recurring cost; full ownership |
| Ease of Adoption | 4 | pip install qdrant-client + 80 lines; but you must write the pipeline (no managed UI) |
| Value | 5 | $0/mo vs $50-700/mo; better latency; no vendor risk; portable sqlite files |
| Support/Ecosystem | 4 | Qdrant team responsive; Python client stable; but you’re on your own for ops |
This is part of our RAG Infrastructure evaluation series. See full comparison: Tool Crucible Vector DB Comparison 2026
Last reviewed 2026-06-10. See our methodology and affiliate policy.