AI Features and extensions pgvector
David Sterling  

PostgreSQL pgvector and RAG: Best Practices and Examples for Better Results

Retrieval-Augmented Generation (RAG) has become one of the most practical applications of AI in production systems, and PostgreSQL’s pgvector extension has emerged as a robust solution for implementing vector search. However, getting optimal results requires more than just installing pgvector and throwing embeddings at it. Let’s explore proven best practices and real-world examples to help you build high-performing RAG systems.Understanding pgvector Fundamentals

Before diving into best practices, it’s essential to understand how pgvector works. The extension adds vector similarity search capabilities to PostgreSQL using three distance functions:

  • Cosine Distance (<=>): Best for normalized vectors, measuring angle between vectors
  • L2 Distance (<->): Euclidean distance, measuring straight-line distance
  • Inner Product (<#>): Useful for certain embedding models

For most RAG applications using modern embedding models like OpenAI’s text-embedding-3-small or sentence-transformers, cosine distance is the preferred choice.

Best Practice 1: Optimize Your Table Schema

A well-designed table schema is crucial for performance. Here’s an optimized example:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1536),  -- Dimension matches your model
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Create index for vector similarity search
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 100);

-- Additional indexes for filtering
CREATE INDEX idx_documents_metadata ON documents USING GIN (metadata);
CREATE INDEX idx_documents_created ON documents (created_at DESC);

Key points:

Create appropriate indexes for common queriesBest Practice 2: Choose the Right Index Strategy

Index selection significantly impacts both query performance and recall quality. pgvector supports two index types:

IVFFlat (Inverted File with Flat Compression)

Best for datasets up to 1M vectors:

CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 100);

Choosing the lists parameter:

  • Small datasets (< 100K): lists = 100
  • Medium datasets (100K – 500K): lists = 500
  • Large datasets (500K – 1M): lists = 1000

Rule of thumb: lists = rows / 1000 (with minimum 100)

HNSW (Hierarchical Navigable Small World)

Better for larger datasets and higher recall requirements:

CREATE INDEX ON documents 
USING hnsw (embedding vector_cosine_ops) 
WITH (m = 16, ef_construction = 64);

Parameters:

  • m: Number of connections per layer (default: 16, higher = better recall but more memory)
  • ef_construction: Size of dynamic candidate list (default: 64, higher = better quality but slower builds)

For production RAG systems, HNSW typically provides better results but uses more memory.

Best Practice 3: Implement Smart Chunking Strategies

How you chunk your documents dramatically affects retrieval quality. Here’s a production-ready example:

import tiktoken
from typing import List, Dict

def smart_chunk_document(
    text: str,
    chunk_size: int = 512,
    overlap: int = 128,
    model: str = "gpt-3.5-turbo"
) -> List[Dict]:
    """
    Chunk document with overlap for better context preservation.
    """
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)

    chunks = []
    start = 0

    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoding.decode(chunk_tokens)

        chunks.append({
            'text': chunk_text,
            'start_idx': start,
            'end_idx': end,
            'token_count': len(chunk_tokens)
        })

        start += (chunk_size - overlap)

    return chunks

# Example usage
text = "Your long document here..."
chunks = smart_chunk_document(text, chunk_size=512, overlap=128)

Key chunking principles:

Store chunk metadata (position, parent document ID) for retrievalBest Practice 4: Optimize Query Performance

Here’s a production-optimized query pattern that combines vector search with filtering:

-- Efficient hybrid search query
WITH vector_matches AS (
    SELECT 
        id,
        content,
        metadata,
        embedding <=> $1::vector AS distance
    FROM documents
    WHERE 
        metadata @> $2::jsonb  -- Pre-filter with metadata
    ORDER BY embedding <=> $1::vector
    LIMIT 50  -- Get top candidates
)
SELECT 
    id,
    content,
    metadata,
    distance,
    1 - distance AS similarity_score
FROM vector_matches
WHERE distance < 0.5  -- Distance threshold
ORDER BY distance
LIMIT 10;

Query optimization tips:

  1. Pre-filter when possible: Use metadata filters before vector search
  2. Set appropriate LIMIT: Retrieve 2-5x more results than needed, then filter
  3. Use distance thresholds: Filter out low-quality matches
  4. Monitor probes for IVFFlat:
SET ivfflat.probes = 10;  -- Default is 1
-- Higher probes = better recall but slower queries
-- Recommended: 10-20 for production

Best Practice 5: Implement Hybrid Search

Combining vector search with traditional full-text search often yields better results:

-- Add tsvector column for full-text search
ALTER TABLE documents ADD COLUMN content_tsv tsvector;

-- Create trigger to maintain tsvector
CREATE TRIGGER documents_content_tsv_update 
BEFORE INSERT OR UPDATE ON documents
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(content_tsv, 'pg_catalog.english', content);

-- Create GIN index for full-text search
CREATE INDEX idx_documents_tsv ON documents USING GIN (content_tsv);

-- Hybrid search query with RRF (Reciprocal Rank Fusion)
WITH vector_search AS (
    SELECT id, content,
           ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) as rank
    FROM documents
    ORDER BY embedding <=> $1::vector
    LIMIT 50
),
text_search AS (
    SELECT id, content,
           ROW_NUMBER() OVER (ORDER BY ts_rank(content_tsv, query) DESC) as rank
    FROM documents, plainto_tsquery('english', $2) query
    WHERE content_tsv @@ query
    LIMIT 50
)
SELECT 
    COALESCE(v.id, t.id) as id,
    COALESCE(v.content, t.content) as content,
    COALESCE(1.0 / (60 + v.rank), 0.0) + 
    COALESCE(1.0 / (60 + t.rank), 0.0) as score
FROM vector_search v
FULL OUTER JOIN text_search t ON v.id = t.id
ORDER BY score DESC
LIMIT 10;

This hybrid approach provides:

Robustness against embedding model weaknessesBest Practice 6: Implement Proper Connection Pooling

For production RAG applications, connection pooling is essential:

import psycopg2
from psycopg2 import pool
import numpy as np

class VectorDB:
    def __init__(self, db_config):
        self.pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=5,
            maxconn=20,
            **db_config
        )

    def search_similar(self, query_embedding, limit=10, distance_threshold=0.5):
        conn = self.pool.getconn()
        try:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT id, content, metadata,
                           embedding <=> %s::vector as distance
                    FROM documents
                    WHERE embedding <=> %s::vector < %s
                    ORDER BY embedding <=> %s::vector
                    LIMIT %s
                """, (query_embedding, query_embedding, 
                      distance_threshold, query_embedding, limit))

                results = cur.fetchall()
                return [{
                    'id': r[0],
                    'content': r[1],
                    'metadata': r[2],
                    'distance': float(r[3]),
                    'similarity': 1 - float(r[3])
                } for r in results]
        finally:
            self.pool.putconn(conn)

# Usage
db = VectorDB({
    'host': 'localhost',
    'database': 'vectordb',
    'user': 'postgres',
    'password': 'your_password'
})

results = db.search_similar(query_embedding, limit=10)

Best Practice 7: Monitor and Optimize Index Performance

Regular monitoring helps maintain optimal performance:

-- Check index usage
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan as index_scans,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE tablename = 'documents'
ORDER BY idx_scan DESC;

-- Monitor query performance
EXPLAIN (ANALYZE, BUFFERS) 
SELECT id, content
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

-- Vacuum and analyze regularly
VACUUM ANALYZE documents;

Performance monitoring checklist:

Set up query performance loggingComplete Real-World Example: Building a Document RAG System

Here’s a complete example that ties everything together:

import openai
import psycopg2
from psycopg2.extras import execute_values
import numpy as np
from typing import List, Dict

class ProductionRAGSystem:
def __init__(self, db_config, openai_api_key):
self.conn = psycopg2.connect(**db_config)
openai.api_key = openai_api_key
self.embedding_model = “text-embedding-3-small”
self.setup_database()

def setup_database(self):
“””Initialize database schema”””
with self.conn.cursor() as cur:
cur.execute(“””
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX IF NOT EXISTS idx_documents_embedding
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

CREATE INDEX IF NOT EXISTS idx_documents_metadata
ON documents USING GIN (metadata);
“””)
self.conn.commit()

def get_embedding(self, text: str) -> List[float]:
“””Generate embedding for text”””
response = openai.Embedding.create(
model=self.embedding_model,
input=text
)
return response[‘data’][0][’embedding’]

def add_documents(self, documents: List[Dict]):
“””Batch insert documents with embeddings”””
with self.conn.cursor() as cur:
data = [
(
doc[‘content’],
self.get_embedding(doc[‘content’]),
doc.get(‘metadata’, {})
)
for doc in documents
]

execute_values(
cur,
“””
INSERT INTO documents (content, embedding, metadata)
VALUES %s
“””,
data,
template=”(%s, %s::vector, %s::jsonb)”
)
self.conn.commit()

def search(self, query: str, limit: int = 5,
metadata_filter: Dict = None) -> List[Dict]:
“””Search for relevant documents”””
query_embedding = self.get_embedding(query)

with self.conn.cursor() as cur:
if metadata_filter:
cur.execute(“””
SELECT
id,
content,
metadata,
1 – (embedding <=> %s::vector) as similarity
FROM documents
WHERE metadata @> %s::jsonb
ORDER BY embedding <=> %s::vector
LIMIT %s
“””, (query_embedding, metadata_filter,
query_embedding, limit))
else:
cur.execute(“””
SELECT
id,
content,
metadata,
1 – (embedding <=> %s::vector) as similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
“””, (query_embedding, query_embedding, limit))

results = cur.fetchall()
return [{
‘id’: r[0],
‘content’: r[1],
‘metadata’: r[2],
‘similarity’: float(r[3])
} for r in results]

def generate_response(self, query: str, context_limit: int = 5) -> str:
“””Generate RAG response”””
# Retrieve relevant context
results = self.search(query, limit=context_limit)

# Build context from results
context = “\n\n”.join([
f”[{i+1}] {r[‘content’]}”
for i, r in enumerate(results)
])

# Generate response with GPT
response = openai.ChatCompletion.create(
model=”gpt-4″,
messages=[
{“role”: “system”, “content”:
“You are a helpful assistant. Use the provided context to answer questions.”},
{“role”: “user”, “content”:
f”Context:\n{context}\n\nQuestion: {query}”}
]
)

return response.choices[0].message.content

# Example usage
rag = ProductionRAGSystem(
db_config={
‘host’: ‘localhost’,
‘database’: ‘vectordb’,
‘user’: ‘postgres’,
‘password’: ‘your_password’
},
openai_api_key=’your-openai-key’
)

# Add documents
rag.add_documents([
{
‘content’: ‘PostgreSQL 16 introduces parallel query improvements…’,
‘metadata’: {‘category’: ‘performance’, ‘version’: ’16’}
},
# … more documents
])

# Search
results = rag.search(
“How to optimize PostgreSQL queries?”,
metadata_filter={‘category’: ‘performance’}
)

# Generate RAG response
response = rag.generate_response(“What’s new in PostgreSQL 16?”)Key Recommendations Summary

To build a production-grade RAG system with PostgreSQL and pgvector:

Schema Design:

  • Use appropriate vector dimensions matching your embedding model
  • Implement JSONB metadata for flexible filtering
  • Add proper indexes (vector + metadata + full-text)

Indexing Strategy:

  • Use HNSW for production systems (better recall, more memory)
  • Use IVFFlat for smaller datasets or memory-constrained environments
  • Tune index parameters based on your dataset size

Query Optimization:

  • Implement connection pooling (5-20 connections)
  • Use metadata pre-filtering before vector search
  • Set distance thresholds to filter low-quality results
  • Consider hybrid search for better recall

Data Management:

  • Chunk documents with 10-20% overlap
  • Use 512-1024 token chunks for most applications
  • Store chunk metadata for context reconstruction
  • Run VACUUM ANALYZE regularly

Monitoring:

  • Track query latency percentiles (p50, p95, p99)
  • Monitor index performance and hit ratios
  • Set up query logging for slow queries
  • Watch for index bloat and table growth

Common Pitfalls to Avoid

  1. Not normalizing embeddings: If your model doesn’t output normalized vectors, normalize them before storage when using cosine distance
  2. Wrong index parameters: Using default parameters can lead to poor performance. Always tune based on your dataset size
  3. Ignoring metadata: Metadata filtering can dramatically improve both relevance and performance
  4. Poor chunking strategy: Too large chunks reduce precision, too small chunks lose context
  5. No connection pooling: Opening a new connection for each query kills performance
  6. Forgetting VACUUM: Vector indexes can bloat quickly without regular maintenance

Conclusion

PostgreSQL with pgvector provides a robust, production-ready foundation for RAG applications. By following these best practices—optimizing your schema, choosing the right index strategy, implementing smart chunking, and maintaining proper monitoring—you can build high-performing RAG systems that scale.

The key to success is iterative optimization: start with these proven patterns, monitor your specific workload, and tune based on your actual usage patterns. Remember that RAG performance depends on the entire pipeline, from document chunking through retrieval to generation.

With PostgreSQL’s reliability, pgvector’s performance, and these best practices, you have everything needed to deploy RAG systems that deliver accurate, relevant results to your users.

Resources:

Have questions about implementing these patterns? Drop a comment below or reach out—I’d love to hear about your RAG implementation experiences!

Track query latency (p50, p95, p99)

Monitor index hit ratios

Watch for index bloat

Run VACUUM ANALYZE after bulk inserts

Better handling of exact term matches

Improved recall for keyword-based queries

Use 512-1024 token chunks for most applications

Implement 10-20% overlap to preserve context across boundaries

Consider semantic boundaries (paragraphs, sections) when possible

Use BIGSERIAL for scalability

Match vector dimension to your embedding model

Use JSONB for flexible metadata storage

Leave A Comment