Building a Production-Ready RAG Pipeline with Supacrawler and pgvector

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances Large Language Models (LLMs) by providing them with up-to-date, external knowledge. Instead of relying solely on its training data, a RAG system retrieves relevant documents from a knowledge base to help generate more accurate and context-aware answers.

Building a robust RAG pipeline starts with high-quality, structured content. This guide provides a complete, end-to-end walkthrough of how to:

Crawl an entire website to build a knowledge base using Supacrawler's Crawl API.
Chunk the crawled content into effective segments for retrieval.
Embed the chunks into vectors.
Store and Query those vectors in a PostgreSQL database using the pgvector extension.

The Anatomy of a RAG Pipeline

Before diving into code, it's helpful to understand the flow of data. A user's query initiates a process where the system retrieves relevant information from your database, combines it with the original query, and feeds it to an LLM to generate a final, context-enriched answer.

Diagram of the RAG pipeline showing data flow from user query to Supacrawler, to a vector database, and finally to an LLM. — The data flow for a production-ready RAG pipeline.

This guide focuses on the critical data ingestion part of this flow: getting from a live website (C) to a populated vector database (F). We include three storage options so you can pick your preferred Python stack:

Supabase Vecs
LangChain with PGVector
LlamaIndex with PGVector

Prerequisites: If you’re using the SDKs (recommended), first see Install the SDKs.

Step 1: Set Up Your Vector Database

First, ensure the pgvector extension is enabled in your PostgreSQL database.

Supabase: Navigate to Database → Extensions → and enable pgvector. See: pgvector extension.
Self‑hosted Postgres: Connect to your database and run:

create extension if not exists vector;

For production environments, it is crucial to create an index (like HNSW or IVFFlat) on your vector column for efficient similarity searches.

Step 2: Crawl a Website with the Crawl API

The foundation of any good RAG system is a comprehensive, clean knowledge base. We’ll crawl a documentation site to create ours, using URL patterns to keep the scope focused.

Create a Crawl Job

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.supacrawler.com",
    "format": "markdown",
    "depth": 3,
    "link_limit": 100,
    "include_patterns": ["/docs/*", "/api/*"],
    "render_js": true
  }'

When the job status is completed, the result will contain data.crawl_data, a map of each URL to its clean markdown content and metadata.

Step 3: Choose Your Chunking Strategy

This is one of the most critical steps for RAG performance. Chunking is the process of breaking down large documents into smaller, semantically meaningful pieces. If chunks are too large, they can introduce noise; if they're too small, they may lack sufficient context.

Here are a few common strategies:

RecursiveCharacterTextSplitter (Recommended Start): This method, popular in frameworks like LangChain, recursively splits text by a list of characters (e.g., \n\n, \n, ). It's a robust starting point that tries to keep paragraphs and sentences together.
Token-Based Splitting: This approach splits text based on the token count that the LLM's embedding model uses. It's more precise but requires a tokenizer for your specific model.
Semantic Chunking: More advanced techniques use NLP libraries or embedding models to split text based on semantic shifts in meaning, creating the most contextually relevant chunks.

For this guide, we'll use RecursiveCharacterTextSplitter as it provides a great balance of simplicity and effectiveness.

Step 4: Embed and Store Vectors (Choose One Option)

Now, we'll process the crawled data. For each page, we'll chunk its markdown, create a vector embedding for each chunk, and store it in our database.

Option A: Supabase Vecs (Python)

import os, vecs
from openai import OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Assumes 'result' is the completed crawl job from Step 2
crawl_data = result.data.get('crawl_data', {})

# 1. Initialize Clients
DB_URL = os.environ['DATABASE_URL']
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
vx = vecs.create_client(DB_URL)
col = vx.get_or_create_collection(name='documents', dimension=1536) # `text-embedding-3-small` uses 1536 dimensions
openai_client = OpenAI(api_key=OPENAI_API_KEY)

# 2. Chunk Documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_chunks = []
for url, page in crawl_data.items():
    content = page.get('markdown', '')
    if not content:
        continue
    chunks = splitter.split_text(content)
    for i, chunk_text in enumerate(chunks):
        all_chunks.append({
            "id": f"{url}#{i}",
            "text": chunk_text,
            "metadata": {
                'url': url,
                'title': page.get('metadata', {}).get('title', ''),
            }
        })

# 3. Embed and Upsert in Batches
records_to_upsert = []
for chunk in all_chunks:
    emb = openai_client.embeddings.create(model='text-embedding-3-small', input=chunk["text"])
    vector = emb.data[0].embedding
    records_to_upsert.append((
        chunk["id"],
        vector,
        chunk["metadata"]
    ))

if records_to_upsert:
    col.upsert(records=records_to_upsert)
    print(f'Upserted {len(records_to_upsert)} chunks')

# 4. Query for Similar Chunks
col.create_index()
query_text = "What are the API endpoints?"
q_emb = openai_client.embeddings.create(model='text-embedding-3-small', input=query_text)
q_vector = q_emb.data[0].embedding
matches = col.query(data=q_vector, limit=3, include_metadata=True)
for match in matches:
    print(match)

Option B: LangChain PGVector (Python)

import os
from sqlalchemy import create_engine
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings

# Assumes 'result' is the completed crawl job from Step 2
crawl_data = result.data.get('crawl_data', {})

# 1. Prepare LangChain Documents
docs = []
for url, page in crawl_data.items():
    content = page.get('markdown', '')
    if content:
        docs.append(Document(
            page_content=content,
            metadata={
                'url': url,
                'title': page.get('metadata', {}).get('title', ''),
            }
        ))

# 2. Chunk Documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# 3. Embed and Store
DATABASE_URL = os.environ['DATABASE_URL']
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
embeddings = OpenAIEmbeddings(model='text-embedding-3-small', openai_api_key=OPENAI_API_KEY)
engine = create_engine(DATABASE_URL)
store = PGVector(connection=engine, collection_name='lc_documents', embeddings=embeddings, use_jsonb=True)
store.add_documents(chunks)
print(f'Added {len(chunks)} chunks to the store.')

# 4. Query
results = store.similarity_search('What are the possible auth methods?', k=3)
for doc in results:
    print(f"- {doc.metadata.get('title')}: {doc.metadata.get('url')}")

Option C: LlamaIndex + PGVector (Python)

import os
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import Document, VectorStoreIndex, StorageContext

# Assumes 'result' is the completed crawl job from Step 2
crawl_data = result.data.get('crawl_data', {})

# 1. Prepare LlamaIndex Documents
docs = []
for url, page in crawl_data.items():
    content = page.get('markdown', '')
    if content:
        docs.append(Document(
            text=content,
            metadata={
                'url': url,
                'title': page.get('metadata', {}).get('title', ''),
            }
        ))

# 2. Initialize Storage and Embedding Model
DB_URL = os.environ['DATABASE_URL']
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
embed_model = OpenAIEmbedding(model='text-embedding-3-small', api_key=OPENAI_API_KEY)
store = PGVectorStore.from_params(database_url=DB_URL, table_name='li_documents', embed_dim=1536)
ctx = StorageContext.from_defaults(vector_store=store)

# 3. Build the Index (this chunks, embeds, and stores)
index = VectorStoreIndex.from_documents(docs, storage_context=ctx, embed_model=embed_model)
print("Index built and stored successfully.")

# 4. Query
query_engine = index.as_query_engine()
response = query_engine.query('What is this page about?')
print(response)

Step 5: Evaluate and Iterate

Building a RAG system is an iterative process. Once your pipeline is running, evaluate its performance by asking a set of test questions. If the answers are not accurate, consider:

Refining the Crawl Scope: Are your include_patterns too broad or too narrow?
Adjusting Chunking Strategy: Experiment with different chunk sizes and overlaps.
Improving Retrieval: You may need to add more specific metadata to your chunks to help the retriever find the best possible context.

Conclusion: A Powerful Foundation for AI

By combining Supacrawler's powerful crawling capabilities with the efficiency of pgvector, you can build a robust and scalable data ingestion pipeline for any RAG application. This process—Crawl, Chunk, Embed, Store—provides the foundation for creating intelligent AI agents that can reason about and answer questions on any web-based knowledge base.