Integrations: Build a simple RAG System with Supacrawler and Supabase pgvector using OpenAI embeddings
Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications by combining the power of large language models with real-time access to external knowledge bases. This comprehensive guide shows you how to build a production-ready RAG system using Supacrawler for web data extraction, Supabase pgvector for vector storage, and OpenAI embeddings for high-quality semantic search.
By the end of this tutorial, you'll have a complete RAG pipeline that can crawl any website, convert content into searchable vectors, and provide intelligent question-answering capabilities.
If you'd like to try it yourself you can check the Supabase Vectors notebook.
Table of Contents
- Understanding the RAG Architecture
- Setting Up Your Environment
- Configuring Supabase pgvector
- Web Crawling with Supacrawler
- Content Processing and Chunking
- Generating OpenAI Embeddings
- Vector Storage in Supabase
- Building the Query Engine
- Complete RAG Implementation
- Performance Optimization
- Production Deployment
Understanding the RAG Architecture
Our RAG system follows a four-stage pipeline that transforms web content into intelligent, searchable knowledge:
Stage | Component | Purpose | Technology |
---|---|---|---|
Data Extraction | Supacrawler | Crawl and extract clean content from websites | Supacrawler Crawl API |
Content Processing | Text Chunking | Split content into meaningful, searchable segments | Python text splitters |
Embedding Generation | OpenAI API | Convert text chunks into high-dimensional vectors | OpenAI text-embedding-3-small |
Vector Storage | Supabase pgvector | Store and search vectors with metadata | PostgreSQL + pgvector extension |
This architecture provides several key advantages:
- Scalable Data Ingestion: Supacrawler handles JavaScript rendering, rate limiting, and large-scale crawling
- High-Quality Embeddings: OpenAI's embeddings provide superior semantic understanding
- Production-Ready Storage: Supabase offers managed PostgreSQL with built-in vector operations
- Real-Time Updates: Easy to refresh content and maintain current knowledge
Setting Up Your Environment
First, install the required dependencies and set up your development environment:
# Install core dependenciespip install supacrawler vecs openai python-dotenv# Install text processing librariespip install beautifulsoup4 markdownify
Create a .env
file with your API credentials:
# .envSUPACRAWLER_API_KEY=your_supacrawler_api_keyOPENAI_API_KEY=your_openai_api_keySUPABASE_URL=your_supabase_project_urlSUPABASE_KEY=your_supabase_anon_keyDATABASE_URL=postgresql://postgres:[password]@db.[project].supabase.co:5432/postgres
Configuring Supabase pgvector
Before building our RAG system, we need to enable the pgvector extension in Supabase:
-
Enable pgvector Extension:
- Go to your Supabase dashboard
- Navigate to
Database
→Extensions
- Search for and enable
pgvector
-
Verify Installation:
import osimport vecsfrom dotenv import load_dotenvload_dotenv()# Connect to SupabaseDB_URL = os.getenv('DATABASE_URL')vx = vecs.create_client(DB_URL)# Test connectionprint("✅ Connected to Supabase successfully!")
- Create Vector Collection:
# Create collection for our RAG system# OpenAI text-embedding-3-small uses 1536 dimensionscollection = vx.get_or_create_collection(name="knowledge_base",dimension=1536)print(f"📦 Collection created: {collection.name}")
Web Crawling with Supacrawler
Supacrawler excels at extracting clean, structured content from websites. Let's crawl a documentation site to build our knowledge base:
import osfrom supacrawler import SupacrawlerClientfrom dotenv import load_dotenvload_dotenv()def crawl_documentation(base_url: str, patterns: list = None) -> dict:"""Crawl a documentation site and return clean contentArgs:base_url: The starting URL to crawlpatterns: URL patterns to include (e.g., ['/docs/*', '/api/*'])Returns:Dictionary mapping URLs to their content and metadata"""client = SupacrawlerClient(api_key=os.getenv('SUPACRAWLER_API_KEY'))# Configure crawl parameters for documentationcrawl_config = {'url': base_url,'format': 'markdown', # Get clean markdown content'depth': 3, # Crawl up to 3 levels deep'link_limit': 200, # Limit total pages crawled'render_js': True, # Handle JavaScript-rendered content'include_patterns': patterns or ['/docs/*', '/api/*', '/guides/*'],'exclude_patterns': ['/blog/*', '/changelog/*'], # Skip non-documentation'timeout': 30000, # 30 second timeout per page}print(f"🚀 Starting crawl of {base_url}")# Create and wait for crawl jobjob = client.create_crawl_job(**crawl_config)result = client.wait_for_crawl(job.job_id)if result.status == 'completed':crawl_data = result.data.get('crawl_data', {})print(f"✅ Crawl completed! Found {len(crawl_data)} pages")return crawl_dataelse:raise Exception(f"Crawl failed with status: {result.status}")# Example: Crawl Supabase documentationcrawled_content = crawl_documentation('https://supabase.com/docs',patterns=['/docs/*'])# Display crawl resultsfor url, page_data in list(crawled_content.items())[:3]:content = page_data.get('markdown', '')title = page_data.get('metadata', {}).get('title', 'No title')print(f"\n📄 {title}")print(f"🔗 {url}")print(f"📝 Content length: {len(content)} characters")print(f"📋 Preview: {content[:200]}...")
Content Processing and Chunking
Effective chunking is crucial for RAG performance. We need to split large documents into meaningful, searchable segments:
import refrom typing import List, Dictclass DocumentChunker:"""Intelligent document chunker optimized for RAG systems"""def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):self.chunk_size = chunk_sizeself.chunk_overlap = chunk_overlapdef clean_content(self, content: str) -> str:"""Clean and normalize content"""# Remove excessive whitespacecontent = re.sub(r'\n\s*\n\s*\n', '\n\n', content)# Remove markdown artifacts that don't add valuecontent = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', content) # Convert links to textcontent = re.sub(r'```[\s\S]*?```', '', content) # Remove code blockscontent = re.sub(r'`([^`]+)`', r'\1', content) # Remove inline code formattingreturn content.strip()def split_by_headers(self, content: str) -> List[str]:"""Split content by markdown headers for semantic boundaries"""# Split on headers (H1-H6)sections = re.split(r'\n(?=#{1,6}\s)', content)return [section.strip() for section in sections if section.strip()]def chunk_text(self, text: str, max_size: int = None) -> List[str]:"""Split text into chunks with overlap"""max_size = max_size or self.chunk_sizeif len(text) <= max_size:return [text]chunks = []start = 0while start < len(text):# Find end positionend = start + max_sizeif end >= len(text):chunks.append(text[start:])break# Try to end at sentence boundarysentence_end = text.rfind('.', start, end)if sentence_end > start + max_size // 2:end = sentence_end + 1else:# Try to end at word boundaryword_end = text.rfind(' ', start, end)if word_end > start + max_size // 2:end = word_endchunks.append(text[start:end])start = end - self.chunk_overlapreturn chunksdef process_crawled_content(self, crawled_data: Dict) -> List[Dict]:"""Process crawled content into chunks with metadataReturns:List of chunk dictionaries with text, metadata, and unique IDs"""all_chunks = []for url, page_data in crawled_data.items():content = page_data.get('markdown', '')metadata = page_data.get('metadata', {})if not content:continue# Clean contentcleaned_content = self.clean_content(content)# Split by headers first for semantic boundariessections = self.split_by_headers(cleaned_content)chunk_index = 0for section in sections:# Further chunk large sectionssection_chunks = self.chunk_text(section)for chunk_text in section_chunks:if len(chunk_text.strip()) < 100: # Skip very small chunkscontinuechunk = {'id': f"{url}#{chunk_index}",'text': chunk_text.strip(),'metadata': {'url': url,'title': metadata.get('title', ''),'description': metadata.get('description', ''),'chunk_index': chunk_index,'source': 'supacrawler'}}all_chunks.append(chunk)chunk_index += 1print(f"📊 Created {len(all_chunks)} chunks from {len(crawled_data)} pages")return all_chunks# Process our crawled contentchunker = DocumentChunker(chunk_size=800, chunk_overlap=150)chunks = chunker.process_crawled_content(crawled_content)# Display chunking resultsprint(f"\n📈 Chunking Statistics:")print(f"Total chunks: {len(chunks)}")print(f"Average chunk size: {sum(len(c['text']) for c in chunks) // len(chunks)} characters")# Show sample chunksprint("\n📋 Sample chunks:")for i, chunk in enumerate(chunks[:3]):print(f"\nChunk {i+1}:")print(f"ID: {chunk['id']}")print(f"Title: {chunk['metadata']['title']}")print(f"Text: {chunk['text'][:200]}...")
Generating OpenAI Embeddings
OpenAI's embeddings provide superior semantic understanding. Let's generate embeddings for our chunks:
import openaifrom typing import Listimport timeclass OpenAIEmbedder:"""Generate embeddings using OpenAI's text-embedding-3-small model"""def __init__(self, api_key: str, model: str = "text-embedding-3-small"):self.client = openai.OpenAI(api_key=api_key)self.model = modelself.dimension = 1536 # text-embedding-3-small dimensiondef embed_text(self, text: str) -> List[float]:"""Generate embedding for a single text"""try:response = self.client.embeddings.create(model=self.model,input=text)return response.data[0].embeddingexcept Exception as e:print(f"❌ Error generating embedding: {e}")return Nonedef embed_batch(self, texts: List[str], batch_size: int = 50) -> List[List[float]]:"""Generate embeddings for multiple texts with batching and rate limiting"""embeddings = []for i in range(0, len(texts), batch_size):batch = texts[i:i + batch_size]try:print(f"🔄 Processing batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")response = self.client.embeddings.create(model=self.model,input=batch)batch_embeddings = [item.embedding for item in response.data]embeddings.extend(batch_embeddings)# Rate limiting: small delay between batchestime.sleep(0.1)except Exception as e:print(f"❌ Error in batch {i//batch_size + 1}: {e}")# Add None for failed embeddingsembeddings.extend([None] * len(batch))return embeddingsdef embed_chunks(self, chunks: List[Dict]) -> List[Dict]:"""Add embeddings to chunk data"""print(f"🧠 Generating embeddings for {len(chunks)} chunks...")# Extract texts for embeddingtexts = [chunk['text'] for chunk in chunks]# Generate embeddingsembeddings = self.embed_batch(texts)# Add embeddings to chunksembedded_chunks = []for chunk, embedding in zip(chunks, embeddings):if embedding is not None:chunk['embedding'] = embeddingembedded_chunks.append(chunk)else:print(f"⚠️ Skipping chunk {chunk['id']} due to embedding failure")print(f"✅ Successfully embedded {len(embedded_chunks)} chunks")return embedded_chunks# Generate embeddingsembedder = OpenAIEmbedder(api_key=os.getenv('OPENAI_API_KEY'))embedded_chunks = embedder.embed_chunks(chunks)print(f"\n📊 Embedding Statistics:")print(f"Successful embeddings: {len(embedded_chunks)}")print(f"Embedding dimension: {len(embedded_chunks[0]['embedding']) if embedded_chunks else 'N/A'}")
Vector Storage in Supabase
Now let's store our embedded chunks in Supabase pgvector for efficient similarity search:
import vecsfrom typing import List, Tuple, Anyclass SupabaseVectorStore:"""Manage vector storage and retrieval in Supabase pgvector"""def __init__(self, db_url: str, collection_name: str = "knowledge_base"):self.client = vecs.create_client(db_url)self.collection_name = collection_nameself.collection = Nonedef create_collection(self, dimension: int = 1536):"""Create or get vector collection"""self.collection = self.client.get_or_create_collection(name=self.collection_name,dimension=dimension)print(f"📦 Collection '{self.collection_name}' ready")def upsert_chunks(self, embedded_chunks: List[Dict], batch_size: int = 100):"""Store embedded chunks in the vector database"""if not self.collection:raise ValueError("Collection not created. Call create_collection() first.")print(f"💾 Storing {len(embedded_chunks)} chunks in Supabase...")# Prepare records for upsertrecords = []for chunk in embedded_chunks:record = (chunk['id'], # Unique IDchunk['embedding'], # Vector embedding{ # Metadata'text': chunk['text'],'url': chunk['metadata']['url'],'title': chunk['metadata']['title'],'description': chunk['metadata']['description'],'chunk_index': chunk['metadata']['chunk_index'],'source': chunk['metadata']['source']})records.append(record)# Upsert in batchesfor i in range(0, len(records), batch_size):batch = records[i:i + batch_size]try:self.collection.upsert(records=batch)print(f"✅ Stored batch {i//batch_size + 1}/{(len(records)-1)//batch_size + 1}")except Exception as e:print(f"❌ Error storing batch {i//batch_size + 1}: {e}")print(f"🎉 Successfully stored {len(embedded_chunks)} chunks!")def create_index(self):"""Create HNSW index for fast similarity search"""if not self.collection:raise ValueError("Collection not created.")print("🔍 Creating HNSW index for fast search...")self.collection.create_index()print("✅ Index created successfully!")def similarity_search(self, query_embedding: List[float], limit: int = 5) -> List[Dict]:"""Search for similar chunks using vector similarity"""if not self.collection:raise ValueError("Collection not created.")results = self.collection.query(data=query_embedding,limit=limit,include_metadata=True)# Format resultsformatted_results = []for result in results:formatted_results.append({'id': result[0],'similarity': result[1],'metadata': result[2]})return formatted_results# Store vectors in Supabasevector_store = SupabaseVectorStore(db_url=os.getenv('DATABASE_URL'),collection_name="supacrawler_rag_demo")# Create collection and store chunksvector_store.create_collection(dimension=1536)vector_store.upsert_chunks(embedded_chunks)vector_store.create_index()print("\n🎯 Vector storage complete! Ready for queries.")
Building the Query Engine
Now let's create a query engine that can answer questions using our knowledge base:
class RAGQueryEngine:"""Complete RAG query engine with embedding search and response generation"""def __init__(self, vector_store: SupabaseVectorStore, embedder: OpenAIEmbedder):self.vector_store = vector_storeself.embedder = embedderself.openai_client = embedder.clientdef search_knowledge_base(self, query: str, top_k: int = 5) -> List[Dict]:"""Search knowledge base for relevant chunks"""print(f"🔍 Searching for: '{query}'")# Generate query embeddingquery_embedding = self.embedder.embed_text(query)if not query_embedding:raise ValueError("Failed to generate query embedding")# Search vector databaseresults = self.vector_store.similarity_search(query_embedding, limit=top_k)print(f"📋 Found {len(results)} relevant chunks")return resultsdef generate_response(self, query: str, context_chunks: List[Dict], model: str = "gpt-3.5-turbo") -> str:"""Generate response using retrieved context"""# Build context from retrieved chunkscontext_texts = []for chunk in context_chunks:metadata = chunk['metadata']text = metadata['text']url = metadata['url']title = metadata['title']context_texts.append(f"Source: {title} ({url})\n{text}")context = "\n\n---\n\n".join(context_texts)# Create promptprompt = f"""You are a helpful assistant that answers questions based on the provided context.Use only the information from the context to answer the question. If the context doesn't containenough information to answer the question, say so clearly.Context:{context}Question: {query}Answer:"""try:response = self.openai_client.chat.completions.create(model=model,messages=[{"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},{"role": "user", "content": prompt}],max_tokens=500,temperature=0.1)return response.choices[0].message.contentexcept Exception as e:return f"Error generating response: {e}"def ask(self, question: str, top_k: int = 5) -> Dict:"""Complete RAG pipeline: search + generate"""print(f"\n💬 Question: {question}")# Search knowledge baserelevant_chunks = self.search_knowledge_base(question, top_k=top_k)if not relevant_chunks:return {'question': question,'answer': "I couldn't find relevant information to answer your question.",'sources': []}# Generate responseanswer = self.generate_response(question, relevant_chunks)# Extract source URLssources = list(set([chunk['metadata']['url']for chunk in relevant_chunks]))result = {'question': question,'answer': answer,'sources': sources,'relevant_chunks': len(relevant_chunks)}print(f"✅ Answer generated using {len(relevant_chunks)} chunks from {len(sources)} sources")return result# Create query enginequery_engine = RAGQueryEngine(vector_store, embedder)# Test queriestest_questions = ["How do I set up pgvector in Supabase?","What are the main features of Supabase?","How do I query vectors in Supabase?",]print("\n🧪 Testing RAG System:")print("=" * 50)for question in test_questions:result = query_engine.ask(question)print(f"\n❓ {result['question']}")print(f"💡 {result['answer']}")print(f"📚 Sources: {', '.join(result['sources'])}")print("-" * 50)
Complete RAG Implementation
Here's the complete implementation that ties everything together:
import osfrom dotenv import load_dotenvfrom supacrawler import SupacrawlerClientimport vecsimport openaifrom typing import List, Dict, Anyclass SupacrawlerRAGSystem:"""Complete RAG system using Supacrawler, OpenAI, and Supabase"""def __init__(self):load_dotenv()# Initialize clientsself.supacrawler_client = SupacrawlerClient(api_key=os.getenv('SUPACRAWLER_API_KEY'))self.openai_client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))self.vector_client = vecs.create_client(os.getenv('DATABASE_URL'))# Initialize componentsself.chunker = DocumentChunker()self.collection = Noneprint("🚀 RAG System initialized!")def build_knowledge_base(self, url: str, collection_name: str = "rag_knowledge"):"""Complete pipeline: crawl → chunk → embed → store"""print(f"🏗️ Building knowledge base from {url}")# Step 1: Crawl websitecrawled_data = self._crawl_website(url)# Step 2: Process and chunk contentchunks = self.chunker.process_crawled_content(crawled_data)# Step 3: Generate embeddingsembedded_chunks = self._embed_chunks(chunks)# Step 4: Store in vector databaseself._store_vectors(embedded_chunks, collection_name)print("✅ Knowledge base built successfully!")return len(embedded_chunks)def _crawl_website(self, url: str) -> Dict:"""Crawl website and return content"""job = self.supacrawler_client.create_crawl_job(url=url,format='markdown',depth=3,link_limit=100,render_js=True,include_patterns=['/docs/*', '/api/*', '/guides/*'])result = self.supacrawler_client.wait_for_crawl(job.job_id)if result.status == 'completed':return result.data.get('crawl_data', {})else:raise Exception(f"Crawl failed: {result.status}")def _embed_chunks(self, chunks: List[Dict]) -> List[Dict]:"""Generate OpenAI embeddings for chunks"""embedded_chunks = []for chunk in chunks:try:response = self.openai_client.embeddings.create(model="text-embedding-3-small",input=chunk['text'])chunk['embedding'] = response.data[0].embeddingembedded_chunks.append(chunk)except Exception as e:print(f"⚠️ Failed to embed chunk {chunk['id']}: {e}")return embedded_chunksdef _store_vectors(self, embedded_chunks: List[Dict], collection_name: str):"""Store vectors in Supabase"""self.collection = self.vector_client.get_or_create_collection(name=collection_name,dimension=1536)# Prepare recordsrecords = [(chunk['id'], chunk['embedding'], chunk['metadata'])for chunk in embedded_chunks]# Upsert and create indexself.collection.upsert(records=records)self.collection.create_index()def ask(self, question: str, top_k: int = 5) -> Dict:"""Ask a question and get an answer"""if not self.collection:raise ValueError("No knowledge base loaded. Call build_knowledge_base() first.")# Generate query embeddingquery_response = self.openai_client.embeddings.create(model="text-embedding-3-small",input=question)query_embedding = query_response.data[0].embedding# Search vectorsresults = self.collection.query(data=query_embedding,limit=top_k,include_metadata=True)# Build contextcontext_parts = []sources = []for result in results:metadata = result[2] # Metadata is third elementcontext_parts.append(f"Source: {metadata['title']}\n{metadata['text']}")sources.append(metadata['url'])context = "\n\n---\n\n".join(context_parts)# Generate answerresponse = self.openai_client.chat.completions.create(model="gpt-3.5-turbo",messages=[{"role": "system", "content": "Answer questions based only on the provided context."},{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}],max_tokens=500,temperature=0.1)return {'question': question,'answer': response.choices[0].message.content,'sources': list(set(sources))}# Usage exampleif __name__ == "__main__":# Initialize RAG systemrag = SupacrawlerRAGSystem()# Build knowledge basenum_chunks = rag.build_knowledge_base("https://supabase.com/docs")print(f"📊 Knowledge base contains {num_chunks} chunks")# Ask questionsquestions = ["How do I enable pgvector in Supabase?","What authentication methods does Supabase support?","How do I create a vector search in Supabase?"]for question in questions:result = rag.ask(question)print(f"\n❓ {result['question']}")print(f"💡 {result['answer']}")print(f"📚 Sources: {result['sources']}")
Performance Optimization
To optimize your RAG system for production use:
Chunking Optimization
# Experiment with different chunk sizeschunk_configs = [{'size': 500, 'overlap': 100},{'size': 1000, 'overlap': 200},{'size': 1500, 'overlap': 300}]for config in chunk_configs:chunker = DocumentChunker(chunk_size=config['size'],chunk_overlap=config['overlap'])# Test and measure retrieval performance
Embedding Batch Processing
# Process embeddings in batches for better throughputdef batch_embed(texts: List[str], batch_size: int = 100):embeddings = []for i in range(0, len(texts), batch_size):batch = texts[i:i + batch_size]response = openai_client.embeddings.create(model="text-embedding-3-small",input=batch)embeddings.extend([item.embedding for item in response.data])return embeddings
Database Indexing
-- Optimize vector search performanceCREATE INDEX CONCURRENTLY ON your_collectionUSING hnsw (embedding vector_cosine_ops)WITH (m = 16, ef_construction = 64);
Production Deployment
For production deployment, consider these enhancements:
Environment Configuration
# config.pyimport osfrom pydantic import BaseSettingsclass Settings(BaseSettings):supacrawler_api_key: stropenai_api_key: strdatabase_url: strsupabase_url: strsupabase_key: str# Performance settingsembedding_batch_size: int = 50vector_search_limit: int = 10chunk_size: int = 1000chunk_overlap: int = 200class Config:env_file = ".env"settings = Settings()
Error Handling and Monitoring
import loggingfrom functools import wrapsdef with_retry(max_retries: int = 3):def decorator(func):@wraps(func)def wrapper(*args, **kwargs):for attempt in range(max_retries):try:return func(*args, **kwargs)except Exception as e:if attempt == max_retries - 1:logging.error(f"Function {func.__name__} failed after {max_retries} attempts: {e}")raiselogging.warning(f"Attempt {attempt + 1} failed: {e}, retrying...")return wrapperreturn decorator@with_retry(max_retries=3)def embed_with_retry(text: str):return openai_client.embeddings.create(model="text-embedding-3-small",input=text)
Scale Beyond Local Development with Supacrawler
While this tutorial demonstrates building a RAG system locally, production deployments require handling scale, reliability, and performance optimization:
- Large-Scale Crawling: Managing thousands of pages with rate limiting and error handling
- Content Updates: Keeping knowledge bases current with incremental updates
- Vector Management: Optimizing storage and search performance across millions of vectors
- Cost Optimization: Balancing embedding quality with API costs
Supacrawler's Crawl API handles these production challenges automatically:
import { SupacrawlerClient } from '@supacrawler/js'const client = new SupacrawlerClient({ apiKey: process.env.SUPACRAWLER_API_KEY })// Production-scale RAG data pipelineasync function buildProductionRAG() {const job = await client.createCrawlJob({url: 'https://docs.company.com',format: 'markdown',depth: 5, // Deep crawlinglink_limit: 10000, // Large scalerender_js: true, // Full JavaScript support// Production optimizationsinclude_patterns: ['/docs/*', '/api/*', '/guides/*'],exclude_patterns: ['/blog/*', '/changelog/*'],timeout: 30000,concurrent_limit: 10, // Parallel processing// Content qualityremove_selectors: ['.sidebar', '.nav', '.footer'],wait_for: '.main-content',block_ads: true})const result = await client.waitForCrawl(job.job_id)return result.data.crawl_data}
Key Production Advantages:
- ✅ Automatic Scale Management: Handle 10,000+ pages without infrastructure complexity
- ✅ Content Quality: Clean, structured content optimized for RAG systems
- ✅ JavaScript Rendering: Full SPA and dynamic content support
- ✅ Rate Limiting: Built-in respect for robots.txt and site limits
- ✅ Error Recovery: Automatic retries and failure handling
- ✅ Incremental Updates: Efficient re-crawling for content freshness
Getting Started:
- 📖 Crawl API Documentation for RAG-optimized crawling
- 🔧 GitHub Repository for self-hosting
- 🆓 Start with 1,000 free crawl operations
Conclusion
You've built a complete RAG system that combines the best of modern AI technologies:
- Supacrawler for high-quality web data extraction
- OpenAI embeddings for superior semantic understanding
- Supabase pgvector for scalable vector storage and search
This foundation provides everything needed for intelligent question-answering systems, chatbots, and knowledge management applications. The modular design makes it easy to swap components, optimize performance, and scale to production requirements.
Whether you're building internal documentation search, customer support automation, or research assistants, this RAG architecture provides a robust, production-ready foundation for AI-powered applications.