Simple YouTube RAG
Built a RAG application that lets you ask questions about YouTube video content by ingesting subtitles, embedding them with OpenAI, and grounding LLM answers in transcript sources.
Overview
A Retrieval-Augmented Generation app that fetches YouTube subtitles, chunks and embeds them via OpenAI, stores vectors in a local ChromaDB instance, and answers natural-language questions with source citations — all through a Streamlit UI.
Problem
I wanted to understand how RAG actually works end-to-end, from document ingestion through retrieval to grounded generation, without the abstraction of a managed service hiding the moving parts.
Constraints
- Must run entirely locally with no managed vector database or cloud LLM endpoint
- YouTube routinely blocks requests from cloud IPs, requiring proxy handling
- Subtitles can be manual or auto-generated, with varying quality
- Single API key (OpenRouter) for both embeddings and chat, keeping the stack minimal
Approach
Used LlamaIndex to orchestrate the indexing and querying pipeline. Built a custom ingestion module that fetches English subtitles via youtube-transcript-api, adds timestamp markers, and stores them as LlamaIndex Documents in a persistent ChromaDB collection. Querying retrieves the top-k relevant chunks and passes them to an LLM with source citation. Wrapped everything in a Streamlit UI with configurable model selection and proxy support.
Key Decisions
Use LlamaIndex over LangChain
LlamaIndex provides a tighter, more opinionated abstraction for indexing and querying. For a single-purpose RAG pipeline, its simpler API meant less boilerplate and fewer abstraction layers to debug.
- LangChain
- Raw OpenAI API with manual retrieval
Use ChromaDB as the local vector store
ChromaDB persists to disk out of the box and requires no server process. It's sufficient for a personal tool and avoids the operational overhead of Pinecone or Weaviate.
- Pinecone
- FAISS (in-memory)
Route everything through OpenRouter
OpenRouter provides a single API key that works across multiple model providers (OpenAI, Anthropic, etc.). This let me swap models without changing the integration code, and kept the .env config to a single key.
- Direct OpenAI API
- Ollama (local models)
Fetch subtitles instead of audio transcription
youtube-transcript-api retrieves subtitles instantly without downloading the full video or running Whisper. Auto-generated subtitles cover most videos, and the quality is sufficient for RAG retrieval.
- Whisper (local transcription)
- AssemblyAI API
Tech Stack
- Python
- LlamaIndex
- ChromaDB
- OpenAI (via OpenRouter)
- youtube-transcript-api
- yt-dlp
- Streamlit
Result & Impact
Building this project demystified RAG for me. Seeing how chunking, embedding, retrieval, and generation actually connect — and where they break — gave me a concrete understanding that reading about vector search never did. Handling YouTube's IP blocking and auto-generated subtitle quirks also taught me practical lessons about building against real-world data sources.
Learnings
- Embedding model choice matters more than LLM choice for retrieval quality — the right chunks need to surface first
- Cloud-hosted data sources often block server IPs; proxy support is a feature, not an edge case
- Auto-generated subtitles are noisy; timestamp markers help both retrieval and citation
- A single-purpose tool with a tight scope is more useful than a generalized platform that tries to do everything
- Local persistent vector stores are sufficient for personal tools — you don't need a managed database until you need concurrent access
How It Works
- Ingest — Paste a YouTube URL and the app fetches the video’s English subtitles (manual or auto-generated), chunks the text, embeds it via OpenAI, and stores it in a local ChromaDB vector store.
- Query — Ask a natural-language question and the app retrieves the most relevant transcript segments, then uses an LLM to generate an answer with source citations showing the video title, URL, and similarity score.
Project Structure
simple-rag/
├── app.py # Streamlit web UI
├── config.py # Configuration and environment variables
├── ingest.py # YouTube subtitle fetching and ChromaDB ingestion
├── query.py # Query engine (retrieval + LLM answer generation)
├── requirements.txt
├── .env.example # Template for environment variables
└── chroma_db/ # Persisted vector store (gitignored)
Proxy Handling
YouTube blocks requests from common cloud IP ranges (AWS, GCP, etc.). The app supports an optional proxy URL configuration in the sidebar so it can be run from cloud environments. Ingestion also includes exponential backoff with retry to handle transient rate limiting.
Source Code
The full project is available on GitHub at omgsian/simple-rag.