Simple YouTube RAG

Data Engineer · 2026 · 1 week · 3 min read

Built a RAG application that lets you ask questions about YouTube video content by ingesting subtitles, embedding them with OpenAI, and grounding LLM answers in transcript sources.

Overview

A Retrieval-Augmented Generation app that fetches YouTube subtitles, chunks and embeds them via OpenAI, stores vectors in a local ChromaDB instance, and answers natural-language questions with source citations — all through a Streamlit UI.

Problem

I wanted to understand how RAG actually works end-to-end, from document ingestion through retrieval to grounded generation, without the abstraction of a managed service hiding the moving parts.

Constraints

  • Must run entirely locally with no managed vector database or cloud LLM endpoint
  • YouTube routinely blocks requests from cloud IPs, requiring proxy handling
  • Subtitles can be manual or auto-generated, with varying quality
  • Single API key (OpenRouter) for both embeddings and chat, keeping the stack minimal

Approach

Used LlamaIndex to orchestrate the indexing and querying pipeline. Built a custom ingestion module that fetches English subtitles via youtube-transcript-api, adds timestamp markers, and stores them as LlamaIndex Documents in a persistent ChromaDB collection. Querying retrieves the top-k relevant chunks and passes them to an LLM with source citation. Wrapped everything in a Streamlit UI with configurable model selection and proxy support.

Key Decisions

Use LlamaIndex over LangChain

Reasoning:

LlamaIndex provides a tighter, more opinionated abstraction for indexing and querying. For a single-purpose RAG pipeline, its simpler API meant less boilerplate and fewer abstraction layers to debug.

Alternatives considered:
  • LangChain
  • Raw OpenAI API with manual retrieval

Use ChromaDB as the local vector store

Reasoning:

ChromaDB persists to disk out of the box and requires no server process. It's sufficient for a personal tool and avoids the operational overhead of Pinecone or Weaviate.

Alternatives considered:
  • Pinecone
  • FAISS (in-memory)

Route everything through OpenRouter

Reasoning:

OpenRouter provides a single API key that works across multiple model providers (OpenAI, Anthropic, etc.). This let me swap models without changing the integration code, and kept the .env config to a single key.

Alternatives considered:
  • Direct OpenAI API
  • Ollama (local models)

Fetch subtitles instead of audio transcription

Reasoning:

youtube-transcript-api retrieves subtitles instantly without downloading the full video or running Whisper. Auto-generated subtitles cover most videos, and the quality is sufficient for RAG retrieval.

Alternatives considered:
  • Whisper (local transcription)
  • AssemblyAI API

Tech Stack

  • Python
  • LlamaIndex
  • ChromaDB
  • OpenAI (via OpenRouter)
  • youtube-transcript-api
  • yt-dlp
  • Streamlit

Result & Impact

Building this project demystified RAG for me. Seeing how chunking, embedding, retrieval, and generation actually connect — and where they break — gave me a concrete understanding that reading about vector search never did. Handling YouTube's IP blocking and auto-generated subtitle quirks also taught me practical lessons about building against real-world data sources.

Learnings

  • Embedding model choice matters more than LLM choice for retrieval quality — the right chunks need to surface first
  • Cloud-hosted data sources often block server IPs; proxy support is a feature, not an edge case
  • Auto-generated subtitles are noisy; timestamp markers help both retrieval and citation
  • A single-purpose tool with a tight scope is more useful than a generalized platform that tries to do everything
  • Local persistent vector stores are sufficient for personal tools — you don't need a managed database until you need concurrent access

How It Works

  1. Ingest — Paste a YouTube URL and the app fetches the video’s English subtitles (manual or auto-generated), chunks the text, embeds it via OpenAI, and stores it in a local ChromaDB vector store.
  2. Query — Ask a natural-language question and the app retrieves the most relevant transcript segments, then uses an LLM to generate an answer with source citations showing the video title, URL, and similarity score.

Project Structure

simple-rag/
├── app.py          # Streamlit web UI
├── config.py       # Configuration and environment variables
├── ingest.py       # YouTube subtitle fetching and ChromaDB ingestion
├── query.py        # Query engine (retrieval + LLM answer generation)
├── requirements.txt
├── .env.example    # Template for environment variables
└── chroma_db/      # Persisted vector store (gitignored)

Proxy Handling

YouTube blocks requests from common cloud IP ranges (AWS, GCP, etc.). The app supports an optional proxy URL configuration in the sidebar so it can be run from cloud environments. Ingestion also includes exponential backoff with retry to handle transient rate limiting.

Source Code

The full project is available on GitHub at omgsian/simple-rag.