Video Search & Summarization

Transcribe and analyze video with Whisper + VLMs

Replaces DGX Spark: Video Search & Summarization

videowhisper

Basic idea

Video search and summarization pipelines work in two stages: first, transcribe audio to text; then, use an LLM to analyze the text. Whisper is OpenAI's speech recognition model trained on 680,000 hours of multilingual audio. mlx-whisper is a port of Whisper to Apple's MLX framework, which runs natively on the Metal GPU in Apple Silicon Macs — achieving real-time or faster transcription where the same model on a laptop CPU would take 5-10x longer.

The resulting transcript is a searchable, queryable document. A 60-minute lecture becomes a text file you can summarize in seconds, search for any topic, or ask questions about — all with a local LLM, nothing sent to the cloud.

What you'll accomplish

A pipeline that takes any local video file, transcribes it with word-level timestamps using mlx-whisper (large-v3-turbo model), saves a structured transcript, and uses Ollama to summarize key points, answer questions about the video content, and generate a chapter table of contents with timestamps.

What to know before starting

Whisper: OpenAI's multilingual speech recognition model. `whisper-large-v3-turbo` is a distilled version trained to be 4x faster than the full large model with minimal quality loss — it's the right choice for most transcription tasks.

Word timestamps: Whisper can output a start and end time for every word in the transcript. This enables jumping to specific moments: "the speaker mentioned neural networks at 4:32".

Faster than realtime: A 60-minute video transcribed in under 60 minutes = realtime. mlx-whisper achieves ~10x realtime on M2+ (a 60-minute video transcribed in ~6 minutes). On CPU, the same model takes ~60+ minutes.

VAD (Voice Activity Detection): Skips silence in the audio, speeding up processing and preventing Whisper from hallucinating text over silent sections.

Context window limits: Long transcripts may exceed the LLM's context window. A 2-hour lecture transcript (~15,000 words) exceeds the context of smaller Ollama models — you'll need to chunk or truncate.

ffmpeg: mlx-whisper delegates audio decoding to ffmpeg. It handles video-to-audio extraction automatically, so you can pass .mp4, .mov, .mkv, etc. directly.

Prerequisites

• macOS 14.0+, Apple Silicon (M1 or later)

• Python 3.10+

• Ollama running with `qwen2.5:7b` pulled

• ffmpeg: `brew install ffmpeg`

Time & risk

Duration:: 30 minutes

Risk level:: Low — read-only operations on your video file, no system changes