๐Ÿค–
Mac Playbook
โฑ 30 min

Video Search & Summarization

Transcribe and analyze video with Whisper + VLMs

Replaces DGX Spark: Video Search & Summarization
videowhisper

Basic idea

Video search and summarization pipelines work in two stages: first, transcribe audio to text; then, use an LLM to analyze the text. Whisper is OpenAI's speech recognition model trained on 680,000 hours of multilingual audio. mlx-whisper is a port of Whisper to Apple's MLX framework, which runs natively on the Metal GPU in Apple Silicon Macs โ€” achieving real-time or faster transcription where the same model on a laptop CPU would take 5-10x longer.

The resulting transcript is a searchable, queryable document. A 60-minute lecture becomes a text file you can summarize in seconds, search for any topic, or ask questions about โ€” all with a local LLM, nothing sent to the cloud.

What you'll accomplish

A pipeline that takes any local video file, transcribes it with word-level timestamps using mlx-whisper (large-v3-turbo model), saves a structured transcript, and uses Ollama to summarize key points, answer questions about the video content, and generate a chapter table of contents with timestamps.

What to know before starting

Whisper: OpenAI's multilingual speech recognition model. `whisper-large-v3-turbo` is a distilled version trained to be 4x faster than the full large model with minimal quality loss โ€” it's the right choice for most transcription tasks.
Word timestamps: Whisper can output a start and end time for every word in the transcript. This enables jumping to specific moments: "the speaker mentioned neural networks at 4:32".
Faster than realtime: A 60-minute video transcribed in under 60 minutes = realtime. mlx-whisper achieves ~10x realtime on M2+ (a 60-minute video transcribed in ~6 minutes). On CPU, the same model takes ~60+ minutes.
VAD (Voice Activity Detection): Skips silence in the audio, speeding up processing and preventing Whisper from hallucinating text over silent sections.
Context window limits: Long transcripts may exceed the LLM's context window. A 2-hour lecture transcript (~15,000 words) exceeds the context of smaller Ollama models โ€” you'll need to chunk or truncate.
ffmpeg: mlx-whisper delegates audio decoding to ffmpeg. It handles video-to-audio extraction automatically, so you can pass .mp4, .mov, .mkv, etc. directly.

Prerequisites

โ€ข macOS 14.0+, Apple Silicon (M1 or later)
โ€ข Python 3.10+
โ€ข Ollama running with `qwen2.5:7b` pulled
โ€ข ffmpeg: `brew install ffmpeg`

Time & risk

Duration:: 30 minutes
Risk level:: Low โ€” read-only operations on your video file, no system changes