Video search and summarization pipelines work in two stages: first, transcribe audio to text; then, use an LLM to analyze the text. Whisper is OpenAI's speech recognition model trained on 680,000 hours of multilingual audio. mlx-whisper is a port of Whisper to Apple's MLX framework, which runs natively on the Metal GPU in Apple Silicon Macs โ achieving real-time or faster transcription where the same model on a laptop CPU would take 5-10x longer.
The resulting transcript is a searchable, queryable document. A 60-minute lecture becomes a text file you can summarize in seconds, search for any topic, or ask questions about โ all with a local LLM, nothing sent to the cloud.
What you'll accomplish
A pipeline that takes any local video file, transcribes it with word-level timestamps using mlx-whisper (large-v3-turbo model), saves a structured transcript, and uses Ollama to summarize key points, answer questions about the video content, and generate a chapter table of contents with timestamps.
What to know before starting
Whisper: OpenAI's multilingual speech recognition model. `whisper-large-v3-turbo` is a distilled version trained to be 4x faster than the full large model with minimal quality loss โ it's the right choice for most transcription tasks.
Word timestamps: Whisper can output a start and end time for every word in the transcript. This enables jumping to specific moments: "the speaker mentioned neural networks at 4:32".
Faster than realtime: A 60-minute video transcribed in under 60 minutes = realtime. mlx-whisper achieves ~10x realtime on M2+ (a 60-minute video transcribed in ~6 minutes). On CPU, the same model takes ~60+ minutes.
VAD (Voice Activity Detection): Skips silence in the audio, speeding up processing and preventing Whisper from hallucinating text over silent sections.
Context window limits: Long transcripts may exceed the LLM's context window. A 2-hour lecture transcript (~15,000 words) exceeds the context of smaller Ollama models โ you'll need to chunk or truncate.
ffmpeg: mlx-whisper delegates audio decoding to ffmpeg. It handles video-to-audio extraction automatically, so you can pass .mp4, .mov, .mkv, etc. directly.
Prerequisites
โข macOS 14.0+, Apple Silicon (M1 or later)
โข Python 3.10+
โข Ollama running with `qwen2.5:7b` pulled
โข ffmpeg: `brew install ffmpeg`
Time & risk
Duration:: 30 minutes
Risk level:: Low โ read-only operations on your video file, no system changes