Getting LLMs to understand videos and audio.

LLMs accept text and images. That's it. Audio, video, raw frames: none of it. You either convert it yourself or you pretend the problem doesn't exist.

Most setups pretend. They skip media entirely or dump a raw transcript straight into context and call it a day. The second approach is worse than it sounds. A one-hour podcast is ~8,000 words. A full conference day is ~60,000. Shove that unfiltered into a context window and you've either hit the limit, wasted money on tokens you didn't need, or gotten a worse response because the relevant part drowned in everything else. Video is worse: frames are large, and naive approaches either skip visuals entirely or immediately exceed any reasonable budget.

@dymoo/media-understanding is an MCP server I built to fix this. Open source, TypeScript, MIT.

Discover, Analyze, Iterate

The package is opinionated about workflow:

1. discover  → probe_media      cheap, batch-safe, no decoding
2. analyze   → understand_media one file, expensive, full output
3. iterate   → get_transcript   narrow the window with timestamps
               get_video_grids  just the frames you need
               get_frames       exact seconds, exact frames

Probe first. probe_media reads file headers only: 5–50ms per file, up to 200 files via glob, no decoding. Type, duration, resolution, codec, size. No tokens, no models invoked. Figure out what you have before you commit to processing any of it.

Analyze one file at a time. understand_media is the expensive call. Transcript plus keyframe grids for a single file. If you've got 10+ files, the tool descriptions actively teach the calling agent to launch one subagent per file: each does the heavy lifting, returns a summary, then gets thrown away. Raw transcripts and base64 images discarded. The orchestrator works with summaries, not raw media. Simon Willison wrote about timestamped video frames for LLMs; this operationalizes that into a workflow that's conservative about what ends up in context.

Narrow, don't widen. Once you know where the interesting part is (a timestamp in a transcript, a grid), use get_transcript with start_sec/end_sec. Pull exactly the frames you need. You're zooming in, not dumping everything.

The Five Tools

probe_media, a batch metadata scan, no decoding:

{
  "name": "probe_media",
  "arguments": { "paths": "recordings/**/*.{mp4,mp3,wav}" }
}

understand_media, full single-file analysis. Transcript plus keyframe grids, interleaved chronologically. More on why that matters below. For screen recordings where scene changes are more useful than uniform sampling:

{
  "name": "understand_media",
  "arguments": {
    "file_path": "/path/to/walkthrough.mp4",
    "sampling_strategy": "scene",
    "thumb_width": 320
  }
}

get_transcript: text only, with time windowing. text for timestamped lines, srt for subtitles, json for ms-precision segments. The full file is transcribed once and cached. Subsequent calls with different windows are instant.

{
  "name": "get_transcript",
  "arguments": {
    "file_path": "/path/to/podcast.mp3",
    "format": "srt",
    "start_sec": 600,
    "end_sec": 900
  }
}

get_video_grids: JPEG contact sheets of keyframes. Budget-aware: omit max_grids and the server auto-fits as many as possible within the character limit.

get_frames: exact individual frames at specific seconds. Use this once you've identified timestamps and want the actual frame.

Two things worth understanding

The budget error teaches you how to fix it

Every MCP response is capped at 48,000 serialized characters by default. When a request would exceed that, the server doesn't silently truncate: it returns an error with the overage ratio, an estimated LLM vision token count using Claude's pixels/750 formula, and specific suggestions:

That last part is the bit that matters. "You exceeded the budget" is useless. "Reduce thumb_width from 480 to 120 or narrow the time window to 0–300s" is something an agent can act on without a human in the loop.

Interleaved output isn't an aesthetic choice

understand_media doesn't return a transcript block followed by all the images. It interleaves them:

[transcript 0–30s]
[grid covering 0–30s]
[transcript 30–60s]
[grid covering 30–60s]
...

If you return them as disconnected blocks, the model has to re-correlate speech to visuals across the whole document. Interleaving means when it reads "presenter moves to the whiteboard at 2:15", the frame is right there. Speech and visuals share context rather than just sharing a response.

Setup

npm install @dymoo/media-understanding

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "media-understanding": {
      "command": "npx",
      "args": ["-y", "@dymoo/media-understanding/mcp"]
    }
  }
}

Same pattern for Cursor, Windsurf, Cline, OpenCode. First run auto-downloads the Whisper model (base.en-q5_1, ~57 MB) to ~/.cache/media-understanding/models. The English-only model already emits non-speech tokens like [Music] and (applause). You don't need the multilingual variant. It has measurably worse English accuracy anyway.

The honest bit

Node >=22, ESM-only. Hard requirement. Native bindings need it.

Native bindings. Frame extraction uses node-av N-API bindings, no FFmpeg CLI spawning. Fast, no system FFmpeg dependency. Platform-specific install pain if something goes wrong.

AV1 on macOS. VideoToolbox is selected automatically. AV1 frames come in as IOSurface and can't be downloaded to system memory via hwdownload. The adapter falls back to software decode per-file and logs to stderr. Expected, not a bug.

Heavy files still take time. Two-hour video, first transcription pass: you're waiting minutes. That's Whisper, not this package.

v0.1.0. It's new. Smoke-test your specific file types before putting it anywhere near a production pipeline.

GitHub · npm

Getting LLMs to understand videos and audio.

Discover, Analyze, Iterate

The Five Tools

Two things worth understanding

The budget error teaches you how to fix it

Interleaved output isn't an aesthetic choice

Setup

The honest bit

Written by Dylan Moore

Posts you may like

Knowledge Is Not Moral. Action Is.

The Winner Takes All

The World Is a Padded Cell