Media ingestion turns audio, video, and transcript files into structured vault references. ztlctl transcribes audio and video locally using faster-whisper (no data leaves your machine), and parses pre-existing transcript files without any external dependency. Every ingest operation creates a captured reference — the first phase of a two-phase workflow that ends when an agent or user annotates and promotes the reference to annotated status.
Prerequisites¶
Warning
Audio and video transcription requires faster-whisper, which is an optional dependency not installed by default. Transcript files (.txt, .vtt, .srt) do not require faster-whisper — they are parsed locally with no extra dependencies.
Install faster-whisper before ingesting audio or video:
bash
uv add --group media faster-whisper
If faster-whisper is not installed and you attempt to ingest an audio or video file, the command returns a DEPENDENCY_MISSING error with the install hint above.
Supported formats¶
| Format | Type | Requires faster-whisper | Notes |
|---|---|---|---|
.mp3 |
Audio | Yes | Common podcast and voice recording format |
.m4a |
Audio | Yes | Apple audio; common for voice memos |
.wav |
Audio | Yes | Uncompressed audio; large files |
.ogg |
Audio | Yes | Open audio format |
.flac |
Audio | Yes | Lossless audio format |
.mp4 |
Video | Yes | Common video format; audio track is transcribed |
.mkv |
Video | Yes | Matroska container; audio track is transcribed |
.webm |
Video | Yes | Web video format; audio track is transcribed |
.txt |
Transcript | No | Plain text transcript; ingested as-is |
.vtt |
Transcript | No | WebVTT subtitle/caption format; timestamps stripped |
.srt |
Transcript | No | SubRip subtitle format; sequence numbers and timestamps stripped |
Ingesting media files¶
Use ztlctl ingest media to ingest an audio or video file. ztlctl loads the whisper model, transcribes the file locally, and creates a captured reference in your vault.
$ ztlctl ingest media PATH [OPTIONS]
Arguments and options:
| Argument / Flag | Required | Description |
|---|---|---|
PATH |
Yes | Filesystem path to the media or transcript file |
--title TEXT |
No | Title override for the captured reference (defaults to the file stem) |
--topic TEXT |
No | Topic directory under notes/ for routing |
--tags TEXT |
No | Tags applied to the captured reference (repeatable) |
--summary TEXT |
No | Capture summary hint written into the reference frontmatter |
--dry-run |
No | Preview ingestion without creating any files |
Example — ingest a podcast recording:
$ ztlctl ingest media recordings/interview-2026-03-21.mp3 \
--title "Interview: distributed systems patterns" \
--topic ml/research \
--tags podcast --tags distributed-systems
Example — preview before writing:
$ ztlctl ingest media lecture.mp4 --dry-run
The --dry-run flag returns a preview with the first 280 characters of the normalized transcript, the resolved title, and the ingest metadata — no files are written.
Note
Transcription runs entirely on your local machine. The audio or video file is never sent to an external service. The whisper model is downloaded once to a local cache on first use. See Configuration to choose a different model size.
Ingesting transcripts¶
If you already have a transcript file — from an external transcription service, a video platform, or a manual capture — use ztlctl ingest media with the transcript file directly. No faster-whisper installation is required.
WebVTT file:
$ ztlctl ingest media captions.vtt --title "Conference talk: CRDT internals"
SRT file:
$ ztlctl ingest media subtitles.srt --title "Lecture 4 — consensus protocols"
Plain text transcript:
$ ztlctl ingest media transcript.txt --title "Team meeting notes 2026-03-21"
VTT and SRT files are automatically stripped of timestamps, sequence numbers, and header lines before ingestion. The resulting plain text is stored in the reference body and the normalized text bundle.
The two-phase workflow¶
Every ztlctl ingest media call creates a captured reference — not a finished note. This is intentional: the transcription is raw material that needs human or agent review before it becomes a durable knowledge artifact.
Phase 1: Captured — ztlctl ingest media produces a reference with:
status: capturedin the frontmatter- The normalized transcript text in the reference body
- A source bundle (written to
.ztlctl/bundles/) containing:normalized_text,capture_agent(whisperfor audio/video,transcript-parserfor transcript files),modalities(e.g.["audio", "text"]), and the originalsource_path.
Phase 2: Annotated — An agent or user reviews the captured reference, adds key_points, a summary, and relevant tags, then updates the status to annotated:
$ ztlctl update ZTL-0123 \
--summary "Key insights from the distributed systems interview" \
--tags podcast distributed-systems \
--status annotated
Tip
Use ztlctl query work-queue to find all captured references waiting for annotation. The work queue scores items by age and status — newly captured media references appear near the top.
Once annotated, the reference is indexed, tagged, and eligible for reweave link discovery like any other vault content.
Configuration¶
The [ingest.media] section in ztlctl.toml controls the whisper model and transcription behavior:
[ingest.media]
whisper_model = "base" # model size: tiny, base, small, medium, large-v2
language = null # ISO language code (e.g. "en", "de"); null = auto-detect
compute_type = "int8" # quantization: int8 (CPU), float16 (GPU), float32
All config fields (sourced from MediaIngestConfig in config/models.py):
| Field | Default | Description |
|---|---|---|
whisper_model |
"base" |
Whisper model size. Larger models are more accurate but slower and require more memory. |
language |
null |
ISO language code hint. null enables automatic language detection. |
compute_type |
"int8" |
Quantization type. Use int8 on CPU (default), float16 on CUDA GPU for faster transcription. |
Model size trade-offs:
| Model | Relative speed | Accuracy | Memory |
|---|---|---|---|
tiny |
Fastest | Lower | ~40 MB |
base |
Fast (default) | Good | ~75 MB |
small |
Moderate | Better | ~240 MB |
medium |
Slow | High | ~770 MB |
large-v2 |
Slowest | Highest | ~1550 MB |
Tip
For voice memos and short recordings, base (the default) provides good accuracy with fast turnaround. For long lectures or interviews where accuracy matters, switch to small or medium.
MCP tool¶
The ingest_media MCP tool mirrors the CLI. Agents use it to ingest media files already accessible on the local filesystem.
Tool name: ingest_media
Side effect: write
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
path |
string | Yes | Filesystem path to the media or transcript file |
title |
string | No | Title override for the captured reference |
topic |
string | No | Topic directory under notes/ |
tags |
list[string] | No | Tags applied to the captured reference |
summary |
string | No | Capture summary hint |
dry_run |
bool | No | Preview ingestion without creating files |
Return format:
{
"id": "REF-0042",
"path": "references/REF-0042.md",
"title": "Interview: distributed systems patterns",
"type": "reference",
"input_kind": "media",
"source_kind": "media",
"modalities": ["audio", "text"],
"capture_agent": "whisper",
"source_bundle_version": 1,
"source_bundle_path": ".ztlctl/bundles/REF-0042/bundle.json"
}
Common errors:
| Error code | Cause |
|---|---|
NOT_FOUND |
The file path does not exist |
UNSUPPORTED_INPUT |
File extension is not in the supported set |
DEPENDENCY_MISSING |
faster-whisper not installed (audio/video only) |
What's next¶
- Concepts — understand references, lifecycle statuses, and source bundles
- Configuration — full
ztlctl.tomlreference including[ingest.media] - Agentic workflows — orchestration recipes for capture-and-annotate pipelines