Media ingestion

Media ingestion turns audio, video, and transcript files into structured vault references. ztlctl transcribes audio and video locally using faster-whisper (no data leaves your machine), and parses pre-existing transcript files without any external dependency. Every ingest operation creates a captured reference — the first phase of a two-phase workflow that ends when an agent or user annotates and promotes the reference to annotated status.

Prerequisites¶

Warning

Audio and video transcription requires faster-whisper, which is an optional dependency not installed by default. Transcript files (.txt, .vtt, .srt) do not require faster-whisper — they are parsed locally with no extra dependencies.

Install faster-whisper before ingesting audio or video:

bash uv add --group media faster-whisper

If faster-whisper is not installed and you attempt to ingest an audio or video file, the command returns a DEPENDENCY_MISSING error with the install hint above.

Supported formats¶

Format	Type	Requires faster-whisper	Notes
`.mp3`	Audio	Yes	Common podcast and voice recording format
`.m4a`	Audio	Yes	Apple audio; common for voice memos
`.wav`	Audio	Yes	Uncompressed audio; large files
`.ogg`	Audio	Yes	Open audio format
`.flac`	Audio	Yes	Lossless audio format
`.mp4`	Video	Yes	Common video format; audio track is transcribed
`.mkv`	Video	Yes	Matroska container; audio track is transcribed
`.webm`	Video	Yes	Web video format; audio track is transcribed
`.txt`	Transcript	No	Plain text transcript; ingested as-is
`.vtt`	Transcript	No	WebVTT subtitle/caption format; timestamps stripped
`.srt`	Transcript	No	SubRip subtitle format; sequence numbers and timestamps stripped

Ingesting media files¶

Use ztlctl ingest media to ingest an audio or video file. ztlctl loads the whisper model, transcribes the file locally, and creates a captured reference in your vault.

$ ztlctl ingest media PATH [OPTIONS]

Arguments and options:

Argument / Flag	Required	Description
`PATH`	Yes	Filesystem path to the media or transcript file
`--title TEXT`	No	Title override for the captured reference (defaults to the file stem)
`--topic TEXT`	No	Topic directory under `notes/` for routing
`--tags TEXT`	No	Tags applied to the captured reference (repeatable)
`--summary TEXT`	No	Capture summary hint written into the reference frontmatter
`--dry-run`	No	Preview ingestion without creating any files

Example — ingest a podcast recording:

$ ztlctl ingest media recordings/interview-2026-03-21.mp3 \
    --title "Interview: distributed systems patterns" \
    --topic ml/research \
    --tags podcast --tags distributed-systems

Example — preview before writing:

$ ztlctl ingest media lecture.mp4 --dry-run

The --dry-run flag returns a preview with the first 280 characters of the normalized transcript, the resolved title, and the ingest metadata — no files are written.

Note

Transcription runs entirely on your local machine. The audio or video file is never sent to an external service. The whisper model is downloaded once to a local cache on first use. See Configuration to choose a different model size.

Ingesting transcripts¶

If you already have a transcript file — from an external transcription service, a video platform, or a manual capture — use ztlctl ingest media with the transcript file directly. No faster-whisper installation is required.

WebVTT file:

$ ztlctl ingest media captions.vtt --title "Conference talk: CRDT internals"

SRT file:

$ ztlctl ingest media subtitles.srt --title "Lecture 4 — consensus protocols"

Plain text transcript:

$ ztlctl ingest media transcript.txt --title "Team meeting notes 2026-03-21"

VTT and SRT files are automatically stripped of timestamps, sequence numbers, and header lines before ingestion. The resulting plain text is stored in the reference body and the normalized text bundle.

The two-phase workflow¶

Every ztlctl ingest media call creates a captured reference — not a finished note. This is intentional: the transcription is raw material that needs human or agent review before it becomes a durable knowledge artifact.

Phase 1: Captured — ztlctl ingest media produces a reference with:

status: captured in the frontmatter
The normalized transcript text in the reference body
A source bundle (written to .ztlctl/bundles/) containing: normalized_text, capture_agent (whisper for audio/video, transcript-parser for transcript files), modalities (e.g. ["audio", "text"]), and the original source_path.

Phase 2: Annotated — An agent or user reviews the captured reference, adds key_points, a summary, and relevant tags, then updates the status to annotated:

$ ztlctl update ZTL-0123 \
    --summary "Key insights from the distributed systems interview" \
    --tags podcast distributed-systems \
    --status annotated

Tip

Use ztlctl query work-queue to find all captured references waiting for annotation. The work queue scores items by age and status — newly captured media references appear near the top.

Once annotated, the reference is indexed, tagged, and eligible for reweave link discovery like any other vault content.

Configuration¶

The [ingest.media] section in ztlctl.toml controls the whisper model and transcription behavior:

[ingest.media]
whisper_model = "base"   # model size: tiny, base, small, medium, large-v2
language = null          # ISO language code (e.g. "en", "de"); null = auto-detect
compute_type = "int8"    # quantization: int8 (CPU), float16 (GPU), float32

All config fields (sourced from MediaIngestConfig in config/models.py):

Field	Default	Description
`whisper_model`	`"base"`	Whisper model size. Larger models are more accurate but slower and require more memory.
`language`	`null`	ISO language code hint. `null` enables automatic language detection.
`compute_type`	`"int8"`	Quantization type. Use `int8` on CPU (default), `float16` on CUDA GPU for faster transcription.

Model size trade-offs:

Model	Relative speed	Accuracy	Memory
`tiny`	Fastest	Lower	~40 MB
`base`	Fast (default)	Good	~75 MB
`small`	Moderate	Better	~240 MB
`medium`	Slow	High	~770 MB
`large-v2`	Slowest	Highest	~1550 MB

Tip

For voice memos and short recordings, base (the default) provides good accuracy with fast turnaround. For long lectures or interviews where accuracy matters, switch to small or medium.

MCP tool¶

The ingest_media MCP tool mirrors the CLI. Agents use it to ingest media files already accessible on the local filesystem.

Tool name: ingest_media

Side effect: write

Parameters:

Parameter	Type	Required	Description
`path`	string	Yes	Filesystem path to the media or transcript file
`title`	string	No	Title override for the captured reference
`topic`	string	No	Topic directory under `notes/`
`tags`	list[string]	No	Tags applied to the captured reference
`summary`	string	No	Capture summary hint
`dry_run`	bool	No	Preview ingestion without creating files

Return format:

{
  "id": "REF-0042",
  "path": "references/REF-0042.md",
  "title": "Interview: distributed systems patterns",
  "type": "reference",
  "input_kind": "media",
  "source_kind": "media",
  "modalities": ["audio", "text"],
  "capture_agent": "whisper",
  "source_bundle_version": 1,
  "source_bundle_path": ".ztlctl/bundles/REF-0042/bundle.json"
}

Common errors:

Error code	Cause
`NOT_FOUND`	The file path does not exist
`UNSUPPORTED_INPUT`	File extension is not in the supported set
`DEPENDENCY_MISSING`	faster-whisper not installed (audio/video only)

What's next¶

Concepts — understand references, lifecycle statuses, and source bundles
Configuration — full ztlctl.toml reference including [ingest.media]
Agentic workflows — orchestration recipes for capture-and-annotate pipelines