type
Post
status
Published
date
Feb 8, 2026
slug
modular-ai-pipelines-video-translation-whisper
summary
By decoupling transcription from translation and implementing robust error handling, we can build specialized AI pipelines that outperform generic multimodal models in both accuracy and debuggability.
tags
LLM
category
Sharing
icon
password
Sometimes the best way to understand the limits of a model’s architecture is to force it to solve a personal annoyance. In my case, that meant automating the translation of K-pop content from Korean to English and Chinese.
 
I recently dusted off this project, which I started last year, treating it as an engineering "puzzle" to refine my chops. The objective was straightforward: take a raw video file, transcribe the audio, translate it, and burn subtitles back in. While the use case is recreational (and helps with my own language immersion), the architectural lessons regarding the division of labor in AI pipelines are universally applicable to enterprise work.
 

The Case for Specialization: Specialists over Generalists

There is a massive temptation in the current market to throw a generic multimodal model at a video file and ask for a final result. It feels magical to upload a video to Gemini or GPT-4o and ask, "Translate this."
 
However, I am a firm believer in the division of labor. When building reliable systems, you want specialists, not generalists.
For this pipeline, I enforced a strict separation of concerns:
  1. The Ear (Transcription): OpenAI’s Whisper (SOTA for audio-to-text).
  1. The Brain (Translation): Large Language Models (specifically Gemini 2.5 Pro).
This modular approach allows for optimization at each stage. If you rely on a single end-to-end multimodal call, debugging becomes a nightmare. If the output is wrong, did the model mishear the Korean phonemes, or did it hallucinate the English translation? You have no way to know.
By maintaining the raw transcript as an intermediate artifact, we gain control and recoverability. I can verify the Korean transcript independently before paying for the translation tokens. If the translation logic fails, I don't need to re-process the audio. This decoupling is essential for cost control and observability.
 

Optimizing Whisper: It’s Not a Chatbot

One specific detail I overlooked in my initial build—and one that trips up many engineers—is how different Whisper’s initial_prompt parameter is compared to a standard LLM system prompt.
As the OpenAI Cookbook highlights, Whisper does not follow instructions in the conversational sense. You cannot tell it, "Format lists in Markdown" or "Ignore filler words." Instead, Whisper’s prompt acts as a style and spelling guide. The model attempts to mimic the *style* of the preceding tokens.
If you want it to recognize specific K-pop member names or technical jargon, you don't ask it to; you simply feed it a string of correct spellings. Furthermore, this prompt is limited to a mere 224 tokens. You have to be incredibly concise. In my use case, I had to distill keywords from reference files into a dense "concept list" to guide the phonetic recognition toward the correct terminology.
 

The "Doom Loop" of Hallucination

I also encountered a specific, frustrating failure mode: Whisper would sometimes get stuck in a loop, repeating the same phrase endlessly, ignoring the actual audio track. This wasn't just a timestamp issue; it was a genuine generation loop.
The fix was counter-intuitive: setting condition_on_previous_text to False.
By default, Whisper uses the previous segment's text to guide the next segment's generation. Usually, this helps flow. But if the model makes a mistake, that mistake is fed back in as context, reinforcing the error and causing a loop. As discussed in here, disabling this history forces the model to look at each audio segment with fresh eyes. It reduced coherence slightly but eliminated the hallucinations entirely.
 

The Translation Layer: Context, JSON, and Resilience

For the translation step, I opted for Gemini 2.5 Pro. The priority here was leveraging its massive context window. A 50-minute video contains callbacks, tonal shifts, and running jokes. Translating sentence-by-sentence destroys that context.
However, asking an LLM to output a perfectly formatted JSON object for an hour-long transcript is asking for trouble. LLMs are notoriously flaky with long-form structured output. One missing closing brace ruins the entire pipeline.
To solve this, I implemented a hierarchical batching strategy:
  1. The "Hail Mary": First, I attempt to translate the full transcript in one massive prompt. This maximizes narrative coherence and is the cheapest option regarding API calls.
  1. The Fallback: If the model times out or returns malformed JSON, the system catches the error and splits the transcript into batches of 200 segments.
  1. The Decomposition: If a specific batch fails, it decomposes further into smaller chunks.
This "graceful degradation" ensures cost-effectiveness without sacrificing reliability. We prioritize the best possible context but accept smaller batches to ensure the job actually finishes.
 

Handling Domain Specificity (Without RAG)

The biggest hurdle in automated translation isn't grammar; it's specific terminology—member names, obscure album titles, or fandom inside jokes.
Since Whisper’s prompt is too small (224 tokens) to handle a full glossary, I moved the heavy lifting of "knowledge management" to the translation stage. I created a separate, lightweight LLM pass that reads reference documents and summarizes them into a concise keyword list.
This "dynamic glossary" is then injected into the Gemini system prompt. It acts as a lightweight alternative to a Vector Store (RAG). Often, engineers jump straight to RAG pipelines when a simple string of hard constraints will do the job. For a finite scope like a single video, a static list of terms is cheaper, faster, and more deterministic than embedding retrieval.
 

See It in Action

All this architecture—the Whisper prompting, the fallback strategies, the glossary injection—comes together in the final render.
Here is a clip of the final output. Notice specifically how the timestamps (derived from Whisper) align perfectly with the speech, and how the specific terminology (handled by the translation layer) is preserved in Chinese.
Video preview
 

The Bottom Line

The final result—stitched together by FFMPEG using Whisper's precise timestamps—costs roughly $0.40 USD for a 50-minute video and is surprisingly robust.
Projects like this are more than just hobbies; they are low-stakes environments to test high-stakes concepts. You don't truly learn the fragility of JSON output or the nuance of `condition_on_previous_text` until you're trying to solve a puzzle you actually care about.
 
 
 
Loading...