A Production Multimodal Pipeline for Document and Video Understanding

The outcome

4
Modalities in one pipeline: 250k+
Annotations delivered: 99.1%
Acceptance rate at final QA

Client context

An AI product company training a vision-language model for enterprise document and video workflows needed annotation across OCR/layout, video events, grounding, and VQA — without running four disconnected vendor processes.

Challenge

Cross-modal consistency was the blocker: the same entity had to be labeled identically in a PDF table, a screen recording, and a narrated walkthrough. Separate pipelines kept drifting apart, poisoning the training signal.

Data strategy

We built a single taxonomy spanning all four modalities, then routed every asset through modality-specialist annotators followed by a cross-modal consistency review layer that checked entity and event alignment across formats.

Workflow

Taxonomy design — one entity/event schema across document, image, video, text
Pilot batch — 2k assets to calibrate guidelines and edge-case handling
Calibrated production — modality-specialist teams with shared glossary
Cross-modal QA — alignment review across asset pairs
Iteration — monthly guideline revisions from client feedback loops

Quality controls

Gold tasks per modality, hallucination-risk labeling for generated captions, temporal consistency checks on video labels, and domain-expert review for financial and legal document content.

Outcome

The unified pipeline delivered 250k+ annotations at a 99.1% final acceptance rate. The client's model showed its largest internal gains on cross-modal retrieval tasks — the exact area the consistency layer was designed to protect.

Next case study

Synchronized Multi-Sensor Episodes for Embodied AI Training

RoboticsPhysical AI & Robotics Data

Read case study