Inter-Annotator Agreement
Inter-annotator agreement (IAA) measures how consistently two or more annotators assign labels, scores, spans, rankings, or other judgments to the same items under a defined annotation protocol.
For AI leaders, data and evaluation teams, governance teams, security leaders, and technical buyers
Definition: Inter-annotator agreement (IAA) measures how consistently two or more annotators assign labels, scores, spans, rankings, or other judgments to the same items under a defined annotation protocol.
Category: Data quality and human evaluation
Full Definition
IAA is evidence about reproducibility among reviewers, not a direct measure of truth or dataset usefulness. Common statistics include percent agreement, Cohen’s kappa for two raters and categorical labels, Fleiss’ kappa for multiple raters, Krippendorff’s alpha for different scales and missing data, correlations for continuous scores, and task-specific overlap or ranking measures. The choice must match the number of raters, label scale, unit of analysis, missingness, and chance model.
A low agreement value can reflect unclear guidelines, ambiguous items, insufficient expertise, poor interfaces, legitimate plural values, prevalence effects, or a statistic that does not fit the task. A high value can occur on an imbalanced task where reviewers mostly choose the dominant label. Agreement should therefore be reported with label distributions, sample size, confidence or uncertainty, raw disagreement patterns, and qualitative error analysis.
How It Works in Practice
Teams first define the annotation unit and whether labels are nominal, ordinal, interval, spans, boxes, rankings, or free-form judgments. A representative calibration set is independently labeled by reviewers. Results are computed overall and by priority slice, reviewer cohort, label, and time. Disagreement is examined to refine definitions, examples, qualification, tooling, or escalation—not simply overwritten.
Production monitoring uses hidden gold tasks, repeated overlap samples, reviewer drift metrics, and adjudication outcomes. Consequential or subjective tasks may preserve multiple judgments rather than collapsing them into one label. When an expert reference exists, reviewer accuracy against that reference should be reported separately from agreement among peers.
Why It Matters for AI Data
IAA helps determine whether a rubric is operational and whether annotations can be reproduced across people, sites, and time. It is especially important for preference data, safety labels, emotion and paralinguistic judgments, domain correctness, and complex multimodal annotations. Buyers should ask which statistic was used, why it fits the task, what the distributions were, and what happened to disagreements.
What a Production Record May Contain
| Field or artifact | Purpose |
|---|---|
| Task definition | Unit, label type, rubric version, examples, and ambiguity policy. |
| Assignment | Item, reviewer cohort, independent/blinded status, order, and timestamp. |
| Judgment | Label/score/span/ranking, confidence, rationale, and abstention. |
| Agreement analysis | Statistic, sample, distribution, slice, confidence interval, and disagreement pattern. |
| Resolution | Adjudication, retained plurality, guideline change, retraining, and version. |
Quality and Governance Risks
- Percent agreement ignores chance agreement and can be misleading on imbalanced labels.
- Kappa-like statistics can behave unexpectedly when prevalence or reviewer marginals are skewed.
- One overall number can hide severe disagreement in a rare but critical category or language.
- Adjudicating every disagreement to a single answer can erase uncertainty or legitimate plural perspectives.
- Annotators may become artificially consistent through shared shortcuts that are not substantively correct.
- Repeated exposure to the same gold items can turn quality monitoring into memorization.
Practical Example
In a medical-answer evaluation, two clinicians rate factual correctness, harmfulness, and escalation on separate ordinal scales. The program reports Krippendorff’s alpha by dimension and specialty, raw confusion matrices, label prevalence, and adjudication reasons. Agreement is high for correctness but lower for escalation in ambiguous cases, leading to clearer patient-context rules and a multi-reviewer requirement for high-risk items.
Related Terms
RLHF · DPO · Data Curation · Model Integrity
Key Takeaway
Inter-annotator agreement is a diagnostic for the judgment system. Use a task-appropriate statistic, inspect slice-level disagreements, and keep agreement separate from correctness, representativeness, and model utility.