Inter-Annotator Agreement

Definition: Inter-annotator agreement (IAA) measures how consistently two or more annotators assign labels, scores, spans, rankings, or other judgments to the same items under a defined annotation protocol.

Category: Data quality and human evaluation

Full Definition

IAA is evidence about reproducibility among reviewers, not a direct measure of truth or dataset usefulness. Common statistics include percent agreement, Cohen’s kappa for two raters and categorical labels, Fleiss’ kappa for multiple raters, Krippendorff’s alpha for different scales and missing data, correlations for continuous scores, and task-specific overlap or ranking measures. The choice must match the number of raters, label scale, unit of analysis, missingness, and chance model.

A low agreement value can reflect unclear guidelines, ambiguous items, insufficient expertise, poor interfaces, legitimate plural values, prevalence effects, or a statistic that does not fit the task. A high value can occur on an imbalanced task where reviewers mostly choose the dominant label. Agreement should therefore be reported with label distributions, sample size, confidence or uncertainty, raw disagreement patterns, and qualitative error analysis.

How It Works in Practice

Teams first define the annotation unit and whether labels are nominal, ordinal, interval, spans, boxes, rankings, or free-form judgments. A representative calibration set is independently labeled by reviewers. Results are computed overall and by priority slice, reviewer cohort, label, and time. Disagreement is examined to refine definitions, examples, qualification, tooling, or escalation—not simply overwritten.

Production monitoring uses hidden gold tasks, repeated overlap samples, reviewer drift metrics, and adjudication outcomes. Consequential or subjective tasks may preserve multiple judgments rather than collapsing them into one label. When an expert reference exists, reviewer accuracy against that reference should be reported separately from agreement among peers.

Why It Matters for AI Data

IAA helps determine whether a rubric is operational and whether annotations can be reproduced across people, sites, and time. It is especially important for preference data, safety labels, emotion and paralinguistic judgments, domain correctness, and complex multimodal annotations. Buyers should ask which statistic was used, why it fits the task, what the distributions were, and what happened to disagreements.

What a Production Record May Contain

Field or artifact	Purpose
Task definition	Unit, label type, rubric version, examples, and ambiguity policy.
Assignment	Item, reviewer cohort, independent/blinded status, order, and timestamp.
Judgment	Label/score/span/ranking, confidence, rationale, and abstention.
Agreement analysis	Statistic, sample, distribution, slice, confidence interval, and disagreement pattern.
Resolution	Adjudication, retained plurality, guideline change, retraining, and version.

Quality and Governance Risks

Percent agreement ignores chance agreement and can be misleading on imbalanced labels.
Kappa-like statistics can behave unexpectedly when prevalence or reviewer marginals are skewed.
One overall number can hide severe disagreement in a rare but critical category or language.
Adjudicating every disagreement to a single answer can erase uncertainty or legitimate plural perspectives.
Annotators may become artificially consistent through shared shortcuts that are not substantively correct.
Repeated exposure to the same gold items can turn quality monitoring into memorization.

Practical Example

In a medical-answer evaluation, two clinicians rate factual correctness, harmfulness, and escalation on separate ordinal scales. The program reports Krippendorff’s alpha by dimension and specialty, raw confusion matrices, label prevalence, and adjudication reasons. Agreement is high for correctness but lower for escalation in ambiguous cases, leading to clearer patient-context rules and a multi-reviewer requirement for high-risk items.

Key Takeaway

Inter-annotator agreement is a diagnostic for the judgment system. Use a task-appropriate statistic, inspect slice-level disagreements, and keep agreement separate from correctness, representativeness, and model utility.

Full Definition

How It Works in Practice

Why It Matters for AI Data

What a Production Record May Contain

Quality and Governance Risks

Practical Example

Related Terms

Key Takeaway

More glossary.

Agentic AI

Data Curation

DPO