Data Product · 02

Data for Agents That Plan, Use Tools, and Complete Real Workflows

Execution-grade trajectories, environments, and failure taxonomies for training and evaluating agents on the workflows your customers actually run.

Scope an Agent Data Program All Data Products

Use cases

What teams use it for.

08 items

Enterprise workflow agentsSoftware engineering agentsBrowser agentsAPI agentsCustomer support agentsMulti-agent systemsRL environment designTask completion evaluation

What we build

Data we produce.

08 items

Golden trajectoriesTool-use logsExecuted trajectoriesRL environmentsPolicy-following tasksState transition dataFailure mode taxonomyHuman escalation labels

Evaluation

Evaluation metrics we instrument

07 items

Task success rateTool correctnessPolicy adherenceState consistencyRecovery behaviorLong-horizon reliabilityHuman escalation accuracy

Delivery & integration

Built to drop into your pipeline.

Every dataset ships versioned, documented, and matched to your schema — with a QA report your research team can audit against acceptance criteria.

Golden trajectoriesTool-use logsExecuted trajectoriesRL environmentsPolicy-following tasks

Workflow

How the program runs.

01Environment Design
02Task Design
03Expert Execution
04Failure Analysis
05Evaluation Loop

Continuous loop — outputs feed back into the data engine.

Quality controls

How we keep it correct.

Executable environment validation
Trajectory replay checks
Policy adherence review
Failure taxonomy calibration
Gold task benchmarks
Delivery QA report

FAQ

Common questions.

What is a golden trajectory?

A golden trajectory is a verified, end-to-end record of an agent (or expert human) completing a task correctly — every tool call, state transition, and decision — used as a reference standard for training and evaluating agentic systems.

Do you build custom RL environments?

Yes. We design and implement task environments that mirror your production workflows — including tools, APIs, policies, and edge cases — so agents can be trained and evaluated under realistic conditions.

How do you evaluate agent reliability?

We instrument task success rate, tool correctness, policy adherence, state consistency, recovery behavior, and long-horizon reliability, then deliver failure taxonomies your team can act on.

Product deep dive

Agentic AI Data for Tool Use, Trajectories, Environments, and Evaluation

The Data Layer Behind Reliable Agents That Plan, Use Tools, and Complete Work

An agent is not evaluated only by what it says. It changes state: editing files, calling APIs, navigating interfaces, writing code, sending messages, updating records, or operating a workflow over many steps. The data therefore has to represent the task environment, available tools, action sequence, intermediate observations, final artifacts, and the conditions under which success is accepted.

Recent benchmarks expose a widening gap between short synthetic tasks and real enterprise work. Long-horizon tasks can be only partly executable, can require subjective professional quality, and can fail despite a superficially correct final answer. Production evaluation needs deterministic checks where possible, rubric-based artifact review where necessary, and explicit measurement of intervention, recovery, latency, cost, and policy compliance.

Our role is not to sell a fixed, generic dataset. We design a program around the target model, deployment environment, failure profile, data rights, and acceptance criteria. Every engagement begins with a concrete definition of what a usable training or evaluation unit means for the customer—and how that unit will be verified before delivery.

Built for Teams That Need More Than Volume

This product supports teams building browser, computer-use, coding, research, support, operations, or multi-agent systems. The buyer often has a prototype and anecdotal failures but lacks a reproducible task suite, state-aware trajectory schema, or reliable method for converting successful and failed runs into training data.

Common engagement triggers

Offline language-model scores do not predict whether the agent completes the workflow.
Logs cannot distinguish model, tool, harness, permission, environment, and evaluator failures.
Human demonstrations are inconsistent, incomplete, or misaligned with the deployed tool interface.
Final-answer checks pass despite incorrect state changes, poor artifacts, or policy violations.
Long-horizon tasks require partial-credit rubrics, professional judgment, or review of multiple generated files.
The team needs tests for prompt injection, excessive agency, unsafe tool use, or cross-system data leakage.

What This Product Can Support

Golden and Corrective Trajectories

A trajectory records the sequence required to complete a task under a defined environment and tool set. High-value records explain not only the ideal path but also recoveries from realistic errors.

Expert-executed action and observation sequences.
Alternative valid paths and tool-selection decisions.
Corrective trajectories from failed or suboptimal model runs.
Human approval, correction, takeover, and escalation events.
Outcome verification linked to final state and artifacts.

Executable Task and Environment Design

Agent evaluation is strongest when a task can be reset, replayed, and verified. We design fixtures, initial state, permissions, mock or controlled services, hidden checks, and reproducibility controls.

Containerized or sandboxed tasks.
Browser and enterprise-application state snapshots.
Repository issues with tests and behavioral constraints.
API workflows with seeded records and post-condition checks.
Injected unavailable, delayed, stale, or misleading tool conditions.

Tool-Use and State-Transition Data

Tool data should preserve arguments, results, errors, permissions, retries, and state before and after action. It can support function calling, planning, verifier models, and failure analysis.

Function/API examples with typed schemas.
Screen interaction and computer-use actions.
Database changes, file diffs, and artifact manifests.
Tool errors, partial availability, and stale observations.
Policy labels for allowed, prohibited, and approval-required actions.

Long-Horizon Artifact Evaluation

Many enterprise tasks end in a document, spreadsheet, presentation, design, code change, or report. Evaluation combines requirements, grounding, file validity, functional or visual quality, and professional usefulness.

Requirement-level rubrics and partial credit.
File-open, parse, compile, test, and schema validation.
Source-grounding and citation checks.
Visual comparison or design-system compliance.
Human adjudication for subjective but consequential quality.

Agent Safety and Policy Evaluation

Agents amplify risk because they can act. Testing covers unsafe authority, prompt injection, credential exposure, irreversible actions, memory poisoning, and hidden cross-system effects.

Direct and indirect prompt-injection scenarios.
Least-privilege and permission-boundary tasks.
Approval gates for high-impact actions.
Data exfiltration and persistent-memory attacks.
Safe recovery from ambiguity and conflicting policies.

Data We Build

The delivery unit is defined at the level required by the model and the evaluation harness—not merely as a row of text or a media file. Depending on the program, one record may include source inputs, structured intermediate state, human judgments, provenance, quality evidence, and model- or environment-derived verification.

Deliverable	What it contains	Typical use
Golden trajectory set	Initial state, task, tools, actions, observations, checkpoints, final state, artifacts, and verifier output.	Imitation learning, SFT, trajectory ranking, planner/controller training.
Failure and recovery corpus	Failed run, localized cause, recovery action, human intervention, and corrected completion.	Recovery policy, critic training, failure classification, escalation design.
Executable task suite	Environment fixture, instruction, seed data, permissions, reset logic, hidden checks, and oracle notes.	Benchmarking, release gating, regression testing, RL environments.
Tool-call dataset	Tool schema, context, arguments, returned observation, errors, and correctness label.	Function calling, tool routing, argument generation, error handling.
Artifact-quality evaluation set	Task requirements, generated files, rubric, reference evidence, reviewer scores, and adjudication.	Professional knowledge-work and subjective long-horizon evaluation.
Agent safety suite	Threat scenario, attack channel, allowed authority, interaction log, impact, severity, and mitigation signal.	Red teaming, security regression, permission-policy evaluation.

Reference Record Design

A production schema is finalized during calibration, but a typical record may include the following fields:

episode_id — Stable identifier for one attempt under a specific environment and harness.
task_spec — User goal, constraints, success requirements, and prohibited actions.
environment_snapshot — Initial fixture, seeded records, files, applications, permissions, and time conditions.
tool_registry — Available tools, versions, schemas, side effects, and permission boundaries.
steps — Ordered observable actions, calls, observations, errors, and state hashes; private model reasoning is not required.
interventions — Human approvals, corrections, takeovers, or environment repairs.
artifacts — Generated or modified files, records, messages, code, and visual outputs with checksums.
final_state — Resulting system state and externally observable outcome.
verifier_results — Deterministic checks, rubric scores, policy checks, and adjudication.
cost_and_latency — Tokens, calls, wall-clock time, retries, and operational measures.

{
  "episode_id": "agent_crm_002157",
  "task_spec": {"goal": "Update the renewal record and draft a follow-up", "approval_required": ["send_email"]},
  "environment_snapshot": {"fixture": "crm-v3.8", "account_id": "acct_491", "state_hash": "sha256:..."},
  "tool_registry": ["crm.search", "crm.update", "mail.draft", "mail.send"],
  "steps": [
    {"index": 1, "action": "crm.search", "arguments": {"account_id": "acct_491"}, "observation_ref": "obs_001"},
    {"index": 2, "action": "crm.update", "arguments": {"stage": "renewal-review"}, "state_diff_ref": "diff_002"}
  ],
  "interventions": [{"type": "human_approval", "action": "mail.send", "outcome": "not_requested"}],
  "artifacts": [{"type": "email_draft", "path": "artifacts/follow_up.md", "checksum": "sha256:..."}],
  "verifier_results": {"record_updated": true, "email_grounded": true, "policy_adherence": "pass", "overall": "partial"},
  "cost_and_latency": {"tool_calls": 5, "wall_clock_seconds": 42.7}
}

The schema is versioned. Changes to label definitions, evidence requirements, reviewer policy, or normalization rules are recorded so training and evaluation results can be traced to the exact specification used.

Program Workflow

Workflow decomposition. Observe the real task, identify systems and artifacts, document decision points, and define what the agent may and may not do.
Environment and fixture design. Create a reproducible starting state, seed data, tool registry, permissions, reset procedure, and protected verification channel.
Task and rubric authoring. Write realistic instructions with explicit and implicit requirements; define deterministic post-conditions and human-scored artifact criteria.
Expert demonstration. Qualified operators complete tasks while preserving actions, observations, state changes, artifacts, and intervention points.
Model rollout collection. Run target agent/harness combinations under locked versions, capturing errors, retries, context management, and operational cost.
Failure localization. Separate planning, perception, tool selection, argument, execution, state-tracking, policy, and artifact-quality failures.
Verification and adjudication. Apply state checks, tests, file validation, rubric scoring, source checks, and human review. Distinguish full, partial, assisted, and unsafe completion.
Training/evaluation loop. Convert selected demonstrations, recoveries, and hard negatives into training data while preserving hidden tasks for regression and release decisions.

A pilot is considered complete only when the customer and delivery team have aligned on the rubric, reviewed representative disagreements, validated the export, and confirmed that the data is useful in the intended training or evaluation loop.

Quality Controls

Quality is designed into the workflow rather than added as a final inspection step. The control plan depends on task ambiguity, domain risk, annotator expertise, and whether an item has an executable or external verifier.

Environment version locking: Every episode records fixture, tool, harness, model, dependency, and policy versions.
State-based verification: Checks inspect resulting databases, files, messages, UI state, or repositories—not only the final response.
Artifact validation: Generated files are opened, parsed, compiled, rendered, or tested before acceptance.
Requirement-level scoring: Complex tasks are decomposed into explicit requirements for partial credit and diagnosis.
Trajectory sanity checks: Validators detect missing observations, invalid calls, impossible timestamps, untracked side effects, and broken references.
Harness-model separation: Reports distinguish model, adapter, prompt, environment, permission, and evaluator failures.
Human intervention labeling: Hints, approvals, corrections, and takeovers are recorded so assisted success is not reported as autonomous.
Safety impact review: Irreversible, privacy-sensitive, or high-impact actions receive authorization review independent of task success.

Recommended acceptance metrics

Verified task success: Completion based on post-condition and artifact checks, not self-report.
Requirement coverage: Fraction of requirements met, with critical constraints treated separately.
Intervention rate: Frequency and type of assistance required to finish safely and correctly.
Recovery rate: Share of injected or natural failures from which the agent returns to a valid path.
Policy adherence: Allowed, approval-required, and prohibited actions assessed against policy.
Efficiency: Tool calls, retries, tokens, wall-clock time, and cost reported with quality.

No single aggregate score is sufficient. Agreement can diagnose ambiguity, but high agreement does not by itself prove correctness; disagreement can reveal plural preferences, unclear policy, underspecified context, or difficult edge cases. The QA report therefore pairs quantitative measures with sampled error analysis and adjudication notes.

Delivery and Integration

Supported delivery patterns

Versioned batch delivery for controlled model-training releases.
Incremental delivery for active learning, post-training, or continuous evaluation.
Secure customer-workspace delivery when source data cannot leave the customer environment.
API- or object-storage-based transfer for high-volume or multimodal programs.
Evaluation-ready task packs with rubrics, reference evidence, and scoring logic.

Common formats

JSONL trajectories, Parquet event tables, container/task bundles, Git repositories, browser traces, HAR, screen recordings, artifact ZIPs

Exports can be adapted to the customer’s agent framework, function schema, browser or computer-use action space, software-engineering harness, or RL environment. Raw events remain separate from normalized training views so future schemas can be regenerated without losing evidence.

Each release can include a dataset card or delivery memo, schema and ontology version, quality summary, known limitations, rights and consent metadata where applicable, and a machine-readable manifest with checksums and file-level lineage.

Security, Rights, and Governance

Agent programs may touch live systems, credentials, personal data, source code, customer communications, and high-impact actions. Safe programs default to isolated fixtures or test tenants, synthetic or de-identified records, scoped service accounts, action allowlists, and explicit human approval for irreversible operations. Live-system collection requires documented authorization and incident response.

Program controls may include role-based access, workspace isolation, least-privilege review queues, de-identification, retention limits, geographic routing, approved-tool restrictions, audit logs, and customer-defined deletion procedures. These controls are scoped contractually; the page does not imply a certification or regulatory status that has not been independently verified.

Engagement Models

Engagement	Best for	Typical output
Agent eval sprint	A prototype with unclear production readiness.	Representative tasks, failure taxonomy, harness report, and prioritized data plan.
Trajectory production program	Teams training a tool-using or workflow agent.	Golden, alternative, failed, and recovery trajectories with state evidence.
Executable benchmark build	Teams needing a private release gate.	Resettable tasks, hidden verifiers, rubrics, baseline runs, and reporting pipeline.
Continuous agent assurance	Agents changing with models, tools, and policies.	Recurring regressions, attack mutations, drift analysis, and corrective data.

Illustrative Program Shapes

The examples below are representative program patterns, not claims about named customers or guaranteed outcomes.

Enterprise service workflow. Build tasks across CRM, ticketing, knowledge, and email fixtures. Verify record state, citation grounding, draft quality, approval requests, and prohibited outbound actions.
Software engineering agent. Create repository tasks with environment setup, tests, code-quality constraints, hidden edge cases, and review of changes that pass tests but violate intended behavior.
Research and report agent. Evaluate source discovery, evidence extraction, citation accuracy, synthesis, file delivery, and professional usefulness across multi-file assignments.
Computer-use safety suite. Test indirect prompt injection, destructive actions, secret exposure, permission escalation, and ambiguous intent in resettable desktop environments.

Why a Custom Program

Off-the-shelf datasets are useful for baseline experimentation, but production systems usually fail at the boundaries: domain-specific policy, uncommon languages, tool or sensor state, difficult negative examples, ambiguous evidence, long-tail user behavior, and deployment-specific risk. A custom program makes those boundaries explicit and converts them into measurable data requirements.

The result is not simply “more labels.” It is a controlled data asset with a defined purpose, documented provenance, repeatable quality process, and a path from observed model failure to the next training or evaluation cycle.

Case studies

Delivered with Agents.

Enterprise AI

Policy-Driven Task Environments for Enterprise Agent Evaluation

Policy-driven task environments and failure taxonomies for enterprise agents — executable evaluation suites covering tool use, recovery behavior, and escalation accuracy.

Read case study