Data Products / Speech
Data Product · 04
Speech and Audio Data for Models That Listen, Speak, and Understand Human Nuance
Prompted, conversational, and expressive speech across languages, accents, and acoustic conditions — collected with consent controls and validated by locale experts.
Use cases
What teams use it for.
08 itemsWhat we build
Data we produce.
08 itemsCoverage
Collection dimensions
06 itemsDelivery & integration
Built to drop into your pipeline.
Every dataset ships versioned, documented, and matched to your schema — with a QA report your research team can audit against acceptance criteria.
Workflow
How the program runs.
- 01Scope
- 02Speaker Sourcing
- 03Consent & Setup
- 04Collection
- 05Transcription
- 06Locale QA
- 07Delivery
Quality controls
How we keep it correct.
- Speaker validation
- Audio quality checks
- Transcript review
- Locale expert QA
- Consent and privacy controls
- PII handling
FAQ
Common questions.
How do you handle consent and privacy for voice data?
Every contributor signs explicit consent covering usage scope, retention, and withdrawal. Recordings flow through PII screening and de-identification workflows before delivery, and consent records are retained as part of the dataset lineage.
Can you collect low-resource languages?
Yes. We source speakers through regional contributor networks and pair them with locale-expert reviewers, which is how low-resource language and dialect programs maintain transcription quality.
Product deep dive
Speech and Audio Data for ASR, TTS, Voice Agents, and Speech Models
The Data Layer Behind Reliable Models That Listen, Speak, and Understand Human Nuance
Speech systems are converging from separate ASR, language, and TTS components toward unified audio-language and speech-to-speech models. Current systems increasingly need to preserve timing, prosody, speaker identity, overlap, background events, interruption, and tool-use state. A transcript remains important, but it no longer captures the full supervision signal.
Full-duplex interaction raises the bar further. The model must listen while speaking, distinguish the primary user from nearby voices, detect meaningful interruption, avoid false barge-ins, manage end-of-turn timing, and react to non-verbal audio. Data must be segmented on a shared timeline and labeled for interaction events—not just sentence boundaries.
Our role is not to sell a fixed, generic dataset. We design a program around the target model, deployment environment, failure profile, data rights, and acceptance criteria. Every engagement begins with a concrete definition of what a usable training or evaluation unit means for the customer—and how that unit will be verified before delivery.
Built for Teams That Need More Than Volume
This product supports speech recognition, expressive synthesis, speech translation, audio understanding, conversational voice AI, and audio-language foundation models. Buyers need controlled rights, locale coverage, acoustic diversity, and labels that connect linguistic content to speaker, timing, interaction, and environment.
Common engagement triggers
- Word error rate is acceptable in clean speech but degrades for accents, code-switching, noise, overlap, far field, or domain vocabulary.
- A voice agent has latency, false interruption, delayed turn-taking, or primary-speaker selection failures.
- TTS is intelligible but lacks style control, natural prosody, conversational timing, or consistent voice identity.
- Audio understanding is limited to transcription and misses non-speech events, delivery, prosody, or cross-speaker context.
- Existing speech data has unclear consent, rights, speaker metadata, or permitted use for voice generation.
- The team needs independent listening tests and production-condition benchmarks rather than studio-only evaluation.
What This Product Can Support
ASR and Speech Understanding Data
Speech recognition programs can target language, domain, acoustic condition, interaction style, and error category. Transcripts are aligned to audio and reviewed under locale-specific conventions.
- Prompted, read, spontaneous, and task-oriented speech.
- Far-field, mobile, headset, in-vehicle, and telephony capture.
- Accent, dialect, code-switching, and low-resource language coverage.
- Timestamped transcripts, punctuation, normalization, and uncertainty.
- Domain terminology, named entities, and contextual-biasing sets.
Conversational and Full-Duplex Interaction
Natural voice systems need continuous interaction data that preserves simultaneous channels, turn-management events, and system response timing.
- Overlapping user and assistant speech.
- Barge-in, backchannel, hesitation, correction, and interruption events.
- Primary-speaker versus bystander activity.
- Semantic end-of-turn and continued-thought examples.
- Tool calls or actions aligned to the audio timeline.
Expressive Speech and TTS
Synthesis data can encode speaking style, emotion expression, pace, emphasis, pronunciation, conversational role, and recording consistency while keeping voice rights explicit.
- Studio or controlled remote voice recording.
- Style, prosody, emotion-expression, and dialogue-act prompts.
- Pronunciation lexicons and phonetic review.
- Multi-speaker dialogue and long-form narration.
- Naturalness, similarity, intelligibility, and style-adherence evaluation.
Audio Events and Paralinguistics
Audio-language models may need environmental sounds, vocal behavior, and contextual cues. Labels should be operationally defined and should avoid unsupported claims about hidden mental states.
- Sound-event detection and temporal localization.
- Laughter, breath, cough, hesitation, emphasis, and speaking-rate labels.
- Acoustic-scene and noise-condition metadata.
- Perceived affect or communication style under culture-aware rubrics.
- Abstention when evidence is ambiguous or sensitive.
Speech and Audio Evaluation
A complete benchmark separates transcription, content understanding, speaker attribution, temporal behavior, audio quality, safety, and user experience.
- WER/CER and entity-aware recognition.
- Speaker diarization and overlap evaluation.
- Audio question answering and instruction following.
- Turn-taking, interruption, latency, and recovery tests.
- Human listening studies and pairwise preference evaluation.
Data We Build
The delivery unit is defined at the level required by the model and the evaluation harness—not merely as a row of text or a media file. Depending on the program, one record may include source inputs, structured intermediate state, human judgments, provenance, quality evidence, and model- or environment-derived verification.
| Deliverable | What it contains | Typical use |
|---|---|---|
| Consented speech corpus | Audio, speaker/locale metadata, consent and rights status, script or task prompt, transcript, and QC. | ASR, speech-language pretraining, locale adaptation. |
| Conversational speech set | Multi-channel audio, speaker turns, overlap, dialogue acts, interruptions, transcript, and outcome. | Voice assistants, diarization, full-duplex systems. |
| Expressive TTS corpus | Clean speech, text, pronunciation, style/prosody labels, session metadata, rights record, and audio QC. | TTS training, voice adaptation, controllable synthesis. |
| Audio event dataset | Waveform, event intervals, source class, acoustic scene, confidence, and ambiguity notes. | Audio-language models, event detection, safety monitoring. |
| Speech preference/evaluation set | Audio candidates, listening protocol, blinded judgments, dimension scores, and reviewer context. | TTS evaluation, voice-agent quality, post-training. |
| Full-duplex interaction benchmark | Continuous dual streams, timing events, task outcomes, latency, false-interruption cases, and rubrics. | Release gating for real-time conversational systems. |
Reference Record Design
A production schema is finalized during calibration, but a typical record may include the following fields:
utterance_or_session_id— Stable identifier for a clip, turn, or continuous interaction session.audio_assets— Channel files, codec, sample rate, bit depth, duration, loudness, and checksums.speaker_and_rights— Pseudonymous speaker ID, locale metadata, consent version, permitted uses, withdrawal status, and compensation record when applicable.transcript— Verbatim or normalized text, timestamps, non-speech markers, uncertainty, and review status.speaker_timeline— Turns, overlap, primary speaker, bystander speech, and diarization references.interaction_events— Barge-in, backchannel, end-of-turn, correction, tool call, pause, or system-response markers.acoustic_metadata— Device, distance, environment, noise, reverberation, SNR estimate, and channel conditions.paralinguistic_or_style_labels— Operationally defined perceived style, prosody, vocal event, or delivery attributes.evaluation_results— Recognition, diarization, timing, audio-quality, preference, and task-outcome measures.split_group— Speaker, household, session, script, or environment group preventing leakage.
{
"utterance_or_session_id": "duplex_support_enUS_00192",
"audio_assets": {"user_channel": "audio/user.flac", "assistant_channel": "audio/assistant.flac", "sample_rate": 24000},
"speaker_and_rights": {"speaker_id": "spk_0184", "locale": "en-US", "consent_version": "voice-research-v2", "tts_use": false},
"transcript": [{"speaker": "user", "start_ms": 1220, "end_ms": 4840, "text": "..."}],
"speaker_timeline": [{"start_ms": 6210, "end_ms": 7040, "state": "overlap", "primary_speaker": "user"}],
"interaction_events": [{"time_ms": 6290, "type": "barge_in", "system_behavior": "stop_speaking"}],
"acoustic_metadata": {"device": "mobile-handset", "environment": "kitchen", "snr_db_estimate": 14.2},
"paralinguistic_or_style_labels": [{"interval": [1220, 4840], "label": "hurried_delivery", "confidence": 0.82}],
"evaluation_results": {"turn_taking": "pass", "false_barge_in": false, "task_success": true},
"split_group": "speaker_spk_0184"
}
The schema is versioned. Changes to label definitions, evidence requirements, reviewer policy, or normalization rules are recorded so training and evaluation results can be traced to the exact specification used.
Program Workflow
- Use, rights, and risk scoping. Define model use, languages, voice-generation rights, biometric or identity risks, retention, and whether withdrawal must propagate to derived data.
- Locale and coverage design. Specify speaker, dialect, acoustic, device, domain, interaction, and long-tail coverage without treating demographic categories as proxies for all speech variation.
- Protocol and script design. Create prompts, tasks, conversations, acoustic conditions, interruptions, and safety boundaries; pilot for naturalness and participant burden.
- Speaker sourcing and consent. Verify eligibility, obtain purpose-specific informed consent, document permitted uses, and separate identity information from production data.
- Capture and session QC. Monitor channel integrity, clipping, loudness, noise, synchronization, script fidelity, and metadata during collection.
- Transcription and annotation. Produce timestamped transcript, diarization, overlap, events, style or acoustic labels, and uncertainty markers under locale-specific guidelines.
- Independent review and listening tests. Apply audio validators, second-pass language review, speaker consistency checks, blinded listening panels, and adjudication.
- Delivery and lifecycle control. Version data and consent, create speaker-disjoint splits, document limitations, and maintain deletion or withdrawal mappings where required.
A pilot is considered complete only when the customer and delivery team have aligned on the rubric, reviewed representative disagreements, validated the export, and confirmed that the data is useful in the intended training or evaluation loop.
Quality Controls
Quality is designed into the workflow rather than added as a final inspection step. The control plan depends on task ambiguity, domain risk, annotator expertise, and whether an item has an executable or external verifier.
- Audio integrity checks: Detect clipping, truncation, dropouts, silence, encoding errors, channel swaps, loudness outliers, and synchronization drift.
- Locale-qualified transcription: Reviewers follow language-specific conventions for normalization, code-switching, names, disfluencies, and non-speech events.
- Speaker-disjoint splits: Speakers—and when relevant households, sessions, scripts, or environments—are grouped to prevent identity leakage.
- Consent-to-asset linkage: Every recording maps to the exact consent and permitted-use record; synthesis rights are not inferred from ASR consent.
- Timeline consistency: Turn, word, event, overlap, and tool-action timestamps are validated against the same media clock.
- Perceptual study controls: Listening tests randomize candidates, control playback conditions, separate dimensions, and monitor rater reliability.
- Sensitive-inference limits: Labels describe observable or perceived vocal attributes and avoid diagnosing health, identity, or mental state without a valid basis.
- Adversarial condition review: Hard cases include noise, competing speech, spoofing, replay, accents, code-switching, interruptions, and non-speech audio.
Recommended acceptance metrics
- Transcript accuracy: WER, CER, entity error, and normalization accuracy sliced by language and condition.
- Diarization and overlap: Speaker error and overlap-aware measures under the specified channel setup.
- Timing quality: Boundary error for words, turns, end-of-turn, interruption, and audio events.
- Audio quality: Signal checks plus human intelligibility, naturalness, similarity, and artifact ratings.
- Interaction performance: Task success, false barge-ins, missed interruptions, latency, recovery, and user effort.
- Coverage and rights completeness: Distribution and missingness across target slices plus consent/provenance linkage rate.
No single aggregate score is sufficient. Agreement can diagnose ambiguity, but high agreement does not by itself prove correctness; disagreement can reveal plural preferences, unclear policy, underspecified context, or difficult edge cases. The QA report therefore pairs quantitative measures with sampled error analysis and adjudication notes.
Delivery and Integration
Supported delivery patterns
- Versioned batch delivery for controlled model-training releases.
- Incremental delivery for active learning, post-training, or continuous evaluation.
- Secure customer-workspace delivery when source data cannot leave the customer environment.
- API- or object-storage-based transfer for high-volume or multimodal programs.
- Evaluation-ready task packs with rubrics, reference evidence, and scoring logic.
Common formats
WAV, FLAC, Opus, JSONL, RTTM, CTM, TextGrid, Parquet, WebDataset, custom streaming event logs
Data can be delivered as clip-level corpora, continuous sessions, multi-channel recordings, or event streams. Training views can support ASR, TTS, diarization, speech-to-speech, audio question answering, and turn-taking while a normalized master manifest preserves immutable audio references.
Each release can include a dataset card or delivery memo, schema and ontology version, quality summary, known limitations, rights and consent metadata where applicable, and a machine-readable manifest with checksums and file-level lineage.
Security, Rights, and Governance
Voice can be identifying and can support impersonation or biometric inference. Consent should state intended model uses, especially whether synthesis, cloning, speaker recognition, or public release is permitted. Programs should minimize direct identifiers, protect raw audio, separate identity records, define revocation handling, and avoid redistribution beyond agreed scope.
Program controls may include role-based access, workspace isolation, least-privilege review queues, de-identification, retention limits, geographic routing, approved-tool restrictions, audit logs, and customer-defined deletion procedures. These controls are scoped contractually; the page does not imply a certification or regulatory status that has not been independently verified.
Engagement Models
| Engagement | Best for | Typical output |
|---|---|---|
| Locale/acoustic gap audit | Teams with production error clusters. | Error taxonomy, coverage analysis, targeted collection, and evaluation plan. |
| Custom speech collection | New languages, domains, devices, or interactions. | Protocol, consented recordings, metadata, transcripts, and QA report. |
| Voice-agent interaction program | Real-time or full-duplex systems. | Continuous sessions, turn events, failures, timing metrics, and recovery examples. |
| Independent speech evaluation | Model comparison or launch decisions. | Blind listening study, test corpus, metric report, and slice analysis. |
Illustrative Program Shapes
The examples below are representative program patterns, not claims about named customers or guaranteed outcomes.
- Noisy multilingual ASR. Collect spontaneous and task-oriented speech across devices, noise, accents, and code-switching, with entity-aware review and speaker-disjoint evaluation.
- Full-duplex customer-service agent. Create dual-stream sessions with interruptions, backchannels, bystander speech, tool calls, delays, and safe escalation; evaluate timing and outcome together.
- Expressive long-form TTS. Record controlled narration and dialogue styles, manage pronunciation and session consistency, and run blinded naturalness, similarity, and style-adherence studies.
- Audio-language safety benchmark. Test spoken prompt injection, hidden or distant speech, replay, conflicting speakers, sensitive inference, and unsafe tool actions under controlled conditions.
Why a Custom Program
Off-the-shelf datasets are useful for baseline experimentation, but production systems usually fail at the boundaries: domain-specific policy, uncommon languages, tool or sensor state, difficult negative examples, ambiguous evidence, long-tail user behavior, and deployment-specific risk. A custom program makes those boundaries explicit and converts them into measurable data requirements.
The result is not simply “more labels.” It is a controlled data asset with a defined purpose, documented provenance, repeatable quality process, and a path from observed model failure to the next training or evaluation cycle.