VoiceThere

Advanced voice settings

VoiceThere cloud agents use the @node-webrtc-rust/sdk speech pipeline: voice activity detection (VAD), optional STT gating, barge-in during agent TTS, and structured speech events. Advanced settings let you override library defaults per project without changing your agent bundle.

Where to configure

  • Dashboard: Project overview → Voice (STT / TTS) → expand Advanced voice pipeline (collapsed by default). Each field has an info tooltip with tuning guidance.
  • CLI: voicethere projects voice-advanced — see examples below.

Changes are saved on the project immediately but apply to live sessions only after you Deploy to cloud on the project overview.

Pipeline overview

  1. VAD detects when the user is speaking on the inbound audio track.
  2. When gateStt is enabled (recommended), STT receives audio only while the gate is open — during speech, pending lead-in, and post-speech hold.
  3. Barge-in stops agent TTS when the user interrupts. With requireSttPartial (default), playback continues until STT returns a qualifying partial (semantic interrupt).
  4. Your agent receives user_speech_final and other speech events, then runs LLM/TTS for the reply.

CLI examples

# List resolved values for the linked project
voicethere projects voice-advanced list

# Disable semantic barge-in (instant VAD interrupt — noisier)
voicethere projects voice-advanced set vad.bargeIn.requireSttPartial false

# Faster end-of-utterance in quiet rooms
voicethere projects voice-advanced set vad.minSilenceDurationMs 800
voicethere projects voice-advanced set vad.sttGateHoldMs 600

# Restore library defaults
voicethere projects voice-advanced reset

Setting reference

Defaults match the SDK production preset (VOICE_AGENT_VAD_PRESET) plus events.mode: both.

KeyDefaultWhen to change
vad.enabledtrueMaster switch for voice activity detection on inbound audio. Disable only for always-on STT experiments. Most voice agents should leave this on.
vad.providerenergyenergy = RMS level detector (default build). silero = neural VAD when the native build includes it. Stay on energy unless you ship a silero-enabled runner image and need softer speech detection.
vad.threshold0.15Energy VAD: RMS level (~0.05–0.2 typical). Silero: speech probability (~0.3–0.6). Raise in noisy rooms to reduce false speech starts. Lower if users must speak quietly and VAD misses them.
vad.minSpeechDurationMs250Voiced audio must exceed this before user_speaking_start fires. Increase to ignore brief coughs and clicks. Decrease for snappier turn detection in quiet environments.
vad.minSilenceDurationMs1300Continuous silence before VAD treats a phrase as ending (intra-utterance gaps). Lower for faster end-of-utterance and quicker replies. Raise if users pause mid-sentence and you get early finals.
vad.speechPadMs500Pre-roll ring capacity fed to STT at speech start when gateStt is enabled. Raise if the first syllable is clipped, especially during barge-in over agent TTS.
vad.sampleRate16000Internal VAD sample rate. WebRTC PCM is resampled to mono 16 kHz for STT. Leave at 16000 unless you have a specific 8 kHz telephony pipeline.
vad.gateStttrueWhen true, STT receives audio only while the gate is open (speech, hold, or pending). Recommended for voice agents. Disable for continuous STT streaming (higher cost/noise).
vad.gateSttOpenOnPendingtrueWhen gateStt is true, feed STT during VAD pending speech before SpeechStart. Keep enabled to capture WebRTC lead-in audio. Disable to tighten when STT opens.
vad.sttGateHoldMs1000After VAD speech end, keep passing audio to STT for trailing phonemes and word gaps. Raise if finals truncate the last word. Lower for faster user_speaking_end and agent replies.
vad.sttListenTimeoutMs4000After vad_triggered, emit user_stt_not_found when no STT partial arrives within this window. Raise in slow STT setups. Lower to fail fast when the user only made noise without speech.
vad.utteranceFinalizeTimeoutMs1500Grace after the last partial or VAD SpeechEnd before forcing user_speech_final. Raise if cloud STT stalls on finals. Lower when partials are stable and you want snappier turns.
vad.bargeIn.enabledtrueMaster switch to stop agent TTS and emit barge_in when interrupted. Disable for broadcast-only agents where users must listen to the full reply.
vad.bargeIn.useVadtrueWhen true, inbound VAD SpeechStart can trigger barge-in during agent TTS. Set false to allow manual barge-in via flushTts only (no automatic interrupt on noise).
vad.bargeIn.flushTtstrueClear pending outbound TTS PCM when barge-in runs. Disable only if your app manages playback cancellation itself.
vad.bargeIn.requireSttPartialtrueDuring agent TTS, wait for a qualifying user_speech_partial before barge-in (semantic interrupt). Keep on to ignore coughs and tones. Set false for instant VAD barge (noisier, faster).
vad.bargeIn.minSttPartialChars2Minimum trimmed STT partial length to trigger barge when requireSttPartial is true. Raise to ignore short false partials from echo or background speech.
vad.bargeIn.agentPlaybackGuardMs0Ignore VAD barge-in for this many ms after agent TTS starts (mitigates speaker echo). Prefer headphones and requireSttPartial first. Raise only if speaker bleed falsely interrupts TTS.
events.modebothHow speech events reach your agent: callback, async iterator stream, or both. Cloud runners use both by default. Change only if your agent bundle expects a specific delivery mode.

Example scenarios

Noisy room — reduce false speech detection

  • Raise vad.threshold (e.g. 0.2–0.25).
  • Increase vad.minSpeechDurationMs to ignore brief noise bursts.
  • Keep semantic barge-in enabled.

Faster replies — shorter pauses

  • Lower vad.minSilenceDurationMs (e.g. 800–1000).
  • Lower vad.sttGateHoldMs (e.g. 600–800).
  • Watch for truncated last words — increase hold if finals clip endings.

Speaker echo interrupts agent TTS

  • Keep vad.bargeIn.requireSttPartial true (default).
  • Optionally raise vad.bargeIn.agentPlaybackGuardMs (e.g. 200–500) after trying headphones or lower speaker volume.

Broadcast agent — no user interrupt

  • Set vad.bargeIn.enabled to false.

Self-hosted agents

If you run @node-webrtc-rust/sdk locally, pass the same fields on VoiceAgentConfig. See the open-source VAD and barge-in guide for deep tuning timelines and event order.

Related: Control plane API, Agent environment & secrets.

Library default snapshot: gateStt=true, speechPadMs=500, requireSttPartial=true.

← All documentation