Advanced voice settings

VoiceThere cloud agents use the @node-webrtc-rust/sdk speech pipeline: voice activity detection (VAD), optional STT gating, barge-in during agent TTS, and structured speech events. Advanced settings let you override library defaults per project without changing your agent bundle.

Where to configure

Dashboard: Project overview → Voice (STT / TTS) → expand Advanced voice pipeline (collapsed by default). Each field has an info tooltip with tuning guidance.
CLI: voicethere projects voice-advanced — see examples below.

Changes are saved on the project immediately but apply to live sessions only after you Deploy to cloud on the project overview.

Pipeline overview

VAD detects when the user is speaking on the inbound audio track.
When gateStt is enabled (recommended), STT receives audio only while the gate is open — during speech, pending lead-in, and post-speech hold.
Barge-in stops agent TTS when the user interrupts. With requireSttPartial (default), playback continues until STT returns a qualifying partial (semantic interrupt).
Your agent receives user_speech_final and other speech events, then runs LLM/TTS for the reply.

CLI examples

# List resolved values for the linked project
voicethere projects voice-advanced list

# Disable semantic barge-in (instant VAD interrupt — noisier)
voicethere projects voice-advanced set vad.bargeIn.requireSttPartial false

# Faster end-of-utterance in quiet rooms
voicethere projects voice-advanced set vad.minSilenceDurationMs 800
voicethere projects voice-advanced set vad.sttGateHoldMs 600

# Restore library defaults
voicethere projects voice-advanced reset

Setting reference

Defaults match the SDK production preset (VOICE_AGENT_VAD_PRESET) plus events.mode: both.

Key	Default	When to change
vad.enabled	true	Master switch for voice activity detection on inbound audio. Disable only for always-on STT experiments. Most voice agents should leave this on.
vad.provider	energy	energy = RMS level detector (default build). silero = neural VAD when the native build includes it. Stay on energy unless you ship a silero-enabled runner image and need softer speech detection.
vad.threshold	0.15	Energy VAD: RMS level (~0.05–0.2 typical). Silero: speech probability (~0.3–0.6). Raise in noisy rooms to reduce false speech starts. Lower if users must speak quietly and VAD misses them.
vad.minSpeechDurationMs	250	Voiced audio must exceed this before user_speaking_start fires. Increase to ignore brief coughs and clicks. Decrease for snappier turn detection in quiet environments.
vad.minSilenceDurationMs	1300	Continuous silence before VAD treats a phrase as ending (intra-utterance gaps). Lower for faster end-of-utterance and quicker replies. Raise if users pause mid-sentence and you get early finals.
vad.speechPadMs	500	Pre-roll ring capacity fed to STT at speech start when gateStt is enabled. Raise if the first syllable is clipped, especially during barge-in over agent TTS.
vad.sampleRate	16000	Internal VAD sample rate. WebRTC PCM is resampled to mono 16 kHz for STT. Leave at 16000 unless you have a specific 8 kHz telephony pipeline.
vad.gateStt	true	When true, STT receives audio only while the gate is open (speech, hold, or pending). Recommended for voice agents. Disable for continuous STT streaming (higher cost/noise).
vad.gateSttOpenOnPending	true	When gateStt is true, feed STT during VAD pending speech before SpeechStart. Keep enabled to capture WebRTC lead-in audio. Disable to tighten when STT opens.
vad.sttGateHoldMs	1000	After VAD speech end, keep passing audio to STT for trailing phonemes and word gaps. Raise if finals truncate the last word. Lower for faster user_speaking_end and agent replies.
vad.sttListenTimeoutMs	4000	After vad_triggered, emit user_stt_not_found when no STT partial arrives within this window. Raise in slow STT setups. Lower to fail fast when the user only made noise without speech.
vad.utteranceFinalizeTimeoutMs	1500	Grace after the last partial or VAD SpeechEnd before forcing user_speech_final. Raise if cloud STT stalls on finals. Lower when partials are stable and you want snappier turns.
vad.bargeIn.enabled	true	Master switch to stop agent TTS and emit barge_in when interrupted. Disable for broadcast-only agents where users must listen to the full reply.
vad.bargeIn.useVad	true	When true, inbound VAD SpeechStart can trigger barge-in during agent TTS. Set false to allow manual barge-in via flushTts only (no automatic interrupt on noise).
vad.bargeIn.flushTts	true	Clear pending outbound TTS PCM when barge-in runs. Disable only if your app manages playback cancellation itself.
vad.bargeIn.requireSttPartial	true	During agent TTS, wait for a qualifying user_speech_partial before barge-in (semantic interrupt). Keep on to ignore coughs and tones. Set false for instant VAD barge (noisier, faster).
vad.bargeIn.minSttPartialChars	2	Minimum trimmed STT partial length to trigger barge when requireSttPartial is true. Raise to ignore short false partials from echo or background speech.
vad.bargeIn.agentPlaybackGuardMs	0	Ignore VAD barge-in for this many ms after agent TTS starts (mitigates speaker echo). Prefer headphones and requireSttPartial first. Raise only if speaker bleed falsely interrupts TTS.
events.mode	both	How speech events reach your agent: callback, async iterator stream, or both. Cloud runners use both by default. Change only if your agent bundle expects a specific delivery mode.

Example scenarios

Noisy room — reduce false speech detection

Raise vad.threshold (e.g. 0.2–0.25).
Increase vad.minSpeechDurationMs to ignore brief noise bursts.
Keep semantic barge-in enabled.

Faster replies — shorter pauses

Lower vad.minSilenceDurationMs (e.g. 800–1000).
Lower vad.sttGateHoldMs (e.g. 600–800).
Watch for truncated last words — increase hold if finals clip endings.

Speaker echo interrupts agent TTS

Keep vad.bargeIn.requireSttPartial true (default).
Optionally raise vad.bargeIn.agentPlaybackGuardMs (e.g. 200–500) after trying headphones or lower speaker volume.

Broadcast agent — no user interrupt

Set vad.bargeIn.enabled to false.

Self-hosted agents

If you run @node-webrtc-rust/sdk locally, pass the same fields on VoiceAgentConfig. See the open-source VAD and barge-in guide for deep tuning timelines and event order.

Library default snapshot: gateStt=true, speechPadMs=500, requireSttPartial=true.