Advanced voice settings
VoiceThere cloud agents use the @node-webrtc-rust/sdk speech pipeline: voice activity detection (VAD), optional STT gating, barge-in during agent TTS, and structured speech events. Advanced settings let you override library defaults per project without changing your agent bundle.
Where to configure
- Dashboard: Project overview → Voice (STT / TTS) → expand Advanced voice pipeline (collapsed by default). Each field has an info tooltip with tuning guidance.
- CLI:
voicethere projects voice-advanced— see examples below.
Changes are saved on the project immediately but apply to live sessions only after you Deploy to cloud on the project overview.
Pipeline overview
- VAD detects when the user is speaking on the inbound audio track.
- When
gateSttis enabled (recommended), STT receives audio only while the gate is open — during speech, pending lead-in, and post-speech hold. - Barge-in stops agent TTS when the user interrupts. With
requireSttPartial(default), playback continues until STT returns a qualifying partial (semantic interrupt). - Your agent receives
user_speech_finaland other speech events, then runs LLM/TTS for the reply.
CLI examples
# List resolved values for the linked project voicethere projects voice-advanced list # Disable semantic barge-in (instant VAD interrupt — noisier) voicethere projects voice-advanced set vad.bargeIn.requireSttPartial false # Faster end-of-utterance in quiet rooms voicethere projects voice-advanced set vad.minSilenceDurationMs 800 voicethere projects voice-advanced set vad.sttGateHoldMs 600 # Restore library defaults voicethere projects voice-advanced reset
Setting reference
Defaults match the SDK production preset (VOICE_AGENT_VAD_PRESET) plus events.mode: both.
| Key | Default | When to change |
|---|---|---|
| vad.enabled | true | Master switch for voice activity detection on inbound audio. Disable only for always-on STT experiments. Most voice agents should leave this on. |
| vad.provider | energy | energy = RMS level detector (default build). silero = neural VAD when the native build includes it. Stay on energy unless you ship a silero-enabled runner image and need softer speech detection. |
| vad.threshold | 0.15 | Energy VAD: RMS level (~0.05–0.2 typical). Silero: speech probability (~0.3–0.6). Raise in noisy rooms to reduce false speech starts. Lower if users must speak quietly and VAD misses them. |
| vad.minSpeechDurationMs | 250 | Voiced audio must exceed this before user_speaking_start fires. Increase to ignore brief coughs and clicks. Decrease for snappier turn detection in quiet environments. |
| vad.minSilenceDurationMs | 1300 | Continuous silence before VAD treats a phrase as ending (intra-utterance gaps). Lower for faster end-of-utterance and quicker replies. Raise if users pause mid-sentence and you get early finals. |
| vad.speechPadMs | 500 | Pre-roll ring capacity fed to STT at speech start when gateStt is enabled. Raise if the first syllable is clipped, especially during barge-in over agent TTS. |
| vad.sampleRate | 16000 | Internal VAD sample rate. WebRTC PCM is resampled to mono 16 kHz for STT. Leave at 16000 unless you have a specific 8 kHz telephony pipeline. |
| vad.gateStt | true | When true, STT receives audio only while the gate is open (speech, hold, or pending). Recommended for voice agents. Disable for continuous STT streaming (higher cost/noise). |
| vad.gateSttOpenOnPending | true | When gateStt is true, feed STT during VAD pending speech before SpeechStart. Keep enabled to capture WebRTC lead-in audio. Disable to tighten when STT opens. |
| vad.sttGateHoldMs | 1000 | After VAD speech end, keep passing audio to STT for trailing phonemes and word gaps. Raise if finals truncate the last word. Lower for faster user_speaking_end and agent replies. |
| vad.sttListenTimeoutMs | 4000 | After vad_triggered, emit user_stt_not_found when no STT partial arrives within this window. Raise in slow STT setups. Lower to fail fast when the user only made noise without speech. |
| vad.utteranceFinalizeTimeoutMs | 1500 | Grace after the last partial or VAD SpeechEnd before forcing user_speech_final. Raise if cloud STT stalls on finals. Lower when partials are stable and you want snappier turns. |
| vad.bargeIn.enabled | true | Master switch to stop agent TTS and emit barge_in when interrupted. Disable for broadcast-only agents where users must listen to the full reply. |
| vad.bargeIn.useVad | true | When true, inbound VAD SpeechStart can trigger barge-in during agent TTS. Set false to allow manual barge-in via flushTts only (no automatic interrupt on noise). |
| vad.bargeIn.flushTts | true | Clear pending outbound TTS PCM when barge-in runs. Disable only if your app manages playback cancellation itself. |
| vad.bargeIn.requireSttPartial | true | During agent TTS, wait for a qualifying user_speech_partial before barge-in (semantic interrupt). Keep on to ignore coughs and tones. Set false for instant VAD barge (noisier, faster). |
| vad.bargeIn.minSttPartialChars | 2 | Minimum trimmed STT partial length to trigger barge when requireSttPartial is true. Raise to ignore short false partials from echo or background speech. |
| vad.bargeIn.agentPlaybackGuardMs | 0 | Ignore VAD barge-in for this many ms after agent TTS starts (mitigates speaker echo). Prefer headphones and requireSttPartial first. Raise only if speaker bleed falsely interrupts TTS. |
| events.mode | both | How speech events reach your agent: callback, async iterator stream, or both. Cloud runners use both by default. Change only if your agent bundle expects a specific delivery mode. |
Example scenarios
Noisy room — reduce false speech detection
- Raise
vad.threshold(e.g. 0.2–0.25). - Increase
vad.minSpeechDurationMsto ignore brief noise bursts. - Keep semantic barge-in enabled.
Faster replies — shorter pauses
- Lower
vad.minSilenceDurationMs(e.g. 800–1000). - Lower
vad.sttGateHoldMs(e.g. 600–800). - Watch for truncated last words — increase hold if finals clip endings.
Speaker echo interrupts agent TTS
- Keep
vad.bargeIn.requireSttPartialtrue (default). - Optionally raise
vad.bargeIn.agentPlaybackGuardMs(e.g. 200–500) after trying headphones or lower speaker volume.
Broadcast agent — no user interrupt
- Set
vad.bargeIn.enabledto false.
Self-hosted agents
If you run @node-webrtc-rust/sdk locally, pass the same fields on VoiceAgentConfig. See the open-source VAD and barge-in guide for deep tuning timelines and event order.
Related: Control plane API, Agent environment & secrets.
Library default snapshot: gateStt=true, speechPadMs=500, requireSttPartial=true.