feat(dictation): streaming STT via silence cut (Silero VAD)

Add a lightweight "streaming" dictation mode as a simpler alternative to the
realtime-websocket path: detect speech with Silero VAD (@ricky0123/vad-web),
cut each segment on a pause and POST it to the existing /ai-chat/transcribe
endpoint, so text appears progressively. No server changes.

- new useStreamingDictation hook (same API as useDictation), lazy-loads VAD,
  in-order seq emission, session-epoch guard against stop->start races
- new encodeWavPcm16 util (Float32 -> mono PCM16 WAV, accepted by the server)
- MicButton gains a `streaming` prop; enabled in the editor toolbar and chat
- VAD tuning: redemptionMs 640 / preSpeechPadMs 320 / minSpeechMs 96
- batch dictation kept as the fallback (streaming=false)
- deps: @ricky0123/vad-web@0.0.30, onnxruntime-web@1.27.0

Note: VAD assets load from the library CDN by default; for self-hosted/offline
set VAD_BASE_ASSET_PATH/VAD_ONNX_WASM_BASE_PATH and copy assets to public/vad/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
claude_code
2026-06-22 16:52:05 +03:00
parent 7ce1a24f82
commit 4f0da42d88
7 changed files with 555 additions and 16 deletions

View File

@@ -28,6 +28,7 @@
"@mantine/modals": "8.3.18",
"@mantine/notifications": "8.3.18",
"@mantine/spotlight": "8.3.18",
"@ricky0123/vad-web": "^0.0.30",
"@slidoapp/emoji-mart": "5.8.7",
"@slidoapp/emoji-mart-data": "1.2.4",
"@slidoapp/emoji-mart-react": "1.1.5",
@@ -53,6 +54,7 @@
"mantine-form-zod-resolver": "1.3.0",
"mermaid": "11.15.0",
"mitt": "3.0.1",
"onnxruntime-web": "^1.27.0",
"posthog-js": "1.372.2",
"react": "18.3.1",
"react-clear-modal": "^2.0.18",