feat(dictation): realtime streaming STT (live dictation) #118
Closed
vvzvlad
wants to merge 4 commits from
feature/streaming-dictation into develop
pull from: feature/streaming-dictation
merge into: vvzvlad:develop
vvzvlad:main
vvzvlad:test/244-part-b
vvzvlad:fix/255-ws-redis-adapter-leak
vvzvlad:feat/251-intentional-clear
vvzvlad:fix/252-e2e-open-handles
vvzvlad:feat/184-autonomous-agent-runs
vvzvlad:feat/221-image-captions
vvzvlad:feat/git-sync
vvzvlad:refactor/193-tool-spec-registry
vvzvlad:fix/244-dataloss-bugs
vvzvlad:fix/embeddings-reindex-progress
vvzvlad:develop
vvzvlad:feature/offline-sync
vvzvlad:feat/229-catalog-yaml
vvzvlad:feat/243-blob-sandbox
vvzvlad:feat/228-inline-footnotes
vvzvlad:fix/qa-ui-bugs-216-218
vvzvlad:feature/agent-roles-catalog
vvzvlad:fix/share-alias-rename
vvzvlad:fix/ai-chat-empty-render
vvzvlad:feat/191-chat-doc-binding
vvzvlad:feat/201-temporary-notes
vvzvlad:feat/198-interrupt-agent
vvzvlad:feat/ai-chat-full-history
vvzvlad:feat/199-ai-generate-title
vvzvlad:feat/205-share-aliases
vvzvlad:batch/issues-189-187-170
vvzvlad:feat/170-mcp-test-button
vvzvlad:feat/189-context-badge
vvzvlad:feat/198-interrupt-agent-send-now
vvzvlad:fix/issues-190-159
vvzvlad:fix/ai-chat-new-chat-during-stream
vvzvlad:fix/ai-chat-stream-perf
vvzvlad:batch/issues-2026-06-25
vvzvlad:feat/ai-chat-persistent-history
vvzvlad:fix/ai-chat-copy-chat-wysiwyg
vvzvlad:fix/ai-stream-reset-resilience
vvzvlad:fix/ai-stream-undici-timeout
vvzvlad:fix/footnote-review-1227-followup
vvzvlad:fix/ai-chat-token-counter-realtime
vvzvlad:docs/manual-qa-test-plan
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c70dac79ad |
fix(dictation): drop the under-input interim preview in chat
The live partial transcript was shown as a dimmed line under the chat textarea, but each segment only flickers there for an instant before its final lands in the draft — the preview was unreadable noise. Remove it: realtime partials are no longer rendered separately; finalized segments are appended straight into the draft (where the user actually reads them). Editor dictation (inline ghost at the caret) is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
85130da79f |
fix(dictation): realtime lockout (latest-wins), unify STT settings (#737), surface upstream failures
Found while live-testing the realtime dictation: - 'already active' lockout (real bug): the per-user slot was tied to the connected socket lifetime and a stale/racing socket could leave the counter stuck, so a fresh mic start was rejected. Now per-user single-session is enforced purely by LATEST-WINS EVICTION — a new connect disconnects the user's prior socket and frees its slot synchronously — and the user counter no longer participates in the cap decision (it could only cause false lockouts). Also free the slot when a start fails to open. The per-workspace cap is unchanged. - #737: drop the separate sttRealtimeModel / sttRealtimeBaseUrl settings — realtime dictation now reuses the existing STT model + base URL (the realtime WS endpoint is derived from it server-side). Removed the fields from the DTO, types, settings service, repo allowlist, and the settings UI. The STT 'Test endpoint' button is now a single context-aware button (probes the realtime WS endpoint when realtime is on, the batch endpoint otherwise), and the 'Request format' selector is disabled while realtime is on (realtime always uses the OpenAI Realtime protocol). - no-silent-loss: parse the OpenAI conversation.item.input_audio_transcription.failed event (e.g. insufficient_quota, bad model) and surface its concrete reason to the client instead of dropping it silently — previously a per-item transcription failure produced 'no words' with no explanation. Tests: realtime suites green (gateway latest-wins eviction, parser .failed surfacing, ai-settings reuse-STT-model); server + client tsc clean; workspace vitest 37 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
090a7c6b24 |
fix(dictation): address PR #118 review feedback (security, stability, tests)
Implements all reviewer comments (code-review, red-team, and test-strategy audit), accepting the recommended variants. Server — realtime service (ai-realtime.service.ts): - SSRF: pin the validated IP via a WebSocket `lookup` hook that re-checks every resolved address with isIpAllowed (mirrors external-mcp buildPinnedDispatcher), closing the TOCTOU/DNS-rebinding window; fix the misleading comment. - no-silent-loss: on Stop, drain the in-flight segment (bounded 2.5s) and deliver the final via onFinal before closing instead of dropping the tail. - fail-closed deriveRealtimeUrl: a non-empty unparseable base now THROWS (no silent api.openai.com fallback that would leak a self-hosted key); http://ws:// bases rejected (plaintext key). Path normalization preserved. - parseUpstreamEvent keys the accumulator by item_id+content_index so GA segments don't concatenate. - inject a wsFactory seam for testing; also fix a latent bug — `import WebSocket from 'ws'` resolved to undefined at runtime (no esModuleInterop) -> import=require. - unref idle/max/drain timers. Server — realtime gateway (ai-realtime.gateway.ts, session-limits.ts): - reject revoked/disabled users and inactive sessions (mirror jwt.strategy: findById+isUserDisabled + findActiveById) with NO counter increment. - CSWSH: Origin allowlist (matching APP_URL, or no Origin for native clients) before auth, no increment. - extract SessionCounters (delete-at-zero, never negative) + pure canConnect (both caps >= checked before any increment); document the per-process/in-memory cap caveat (single-replica only). Client: - dictation-group: realtime final now inserts at the captured rangeRef SNAPSHOT (not the live caret) and guards editor.isEditable; single-space separator. - use-realtime-dictation/realtime-dictation-client: stop-during-acquisition tears down the mic (no leak / button reset); reconnect re-emits start (double-start guarded); interim ghost cleared on teardown; io() options de-duplicated. - pcm16-worklet: flush the partial sub-frame tail on stop; one-pole anti-aliasing low-pass before 48k->24k. - extract shared mic-capture (acquireMicStream/mapGetUserMediaError, used by batch + realtime), pure DSP (pcm16-dsp.ts), and the session reducer/baseLanguageSubtag; extract applyInterimMeta/clampRange/resolveUrl/appendFinalToDraft. Tests + infra: +~150 server tests (deriveRealtimeUrl, parseUpstreamEvent branches, openSession/lifecycle/timers/testConnection via fake ws, gateway auth/caps/no-leak, realtime-test admin contract, AiSettings update/resolve, DTO boolean, SSRF deny) and +~140 client tests (DSP property/edge, resampler continuity, framing, reducer, mic-capture, RealtimeDictationClient/MicButton, ProseMirror interim regression + history guards, appendFinalToDraft, resolveKeyField, route contract). Added @vitest/coverage-v8. CHANGELOG [Unreleased] entry incl. the single-replica caveat. Review: APPROVE WITH SUGGESTIONS (no critical/regression); applied the drain-timer unref. Server tsc clean + 358 tests; client tsc clean + 201 tests; vite build ok. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
9faf927b93 |
feat(dictation): add realtime streaming STT (live dictation)
Layer an optional realtime speech-to-text path on top of the existing batch dictation, so transcribed text appears as the user speaks. Transport A2: browser <-> our server (Socket.IO `/ai-realtime`) <-> OpenAI Realtime (raw ws). The provider API key never leaves the server; the upstream URL is SSRF-checked before connecting; the gateway enforces the dictation+dictationRealtime gate, cookie-JWT auth and per-user/ per-workspace concurrency caps. Implemented against the GA (2026) OpenAI Realtime transcription contract (session.update / audio.input.format / server_vad), not the now-removed beta shape. Editor UI B2: interim text is shown as a meta-only ProseMirror ghost decoration (no Yjs/history noise); only completed segments are committed. Chat shows interim as a dimmed tail. The mic button switches realtime vs batch by the workspace flag; batch remains the default and fallback. Server: - AiRealtimeService (upstream ws proxy, normalized events, idle/max- duration timeouts, idempotent teardown) + parseUpstreamEvent unit tests - AiRealtimeGateway (Socket.IO `/ai-realtime`) wired into AiChatModule - admin-gated POST /ai-chat/realtime/test connectivity probe - config: settings.ai.dictationRealtime + provider sttRealtimeModel/ sttRealtimeBaseUrl (realtime key reuses sttApiKey; no new secret) Client: - pcm16 AudioWorklet (24kHz mono PCM16), RealtimeDictationClient, use-realtime-dictation hook (status/start/stop/cancel + onInterim/onFinal) - RealtimeMicButton + dictation-interim ProseMirror decoration - editor/chat integration + AI settings UI (toggle, model, test endpoint) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |