vvzvlad/gitmost

feat(dictation): realtime streaming STT (live dictation) #118

Closed

vvzvlad wants to merge 4 commits from feature/streaming-dictation into develop

Author	SHA1	Message	Date
claude code agent 227	c70dac79ad	fix(dictation): drop the under-input interim preview in chat The live partial transcript was shown as a dimmed line under the chat textarea, but each segment only flickers there for an instant before its final lands in the draft — the preview was unreadable noise. Remove it: realtime partials are no longer rendered separately; finalized segments are appended straight into the draft (where the user actually reads them). Editor dictation (inline ghost at the caret) is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 01:10:10 +03:00
claude code agent 227	85130da79f	fix(dictation): realtime lockout (latest-wins), unify STT settings (#737 ), surface upstream failures Found while live-testing the realtime dictation: - 'already active' lockout (real bug): the per-user slot was tied to the connected socket lifetime and a stale/racing socket could leave the counter stuck, so a fresh mic start was rejected. Now per-user single-session is enforced purely by LATEST-WINS EVICTION — a new connect disconnects the user's prior socket and frees its slot synchronously — and the user counter no longer participates in the cap decision (it could only cause false lockouts). Also free the slot when a start fails to open. The per-workspace cap is unchanged. - #737: drop the separate sttRealtimeModel / sttRealtimeBaseUrl settings — realtime dictation now reuses the existing STT model + base URL (the realtime WS endpoint is derived from it server-side). Removed the fields from the DTO, types, settings service, repo allowlist, and the settings UI. The STT 'Test endpoint' button is now a single context-aware button (probes the realtime WS endpoint when realtime is on, the batch endpoint otherwise), and the 'Request format' selector is disabled while realtime is on (realtime always uses the OpenAI Realtime protocol). - no-silent-loss: parse the OpenAI conversation.item.input_audio_transcription.failed event (e.g. insufficient_quota, bad model) and surface its concrete reason to the client instead of dropping it silently — previously a per-item transcription failure produced 'no words' with no explanation. Tests: realtime suites green (gateway latest-wins eviction, parser .failed surfacing, ai-settings reuse-STT-model); server + client tsc clean; workspace vitest 37 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 01:10:10 +03:00
claude code agent 227	090a7c6b24	fix(dictation): address PR #118 review feedback (security, stability, tests) Implements all reviewer comments (code-review, red-team, and test-strategy audit), accepting the recommended variants. Server — realtime service (ai-realtime.service.ts): - SSRF: pin the validated IP via a WebSocket `lookup` hook that re-checks every resolved address with isIpAllowed (mirrors external-mcp buildPinnedDispatcher), closing the TOCTOU/DNS-rebinding window; fix the misleading comment. - no-silent-loss: on Stop, drain the in-flight segment (bounded 2.5s) and deliver the final via onFinal before closing instead of dropping the tail. - fail-closed deriveRealtimeUrl: a non-empty unparseable base now THROWS (no silent api.openai.com fallback that would leak a self-hosted key); http://ws:// bases rejected (plaintext key). Path normalization preserved. - parseUpstreamEvent keys the accumulator by item_id+content_index so GA segments don't concatenate. - inject a wsFactory seam for testing; also fix a latent bug — `import WebSocket from 'ws'` resolved to undefined at runtime (no esModuleInterop) -> import=require. - unref idle/max/drain timers. Server — realtime gateway (ai-realtime.gateway.ts, session-limits.ts): - reject revoked/disabled users and inactive sessions (mirror jwt.strategy: findById+isUserDisabled + findActiveById) with NO counter increment. - CSWSH: Origin allowlist (matching APP_URL, or no Origin for native clients) before auth, no increment. - extract SessionCounters (delete-at-zero, never negative) + pure canConnect (both caps >= checked before any increment); document the per-process/in-memory cap caveat (single-replica only). Client: - dictation-group: realtime final now inserts at the captured rangeRef SNAPSHOT (not the live caret) and guards editor.isEditable; single-space separator. - use-realtime-dictation/realtime-dictation-client: stop-during-acquisition tears down the mic (no leak / button reset); reconnect re-emits start (double-start guarded); interim ghost cleared on teardown; io() options de-duplicated. - pcm16-worklet: flush the partial sub-frame tail on stop; one-pole anti-aliasing low-pass before 48k->24k. - extract shared mic-capture (acquireMicStream/mapGetUserMediaError, used by batch + realtime), pure DSP (pcm16-dsp.ts), and the session reducer/baseLanguageSubtag; extract applyInterimMeta/clampRange/resolveUrl/appendFinalToDraft. Tests + infra: +~150 server tests (deriveRealtimeUrl, parseUpstreamEvent branches, openSession/lifecycle/timers/testConnection via fake ws, gateway auth/caps/no-leak, realtime-test admin contract, AiSettings update/resolve, DTO boolean, SSRF deny) and +~140 client tests (DSP property/edge, resampler continuity, framing, reducer, mic-capture, RealtimeDictationClient/MicButton, ProseMirror interim regression + history guards, appendFinalToDraft, resolveKeyField, route contract). Added @vitest/coverage-v8. CHANGELOG [Unreleased] entry incl. the single-replica caveat. Review: APPROVE WITH SUGGESTIONS (no critical/regression); applied the drain-timer unref. Server tsc clean + 358 tests; client tsc clean + 201 tests; vite build ok. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 01:08:50 +03:00
claude_code	9faf927b93	feat(dictation): add realtime streaming STT (live dictation) Layer an optional realtime speech-to-text path on top of the existing batch dictation, so transcribed text appears as the user speaks. Transport A2: browser <-> our server (Socket.IO `/ai-realtime`) <-> OpenAI Realtime (raw ws). The provider API key never leaves the server; the upstream URL is SSRF-checked before connecting; the gateway enforces the dictation+dictationRealtime gate, cookie-JWT auth and per-user/ per-workspace concurrency caps. Implemented against the GA (2026) OpenAI Realtime transcription contract (session.update / audio.input.format / server_vad), not the now-removed beta shape. Editor UI B2: interim text is shown as a meta-only ProseMirror ghost decoration (no Yjs/history noise); only completed segments are committed. Chat shows interim as a dimmed tail. The mic button switches realtime vs batch by the workspace flag; batch remains the default and fallback. Server: - AiRealtimeService (upstream ws proxy, normalized events, idle/max- duration timeouts, idempotent teardown) + parseUpstreamEvent unit tests - AiRealtimeGateway (Socket.IO `/ai-realtime`) wired into AiChatModule - admin-gated POST /ai-chat/realtime/test connectivity probe - config: settings.ai.dictationRealtime + provider sttRealtimeModel/ sttRealtimeBaseUrl (realtime key reuses sttApiKey; no new secret) Client: - pcm16 AudioWorklet (24kHz mono PCM16), RealtimeDictationClient, use-realtime-dictation hook (status/start/stop/cancel + onInterim/onFinal) - RealtimeMicButton + dictation-interim ProseMirror decoration - editor/chat integration + AI settings UI (toggle, model, test endpoint) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 01:08:50 +03:00