feat(dictation): realtime streaming STT (live dictation) #118

Closed
vvzvlad wants to merge 4 commits from feature/streaming-dictation into develop

4 Commits

Author SHA1 Message Date
claude code agent 227
c70dac79ad fix(dictation): drop the under-input interim preview in chat
The live partial transcript was shown as a dimmed line under the chat textarea,
but each segment only flickers there for an instant before its final lands in
the draft — the preview was unreadable noise. Remove it: realtime partials are
no longer rendered separately; finalized segments are appended straight into the
draft (where the user actually reads them). Editor dictation (inline ghost at the
caret) is unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 01:10:10 +03:00
claude code agent 227
85130da79f fix(dictation): realtime lockout (latest-wins), unify STT settings (#737), surface upstream failures
Found while live-testing the realtime dictation:

- 'already active' lockout (real bug): the per-user slot was tied to the
  connected socket lifetime and a stale/racing socket could leave the counter
  stuck, so a fresh mic start was rejected. Now per-user single-session is
  enforced purely by LATEST-WINS EVICTION — a new connect disconnects the user's
  prior socket and frees its slot synchronously — and the user counter no longer
  participates in the cap decision (it could only cause false lockouts). Also
  free the slot when a start fails to open. The per-workspace cap is unchanged.

- #737: drop the separate sttRealtimeModel / sttRealtimeBaseUrl settings — realtime
  dictation now reuses the existing STT model + base URL (the realtime WS endpoint
  is derived from it server-side). Removed the fields from the DTO, types, settings
  service, repo allowlist, and the settings UI. The STT 'Test endpoint' button is
  now a single context-aware button (probes the realtime WS endpoint when realtime
  is on, the batch endpoint otherwise), and the 'Request format' selector is
  disabled while realtime is on (realtime always uses the OpenAI Realtime protocol).

- no-silent-loss: parse the OpenAI
  conversation.item.input_audio_transcription.failed event (e.g. insufficient_quota,
  bad model) and surface its concrete reason to the client instead of dropping it
  silently — previously a per-item transcription failure produced 'no words' with
  no explanation.

Tests: realtime suites green (gateway latest-wins eviction, parser .failed surfacing,
ai-settings reuse-STT-model); server + client tsc clean; workspace vitest 37 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 01:10:10 +03:00
claude code agent 227
090a7c6b24 fix(dictation): address PR #118 review feedback (security, stability, tests)
Implements all reviewer comments (code-review, red-team, and test-strategy
audit), accepting the recommended variants.

Server — realtime service (ai-realtime.service.ts):
- SSRF: pin the validated IP via a WebSocket `lookup` hook that re-checks every
  resolved address with isIpAllowed (mirrors external-mcp buildPinnedDispatcher),
  closing the TOCTOU/DNS-rebinding window; fix the misleading comment.
- no-silent-loss: on Stop, drain the in-flight segment (bounded 2.5s) and deliver
  the final via onFinal before closing instead of dropping the tail.
- fail-closed deriveRealtimeUrl: a non-empty unparseable base now THROWS (no
  silent api.openai.com fallback that would leak a self-hosted key); http://ws://
  bases rejected (plaintext key). Path normalization preserved.
- parseUpstreamEvent keys the accumulator by item_id+content_index so GA segments
  don't concatenate.
- inject a wsFactory seam for testing; also fix a latent bug — `import WebSocket
  from 'ws'` resolved to undefined at runtime (no esModuleInterop) -> import=require.
- unref idle/max/drain timers.

Server — realtime gateway (ai-realtime.gateway.ts, session-limits.ts):
- reject revoked/disabled users and inactive sessions (mirror jwt.strategy:
  findById+isUserDisabled + findActiveById) with NO counter increment.
- CSWSH: Origin allowlist (matching APP_URL, or no Origin for native clients)
  before auth, no increment.
- extract SessionCounters (delete-at-zero, never negative) + pure canConnect
  (both caps >= checked before any increment); document the per-process/in-memory
  cap caveat (single-replica only).

Client:
- dictation-group: realtime final now inserts at the captured rangeRef SNAPSHOT
  (not the live caret) and guards editor.isEditable; single-space separator.
- use-realtime-dictation/realtime-dictation-client: stop-during-acquisition tears
  down the mic (no leak / button reset); reconnect re-emits start (double-start
  guarded); interim ghost cleared on teardown; io() options de-duplicated.
- pcm16-worklet: flush the partial sub-frame tail on stop; one-pole anti-aliasing
  low-pass before 48k->24k.
- extract shared mic-capture (acquireMicStream/mapGetUserMediaError, used by batch
  + realtime), pure DSP (pcm16-dsp.ts), and the session reducer/baseLanguageSubtag;
  extract applyInterimMeta/clampRange/resolveUrl/appendFinalToDraft.

Tests + infra: +~150 server tests (deriveRealtimeUrl, parseUpstreamEvent branches,
openSession/lifecycle/timers/testConnection via fake ws, gateway auth/caps/no-leak,
realtime-test admin contract, AiSettings update/resolve, DTO boolean, SSRF deny)
and +~140 client tests (DSP property/edge, resampler continuity, framing, reducer,
mic-capture, RealtimeDictationClient/MicButton, ProseMirror interim regression +
history guards, appendFinalToDraft, resolveKeyField, route contract). Added
@vitest/coverage-v8. CHANGELOG [Unreleased] entry incl. the single-replica caveat.

Review: APPROVE WITH SUGGESTIONS (no critical/regression); applied the drain-timer
unref. Server tsc clean + 358 tests; client tsc clean + 201 tests; vite build ok.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 01:08:50 +03:00
claude_code
9faf927b93 feat(dictation): add realtime streaming STT (live dictation)
Layer an optional realtime speech-to-text path on top of the existing
batch dictation, so transcribed text appears as the user speaks.

Transport A2: browser <-> our server (Socket.IO `/ai-realtime`) <->
OpenAI Realtime (raw ws). The provider API key never leaves the server;
the upstream URL is SSRF-checked before connecting; the gateway enforces
the dictation+dictationRealtime gate, cookie-JWT auth and per-user/
per-workspace concurrency caps. Implemented against the GA (2026) OpenAI
Realtime transcription contract (session.update / audio.input.format /
server_vad), not the now-removed beta shape.

Editor UI B2: interim text is shown as a meta-only ProseMirror ghost
decoration (no Yjs/history noise); only completed segments are committed.
Chat shows interim as a dimmed tail. The mic button switches realtime vs
batch by the workspace flag; batch remains the default and fallback.

Server:
- AiRealtimeService (upstream ws proxy, normalized events, idle/max-
  duration timeouts, idempotent teardown) + parseUpstreamEvent unit tests
- AiRealtimeGateway (Socket.IO `/ai-realtime`) wired into AiChatModule
- admin-gated POST /ai-chat/realtime/test connectivity probe
- config: settings.ai.dictationRealtime + provider sttRealtimeModel/
  sttRealtimeBaseUrl (realtime key reuses sttApiKey; no new secret)

Client:
- pcm16 AudioWorklet (24kHz mono PCM16), RealtimeDictationClient,
  use-realtime-dictation hook (status/start/stop/cancel + onInterim/onFinal)
- RealtimeMicButton + dictation-interim ProseMirror decoration
- editor/chat integration + AI settings UI (toggle, model, test endpoint)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 01:08:50 +03:00