Commit Graph

5 Commits

Author SHA1 Message Date
Mark Backman
f1a3ee97de fix: surface TTSSpeakFrame greetings in on_assistant_turn_stopped
Two issues were causing TTSSpeakFrame(append_to_context=True) greetings to
silently lose their trailing words and never fire on_assistant_turn_stopped:

- LLMAssistantPushAggregationFrame was emitted without a PTS, so the
  transport routed it through the audio (sync) queue while word-level
  TTSTextFrames travel through the clock queue. The aggregation could reach
  the assistant aggregator before the final words, leaving them orphaned
  in the buffer. Stamp the frame with `_word_last_pts + 1` when there are
  word timestamps so it can't overtake them.

- The aggregator's LLMAssistantPushAggregationFrame handler called
  push_aggregation() directly, bypassing _trigger_assistant_turn_stopped.
  For TTS-only flows there is no LLMFullResponseStartFrame, so the turn
  start timestamp was never set and on_assistant_turn_stopped never fired.
  Open a turn (if needed) and trigger stopped from the handler.

Fixes #4264.
2026-05-04 10:41:22 -04:00
Ian Lee
b435ddfa44 feat(tts): add includes_inter_frame_spaces flag to word-timestamp API
Some TTS providers (e.g. Inworld) return verbatim tokens where spaces and
punctuation are already embedded in the token text. When downstream consumers
join these tokens with an extra space they produce "hello , world" instead of
"hello, world".

Add an opt-in `includes_inter_frame_spaces: bool = False` parameter to
`add_word_timestamps` / `_add_word_timestamps`. The flag is threaded through
`_WordTimestampEntry` and stamped onto every emitted `TTSTextFrame`.
Defaults to `False` — no behaviour change for existing services.

`InworldTTSService` passes `includes_inter_frame_spaces=True` and stops
pre-processing tokens in `_calculate_word_times`, returning them verbatim.

Tests added to `test_tts_frame_ordering.py` covering both HTTP and WebSocket
delivery paths: verbatim text preservation, PTS ordering, text-before-audio
ordering, and the Inworld punctuation-token scenario.

Made-with: Cursor
2026-04-18 12:03:32 -07:00
Aleix Conchillo Flaqué
b3bb6fdaa5 Modernize Python typing across the codebase
Automated via ruff UP006, UP007, UP035, UP045 rules (target: py311):

- Replace `typing.List`, `Dict`, `Tuple`, `Set`, `FrozenSet`, `Type`
  with their built-in equivalents (`list`, `dict`, `tuple`, etc.)
- Replace `typing.Optional[X]` with `X | None`
- Replace `typing.Union[X, Y]` with `X | Y`
- Move `Mapping`, `Sequence`, `Callable`, `Awaitable`,
  `MutableMapping`, `MutableSequence`, `Iterator`, `AsyncIterator`,
  `AsyncGenerator` imports from `typing` to `collections.abc`
- Remove now-unused `typing` imports
- Add `from __future__ import annotations` to 5 files that use
  forward-reference strings in `X | "Y"` annotations
2026-04-16 09:28:23 -07:00
Mark Backman
5d71de8aad Fix LLMFullResponseEndFrame racing ahead of final TTSTextFrame
Route LLMFullResponseEndFrame through the serialization queue instead
of pushing it directly downstream when push_text_frames is enabled.
This ensures the frame is emitted only after the audio context is
fully drained, preserving correct ordering relative to TTSTextFrames.

Previously, the final sentence TTSTextFrame would arrive at the
LLMAssistantAggregator after LLMFullResponseEndFrame, causing it to
be dropped from the conversation context (especially with RTVI text
input where no subsequent interruption would flush the orphaned text).
2026-03-24 15:09:42 -04:00
filipi87
5fd98e1391 Fixing TTS frame order. 2026-03-19 09:43:40 -03:00