Two issues were causing TTSSpeakFrame(append_to_context=True) greetings to
silently lose their trailing words and never fire on_assistant_turn_stopped:
- LLMAssistantPushAggregationFrame was emitted without a PTS, so the
transport routed it through the audio (sync) queue while word-level
TTSTextFrames travel through the clock queue. The aggregation could reach
the assistant aggregator before the final words, leaving them orphaned
in the buffer. Stamp the frame with `_word_last_pts + 1` when there are
word timestamps so it can't overtake them.
- The aggregator's LLMAssistantPushAggregationFrame handler called
push_aggregation() directly, bypassing _trigger_assistant_turn_stopped.
For TTS-only flows there is no LLMFullResponseStartFrame, so the turn
start timestamp was never set and on_assistant_turn_stopped never fired.
Open a turn (if needed) and trigger stopped from the handler.
Fixes#4264.
Some TTS providers (e.g. Inworld) return verbatim tokens where spaces and
punctuation are already embedded in the token text. When downstream consumers
join these tokens with an extra space they produce "hello , world" instead of
"hello, world".
Add an opt-in `includes_inter_frame_spaces: bool = False` parameter to
`add_word_timestamps` / `_add_word_timestamps`. The flag is threaded through
`_WordTimestampEntry` and stamped onto every emitted `TTSTextFrame`.
Defaults to `False` — no behaviour change for existing services.
`InworldTTSService` passes `includes_inter_frame_spaces=True` and stops
pre-processing tokens in `_calculate_word_times`, returning them verbatim.
Tests added to `test_tts_frame_ordering.py` covering both HTTP and WebSocket
delivery paths: verbatim text preservation, PTS ordering, text-before-audio
ordering, and the Inworld punctuation-token scenario.
Made-with: Cursor
Route LLMFullResponseEndFrame through the serialization queue instead
of pushing it directly downstream when push_text_frames is enabled.
This ensures the frame is emitted only after the audio context is
fully drained, preserving correct ordering relative to TTSTextFrames.
Previously, the final sentence TTSTextFrame would arrive at the
LLMAssistantAggregator after LLMFullResponseEndFrame, causing it to
be dropped from the conversation context (especially with RTVI text
input where no subsequent interruption would flush the orphaned text).