BaseOutputTransport only clears buffered audio mid-playback on
InterruptionFrame. Realtime services stream audio downstream as fast as
they produce it, and playback necessarily trails the buffer — so when the
user interrupts, the bot keeps talking past the interruption unless the
service surfaces the interruption to the pipeline.
Two realtime services were missing this signal:
- AWS Nova Sonic acknowledged the INTERRUPTED stop reason internally
(closing its own response state) but never broadcast InterruptionFrame.
- Ultravox's playback_clear_buffer message — the server's explicit
"drop buffered output audio" signal for interruptions — was not
handled at all.
In both cases the latent bug was masked by enabling local VAD on the
user aggregator, which produced UserStartedSpeakingFrame and triggered
the aggregator-side interruption path. The realtime context aggregator
work makes local VAD optional, so the underlying gap needs fixing first.
Wire broadcast_interruption() into both services on the server-side
interruption signal, firing before the response-end signal so the
assistant aggregator marks the message interrupted=True before
LLMFullResponseEndFrame closes the turn.
660 B
660 B
- Fixed Ultravox Realtime not surfacing server-side interruption. The server sends a
playback_clear_buffermessage when the user interrupts the bot mid-speech, instructing clients to drop buffered output audio; this was previously unhandled, soBaseOutputTransportkept playing the buffered audio and the bot kept talking past the interruption. Ultravox now broadcastsInterruptionFrameonplayback_clear_buffer. This was previously masked by enabling local VAD on the user aggregator, which generatedUserStartedSpeakingFrameand triggered the aggregator-side interruption path; the fix makes the behavior correct without local VAD as a workaround.