Only create the EmulateUserStartedSpeakingFrame if we have received a transcription.

2025-07-14 17:38:03 -03:00
parent 8fd5576879
commit 727af2e6fb
2 changed files with 10 additions and 1 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - For `LmntTTSService`, changed the default `model` to `blizzard`, LMNT's
  recommended model.

+### Fixed
+
+- Fixed an issue where, in some edge cases, the `EmulateUserStartedSpeakingFrame`
+  could be created even if we didn't have a transcription.
+
 ## [0.0.76] - 2025-07-11

 ### Added
--- a/src/pipecat/processors/aggregators/llm_response.py
+++ b/src/pipecat/processors/aggregators/llm_response.py
@@ -693,7 +693,11 @@ class LLMUserContextAggregator(LLMContextResponseAggregator):
        # to emulate VAD (i.e. user start/stopped speaking), but we do it only
        # if the bot is not speaking. If the bot is speaking and we really have
        # a short utterance we don't really want to interrupt the bot.
-        if not self._user_speaking and not self._waiting_for_aggregation:
+        if (
+            not self._user_speaking
+            and not self._waiting_for_aggregation
+            and len(self._aggregation) > 0
+        ):
            if self._bot_speaking:
                # If we reached this case and the bot is speaking, let's ignore
                # what the user said.