pipecat

Author	SHA1	Message	Date
Mark Backman	d69a337def	Add text_aggregation_mode parameter to TTSService Move the sentence vs token aggregation concern into text aggregators so all text flows through them regardless of mode. This enables pattern detection and tag handling to work in TOKEN mode. - Add TextAggregationMode enum (SENTENCE, TOKEN) as the user-facing TTS setting, separate from the internal AggregationType - Add TOKEN mode support to Simple, SkipTags, and PatternPair aggregators - Add text_aggregation_mode parameter to TTSService and all TTS subclasses - Deprecate aggregate_sentences in favor of text_aggregation_mode - Merge TTSService._process_text_frame() into a single codepath	2026-02-26 08:55:41 -05:00
James Hush	763002f2bc	Fix sentence splitting for CJK and other non-Latin languages in TTS pipeline NLTK's sent_tokenize() only supports ~15 European languages and defaults to English. For Japanese, Chinese, Korean, Hindi, Arabic, and other non-Latin languages, NLTK fails to recognize sentence boundaries like 。？！ causing text to accumulate until flush instead of being emitted sentence-by-sentence. Add a fallback in match_endofsentence() that scans for unambiguous non-Latin sentence-ending punctuation when NLTK fails to split the text. Latin punctuation (. ! ? ; …) is excluded from the fallback since NLTK handles those correctly and they can be ambiguous (abbreviations, decimals, etc.). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 14:27:49 +08:00
Aleix Conchillo Flaqué	305ab44132	tests: add unittest.main() call	2026-01-30 10:07:34 -08:00
Aleix Conchillo Flaqué	2626154a64	update examples and tests copyright and use a proper dash in 2024-2026	2026-01-07 19:32:22 -08:00
Mark Backman	d79dd94019	Make aggregate return an AsyncIterator, other clean up	2025-12-03 22:00:34 -05:00
Mark Backman	ffbb6e5937	Update SimpleTextAggregator to handle character by character input, use a buffer to handle ambiguous EOS scenarios, and add a flush method to all aggregators	2025-12-03 22:00:02 -05:00
mattie ruth backman	dcc20f86e1	Updated the BaseTextAggregator to categorize aggregations Modified the BaseTextAggregator type so that when text gets aggregated, metadata can be associated with it. Currently, that just means a `type`, so that the aggregation can be classified or described. Changes made to support this: - IMPORTANT: Aggregators are now expected to strip leading/trailing white space characters before returning their aggregation from `aggregation()` or `.text`. This way all aggregators have a consistent contract allowing downstream use to know how to stitch aggregations back together - Introduced a new `Aggregation` dataclass to represent both the aggregated `text` and a string identifying the `type` of aggregation (ex. "sentence", "word", "my custom aggregation") - BREAKING: `BaseTextAggregator.text` now returns an `Aggregation` (instead of `str`). To update: `aggregated_text = myAggregator.text` -> `aggregated_text = myAggregator.text.text` - BREAKING: `BaseTextAggregator.aggregate()` now returns `Optional[Aggregation]` (instead of `Optional[str]`). To update: ``` aggregation = myAggregator.aggregate(text) if (aggregation): print(f"successfully aggregated text: {aggregation.text}") // instead of {aggregation} ``` - `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator` updated to produce/consume `Aggregation` objects. - All uses of the above Aggregators have been updated accordingly.	2025-11-21 17:16:10 -05:00
Aleix Conchillo Flaqué	54b1d7fcc1	BaseTextAggregator: make functions async	2025-05-20 13:11:42 -07:00
Aleix Conchillo Flaqué	f8610a69a5	introduce text aggregators	2025-03-14 10:48:25 -07:00

9 Commits