pipecat/tests/test_utils_string.py at main

Files

James Hush 763002f2bc Fix sentence splitting for CJK and other non-Latin languages in TTS pipeline

NLTK's sent_tokenize() only supports ~15 European languages and defaults to
English. For Japanese, Chinese, Korean, Hindi, Arabic, and other non-Latin
languages, NLTK fails to recognize sentence boundaries like 。？！ causing
text to accumulate until flush instead of being emitted sentence-by-sentence.

Add a fallback in match_endofsentence() that scans for unambiguous non-Latin
sentence-ending punctuation when NLTK fails to split the text. Latin
punctuation (. ! ? ; …) is excluded from the fallback since NLTK handles
those correctly and they can be ambiguous (abbreviations, decimals, etc.).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-02 14:27:49 +08:00

12 KiB

Raw Permalink Blame History

View Raw

12 KiB Raw Permalink Blame History

12 KiB

Raw Permalink Blame History