NLTK's sent_tokenize() only supports ~15 European languages and defaults to English. For Japanese, Chinese, Korean, Hindi, Arabic, and other non-Latin languages, NLTK fails to recognize sentence boundaries like 。?! causing text to accumulate until flush instead of being emitted sentence-by-sentence. Add a fallback in match_endofsentence() that scans for unambiguous non-Latin sentence-ending punctuation when NLTK fails to split the text. Latin punctuation (. ! ? ; …) is excluded from the fallback since NLTK handles those correctly and they can be ambiguous (abbreviations, decimals, etc.). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
12 KiB