The initial implementation of long text fragmentation in textFragment.js tries to take the first 5 words and the last 5 words and remove the middle; this falls down in a couple places:
- if there are in fact less than 10 words, you get duplicates
- languages like Chinese and Japanese that don't use spacing between words will treat long strings as individual words, making that situation more likely
For instance sharing a link to an entire paragraph of Chinese text from the zh "Paris" article gave me a URL over 5000 characters long (tweaked to point at zhwiki instead of my localhost):
Consider treating each Han character as an entire word (which may help with Chinese and Japanese specifically), or cropping the total length of each side of the pair.
Acceptance criteria:
- Never repeat words on both sides of the "long string" start/end pair in a Chinese string
- If cropping within long strings, do not break Unicode surrogate pairs
