User Story: As a search engineer I don't want there to be any big language analysis surprises when we upgrade from Elasticsearch 6.8 to 7.10.
We can do a relatively quick check on 500–1000 random documents from a selection of Wikipedias and Wiktionaries, to test language-specific analysis and a variety of scripts. I will also test some additional "rare" characters (such as ♙, ☥, 〃, and 〆—see T211824: Investigate a “rare-character” index and 6cac1cbae6c4).
If there are no big issues, it should be relatively quick. If there are any big issues, well then it'll be a good thing we found them. (We really want to detect problems like the Chinese punctuation problem in T172653—though admittedly that was not caused by an upgrade.)
We are planning a similar analysis from ES 6.5 to ES 6.8. (See T300302). Review notes from 6.5 to 6.8 upgrade for potential tokenizer issues, especially "Next Steps" / "Start fixing stuff" list.
Acceptance Criteria:
- Report on language analysis diffs between ES 6.5 & 6.8 and ES 7.10
- New phab tickets for any big issues that need to be addressed