Korean got reindexed incidentally as part of the ES 6 upgrade: some of our previous configuration for spaceless languages was deprecated so we had to upgrade them to BM25, and reindex, which picked up the Nori changes for Korean.
Thu, Apr 18
Fri, Apr 12
Thu, Apr 11
Wed, Apr 10
This looks good, @Gehel. You brought up of some things we hadn't talked about before, so you covered more than 100% of the topics I had!
Mon, Apr 8
The code update is done, but I'm moving this back to "in progress" because I'm still working on my presentation.
Fri, Apr 5
Thu, Apr 4
Tue, Apr 2
Created the following tasks and will prioritize them into the Language Stuff workboard column:
The results are in! A brief summary:
Fri, Mar 29
Thu, Mar 28
As I thought, this is a customization that was added to the English Language Analysis years ago before my time. It was originally limited to search on MediaWiki.org in 2013, and then expanded to all English-language wikis in 2014, but it was never expanded beyond that.
I'll take a look today. I'm pretty sure I know what's happening, but will double check.
Mar 20 2019
Mar 19 2019
Mar 12 2019
Mar 8 2019
Mar 7 2019
Mar 6 2019
After refactoring the lowercase-to-ICU-normalization upgrade code for Greek (T203117) so that the lowercase filter is kept if it is language-specific, I needed to test it for the other language-specific cases: Turkish and Irish. The impact is positive but small because it is limited to the plain field and other fields besides the text field (where the lang-specific lowercasing is already in effect because the analyzers have not been unpacked). Full details on MediaWiki.
Unpacking the Greek analyzer exposes the lowercase filter, which is upgraded to icu_normalizer, losing the Greek-specific processing therein! So, we need to keep the Greek lowercasing even if we do ICU normalization. After that, everything is copacetic. Full write up on MediaWiki.
Mar 4 2019
Feb 26 2019
Feb 21 2019
We need to reindex, but not until after the ES6 upgrade is complete, and LTR has been disabled.
Feb 20 2019
@EBernhardson, thanks for the explanation!
The spikes on create_index are pretty extreme, with 194s for chi-eqiad-with-archive and 291s for omega-eqiad-with-archive. Is that just bad luck, or is something going on with the archives that makes this sometimes take much longer?
Feb 14 2019
Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice.
Feb 13 2019
Cool! Thanks, @Smalyshev!
Change 490412 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] admin: reset Julia SSH key
Removing this from current work and moving it to the "Language Stuff" backlog. I'm the only one who could work on this this quarter, and I'm a bit out of my depth with the integration. We'll reprioritize this for future work when we can assign a slightly larger team (≥2 people) to work on it.
Feb 12 2019
Hmm—what about Nori (the Korean analyzer) and LTR? I believe we have to disable LTR for Korean, enable Nori, gather more data, then rebuild the LTR model. Sounds like maybe all of that should wait until after the ES upgrade, even though it means re-indexing Korean wikis at a later date.
Looks good, and all the detail is much appreciated.
Feb 11 2019
Sounds good to me! If it turns out that the smallest volume languages have trouble, we can fall back to larger languages on the list.