I shudder to think whether there are any words with three or more character sets used, but at this point I wouldn't be surprised.
May 13 2019
I'm showing only the first O is Latin in (a), but the effect is the same—it gets no results. (Search for Latin "O" on the page and the first character of (a) will be highlighted.)
I thought the goal was logging 100% for the smaller projects. 1:10 or even less is probably okay for the giant wikis (though it still makes computing relative percentages of volume either wrong or complicated to compute).
May 6 2019
May 2 2019
Currently scheduled for 14:00 on Friday at one of the hacking tables—Table #9 (we've taken over the Technical Help Desk)
Apr 28 2019
Apr 26 2019
Apr 18 2019
Apr 12 2019
Apr 11 2019
Apr 10 2019
Korean got reindexed incidentally as part of the ES 6 upgrade: some of our previous configuration for spaceless languages was deprecated so we had to upgrade them to BM25, and reindex, which picked up the Nori changes for Korean.
This looks good, @Gehel. You brought up of some things we hadn't talked about before, so you covered more than 100% of the topics I had!
Apr 8 2019
The code update is done.
Apr 5 2019
Apr 4 2019
Apr 2 2019
Created the following tasks and will prioritize them into the Language Stuff workboard column:
The results are in! A brief summary:
Mar 29 2019
Mar 28 2019
As I thought, this is a customization that was added to the English Language Analysis years ago before my time. It was originally limited to search on MediaWiki.org in 2013, and then expanded to all English-language wikis in 2014, but it was never expanded beyond that.
I'll take a look today. I'm pretty sure I know what's happening, but will double check.
Mar 20 2019
Mar 19 2019
Mar 12 2019
Mar 8 2019
Mar 7 2019
Mar 6 2019
After refactoring the lowercase-to-ICU-normalization upgrade code for Greek (T203117) so that the lowercase filter is kept if it is language-specific, I needed to test it for the other language-specific cases: Turkish and Irish. The impact is positive but small because it is limited to the plain field and other fields besides the text field (where the lang-specific lowercasing is already in effect because the analyzers have not been unpacked). Full details on MediaWiki.
Unpacking the Greek analyzer exposes the lowercase filter, which is upgraded to icu_normalizer, losing the Greek-specific processing therein! So, we need to keep the Greek lowercasing even if we do ICU normalization. After that, everything is copacetic. Full write up on MediaWiki.
Mar 4 2019
Feb 26 2019
Feb 21 2019
We need to reindex, but not until after the ES6 upgrade is complete, and LTR has been disabled.
Feb 20 2019
@EBernhardson, thanks for the explanation!
The spikes on create_index are pretty extreme, with 194s for chi-eqiad-with-archive and 291s for omega-eqiad-with-archive. Is that just bad luck, or is something going on with the archives that makes this sometimes take much longer?
Feb 14 2019
Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice.