Re-index un-fallbacked languages
Closed, ResolvedPublic

Description

The following have had their language configs modified as a result of disabling fallback languages. They need to be re-indexed after the parent task (T147959) has been deployed.

We also need to think about how we communicate the changes—should we have general announcements or individual announcements on individual wikis? Just on Wikipedias, or on all affected projects in a language?

  • arz: Egyptian Arabic
  • oc: Occitan
  • sk: Slovak
  • kl: Greenlandic
  • nds-nl: Dutch Low Saxon
  • li: Limburgish
  • srn: Sranan
  • vls: West Flemish
  • zea: Zeelandic
  • olo: Livvi-Karelian
  • bm: Bambara
  • br: Breton
  • frp: Franco-Provençal
  • ff: Fula
  • ht: Haitian
  • ln: Lingala
  • mg: Malagasy
  • nrm: Norman
  • pcd: Picard
  • sg: Sango
  • ty: Tahitian
  • wa: Walloon
  • wo: Wolof
  • als: Alemannic
  • bar: Bavarian
  • nds: Low Saxon
  • dsb: Lower Sorbian
  • lb: Luxembourgish
  • frr: North Frisian
  • pfl: Palatinate German
  • pdc: Pennsylvania German
  • ksh: Ripuarian
  • stq: Saterland Frisian
  • hsb: Upper Sorbian
  • pnt: Pontic Greek
  • mai: Maithili
  • sa: Sanskrit
  • ace: Acehnese
  • bjn: Banjar
  • map-bms: Banyumasan
  • bug: Buginese
  • jv: Javanese
  • min: Minangkabau
  • su: Sundanese
  • co: Corsican
  • eml: Emilian-Romagnol
  • fur: Friulian
  • lij: Ligurian
  • lmo: Lombard
  • nap: Neapolitan
  • pms: Piedmontese
  • roa-tara: Tarantino
  • scn: Sicilian
  • vec: Venetian
  • ltg: Latgalian
  • bat-smg: Samogitian
  • glk: Gilaki
  • mzn: Mazandarani
  • lrc: Northern Luri
  • azb: Southern Azerbaijani
  • csb: Kashubian
  • szl: Silesian
  • mwl: Mirandese
  • roa-rup: Aromanian
  • mo: Moldovan Cyrillic (Romanian)
  • rmy: Romani
  • ab: Abkhazian
  • av: Avar
  • ba: Bashkir
  • bxr: Buryat
  • ce: Chechen
  • cv: Chuvash
  • myv: Erzya
  • mrj: Hill Mari
  • xal: Kalmyk
  • krc: Karachay-Balkar
  • kv: Komi
  • koi: Komi-Permyak
  • lbe: Lak
  • lez: Lezgian
  • mhr: Meadow Mari
  • mdf: Moksha
  • os: Ossetian
  • sah: Sakha
  • tt: Tatar
  • tyv: Tuvan
  • udm: Udmurt
  • an: Aragonese
  • ast: Asturian
  • ay: Aymara
  • cbk-zam: Chavacano
  • ext: Extremaduran
  • gn: Guarani
  • lad: Ladino
  • nah: Nahuatl
  • qu: Quechua
  • gag: Gagauz
  • rue: Rusyn

Configs have changed but no action is necessary:

  • yi: Yiddish
  • atj: Atikamekw
  • kbp: Kabiye

The last three were configured incorrectly, but didn't have a chance to be indexed incorrectly. Yiddish because there was no Hebrew analyzer available until very recently, and Atikamekw and Kabiye presumably because their fallback was configured after they were initially indexed and they haven't been re-indexed since.

TJones updated the task description. (Show Details)Oct 10 2017, 7:12 PM

@Smalyshev suggested reviewing zh-* languages specifically and un-fallbacked languages in general to see if there are any obvious changes that should be made. I think that should be a quick task, and if it is, I'll comment here and create another task if necessary to update their configs and mark them in the description above as being dependent on another task.

TJones updated the task description. (Show Details)Oct 10 2017, 8:17 PM

Okay, the zh-* languages thing got out of control. It's complicated (see T177888) and none were/will be changed by the fallback changes, so I'm dropping that from this ticket. I'll try to review the rest of the un-fallbacked languages this week.

That's a lot of languages, but hopefully the re-indexing won't be too much of a strain or take too long. :)

@CKoerner_WMF - how do you feel we should do the communication to the communities affected:

general announcements or individual announcements on individual wikis? Just on Wikipedias, or on all affected projects in a language?

That's a lot of languages, but hopefully the re-indexing won't be too much of a strain or take too long. :)

Some of the languages only have a Wikipedia, or only a Wikipedia and a Wiktionary. Many of the Wikipedias are very small—thousands to tens of thousands of articles—so the re-indexing should be quick and easy. The strain will be on the human who has to issue so many re-index commands!

All of the un-fallbacked languages seem okay with the default analyzer. Javanese script doesn't use spaces, but... (a) Javanese wikis mostly use the Javanese Latin alphabet, (b) the ICU tokenizer is already configured for Javanese as a result of earlier spaceless language config, and (c) none of the tokenizers do anything different with it anyway. So, we're ready to start re-indexing once the changes have been deployed.

debt added a comment.Oct 11 2017, 9:36 PM

...we're ready to start re-indexing once the changes have been deployed.

woohoo! :)

Oof, that's a lot of wikis, many of them small where messages from the WMF trounce any actual community discussion (random example).

I suggest we:

  • Document this work on MediaWiki somewhere - to help with future searches if people do have questions
  • Add to Tech/News (I just did!)
  • Post an update to wikitech-l and wikitech-ambassadors
  • Add to the weekly Discovery update

Do others agree?

debt added a comment.Oct 12 2017, 9:40 PM

I suggest we:

  • Document this work on MediaWiki somewhere - to help with future searches if people do have questions
  • Add to Tech/News (I just did!)
  • Post an update to wikitech-l and wikitech-ambassadors
  • Add to the weekly Discovery update

    Do others agree?

@TJones has this page that documents the what and why: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Fallback_Languages
and we can certainly add in the 'when' to it, or just reference the email posts and Discovery update and this ticket in the Notes.

I think we're in a good spot and will just need to craft the wording for the emails. Or, just use @CKoerner_WMF's wording that he added to Tech News:

  • Searching in some languages used other languages instead when the search in the first language didn't work. This created bad searches. The search index is being fixed to work better. [https://phabricator.wikimedia.org/T177871]

...and then, of course, when the re-indexing is done, we send the emails out.

TJones added a subscriber: dcausse.Oct 13 2017, 3:09 PM

Should we post an update before we know when the re-indexing is going to happen? I think @dcausse is likely to be the person who does the actual updates (I don't have permissions on the relevant servers), so should we plan around when he thinks he can do it? If it is next week, then TechNews 2017/42 would be a good place to announce it. Otherwise should we wait for 43? The others are all easy to announce to right before we do it.

@TJones has this page that documents the what and why: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Fallback_Languages

The updated version of my notes is Fallback Redux, but the more user-friendly version is Disabling Messaging Fallbacks for Language Analysis, and that's probably what we should point people to on MediaWiki, if anything.

Johan added a subscriber: Johan.Oct 13 2017, 3:11 PM

It's already included in Tech News 2017/42 – @CKoerner_WMF added an item.

@Johan, I was suggesting considering delaying it it to Tech News 2017/43 (unless that's not possible) if we need to.

Johan added a comment.Oct 13 2017, 3:18 PM

Right. Sorry for the misunderstanding. It's too late to add or change the text but if you think the community will be less confused if we move it to one week later, I can move the text to 2017/43 as long as you make sure (I should see any pings here, but email me to be on the safe side) to tell me by Monday morning UTC.

debt added a comment.Oct 13 2017, 4:06 PM

It looks like @dcausse will be able to do the re-indexing after next week's train deployment and because we really don't want to start the re-indexing on a Friday, the earliest we'd be able to do it is Monday, Oct 23rd.

@Johan - it sounds fine to publish the upcoming re-indexing next week, but can it mention that it'll actually start on Monday the 23rd? Or, maybe another note in issue 43?

Johan added a comment.Oct 16 2017, 2:22 PM

I very much prefer to not add information after the translators have been told it's safe to ignore the issue as no further edits will be made, so I'm moving it to next issue. (:

Thanks, @Johan! Sorry for the last-minute shuffle as we get all our ducks in a row.

debt assigned this task to dcausse.Oct 17 2017, 5:22 PM

We'll get start this next week on Monday.

List of affected wikis:

abwiki
abwiktionary
acewiki
anwiki
anwiktionary
arzwiki
astwiki
astwikibooks
astwikiquote
astwiktionary
avwiki
avwiktionary
aywiki
aywikibooks
aywiktionary
azbwiki
barwiki
bawiki
bawikibooks
bjnwiki
bmwiki
bmwikibooks
bmwikiquote
bmwiktionary
brwiki
brwikiquote
brwikisource
brwiktionary
bugwiki
bxrwiki
cbk_zamwiki
cewiki
cowiki
cowikibooks
cowikiquote
cowiktionary
csbwiki
csbwiktionary
cvwiki
cvwikibooks
dsbwiki
emlwiki
extwiki
ffwiki
frpwiki
frrwiki
furwiki
gagwiki
glkwiki
gnwiki
gnwikibooks
gnwiktionary
hsbwiki
hsbwiktionary
htwiki
htwikisource
jvwiki
jvwiktionary
klwiki
klwiktionary
koiwiki
krcwiki
kshwiki
kvwiki
ladwiki
lbewiki
lbwiki
lbwikibooks
lbwikiquote
lbwiktionary
lezwiki
lijwiki
liwiki
liwikibooks
liwikiquote
liwikisource
liwiktionary
lmowiki
lnwiki
lnwikibooks
lnwiktionary
lrcwiki
ltgwiki
maiwiki
maiwikimedia
map_bmswiki
mdfwiki
mgwiki
mgwikibooks
mgwiktionary
mhrwiki
minwiki
mowiki
mowiktionary
mrjwiki
mwlwiki
myvwiki
mznwiki
nahwiki
nahwikibooks
nahwiktionary
napwiki
nds_nlwiki
ndswiki
ndswikibooks
ndswikiquote
ndswiktionary
nrmwiki
ocwiki
ocwikibooks
ocwiktionary
olowiki
oswiki
pcdwiki
pdcwiki
pflwiki
pmswiki
pntwiki
quwiki
quwikibooks
quwikiquote
quwiktionary
rmywiki
roa_tarawiki
ruewiki
sahwiki
sahwikisource
sawiki
sawikibooks
sawikiquote
sawikisource
sawiktionary
scnwiki
scnwiktionary
sgwiki
sgwiktionary
skwiki
skwikibooks
skwikiquote
skwikisource
skwiktionary
srnwiki
stqwiki
suwiki
suwikibooks
suwikiquote
suwiktionary
szlwiki
ttwiki
ttwikibooks
ttwikiquote
ttwiktionary
tyvwiki
tywiki
udmwiki
vecwiki
vecwikisource
vecwiktionary
vlswiki
wawiki
wawikibooks
wawiktionary
wowiki
wowikiquote
wowiktionary
xalwiki
zeawiki

Mentioned in SAL (#wikimedia-operations) [2017-10-23T10:09:52Z] <dcausse> elasticsearch/cirrus reindexing 167 wikis from terbium (T177871)

Mentioned in SAL (#wikimedia-operations) [2017-10-24T10:50:56Z] <dcausse> cirrus/elasticsearch: reindexing of 167 small wikis done (T177871)

TJones updated the task description. (Show Details)Oct 24 2017, 2:31 PM
TJones added a comment.EditedOct 24 2017, 2:38 PM

Thanks @dcausse!

  • Document this work on MediaWiki somewhere - to help with future searches if people do have questions
  • Add to Tech/News (I just did!)
  • Post an update to wikitech-l and wikitech-ambassadors
  • Add to the weekly Discovery update

I've replied to my earlier messages to the mailing lists (discovery, wikitech-l and wikitech-ambassadors) with updates and I'll add it to the Discovery update for next week. I'll look around and see if there are any other channels that should get a message, too. (Better to communicate a bit too much rather a bit too little, right?)

I've replied to Chris's comments on Babylon, and all of the Village Pumps that I had posted to before: Slovak, Mirandese, Occitan, Limburgish (which spilled over to my talk page), Egyptian Arabic, Gagauz, and Livvi-Karelian. I'll follow those conversations for the rest of the week to see if any concerns come up. I've also updated the page on MediaWiki to reflect what has happened rather than what will happen.

debt added a comment.Oct 24 2017, 4:07 PM

🎉 👍 thanks, @TJones and @dcausse!

debt closed this task as Resolved.Oct 26 2017, 3:53 PM
Restricted Application reassigned this task from dcausse to R3609901. · View Herald TranscriptNov 17 2017, 3:45 AM
Aklapper reassigned this task from R3609901 to dcausse.Nov 17 2017, 1:53 PM
Aklapper added a subscriber: R3609901.