Page MenuHomePhabricator

RFC: Reevaluate LocalisationUpdate extension for WMF
Open, NormalPublic

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
demon added a comment.Mar 7 2017, 2:10 AM

Is there anyway we could look at statics to see how many changes actually happen on a daily basis?

Best course of action is seeing what repos (and how much in each) get updated during the daily commits from TWN: Gerrit changes by L10n-bot.

Reedy added a comment.Mar 7 2017, 8:04 PM

I wonder how much weight T45917 has in all this

Krinkle added a subscriber: Krinkle.Mar 8 2017, 9:26 PM
daniel added a comment.Mar 9 2017, 3:32 PM

A note from last night's ArchCom session: we talked about this briefly. @Krinkle raised concerns about the current implementation, which has caused outages in the past. He suggested to overhaul or better rewrite LocalizationUpdate if we want to keep using it on the WMF cluster. This means the options are basically:

  1. drop LocalizationUpdate
  2. invest time into reviewing/overhauling/rewriting LocalizationUpdate

We also discussed whether this RFC is in ArchCom's scope at all. There seemed to be consensus that while ArchCom shouldn't decide whether LocalizationUpdate is used on the cluster (that's up to release engineering), it's ArchCom's job to make sure that if we use it, LocalizationUpdate is implemented in a way that is safe, scalable, and architecturally sound.

matmarex added a subscriber: matmarex.EditedMar 9 2017, 5:40 PM

If the problem is two different systems fighting… could we instead backport l10n update Git commits from master to wmf branches daily, and deploy that with Scap? Even if that couldn't run unattended (could it?), doing it daily-on-weekdays, or even daily-on-train-days, would still be far better than weekly.

Side note: I think quick updates matter the most for newly introduced messages – if someone merges a change on Tuesday morning that replaces an established message (with translations) for a new one (naturally without translations), it's important for users who translate it on Wednesday not to have to wait a week to see their translation on a wiki in their language. But if there has been a typo in some translation for months, it's not critical for the correction to go live immediately.

demon added a comment.Mar 9 2017, 5:46 PM

If the problem is two different systems fighting… could we instead backport l10n update Git commits from master to wmf branches daily, and deploy that with Scap? Even if that couldn't run unattended (could it?), doing it daily-on-weekdays, or even daily-on-train-days, would still be far better than weekly.

I was having that thought last night--that's a possible route forward and removes the special-snowflake status of these deploys.

MaxSem added a subscriber: MaxSem.EditedMar 9 2017, 7:10 PM

If the problem is two different systems fighting… could we instead backport l10n update Git commits from master to wmf branches daily, and deploy that with Scap? Even if that couldn't run unattended (could it?), doing it daily-on-weekdays, or even daily-on-train-days, would still be far better than weekly.

I was having that thought last night--that's a possible route forward and removes the special-snowflake status of these deploys.

That would potentially clash with security patches, however. Which returns us to the question of how we can make security releases easier?

Another thing: Just very recently I read somewhere, cannot remember where and if it was an RFC or so, that people at WMF are considering to switch from weekly deployments to much more infrequent deployments. So if translation updates are tied with deployments this would mean that we are not just talking about a week but a longer time span. Ouch. So connecting the translation updated to deployments appears not to be cool at all.

Err, where is that even coming from? All talks related to deployment train that I'm aware of were about going faster, not slower.

(Edit: didn't finish my thought)

demon added a comment.Mar 9 2017, 7:52 PM

If the problem is two different systems fighting… could we instead backport l10n update Git commits from master to wmf branches daily, and deploy that with Scap? Even if that couldn't run unattended (could it?), doing it daily-on-weekdays, or even daily-on-train-days, would still be far better than weekly.

I was having that thought last night--that's a possible route forward and removes the special-snowflake status of these deploys.

That would potentially clash with security patches, however. Which returns us to the question of how we can make security releases easier?

As does every single swat deploy. The vast majority of security patches don't involve message changes, so I'm not particularly worried.

[..] There seemed to be consensus that while ArchCom shouldn't decide whether LocalizationUpdate is used on the cluster (that's up to release engineering), it's ArchCom's job to make sure that if we use it, LocalizationUpdate is implemented in a way that is safe, scalable, and architecturally sound.

If the problem is two different systems fighting… could we instead backport l10n update Git commits from master to wmf branches daily, and deploy that with Scap? Even if that couldn't run unattended (could it?), doing it daily-on-weekdays, or even daily-on-train-days, would still be far better than weekly.

I was having that thought last night--that's a possible route forward and removes the special-snowflake status of these deploys.

Yup, I was having the same thought as well. I propose to have a proces that automatically generates a commit for the wmf branch(es) that effectively does what LocalisationUpdate currently does: Merge compatible translations from the current master. This would then be a commit in Jenkins that we can roll out manually. Either by one of the SWAT operators (as default SWAT entry every day), or by the person doing the train (on 3/5 week days).

This also has the advantage of being publicly tracked in version control. Given that extensions are in separate repos, we'll probably want to auto-merge them in Jenkins. Perhaps the commits themselves can be drafted, submitted-to-Gerrit, and force-merged by a scap command? Similar to what translatewiki does once a day. Except run from the deployment host by the person deploying it.

demon added a comment.Mar 10 2017, 5:08 PM

[..] There seemed to be consensus that while ArchCom shouldn't decide whether LocalizationUpdate is used on the cluster (that's up to release engineering), it's ArchCom's job to make sure that if we use it, LocalizationUpdate is implemented in a way that is safe, scalable, and architecturally sound.

! In T158360#3088471, @demon wrote:

! In T158360#3088438, @matmarex wrote:

If the problem is two different systems fighting… could we instead backport l10n update Git commits from master to wmf branches daily, and deploy that with Scap? Even if that couldn't run unattended (could it?), doing it daily-on-weekdays, or even daily-on-train-days, would still be far better than weekly.

I was having that thought last night--that's a possible route forward and removes the special-snowflake status of these deploys.

Yup, I was having the same thought as well. I propose to have a proces that automatically generates a commit for the wmf branch(es) that effectively does what LocalisationUpdate currently does: Merge compatible translations from the current master. This would then be a commit in Jenkins that we can roll out manually. Either by one of the SWAT operators (as default SWAT entry every day), or by the person doing the train (on 3/5 week days).
This also has the advantage of being publicly tracked in version control. Given that extensions are in separate repos, we'll probably want to auto-merge them in Jenkins.

This all sounds good

Perhaps the commits themselves can be drafted, submitted-to-Gerrit, and force-merged by a scap command? Similar to what translatewiki does once a day. Except run from the deployment host by the person deploying it.

I'd rather it all be in LocalisationUpdate as a maintenance script there, not in scap. Otherwise no disagreements.

daniel moved this task from Inbox to In progress on the TechCom-RFC board.Mar 15 2017, 8:11 PM
Krinkle moved this task from Backlog to Krinkle on the TechCom-Has-shepherd board.

Does this proposal mean that the MediaWiki translations will also go away? There are cases when the software is translated at translatewiki in a generic way, which should make any 3rd pary user happy, while on Wikipedia we adapt it to Wikimedia tools and local customs. How will those be treated in the future?

Reedy added a comment.Mar 17 2017, 3:51 PM

Does this proposal mean that the MediaWiki translations will also go away? There are cases when the software is translated at translatewiki in a generic way, which should make any 3rd pary user happy, while on Wikipedia we adapt it to Wikimedia tools and local customs. How will those be treated in the future?

No. Just the nightly automated updates onto WMF wiki's. At worst, they would become weekly

Nemo_bis added a comment.EditedMar 19 2017, 9:45 AM

How will those be treated in the future?

I've clarified the wiki page for the RfC. When LocalisationUpdate isn't running, such cases of local translations increase, so there is a need for a cleanup process.

Reedy added a comment.EditedMar 24 2017, 2:27 AM

From operations just now during an outage

[02:21:11] <Krinkle> l10n update has the lock
[02:22:13] <RoanKattouw> Kill l10nupdate?
[02:22:25] <Krinkle> !log Hard-killed all l10nupdate processes and rm'ed scap lock

Granted, this was just timing. But not someone in control of it, and it happened to be running at the same time as an outage was happening. Which caused it to have to be manually killed. Which mean it took more time to be able to fix the sites

I like the idea of making l10nupdate more version-control-driven as @Krinkle suggested, and generally agree that it should be improved, particularly the deployment infrastructure around it. I don't think we should eliminate daily updates, although I would be OK with updates being more human-gated and/or updates not happening on Fridays, Saturdays and Sundays.

Removing this from the RFC board. After brief discussion, we (ArchCom) agreed that this is not really in our scope: it's up to rel-eng to decide whether they want to keep this or kill this, and how much they want to invest in making it better. There are no cross-cutting or strategic concerns here, and any decision on the issue is easy enough to revert.

(Note: ArchCom is currently working on clarifying the scope and purpose of RFCs; if you think we are wrong about this, let us know why)

demon added a comment.Apr 13 2017, 7:44 PM

it's up to rel-eng to decide whether they want to keep this or kill this, and how much they want to invest in making it better. There are no cross-cutting or strategic concerns here, and any decision on the issue is easy enough to revert.

I disagree--it's not solely up to RelEng or I would've just killed it and not bothered with an RfC to begin with. There are quite a few cross-cutting and strategic concerns here.

But if ArchCom doesn't want to be a part of it, I don't really care....was just following what we thought to be the rules.

@demon I did not mean to say that RelEng should decide without consultation. But it's ultimately up to them. I also agree that the attention and broader discussion was useful. But we feel that the ArchCom-RFC process should not be the only (or the preferred) way of getting that attention, unless we are talking about strategic decisions, or cross-cutting issues, or decisions inherently hard to undo. Basically, we have little expertise on the issues discussed here.

I appreciate that you tried to tried to follow due process here; sadly, the scope of ArchCom RFCs are not well defined. We are currently trying to fix that, and this ticket happens to be one of the first places where we apply our new experimental (and currently undocumented) understanding of scope.

Please consider this an experiment, and a learning opportunity for me and ArchCom. What are the cross-cutting or strategic concerns?

demon added a comment.Apr 13 2017, 8:13 PM

@demon I did not mean to say that RelEng should decide without consultation. But it's ultimately up to them.

I disagree, it's not up to us :)

Basically, we have little expertise on the issues discussed here.

Neither do we! That's half of the problem :)

What are the cross-cutting or strategic concerns?

Well, there's many stakeholders in up-to-date translations, as this discussion has shown. I consider that pretty cross-cutting--it's not only RelEng who has to deal with the consequences of decisions made here :)

All that being said, I think we've found some solutions here that can be acted upon.

All that being said, I think we've found some solutions here that can be acted upon.

I believe the "not cross-cutting" appearance was based on the idea of implementing this as a scap plugin that creates Git commits. As such, unlike the LocalisationUpdate extension, there aren't any background processes, silent deploys, or interaction with any other deployed service or product.

It would essentially just be like the localisation update commits we already merge and deploy. The daily update process would not be an active component in the MediaWiki platform. It would "Just work", passively.

Also, if the solution ends up being used with regular frequency, it wouldn't have much cross-cutting impact from a social perspective given that we'd still effectively have nightly updates. The only exception being Fridays and weekends. I suppose we could have a discussion about that, perhaps as orthogonal proposal to not run the existing LocalisationUpdate process on those days.

Krinkle edited projects, added Deployments; removed Wikimedia-General-or-Unknown.
Krinkle renamed this task from RFC: Disabling LocalisationUpdate on WMF wikis to RFC: Reevaluate LocalisationUpdate extension for WMF.Apr 26 2017, 9:04 PM
Krinkle updated the task description. (Show Details)
Krinkle added a comment.EditedApr 26 2017, 9:09 PM

Per @demon comment we've re-added this to the TechCom-RFC board. While one of the proposed solutions isn't very cross-cutting in its implementation (the solution involving scap creating git commits that will be deployed normally), we acknowledge that it still impact users and developers. We can therefore still help facilitate this proposal and work towards an approved solution that involves the different stakeholders.

  • Users and translators: No updates in the weekend.
  • Developer processes: Possible conflicts in wmf branches?
  • Operations: Where is this process going to run? How will the commits be submitted to Gerrit?
  • Release-Engineering (or Language-Engineering): How, when, and by whom will they be deployed?

See updated RFC at: https://www.mediawiki.org/wiki/Requests_for_comment/Reevaluate_LocalisationUpdate_extension

greg added a comment.Apr 26 2017, 11:24 PM

Another data point is this bug report: T163671: LocalisationUpdate not working since 2017-04-11

As you can see from the title, l10nupdate wasn't working since 2017-04-11. That task was reported on 2017-04-24, almost 2 full weeks later. Which, I believe, re-opens the "what's wrong with weekly (at least) updates (with updates during a SWAT on an as-needed basis)?" question. I won't say more on that topic now, though.

One change that should be uncontroversial is only have l10nupdate run during the work week. It currently runs at 2am UTC. That means we can probably have it run Mon-Fri at that time without much issue. In the Pacific timezone that's Sunday night through Thursday night. Given the situation with l10nupdate today, I recommend we make this change ASAP. This small of a change, to bring it in-line with our standard operating procedures for all deploys, should not require the rest of this RFC process to complete.

Pikne added a subscriber: Pikne.May 4 2017, 7:54 AM

As you can see from the title, l10nupdate wasn't working since 2017-04-11. That task was reported on 2017-04-24, almost 2 full weeks later. Which, I believe, re-opens the "what's wrong with weekly (at least) updates (with updates during a SWAT on an as-needed basis)?" question. I won't say more on that topic now, though.

It may be that it wasn't reported sooner because translators are used to complications with the updating process and it's hard to tell whether it's broken, or there's yet another temporary outage, or delay to daily update is intentional due to message definition update or some other special condition related to certain message or repo. This time I looked into it a little and reported it, but usually I'm just patient. I expect interface texts to be updated in a timely manner though. In any case, thanks for fixing it.

I did notice that messages aren't updated, and assumed that it's related to the reduction of deployments related to the dc switch, and there weren't any particularly important updates that I wanted to get fixed urgently.

I did report particular LU failures numerous times in the past.

demon added a comment.May 4 2017, 4:32 PM

I did notice that messages aren't updated, and assumed that it's related to the reduction of deployments related to the dc switch, and there weren't any particularly important updates that I wanted to get fixed urgently.

But LU runs even when we're not doing deployments (unless we explicitly cut it off). If the default assumption of "no updates" is "we aren't deploying code right now" -- how does this differ from letting things go with the train?

I did report particular LU failures numerous times in the past.

Yes, they do get reported. There tends to be a bit of a lag between failure & them getting noticed and filed. Not blaming you though, just my general observation.

Since it is hard to detect when LU is or isn't working, we could add one dummy message key to MediaWiki core which contains the timestamp of last export in a given language. Then one could just look at the timestamp (and compare to what is in git if necessary).

greg added a comment.May 5 2017, 4:59 PM

It's not hard at all: it logs success/failure in the SAL after each run.

Kghbln added a comment.May 5 2017, 5:02 PM

It's not hard at all: it logs success/failure in the SAL after each run.

It takes two weeks to check the logs for failures??? ;)

greg added a comment.May 5 2017, 5:05 PM

Exactly my point: the information is there for those who care to see it. Because we've had multiple occasions where it was multiple weeks for people to notice...

Kghbln added a comment.May 5 2017, 5:18 PM

Exactly my point: the information is there for those who care to see it. Because we've had multiple occasions where it was multiple weeks for people to notice...

Frankly I do not believe that it is the task of the wikipedia community to notice and report. On high frequency wikis they just add the translations manually which has happened multiple times in the past and on low frequency wikis people are more concerned with other issues. How can they be supported to get their translations? I guess this is what task is all about.

greg added a comment.May 5 2017, 5:26 PM

Exactly my point: the information is there for those who care to see it. Because we've had multiple occasions where it was multiple weeks for people to notice...

Frankly I do not believe that it is the task of the wikipedia community to notice and report.

I never said any such thing :)

As stated above, there is no clear owner of the l10nupdate code and process. For us (RelEng) to consider it something we would take ownership of we would require some basic changes, like update frequency.

How can they be supported to get their translations? I guess this is what task is all about.

I respectfully disagree. This task is about the fact that no one owns l10nupdate at the Foundation and thus it is not taken care of. No where in any of the proposals would any wiki community lose support for getting their translations.

Reedy added a comment.May 5 2017, 5:27 PM

Surely it has some purview of the language team to make sure things are running correctly? I'm not saying they should necessarily be the ones to fix it, but at least check on it, file tasks as appropriate.

Anyone is of course welcome to add improved logging, notifications, ganglia or similar checks etc as deemed appropriate

Reedy added a comment.May 5 2017, 5:29 PM

How can they be supported to get their translations? I guess this is what task is all about.

And no one has said we're going to stop exporting from translatewiki on a daily basis, nor that the changes wouldn't ride the train every week like they currently do

The task is about the automated deployment of these translations, how and when they happen

Pikne added a comment.May 5 2017, 8:40 PM

Exactly my point: the information is there for those who care to see it. Because we've had multiple occasions where it was multiple weeks for people to notice...

Well, failure in the SAL isn't the only reason why translations sometimes won't go through in a timely manner. As noted above sometimes the problem is elsewhere, or regarding certain interface text delay could be intentional (I usually suspect the latter). Most translators probably don't know where to check any of that. And of course, if the interface text isn't highly visible, then its translation not coming through may go unnoticed too by translators since they are not expected to come back each time they have translated something to see if updating went smoothly.

Nonetheless, regardless of the reason why it sometimes takes time to report failures, I believe it's fair to expect that interface looks neat at any given time, and users see as little untranslated texts as possible and as briefly as possible.

It's not hard at all: it logs success/failure in the SAL after each run.

Language team has received reports multiple times about LU being broken. What I proposed could help to investigate what is the root cause for the user reports by eliminating a lot of uncertainty. The presence or lack of presence of the log entry in SAL alone doesn't tell much. The full logs of the runs are not visible as far as I know, so my suggestion would also make more information available to non-sysadmins.

But I'll bring up the larger question of how to do localisation updates. We might choose to do things in one way for now, with the assumption that we will improve it later. Or if we are certain there are no plans to work on this area in near future, then we should aim for a good enough solution that avoids as many of the drawbacks listed in the discussion as possible.

I'm linking some ideas on work on this area:

What I am currently missing from the wiki page is a list of requirements/wishes how things should work to be painless for deployments. My guess would be something like: reliable, fast, human-initiated. But should it go through git? Should it be like regular updates to i18n files, or can we just drop in an optimized blob that would be fast to deploy?

Language team has received reports multiple times about LU being broken. What I proposed could help to investigate what is the root cause for the user reports by eliminating a lot of uncertainty. The presence or lack of presence of the log entry in SAL alone doesn't tell much. The full logs of the runs are not visible as far as I know, so my suggestion would also make more information available to non-sysadmins.

AFAIK they only exist on disk at tin:/var/log/l10nupdatelog (or swap tin for whatever is the current deployment host in whichever datacenter). I don't think that they make it into logstash at all

Nemo_bis updated the task description. (Show Details)Jun 21 2017, 7:19 AM
Krinkle updated the task description. (Show Details)
greg triaged this task as Normal priority.Nov 27 2018, 7:12 PM
greg moved this task from INBOX to Epics on the Release-Engineering-Team board.