See problem statement and proposed solutions at:
See problem statement and proposed solutions at:
|Open||None||T220481 Reenable l10update in production|
|Open||None||T158360 RFC: Reevaluate LocalisationUpdate extension for WMF|
|Open||None||T206694 Determine desired architecture to update localization strings for Wikimedia|
A note from last night's ArchCom session: we talked about this briefly. @Krinkle raised concerns about the current implementation, which has caused outages in the past. He suggested to overhaul or better rewrite LocalizationUpdate if we want to keep using it on the WMF cluster. This means the options are basically:
We also discussed whether this RFC is in ArchCom's scope at all. There seemed to be consensus that while ArchCom shouldn't decide whether LocalizationUpdate is used on the cluster (that's up to release engineering), it's ArchCom's job to make sure that if we use it, LocalizationUpdate is implemented in a way that is safe, scalable, and architecturally sound.
If the problem is two different systems fighting… could we instead backport l10n update Git commits from master to wmf branches daily, and deploy that with Scap? Even if that couldn't run unattended (could it?), doing it daily-on-weekdays, or even daily-on-train-days, would still be far better than weekly.
Side note: I think quick updates matter the most for newly introduced messages – if someone merges a change on Tuesday morning that replaces an established message (with translations) for a new one (naturally without translations), it's important for users who translate it on Wednesday not to have to wait a week to see their translation on a wiki in their language. But if there has been a typo in some translation for months, it's not critical for the correction to go live immediately.
That would potentially clash with security patches, however. Which returns us to the question of how we can make security releases easier?
Err, where is that even coming from? All talks related to deployment train that I'm aware of were about going faster, not slower.
(Edit: didn't finish my thought)
Yup, I was having the same thought as well. I propose to have a proces that automatically generates a commit for the wmf branch(es) that effectively does what LocalisationUpdate currently does: Merge compatible translations from the current master. This would then be a commit in Jenkins that we can roll out manually. Either by one of the SWAT operators (as default SWAT entry every day), or by the person doing the train (on 3/5 week days).
This also has the advantage of being publicly tracked in version control. Given that extensions are in separate repos, we'll probably want to auto-merge them in Jenkins. Perhaps the commits themselves can be drafted, submitted-to-Gerrit, and force-merged by a scap command? Similar to what translatewiki does once a day. Except run from the deployment host by the person deploying it.
This all sounds good
Perhaps the commits themselves can be drafted, submitted-to-Gerrit, and force-merged by a scap command? Similar to what translatewiki does once a day. Except run from the deployment host by the person deploying it.
I'd rather it all be in LocalisationUpdate as a maintenance script there, not in scap. Otherwise no disagreements.
Does this proposal mean that the MediaWiki translations will also go away? There are cases when the software is translated at translatewiki in a generic way, which should make any 3rd pary user happy, while on Wikipedia we adapt it to Wikimedia tools and local customs. How will those be treated in the future?
From operations just now during an outage
[02:21:11] <Krinkle> l10n update has the lock [02:22:13] <RoanKattouw> Kill l10nupdate? [02:22:25] <Krinkle> !log Hard-killed all l10nupdate processes and rm'ed scap lock
Granted, this was just timing. But not someone in control of it, and it happened to be running at the same time as an outage was happening. Which caused it to have to be manually killed. Which mean it took more time to be able to fix the sites
I like the idea of making l10nupdate more version-control-driven as @Krinkle suggested, and generally agree that it should be improved, particularly the deployment infrastructure around it. I don't think we should eliminate daily updates, although I would be OK with updates being more human-gated and/or updates not happening on Fridays, Saturdays and Sundays.
Removing this from the RFC board. After brief discussion, we (ArchCom) agreed that this is not really in our scope: it's up to rel-eng to decide whether they want to keep this or kill this, and how much they want to invest in making it better. There are no cross-cutting or strategic concerns here, and any decision on the issue is easy enough to revert.
(Note: ArchCom is currently working on clarifying the scope and purpose of RFCs; if you think we are wrong about this, let us know why)
I disagree--it's not solely up to RelEng or I would've just killed it and not bothered with an RfC to begin with. There are quite a few cross-cutting and strategic concerns here.
But if ArchCom doesn't want to be a part of it, I don't really care....was just following what we thought to be the rules.
@demon I did not mean to say that RelEng should decide without consultation. But it's ultimately up to them. I also agree that the attention and broader discussion was useful. But we feel that the ArchCom-RFC process should not be the only (or the preferred) way of getting that attention, unless we are talking about strategic decisions, or cross-cutting issues, or decisions inherently hard to undo. Basically, we have little expertise on the issues discussed here.
I appreciate that you tried to tried to follow due process here; sadly, the scope of ArchCom RFCs are not well defined. We are currently trying to fix that, and this ticket happens to be one of the first places where we apply our new experimental (and currently undocumented) understanding of scope.
Please consider this an experiment, and a learning opportunity for me and ArchCom. What are the cross-cutting or strategic concerns?
I disagree, it's not up to us :)
Basically, we have little expertise on the issues discussed here.
Neither do we! That's half of the problem :)
What are the cross-cutting or strategic concerns?
Well, there's many stakeholders in up-to-date translations, as this discussion has shown. I consider that pretty cross-cutting--it's not only RelEng who has to deal with the consequences of decisions made here :)
All that being said, I think we've found some solutions here that can be acted upon.
I believe the "not cross-cutting" appearance was based on the idea of implementing this as a scap plugin that creates Git commits. As such, unlike the LocalisationUpdate extension, there aren't any background processes, silent deploys, or interaction with any other deployed service or product.
It would essentially just be like the localisation update commits we already merge and deploy. The daily update process would not be an active component in the MediaWiki platform. It would "Just work", passively.
Also, if the solution ends up being used with regular frequency, it wouldn't have much cross-cutting impact from a social perspective given that we'd still effectively have nightly updates. The only exception being Fridays and weekends. I suppose we could have a discussion about that, perhaps as orthogonal proposal to not run the existing LocalisationUpdate process on those days.
Per @demon comment we've re-added this to the TechCom-RFC board. While one of the proposed solutions isn't very cross-cutting in its implementation (the solution involving scap creating git commits that will be deployed normally), we acknowledge that it still impact users and developers. We can therefore still help facilitate this proposal and work towards an approved solution that involves the different stakeholders.
Another data point is this bug report: T163671: LocalisationUpdate not working since 2017-04-11
As you can see from the title, l10nupdate wasn't working since 2017-04-11. That task was reported on 2017-04-24, almost 2 full weeks later. Which, I believe, re-opens the "what's wrong with weekly (at least) updates (with updates during a SWAT on an as-needed basis)?" question. I won't say more on that topic now, though.
One change that should be uncontroversial is only have l10nupdate run during the work week. It currently runs at 2am UTC. That means we can probably have it run Mon-Fri at that time without much issue. In the Pacific timezone that's Sunday night through Thursday night. Given the situation with l10nupdate today, I recommend we make this change ASAP. This small of a change, to bring it in-line with our standard operating procedures for all deploys, should not require the rest of this RFC process to complete.
It may be that it wasn't reported sooner because translators are used to complications with the updating process and it's hard to tell whether it's broken, or there's yet another temporary outage, or delay to daily update is intentional due to message definition update or some other special condition related to certain message or repo. This time I looked into it a little and reported it, but usually I'm just patient. I expect interface texts to be updated in a timely manner though. In any case, thanks for fixing it.
I did notice that messages aren't updated, and assumed that it's related to the reduction of deployments related to the dc switch, and there weren't any particularly important updates that I wanted to get fixed urgently.
I did report particular LU failures numerous times in the past.
But LU runs even when we're not doing deployments (unless we explicitly cut it off). If the default assumption of "no updates" is "we aren't deploying code right now" -- how does this differ from letting things go with the train?
I did report particular LU failures numerous times in the past.
Yes, they do get reported. There tends to be a bit of a lag between failure & them getting noticed and filed. Not blaming you though, just my general observation.
Since it is hard to detect when LU is or isn't working, we could add one dummy message key to MediaWiki core which contains the timestamp of last export in a given language. Then one could just look at the timestamp (and compare to what is in git if necessary).
Frankly I do not believe that it is the task of the wikipedia community to notice and report. On high frequency wikis they just add the translations manually which has happened multiple times in the past and on low frequency wikis people are more concerned with other issues. How can they be supported to get their translations? I guess this is what task is all about.
I never said any such thing :)
As stated above, there is no clear owner of the l10nupdate code and process. For us (RelEng) to consider it something we would take ownership of we would require some basic changes, like update frequency.
How can they be supported to get their translations? I guess this is what task is all about.
I respectfully disagree. This task is about the fact that no one owns l10nupdate at the Foundation and thus it is not taken care of. No where in any of the proposals would any wiki community lose support for getting their translations.
Surely it has some purview of the language team to make sure things are running correctly? I'm not saying they should necessarily be the ones to fix it, but at least check on it, file tasks as appropriate.
Anyone is of course welcome to add improved logging, notifications, ganglia or similar checks etc as deemed appropriate
And no one has said we're going to stop exporting from translatewiki on a daily basis, nor that the changes wouldn't ride the train every week like they currently do
The task is about the automated deployment of these translations, how and when they happen
Well, failure in the SAL isn't the only reason why translations sometimes won't go through in a timely manner. As noted above sometimes the problem is elsewhere, or regarding certain interface text delay could be intentional (I usually suspect the latter). Most translators probably don't know where to check any of that. And of course, if the interface text isn't highly visible, then its translation not coming through may go unnoticed too by translators since they are not expected to come back each time they have translated something to see if updating went smoothly.
Nonetheless, regardless of the reason why it sometimes takes time to report failures, I believe it's fair to expect that interface looks neat at any given time, and users see as little untranslated texts as possible and as briefly as possible.
Language team has received reports multiple times about LU being broken. What I proposed could help to investigate what is the root cause for the user reports by eliminating a lot of uncertainty. The presence or lack of presence of the log entry in SAL alone doesn't tell much. The full logs of the runs are not visible as far as I know, so my suggestion would also make more information available to non-sysadmins.
But I'll bring up the larger question of how to do localisation updates. We might choose to do things in one way for now, with the assumption that we will improve it later. Or if we are certain there are no plans to work on this area in near future, then we should aim for a good enough solution that avoids as many of the drawbacks listed in the discussion as possible.
I'm linking some ideas on work on this area:
What I am currently missing from the wiki page is a list of requirements/wishes how things should work to be painless for deployments. My guess would be something like: reliable, fast, human-initiated. But should it go through git? Should it be like regular updates to i18n files, or can we just drop in an optimized blob that would be fast to deploy?
AFAIK they only exist on disk at tin:/var/log/l10nupdatelog (or swap tin for whatever is the current deployment host in whichever datacenter). I don't think that they make it into logstash at all