Page MenuHomePhabricator

s1 (enwiki) wiki replicas replication has stopped
Closed, ResolvedPublic

Event Timeline

And now it's 18 hours behind.

Up to 28 hours now. @Marostegui, it looks like you're the one who has solved similar issues in the past, can you look at this? Or do you know who else to ping?

And now there is a 34 hour lag! How can a system catch up with a lag longer than a day?

And now there is a 34 hour lag! How can a system catch up with a lag longer than a day?

Don't worry about catching up, the systems are not normally lagged so it will recover when replication is back working. The DBAs & Service owners could also choose to reclone from another databases.

Ladsgroup subscribed.

hmm, there is no need to worry. This is because of a schema change on revision table (T298560) getting propagated to the cloud. It takes twice as the time of the alter (plus some time to recover) and usually the schema changes are fast enough that users don't notice (like five minutes) but revision table of enwiki is ... special. It took 16-ish hours to finish and it's catching up. It'll reach there the current time soon (check orchestrator.wikimedia.org)

Alter table on db1154 (sanitarium) is ongoing:

Slave_SQL_Running_State: Copy to tmp table

Thank you for updating the task.

Orchestrator isn't public as far as I can see so not sure who that part of the comment was aimed at.

It was for Kunal. I'm sure it'll reach there really soon.

hmm, there is no need to worry. This is because of a schema change on revision table (T298560) getting propagated to the cloud. It takes twice as the time of the alter (plus some time to recover) and usually the schema changes are fast enough that users don't notice (like five minutes) but revision table of enwiki is ... special. It took 16-ish hours to finish and it's catching up. It'll reach there the current time soon (check orchestrator.wikimedia.org)

We always worry :-) but thank you for the explanation, it's appreciated and I'll add orchestrator to my list of things to check in the future.

On @RhinosF1's suggestion, I sent a heads-up to the cloud list about this: https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/5QG3EHJ4C22LJHD3M2J3GGFLDGBLHRFF/

Ladsgroup says "soon" and Legoktm's message says "should recover by Monday". Some clarity would be helpful for those of us who process daily (or more frequent) reports that are not updating. Thanks for anything you can provide.

Ladsgroup says "soon" and Legoktm's message says "should recover by Monday". Some clarity would be helpful for those of us who process daily (or more frequent) reports that are not updating. Thanks for anything you can provide.

My message was relaying what Ladsgroup said on IRC, that it should recover by Monday :-)

Monday is not "soon", at least not to me on this Friday night. But I appreciate any specific estimate.

I'm with Jonesey95, there are those of us who utilize bot reports and database reports throughout the day and so things that need to be fixed are going to pile up for four days.

I think the part of this problem that concerns me is that it existed for most of Thursday and when I posted about it to en-WP:VPT there were no messages about it on that noticeboard. I would think a time lag this serious would have caught the eye of developers who work on the database systems and the fact that a very non-technical person like myself caused a phab ticket to be filed is not reassuring! I don't know the world you work in but aren't these processes checked on a daily basis so if anything is awry, those folks who can fix it find out about it? Just curious. Thanks to everyone trying to resolve this problem.

As Amir said, this is expected downtime.

The WikiReplicas are a best effort service.

Please keep in mind that WikiReplicas do not have lag in most cases. However, it never guaranteed that this service will be up or without lag. It is a best effort and we try to treat it as production as much as we can, but it is not considered production.

In this particular case, the lag is unavoidable. If we deploy schema changes, they will reach WikiReplicas and they will get lagged, like production do (but we can depool the hosts there).
We have no way to avoid this and there is no way we can speed up schema changes there. Sometimes they last a few minutes, sometimes they last days cause we are touching big big tables.

Monday is the worst case estimate, the likely scenario is tomorrow morning (EU time) but the actual time depends on many factors such as hardware and so on that I can't give an exact time. From what I'm seeing in orchestrator, sanitarium is already fully recovering and will be up to speed in two to three hours but clouddbs (the actual user facing dbs) need more time, likely by Sunday morning, worst case Monday morning.

The good news is that such alter tables are rather rare, we need to break down and compact large tables (including but not limited to revision table, templatelinks, etc.) and this is happening, it'll just take time to get there. Thank you for your patience.

Please keep in mind that WikiReplicas do not have lag in most cases. However, it never guaranteed that this service will be up or without lag. It is a best effort and we try to treat it as production as much as we can, but it is not considered production.

In this particular case, the lag is unavoidable. If we deploy schema changes, they will reach WikiReplicas and they will get lagged, like production do (but we can depool the hosts there).
We have no way to avoid this and there is no way we can speed up schema changes there. Sometimes they last a few minutes, sometimes they last days cause we are touching big big tables.

If this level of service is to be expected, volunteer editors can probably learn to live with it. It would be helpful to have an announcement when a key system on which we rely is expected to be down for a few days. Regardless of whether this system is considered "production" or not, it is apparently used to generate many hourly and daily reports on the English Wikipedia, all of which have not been updated since sometime Thursday or early Friday UTC. If we should be using some other (production) system to generate our reports, please let us know what it is so that we can make the appropriate adjustments.

The only thing I have to add is that we probably need to make some effort again to turn the more important database reports into proper MediaWiki special pages so they are treated as production quality, since it seems that editors are relying on them that much. I'll start a subthread on VPT about this.

I don't understand the difference between what is "production" and what is not (and what difference that makes to the developers) but, again, I'm with Jonesey95. All information that is communicated is valuable to us volunteer editors, including what you have just shared here today. Most of us have our daily editing routines and handling problems that show up on bot or database reports is part of our daily work. When they go down, we understandably have questions.

But I also understand that some issues are unavoidable (our power just went out last night and that was both unexpected and unavoidable when a power line goes down) but advance notice would be useful. It seems like there used to be messages that appeared on our Watchlist page about expected downtimes and sometimes even a warning would pop-up an hour or so before the English Wikipedia basically went to "read-only". Then we could adjust our expectations.

Thank you for helping to resolve this issue and for sharing this information with us.

"Production" doesn't mean "for readers rather than editors". Readers come first, but editing tools help us to serve them well. One action by an editor may help thousands of readers.

Most reports work better in their current implementation than as special pages. Reports let us access the usual page tools. For example, I routinely compare one report with yesterday's version to identify new entries, but special pages lack history. Any competent bot operator can modify a report, but enhancing or fixing a special page needs developer time, and many such changes are still queued after several years.

Production has nothing to do with readers vs editors. it's a set of services inside a tightly closed network. Basically any service that ends with wikimedia.org, wikipedia.org, etc. are production and anything ending with wmcloud.org, toolforge.org, wmflabs.org are the cloud (formerly known as labs). We have many tools just for editors in production. For example citoid.wikimedia.org is what powers reference generation in visual editor and it's in production. Or Content translation is production. ores is production, etc.

I do understand this could be frustrating (I have tools that rely on wikireplicas: wdvd.toolforge.org/) and I will make sure to announce them beforehand next time (also best effort) but again:

  • These events are rare. The only table that takes this amount of time to have alter table is revision table of enwiki and wikidata. And by a massive margin, the other largest alter we did recently took 14 hours, this one is taking 25 hours (you need to multiply by three to get how much of lag will happen because of it).
  • If it's something heavily depended on by the community, it should be migrated to production. Doing so requires some effort but not impossible. For example the Magnus tool that did the querying wikidata eventually became query.wikidata.org
  • One big reason we have these problems atm is that our database schemas didn't get much attention in the past twenty years and we are addressing its issues right now. For example, this alter was done to make schema of revision table consistent with the rest of infra. i.e. revision table of English Wikipedia had a different schema compared to revision table of Wikidata. Such inconsistencies in production can lead to nasty bugs and we need to fix these drifts to avoid them.
    • If we attended to our db schema's tech debt sooner, they wouldn't have not grown to such a monstrous size in the first place but right now we need to make changes to make sure they stabilize and don't grow to even larger problem and that can lead to cases like this.
  • Note that since these changes are done mostly automatically these days. We could in paper tell the script to do it one by one in cloud as well but it would break replication to the cloud because it uses row-based replication (unlike production that is statement based replication). And why it's RBR? Because then we can filter out private data. There is a reason we have to do it this way.

Looking now at the replag page, the lag goes down, but it has to catch up 75 hours (from right now).

So the tools and reports should start to get newer data.

clouddb1017 is now unblocked and recovering, the other one (clouddb1013) will join shortly

Well, I noticed that some bots, like AnomieBOT III, is back to issuing reports, while others, like SDZeroBot, are still affected. I guess it takes a while for everything to return to normal.