"Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis
Closed, ResolvedPublic

Description

With a large number of watched articles (more than 700), the watchlist is loaded for a very long time. If articles are more than 1000, it does not load.

"Database error: An error has occurred in the database, this may indicate a software error. An unrecoverable exception of the type" Wikimedia\Rdbms\DBQueryError "

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 382990 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Add config var for disabling inject RC records

https://gerrit.wikimedia.org/r/382990

Reedy added a comment.Oct 7 2017, 6:11 PM

I've done this as a revert in https://gerrit.wikimedia.org/r/#/c/382989/ as we still use the config in wmf-config for the repo wikis

Change 382990 abandoned by Ladsgroup:
Add config var for disabling inject RC records

https://gerrit.wikimedia.org/r/382990

Let's merge and deploy yours in Monday

@jcrespo I'm sorry. I wasn't aware how bad things are. This shouldn't have happened.
Amir and Marius are on it now from our side. The next step is T177707.

Masti added a subscriber: Masti.Oct 8 2017, 7:19 PM
saper added a subscriber: saper.Oct 8 2017, 7:55 PM
Finavon added a subscriber: Finavon.Oct 9 2017, 6:30 AM

BTW, the decision was already mentioned as the 4th option of @Catrope suggestions "Disable Wikidata RC on large wikis until we have a more scalable implementation of the feature". I think nobody predicted how bad things were at the time.

jcrespo closed this task as Resolved.Oct 9 2017, 8:24 AM
jcrespo added a project: User-notice.

Notification for users: We are going to disable wikidata recentchanges (meaning, changes on pages on other wikis coming from changes done on wikidata; the recentchanges at wikidata is not a problem) due to performance concerns. Once those have been fixed, the functionality could be enabled again.

jcrespo reopened this task as Open.Oct 9 2017, 8:24 AM
jcrespo moved this task from Archive to To Triage on the User-notice board.

Change 383093 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikidata@wmf/1.31.0-wmf.2] Re-instate $wgWBClientSettings['injectRecentChanges']

https://gerrit.wikimedia.org/r/383093

Change 383093 merged by jenkins-bot:
[mediawiki/extensions/Wikidata@wmf/1.31.0-wmf.2] Re-instate $wgWBClientSettings['injectRecentChanges']

https://gerrit.wikimedia.org/r/383093

Mentioned in SAL (#wikimedia-operations) [2017-10-09T10:38:33Z] <ladsgroup@tin> Synchronized php-1.31.0-wmf.2/extensions/Wikidata/extensions/Wikibase/client: Re-instate $wgWBClientSettings[injectRecentChanges] (T171027) (duration: 00m 56s)

Change 383100 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Disable injecing RC records for commonswiki and ruwiki

https://gerrit.wikimedia.org/r/383100

Change 383100 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable injecing RC records for commonswiki and ruwiki

https://gerrit.wikimedia.org/r/383100

Mentioned in SAL (#wikimedia-operations) [2017-10-09T10:50:52Z] <ladsgroup@tin> Synchronized wmf-config/Wikibase-production.php: Disable injecing RC records for commonswiki and ruwiki (T171027) (duration: 00m 47s)

Change 383107 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Disable inject recent changes on all client wikis

https://gerrit.wikimedia.org/r/383107

Thanks @jcrespo , @Reedy and @Ladsgroup for taking action.

Notification for users: We are going to disable wikidata recentchanges (meaning, changes on pages on other wikis coming from changes done on wikidata; the recentchanges at wikidata is not a problem) due to performance concerns. Once those have been fixed, the functionality could be enabled again.

To clarify: it looks like WD RC is only disabled on commonswiki and ruwiki right now, and while @Reedy has submitted a patch to disable it everywhere else, that hasn't been deployed yet. Is that planned for the future?

Risker added a subscriber: Risker.Oct 10 2017, 2:35 AM

@Catrope- discussion ongoing, feel free to weight in. As a DBA I required these 2 now because very large tables and high error rate.

I forgot to say we suspect the same thing happens on other s3 hosts, but these 2 previous wikis create so many errors that it is difficult to say until we fix these 2.

Elitre added a subscriber: Elitre.Oct 10 2017, 8:21 AM
Stryn added a subscriber: Stryn.Oct 10 2017, 12:48 PM
Reedy added a comment.Oct 10 2017, 1:17 PM

To clarify: it looks like WD RC is only disabled on commonswiki and ruwiki right now, and while @Reedy has submitted a patch to disable it everywhere else, that hasn't been deployed yet. Is that planned for the future?

I definitely think that until the feature is fixed properly (T177707 at least; there's probably more that can/should be done), we should be disabling it everywhere based on what has been suggested in T177702

On those grounds alone, that is good enough. Anyone that "relies" on the feature; well, unfortunately, it's a case of tough. Security and performance are generally accepted reasons to break features like this if they're causing (or have the potential to cause) bigger problems.

The original patch (after being fixed) was enough to stem the immediate problem, letting @jcrespo confirm that the fix has the intended consequences (stopping all the rows entering the recentchanges table; as done in batches, it's not immediately possible to tell)

I think we should go ahead and deploy this, and then when an appropriate fix is in place, slowly roll the feature back out to various wikis, to test the impact

Framawiki renamed this task from "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis to "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis.Oct 10 2017, 5:10 PM

This is a hugely political issue. Let's please not do this unless necessary. Daniel has provided a patch for T177707 and Marius has been working on various other improvements.

Nirmos added a subscriber: Nirmos.Oct 10 2017, 7:52 PM

This is a hugely political issue. Let's please not do this unless necessary. Daniel has provided a patch for T177707 and Marius has been working on various other improvements.

I'm not sure how political it is at this point; it was an issue, certainly, when use of Wikidata was first being considered on some projects, and a regression to remove it from recent changes and/or watchlists may have negative impacts on a few projects. On the other hand, removing it entirely, with noise coming from communities about it, will in fact focus the minds to address this issue - not only for Wikidata, but also for ORES and for Commons changes as well. Two are in place now and are having an impact, and the third has been a much-desired feature for as long as I can remember. Getting this right is important and worthy of the investment of time of many knowledgeable developers, not just one or two people trying to do a quick "repair" without digging into the underlying issues. Including all of these things in some sort of easily accessible recent changes/watchlist profile is very important, particularly for projects with a larger community; however, the way we are doing things now (particularly on projects where the recent changes origins are heavily weighted toward Wikidata) is creating major bottlenecks. The methodology for including this data deserves some rethinking and testing of other options. We can't just keep tacking more tables onto watchlists/recent changes, if the impact is rendering RC/watchlists slow to the point of unusable or flooding the servers.

It strikes me that this is an unforeseen effect of the remarkable success of Wikidata. If not for its extensive usage on certain projects, the root causes of this bug would not have been identified so quickly. There is every reason to believe that if other projects continue to expand their use of Wikidata, they too will encounter the same issues; the same is true if a similar methodology is used to include Commons changes. This problem isn't really about Wikidata; it's about the effects of the ever-expanding use of recent changes. It's just particularly obvious with Wikidata because it has been so successful.

jmatazzoni added a comment.EditedOct 11 2017, 3:29 PM

In T171027#3673060, @Lydia_Pintscher wrote:

This is a hugely political issue. Let's please not do this unless necessary.

Right now, ORES is unusable on Watchlist for English, Russian and probably some other big wikis. It causes delays of 30 seconds to a minute when users have more than 3 days set. I'm told Wikidata is a likely culprit. If that's correct (I can't say it is or isn't), then by that standard, some kind of change appears to be "necessary," and, as Risker says, some kind of escalation appears both justified and required.

In T171027#3673060, @Lydia_Pintscher wrote:

This is a hugely political issue. Let's please not do this unless necessary.

Right now, ORES is unusable on Watchlist for English, Russian and probably some other big wikis. It causes delays of 30 seconds to a minute when users have more than 3 days set. I'm told Wikidata is a likely culprit. If that's correct (I can't say it is or isn't), then by that standard, some kind of change appears to be "necessary," and, as Risker says, some kind of escalation appears both justified and required.

Well wikidata is almost certainly a contributing factor, (particularly for russian) I would hesitate to blame it solely for slow ores on watchlist speeds, without more evidence. Especially on english wikipedia with its large recentchanges table and relatively low usage of wikidata.

Bawolff added a comment.EditedOct 11 2017, 7:42 PM

Right now, ORES is unusable on Watchlist for English, Russian and probably some other big wikis. It causes delays of 30 seconds to a minute when users have more than 3 days set. I'm told Wikidata is a likely culprit. If that's correct (I can't say it is or isn't), then by that standard, some kind of change appears to be "necessary," and, as Risker says, some kind of escalation appears both justified and required.

Well wikidata is almost certainly a contributing factor, (particularly for russian) I would hesitate to blame it solely for slow ores on watchlist speeds, without more evidence. Especially on english wikipedia with its large recentchanges table and relatively low usage of wikidata.

So addendum to this. For the enwiki case, and talking about watchlist (Not Special:RecentChanges) all the timeouts I could find - https://logstash.wikimedia.org/goto/3183d6a077082f6b4469255e51e60105 (This are just the severe read timeouts. I had trouble making a logstash query for the query longer than 5 seconds warning) had STRAIGHT_JOIN specified, including people with small watchlists. This probably makes sense for Special:Recentchanges case, but for the watchlist case, when the user has a small watchlist, this will cause the query to be much slower then needed. I'd reccomend removing the STRAIGHT_JOIN in the watchlist case for ORES

Catrope added a subscriber: awight.Oct 11 2017, 8:28 PM

Thanks, I'll experiment with removing the STRAIGHT_JOIN, as well as with @awight's suggestion at T164796#3560530 of breaking out the change tags rollup.

Alsee added a subscriber: Alsee.Oct 11 2017, 10:29 PM

This is a hugely political issue.

It doesn't matter whether Political Issues come from the community, a project manager, or President of the Planet. They wait in line when a database administrator declares an emergency:
we are close of running out of space on several main database servers, breaking all of Wikipedia.

Thanks, I'll experiment with removing the STRAIGHT_JOIN, as well as with @awight's suggestion at T164796#3560530 of breaking out the change tags rollup.

@Catrope, Just a heads up that I was unable to duplicate the same drop in query time when leaving out the change tags clause in T176456, probably for the reason mentioned, that parts of the query had been cached or paged-in. I probably made that same mistake in T164796 as well, so maybe check that before breaking out the subquery.

jcrespo added a comment.EditedOct 12 2017, 11:14 AM

This is the result on query error rate after ruwiki has been purged:

Please note that these are preliminary results, and that tables have yet to be optimized/analyzed, but I think this confirms we are in the right track.

Commons has both more rows to purge and more load, so it will take more time to finish the purging (still ongoing).

In fact, none of the small number of error appear to be related to watchlist or recentchanges, unlike the old ones.

Thanks, I'll experiment with removing the STRAIGHT_JOIN, as well as with @awight's suggestion at T164796#3560530 of breaking out the change tags rollup.

This is tracked over at T176456: ORES on Watchlist causes big slowdown—especially with 'Last revision' filter turned on, but just to close the loop here: removing the STRAIGHT_JOIN (on watchlist but not RC) appears to work. I wrote a patch for that, it was merged, and I'm SWATting it to wmf.3 in the upcoming SWAT window in 15 minutes.

Johan added a comment.Oct 12 2017, 6:03 PM

@jcrespo Is this an accurate description right now?

There is a problem with watchlists that show a large number of pages. Until it has been fixed Wikidata edits will not be shown in recent changes on Commons or Russian Wikipedia. If necessary Wikidata edits will be removed from watchlists on more wikis.

Hi, @Johan. Recent changes or watchlists?

Hi, @Johan. Recent changes or watchlists?

Both. But the statement should really make that clearer. It uses RC and WL as if they're interchangeable, and in the code they kind of are because they use the same underlying infrastructure, but most non-technical people won't know that.

The initial scope- query issues on ruwiki and commonswiki, I think has been successfully patched. These are the statistics I have now:

I would ask users affected to confirm the issues are gone- probably not 100% of them will disappear forever, but I am confident most will- to levels of 3 months ago (not better, not worse). On separate tickets we can handle the followups of how to reenable the hosts, checking other potential projects affected and reclaiming disk space.

jcrespo removed jcrespo as the assignee of this task.Oct 13 2017, 3:49 PM

For a separate ticket, other potential projects to keep an eye on:

  • ca.wikipedia.org
  • fr.wikipedia.org
  • ro.wikipedia.org
jcrespo lowered the priority of this task from Unbreak Now! to Normal.Oct 13 2017, 3:53 PM

Lowering priority unless we get more reports from other ruwiki or commons users.

jcrespo rescinded a token.Oct 13 2017, 3:53 PM
jcrespo awarded a token.
Jwh added a comment.Oct 13 2017, 7:10 PM

Works fine now.
Thank you so much to have solved this problem!

jcrespo closed this task as Resolved.Oct 16 2017, 1:00 PM
jcrespo assigned this task to Bawolff.

I am going to declare the initial issue as resolves, as the only feedback I got was positive. There is still a lot of followup: check the issue on other wikis and have a more long-term solution for Wikidata T178063, both will be handled separately.

Thanks for help! It works now!

Change 383107 abandoned by Reedy:
Disable inject recent changes on all client wikis

https://gerrit.wikimedia.org/r/383107