Linter multiple database issues
Closed, ResolvedPublic

Description

It seems the workflow of linter regarding the database has many issues, creating user complains. Based on the wikimedia logs, I see 2 kinds of issues:

  • Queries duing counts that take over 60 seconds to run, and because of that, they are killed by the query killer- those should either cached and run as mainteanance jobs or all queries should take less than 1 second to be executed, no matter the size of the table
  • Database contention due to multiple rows being written at the same time, creating Lock wait timeout errors

Those seem to be created by multiple executions of job runners, which are probably retried and make the problem worse. A way of coordinating multiple executions over the same data could be thought. For actual examples of both issues, look at:

https://logstash.wikimedia.org/goto/a5d9a903670cebc33251939f42e8c292 The errors seem to be increasing. The good news is that they do not seem to be affecting other functionalities, but it could end up affecting the reliability of the host if multiple connections are stuck.

At least one wiki user complained about delays of the jobs, probably related to these issues.

jcrespo created this task.Jan 5 2018, 2:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 5 2018, 2:55 PM

Thanks @jcrespo. @Legoktm ... can we touch base Monday about this?

Looks like this is primarily caused by commonswiki with over 70M linter entries (a large portion of which comes from the lower priority categories). Perhaps, we need special handling for commons altogether including suppressing display / update of those lower priority categories as a temporary solution if we cannot find a good solution promptly enough.

To understand the impact for triage reasons, linter-related errors are now the top sql-related error (including all servers) with a rate of 54/134 (linter related errors/non-linter related errors). https://logstash.wikimedia.org/goto/9c2b588325062b7454a7676ff9e6cbb9

ssastry triaged this task as High priority.Jan 9 2018, 3:37 PM

As a stop gap measure, we can rate limit the speed at which the insert jobs run with $wgJobBackoffThrottling. I need to look at the caching for the count() queries.

Looks like things have calmed down last few days. I had filed the now abandoned https://gerrit.wikimedia.org/r/#/c/403581/ but @Legoktm thinks we should be able to suppress these high volume low-priority categories on commons by editing the mediawiki-config repo. So, we have that option available.

Another alternative is to increase wgLinterStatsdSampleFactor to more than 1000 for commonswiki which should reduce # of aggregate queries for those wikis. Right now, I see it set to 1000 ... which, for commons, based on how common lint errors are there, and rate of edits .. would probably be once every 5-10 mins. So, we could bump that to 5000.

Change 404233 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[operations/mediawiki-config@master] Update linter stats for commonswiki less frequently

https://gerrit.wikimedia.org/r/404233

Change 404233 merged by jenkins-bot:
[operations/mediawiki-config@master] Update linter stats for commonswiki less frequently

https://gerrit.wikimedia.org/r/404233

Mentioned in SAL (#wikimedia-operations) [2018-01-18T19:20:05Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404233|Update linter stats for commonswiki less frequently]] T184280 (duration: 01m 13s)

ssastry lowered the priority of this task from High to Normal.Jan 19 2018, 5:26 PM

That patch seems to have done the trick in terms of reducing the # of errors. However, the core problem still exists. I am going to lower the priority since the immediate issue is ameliorated.

Something seems to have been dome about the contention, but not about the long running queries (first bullet), see T173943

The queries are cached, but, are right now run online (when a request comes in and the cached info needs a refresh), not as a maintenance job. @jcrespo do you have any suggestions for how to optimize the count query?

how to optimize the count query

If the query takes more than 60 seconds (actually, more than 1 second) it cannot be run in the foreground (here foreground means synchronous to a client-initiated request). The optimization is to stop doing that query, whatever it takes. That is a pure code issue, not a database optimization problem.

@performance-team @security This is not your responsibility, just a heads up that if the issue continues, I will be forced to disable the extensions for fear of enabling a DOS towards the database, bringing all wikis down.

Addshore removed a subscriber: Addshore.Feb 1 2018, 4:36 PM

how to optimize the count query

If the query takes more than 60 seconds (actually, more than 1 second) it cannot be run in the foreground (here foreground means synchronous to a client-initiated request). The optimization is to stop doing that query, whatever it takes. That is a pure code issue, not a database optimization problem.

Sorry, I didn't explain this fully. @Legoktm and I will review the code to make sure that there are no gaps here, but the extension caches results of queries. Most of the time, the results are updated as a result of a background job running (RecordLintJob). However, on occasion, the results are updated in the foreground (when there is an API query OR when someone opens the Special:LintErrors page) if the cached results are stale (based on cache refresh rate). So, the synchronous foreground queries will be uncommon. But, as I said, @Legoktm and I will review the code to make sure there are no holes in that logic.

I fixed the second issue you identified by decreasing the cache refresh rate for commonswiki.

My question to you was: given that we are mostly updating the results in the background, and there are occasional synchronous queries (and all additional requests will get results from the cache), I was asking if there was any query optimization trick we are missing in our code or in our schema that can also reduce the query time. But, you are saying that it is not possible.

Does this clarify the situation?

jcrespo added a comment.EditedFeb 1 2018, 4:50 PM

Correct, count(*) is not optimizable, only pre-cacheable more aggressively. It can even be cachable on a mysql table, in addition to other places- this is not a "new" problem, it has been "solved" for a long time on thinkgs like Special:Statistics for counting articles. We shouldn't invent new methods, as those normally suffer from contention or pileups very easily, as it happens here- when the caching expires, all request start asking for a refresh at the same time, which can lead to a DOS.

Are you still concerned about bullet item #1 after my clarification above?

Are you still concerned about bullet item #1 after my clarification above?

Never mind. I didn't notice you had updated your comment.

But just one minor clarification: this is not a commonly accessed page, so pileups should be rare, but your point is taken. We'll discuss to see what followup is needed here.

I do not think this is urgent, but I worry because of the DOS possibility. Moving things to a maintenance job does not need really normally a lot of code changes, just moving it to a cron[sic, for a lack of a better mechanism]. I can help with the puppet side, no problem.

I am also worried because users complained to me on #wm-operations for getting 5XX errors when loading this page- I only filed this on their behalf. You should be happy you have worried users that are already dependent on this working every time :-D.

Imarlier moved this task from Inbox to Radar on the Performance-Team board.Feb 1 2018, 5:38 PM
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

Change 407511 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/Linter@master] Use estimateRowCount() instead of actually counting everything

https://gerrit.wikimedia.org/r/407511

Change 407511 merged by jenkins-bot:
[mediawiki/extensions/Linter@master] Use estimateRowCount() instead of actually counting everything

https://gerrit.wikimedia.org/r/407511

Once the above patch is deployed this week, the very slow COUNT queries should disappear.

@Johan @Elitre FYI. See https://www.mediawiki.org/wiki/Topic:U7bdlw783vtf38fg

While we figure out how to tackle this, note that wikis are going to see their linter counts artificially inflated since we are no longer reporting exact counts. Could this be reported in tech news so that wikis are aware and don't freak out?

Elitre added a comment.EditedFeb 11 2018, 6:13 PM

@Johan @Elitre FYI. See https://www.mediawiki.org/wiki/Topic:U7bdlw783vtf38fg

While we figure out how to tackle this, note that wikis are going to see their linter counts artificially inflated since we are no longer reporting exact counts. Could this be reported in tech news so that wikis are aware and don't freak out?

Yes, but I believe it's too late for the one that goes out on Monday. As a temporary thing a note could be dropped to wikitech + ambassadors as a FYI, if you think it's important. (on wiki example: https://www.mediawiki.org/wiki/Topic:U7fnmknredgujpo4 )

@Johan @Elitre FYI. See https://www.mediawiki.org/wiki/Topic:U7bdlw783vtf38fg

While we figure out how to tackle this, note that wikis are going to see their linter counts artificially inflated since we are no longer reporting exact counts. Could this be reported in tech news so that wikis are aware and don't freak out?

Yes, but I believe it's too late for the one that goes out on Monday. As a temporary thing a note could be dropped to wikitech + ambassadors as a FYI, if you think it's important.

From what I have seen this only seems to affect wikis that have linter categories with large error counts. In other cases, the estimates seems to be using precise counts. So, no rush at this point. Those who care seem to be finding us at mw:Help_talk:Extension:Linter already.

@Johan @Elitre FYI. See https://www.mediawiki.org/wiki/Topic:U7bdlw783vtf38fg

While we figure out how to tackle this, note that wikis are going to see their linter counts artificially inflated since we are no longer reporting exact counts. Could this be reported in tech news so that wikis are aware and don't freak out?

Yes, but I believe it's too late for the one that goes out on Monday. As a temporary thing a note could be dropped to wikitech + ambassadors as a FYI, if you think it's important.

From what I have seen this only seems to affect wikis that have linter categories with large error counts. In other cases, the estimates seems to be using precise counts. So, no rush at this point. Those who care seem to be finding us at mw:Help_talk:Extension:Linter already.

Make sure this is reflected in your suggested phrasing in TN, so that Johan can add to next week's issue. You could already do it as well, BTW :)

Is this issue related? Both of these counts are equal on lvwiki:

Missing end tag (37 064 errors)
Obsolete HTML tags (37 064 errors)

https://lv.wikipedia.org/wiki/Special:LintErrors

Now we have multiple warnings:

Warning: Unable to record MySQL stats with: EXPLAIN /* MediaWiki\Linter\Database::getTotalsEstimate */ SELECT * FROM `linter` WHERE linter_cat = '12' in /srv/mediawiki/php-1.31.0-wmf.22/includes/libs/rdbms/database/DatabaseMysqli.php on line 47

Now we have multiple warnings:

Warning: Unable to record MySQL stats with: EXPLAIN /* MediaWiki\Linter\Database::getTotalsEstimate */ SELECT * FROM `linter` WHERE linter_cat = '12' in /srv/mediawiki/php-1.31.0-wmf.22/includes/libs/rdbms/database/DatabaseMysqli.php on line 47

That's an HHVM bug :( https://github.com/facebook/hhvm/pull/8139

Can I close this now? Anything left here?

I am not seeing linter issues lately, I will open a new one if they come back.

Krinkle closed this task as Resolved.Jul 11 2018, 2:04 AM
Krinkle claimed this task.
Krinkle added a subscriber: Krinkle.

No longer appears in recent logs. Seems fixed. Please re-open if I misunderstood.