Page MenuHomePhabricator

Increase in DBPerformance warnings since 1.36.0-wmf.13
Closed, ResolvedPublic

Description

There has been a large increase in warnings with normalized message: Expectation ({measure} <= {max}) by {by} not met (actual: {actual}): {query} since wmf.13 (deployed starting 13 October 2020):

image.png (238×1 px, 31 KB)

There are lots of variants, e.g.:

  • Expectation (masterConns <= 0) by MediaWiki::main not met (actual: 1): [...]
  • Expectation (writes <= 0) by ApiMain::setRequestExpectations not met (actual: 1): query-m: INSERT INTO cu_changes [...]
  • Expectation (writes <= 0) by MediaWiki::restInPeace not met (actual: 5): query-m: INSERT IGNORE INTO watchlist[...]
  • Expectation (writes <= 0) by MediaWiki::restInPeace not met (actual: 3): query-m: UPDATE site_stats [...]

More examples at: https://logstash.wikimedia.org/goto/e39f841693dfdd3023ac54be22636b89

Could something have gone wrong somewhere with setting the expectations?

Event Timeline

@Krinkle Is there anyone in particular who should be tagged on this?

In the DBPerformance channel in the mediawiki Logstash dashboard, from Oct 9 to Oct 19, I don't see a noticable increase or other change in volume:

https://logstash.wikimedia.org/goto/f80e112edf4485e6d4835d7d6420997f

a.png (870×1 px, 68 KB)

Looking at some of the individual violations, such as the unexpected cu_changes INSERT query during ApiClientLogin from api_appserver hosts, I see those both before and after Oct 13.

What did change is the normalized_message template.
On 1.36.0-wmf.20, this is: Expectation ({measure} <= {max}) by {by} not met (actual: {actual}): {query}.
On 1.36.0-wmf.10 and earlier, it was: Expectation ({measure} <= {max}) by {by} not met (actual: {actual}): {query} #0 /srv/mediawiki/php-1.36.0-wmf.10/includes/libs/rdbms/TransactionProfiler.php(248) Object(Wikimedia\Rd

The last part was a cut off portion of a stack trace, which was causing the messages to be fragmented all over and thus hard to filter for. It also meant that searches for exception.trace did not find these, as they were stored within the message and (cut-off) within the normalised_message fields instead.

This was fixed in https://gerrit.wikimedia.org/r/628190 (3ac199cabd3), and is most likely why they have started to appear on your team's Logstash dasboard.

If you want to individually triage/filter these, a filter on part of the trace should work. E.g. exception.trace NOT "CheckUserHooks::onAuthManagerLoginAuthenticateAudit". Or, to restore the previous status quo and not monitor/triage these on that particular Logstash dasboard, exclude the normalized_message Expectation ({measure} <= {max}) by {by} not met (actual: {actual}): {query}.

Krinkle reassigned this task from Krinkle to Tchanders.
Krinkle triaged this task as Medium priority.
Krinkle added a project: Anti-Harassment.
Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

Expectation (writes <= 0) by ApiMain::setRequestExpectations not met (actual: 1): query-m: INSERT INTO cu_changes [...]

I was looking again at these ones today on your team's Logstash dashboard. What I find interesting about these is that they come from a POST request to the API for logging in (ApiClientLogin). As such, one would think that master connections and writes are allowed for that reason alone. While login might not be as obvious as an "Edit" that it does db wwrites, it's not that strange either. Afaik we sometimes update the user object (user_touched) during login, and indeed CheckUser has a use case for temporarily storing client IP details in the database.

This should be fine since it's a POST request, and would indeed be routed to the primary. I think the issue might be that ApiClientLogin.php doesn't set isWriteMode() to true, and so it actually ends up registered as a query module that only supports POST for the purposes of authenticating the its query string via POST payload (e.g. if it doesn't fit in the query string, or to hide it from URL), but is otherwise treated the same as a GET request internally and so triggers these performance warnings.

The question then is, did this recently change? Well, ApiClientLogin.php has not substantially changed in 4 years. This didn't regress there. But, maybe somewhere else the lack of isWriteMode() was previously ignored and now taken into account? More to come...

The question then is, did this recently change? Well, ApiClientLogin.php has not substantially changed in over 4 years. This didn't regress there. But maybe somewhere else the lack of isWriteMode() was previously ignored and now taken into account? [...]

@Krinkle Is there evidence that these warnings are the result of a recent regression? Some of the other root causes of similar warnings look as though they've been there for many years (see subtasks)...

I've summed up T267945#6875458 in a new subtask - feel free to modify: T283394

Closing in favour of sub tasks. I think the higher-level conclusion is that a bunch of stuff slipped through and this captures a fairly arbitrary subset of things. Good to have separately, but afaik nothing to track here specifically.