Page MenuHomePhabricator

Provide train-blocking alerts for cross-DC mysql traffic spikes
Closed, ResolvedPublic

Description

Treat 3X increases in non-exempt GET reqs/sec that do master updates as a deploy blocker

Aside from ?action=rollback, Special:CentralAuth, Special:CentralAutoLogin (which will have special multi-DC CDN routing logic to treat GET like POST), we should not be seeing increases in these write patterns. An appropriate logstash query should be constructed, one that uses the DBPerformance channel. It should look for:

+channel:DBPerformance +http_method:GET +"writes <= 0" +"(actual: 1)"

... among other things.

  • Make sure known things we want to exempt are in fact exempted in code (that is, they don't issue warnings to Logstash).
  • Use redefineExpectations(…POST) instead of silence*() for post-send deferred updates. This means that such events are still logged but can be filtered out as needed. The rate of such events should be low, but does not have to be as low as the pre-send case.
  • Emit statd metrics for requests that have unexpected DB_PRIMARY connections/writes
  • Alert when too many such events happen (first violation per request)

Maybe in the future:

  • Include DBPerformance pattern in the Scap monitoring query
  • Include DBPerformance in the mediawiki-errors and mediawiki-new-errors dashboards.

Event Timeline

aaron renamed this task from Treat 3X increases in non-exempt GET reqs/sec that do master updates as a deploy blocker to Setup train-blocking alerts for cross-DC mediawiki/mysql traffic spikes.Oct 5 2022, 6:58 PM
aaron renamed this task from Setup train-blocking alerts for cross-DC mediawiki/mysql traffic spikes to Train-blocking alerts for cross-DC mysql traffic spikes.
aaron updated the task description. (Show Details)

Change 838887 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] objectcache: suppress TransactionProfiler in occasionallyGarbageCollect()

https://gerrit.wikimedia.org/r/838887

According to:

+channel:DBPerformance +http_method:GET +measure:writes +by:"MediaWiki::main"

There are 2,773 events in 24 hours, or .03/sec.

Change 839553 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: limit the effects of TransactionProfiler::silenceForScope()

https://gerrit.wikimedia.org/r/839553

Change 839553 merged by jenkins-bot:

[mediawiki/core@master] rdbms: improve TransactionProfiler::silenceForScope()

https://gerrit.wikimedia.org/r/839553

Jdforrester-WMF renamed this task from Train-blocking alerts for cross-DC mysql traffic spikes to Provide train-blocking alerts for cross-DC mysql traffic spikes.Nov 17 2022, 2:20 PM

Change 838887 merged by jenkins-bot:

[mediawiki/core@master] objectcache: suppress TransactionProfiler in occasionallyGarbageCollect()

https://gerrit.wikimedia.org/r/838887

I'm not seeing it the rate exceed 1/sec in the last 7 days. Maybe the threshold could be 3/sec for >1hr .

Change 859641 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add statsd metrics to TransactionProfiler

https://gerrit.wikimedia.org/r/859641

Change 859641 merged by jenkins-bot:

[mediawiki/core@master] rdbms: add statsd metrics to TransactionProfiler

https://gerrit.wikimedia.org/r/859641

aaron updated the task description. (Show Details)