Page MenuHomePhabricator

Investigate burst of DBReadOnlyError during switchover test
Closed, ResolvedPublic

Description

During the live test, there was a burst of DBReadOnlyError reported in Logstash. At the same time, alerts fired for session_loss and also a small spike in (presumed user-facing) errors for eqiad DBs being read-only.

In theory, putting the secondary datacentre in read-only should be a safe thing to do - this is a core assumption of the live-test behaviour, but these errors run counter to that. We need to determine whether there are some behaviours that are attempting to unexpectedly write, whether these are new behaviours and how to make them safe. It seems that these are at least a subset of requests, as broad write failure on a read only database would cause a much higher number of errors.

Additionally, we should determine whether these session_loss errors are related, or if they are a side-effect.

Event Timeline

Follow-up, the session_loss errors are more or less completely disjoint from the RO errors we're seeing in logstash and can be treated as a completely separate problem (or a non-problem, still discerning the source of those)

Of 203 DBReadOnlyError exceptions in total, 195 contain stack traces traversing SqlBagOStuff, all of which are deferred updates calling deleteServerObjectsExpiringBefore. The SqlBagOStuff suggests either mainstash or parsercache?

Edit: On second thought, given that discovering DC read-only state in Wikimedia\Rdbms\Database in this way would need to happen via the LB_READ_ONLY_REASON in the lbInfo array, but parsercache doesn't use a loadbalancer-backed SqlBagOStuff, this would presumably be mainstash? Ah, and ot's only mainstash where we set a non-zero purgePeriod, which would explain how we get in the (deferred) closure that calls deleteServerObjectsExpiringBefore.

Of the 8 remaining requests, the majority (6) implicate writes by the ContentTranslation extension.

There is one other interesting outlier: nearly all exceptions are GET requests, except for one: this PUT to the REST API for a handler associated with the CampaignEvents extension.

That latter one would suggest that we need to update multi-dc.lua to also shunt PUT to the primary, similar to our treatment for POST.

On the session loss errors - I believe these are actually other operations that resulted in errors like Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::commit critical section while session state is out of sync. The numbers don't quite match up but I think these are the same errors.

Change #1123625 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: send PUTs to the write datacentre

https://gerrit.wikimedia.org/r/1123625

In theory, putting the read only datacentre in read only should be a fine thing to do

I don't think this was ever even remotely true - there are plenty of edge cases where a write happens on GET and there is no clear pattern to the request that could be detected by the edge cache. User autocreation, SqlBagOStuff cache misses, CheckUser logs, I'm sure there's more.

Specifically for deleteServerObjectsExpiringBefore, it should be straightforward to add a check of whether the DB is readonly - there is no harm in delaying the purge, it will ignore expired keys anyway. There are other situations though where more important data is written on GET.

Krinkle renamed this task from Investigate burst of read only errors during live test to Investigate burst of Uncaught DBReadOnlyError during switchover test.Mar 2 2025, 1:45 AM
Krinkle renamed this task from Investigate burst of Uncaught DBReadOnlyError during switchover test to Investigate burst of DBReadOnlyError during switchover test.Mar 2 2025, 1:48 AM
Krinkle updated the task description. (Show Details)

I'll to clarify some things that might have generated some confusion. Under normal circumstances for all core DB sections but parsercache and x2, we do keep al DBs in RO mode except the masters in the primary DC. In addition there are Icinga checks to ensure that each DB based on their section and role is in the expected RO/RW mode.
That's why a live test of the datacenter switchover is safe, because from the DB point of view is designed to be a noop. It's setting RO DBs that are already in RO mode.

Just to echo / build on what @Volans is saying for extra clarity: The issue in this task is entirely limited to setting read-only at the application layer (02-set-readonly.py) - i.e., the $DC/ReadOnly keys in mwconfig in etcd - not at the DB layer (03-set-db-readonly.py), which as @Volans points out, is a noop in the context of the live-test, whereas the former is not.

In theory, putting the read only datacentre in read only should be a fine thing to do.
I don't think this was ever even remotely true. there are plenty of edge cases where a write happens on GET […]

You're right from an internal MediaWiki perspective, but this might be confusing from an operational/SRE perspective. The terms "write" and "GET" have a subtly different connotation there.

For SREs:

It is safe to put MediaWiki databases in read-only mode. This is why, for example, MediaWiki regularly and autonomously places itself in read-only mode when it detects high replag (e.g. after high write traffic), under the assumption that doing so does not degrade functionality or display error messages to users that browse the site, nor should it cause HTTP 500 responses on "read" requests to mw-web or mw-api clusters. The twice-yearly switchover excercises confirm this, and catch potential regressions in how MediaWiki features work or in how we configure monitoring alerts (false positives).

When databases in the primary DC are put in read-only mode, user-initiated write actions fail with a descriptive error. This applies to actions that we believe end-users understand as "changing a page" or "changing an account", and HTTP requests that perform such actions are routed to the primary DC. The secondary DC is thus safe to put in read-only mode.

Disambiguation:

Personally, I'd "MediaWiki does not perform writes on GET requests". In this context, "GET" is short-hand referring to "HTTP read requests" that we route to the secondary data center, and "writes" refers to cases where MediaWiki incurs a hard dependency on a write-capable connection to a MediaWiki database in the primary data center.

What's the difference? Well, there are a few cases where a "write" happens on what is technically a "GET" request. They fall in one of these categories:

  1. Non-database writes to local services that are not the "MediaWiki database".
    • This includes writes to php-apcu, Memcached, Cassandra, ParserCache, Kafka/JobQueue, and MainStash. Yes, Cassandra is a kind of database. Yes, ParserCache happens to literally use a MySQL database (via MediaWiki's SqlBagOStuff class). However, these are local services in each data center, their writes don't involve the primary data center, and are thus safe to use in the secondary data center. They exist independent of the MediaWiki database, and are unaffected by read-only mode.
  2. Certain write actions implemented as GET, and routed by our CDN the same as a POST request, to the primary DC.
    • This includes loginwiki redirects, OAuth requests, and rollback requests (T91820). These are GET requests for technical or historical reasons, but recognised as "write" requests and routed as such. Unaffected by read-only mode in the secondary data center.
    • Conversely, large API "read" queries, may transmit their query as a POST body due to parameter-size restrictions. These POST requests are marked with the Promise-Non-Write-API-Action header and routed the same way as GET and other HTTP "read" requests, to the secondary data center.
    • Nonetheless, we often use GET/POST as shorthand for "HTTP read requests routable to secondary DC" and "HTTP write requests routed to the primary DC".
  3. Rare and optional writes that are outside the web response.
    • For example, when a logged-in user reads an article, and that article was previously added to their watchlist, and the page is at a revision they haven't seen before (i.e. it was edited since they last read it), then we used to queue a post-send deferred that tries to update their wachlist marker in the MediaWiki database, to mark this revision as "seen". These are low in frequency, given the logged-in preconditions. They are also optional — silently skipped in read-only mode. And happen post-send, so when they do run and fail, it does not affect the web response. This feature has since been optimized to use the job queue instead.
    • I couldn't find examples of this still in our code today, but there are probably one or two cases like this, where a write is post-send, best-effort, low-frequency, and skipped or ignored when in read-only mode.
  4. Auto-creation.

Auto-creation is a notable exception to the principle that read-only mode isn't noticable when browsing. It is relatively rare, as it applies only to logged-in users, who visit multiple wikis, when they visit a new wiki for the first time. But, it is definitely something we don't present as a "change" or "write" action, and that failure is surprising. The impact, though, is limited. They still get the page they requested, but rendered logged-out instead of being welcomed with a new auto-created and auto-logged account that they didn't ask for. Depending on their intent, that might even be fine. And afaik this happens without displaying error messages or raising HTTP 500. The auto-creation is skipped in readonly mode (which I've confirmed via WikimediaDebug in readonly mode, and using KrinkleSock to visit a new wiki), and when it does try it, any failures are silently caught (Logstash result suggests this works. One hit in 7 days.)

The reason that this is surprising to end-users, is that Wikimedia Global accounts (i.e. SUL/CentralAuth) are designed to feel like a single account. However, it does involve a local stub account, created when you first login. This isn't unlike StackOverflow where you have a central account, with clearly distinct local accounts on each site that connect to the central one, which aren't created until you login. The difference is that they require you to manually decide and choose which communities you join, with the option to e.g. create a separate account instead. It is presented quite differently. We discourage this, and instead automatically and transparently automate this whenever possible.

@hnowlan wrote in the task description:

During the live test, there was a burst of DBReadOnlyError reported in Logstash. […] and also a small spike in (presumed user-facing) errors for eqiad DBs being read-only.

In theory, putting the secondary datacentre in read-only should be a safe thing to do […], but these errors run counter to that.

TLDR: This is all a false-positive for the purposes of live switchover test.

What's going:

  1. This Logstash query does not (only) report HTTP-500-inducing uncaught errors.

The above Logstash query is for type:mediawiki AND channel:exception. This provides MediaWiki developers with information about uncaught errors from their code, as detected by MediaWiki itself. These are usually "fatal errors", i.e. uncaught errors while making the HTTP response, which would cause HTTP 500 instead of what the user asked for. As such, it is a natural place to look, if you know there is a rise in HTTP errors on a specific MW cluster.

If there is no rise in HTTP errors, or if you query Logstash in this way without selecting the affected MW cluster, then this should not be taken as a signal that there are HTTP errors.

This primarily addresses MediaWiki developers, and so reports all uncaught errors from our code, not just errors during a user-facing web response. It includes jobs, CLI scripts, and most importantly for this task: uncaught errors in deferred updates. These are isolated from the web response, and from each other. The above-linked burst is 99% from such "deferreds", which look as follows:

url: /w/load.php?…
message: Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: Datacenter MediaWiki switchover live-test
exception.trace:
from /srv/mediawiki/php-1.44.0-wmf.17/includes/libs/rdbms/database/Database.php(686)
#1 /srv/mediawiki/php-1.44.0-wmf.17/includes/libs/rdbms/database/Database.php(1738): Wikimedia\Rdbms\Database->query()
#2 /srv/mediawiki/php-1.44.0-wmf.17/includes/libs/rdbms/database/DBConnRef.php(127): Wikimedia\Rdbms\Database->delete()
…
#5 /srv/mediawiki/php-1.44.0-wmf.17/includes/objectcache/SqlBagOStuff.php(1524): Wikimedia\Rdbms\DeleteQueryBuilder->execute()
#6 /srv/mediawiki/php-1.44.0-wmf.17/includes/objectcache/SqlBagOStuff.php(1370): SqlBagOStuff->deleteServerObjectsExpiringBefore()
…
#17 /srv/mediawiki/php-1.44.0-wmf.17/includes/MediaWikiEntryPoint.php(674): MediaWiki\Deferred\DeferredUpdates::doUpdates()
#18 /srv/mediawiki/php-1.44.0-wmf.17/includes/MediaWikiEntryPoint.php(496): MediaWiki\MediaWikiEntryPoint->restInPeace()
#19 /srv/mediawiki/php-1.44.0-wmf.17/includes/MediaWikiEntryPoint.php(454): MediaWiki\MediaWikiEntryPoint->doPostOutputShutdown()
…
#22 /srv/mediawiki/w/load.php(3): require(string)

"doPostOutputShutdown" indicates that this is a post-send deferred update.

  1. Does a failed deferred matter?

To end-users, no. It can't cause HTTP 500 (because it is written to happen post-send), which means the code owner decided it doesn't matter to the user.

To us? That depends. Let's see what this deferred is for. When a MediaWiki component writes data to the MainStash, we periodically (1:100 random) queue a call to SqlBagOStuff->deleteServerObjectsExpiringBefore to perform some garbage collection, i.e. remove expired rows. This is non-critical.

  1. Why did it happen?

The MainStash should be disabled when in read-only mode, because it operates independently from the MediaWiki databases. This is not intentional, and a regression since last time. I filed a task to fix this: T387654: Re-evaluate whether WMF's MainStash config should use LBFactory.

This means in the future, these updates will still happen, but they will succeed instead of artifically fail.

@hnowlan wrote in the task description:

During the live test, there was a burst of DBReadOnlyError reported in Logstash. […] and also a small spike in (presumed user-facing) errors for eqiad DBs being read-only.

In theory, putting the secondary datacentre in read-only should be a safe thing to do […], but these errors run counter to that.

TLDR: This is all a false-positive for the purposes of live switchover test.

Thank you for such a comprehensive explanation! I think this provides enough information to close this for now.

Change #1123625 merged by Hnowlan:

[operations/puppet@production] trafficserver: send PUTs to the write datacentre

https://gerrit.wikimedia.org/r/1123625

Change #1155198 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver::multi-dc: route PATCH and DELETE to primary DC

https://gerrit.wikimedia.org/r/1155198

Change #1155198 merged by Hnowlan:

[operations/puppet@production] trafficserver::multi-dc: route PATCH and DELETE to primary DC

https://gerrit.wikimedia.org/r/1155198