Page MenuHomePhabricator

Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases.
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases.
exception.trace
from /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(1055)
#0 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(1342): Wikimedia\Rdbms\Database->assertIsWritablePrimary()
#1 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(1293): Wikimedia\Rdbms\Database->executeQuery(string, string, integer)
#2 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(2636): Wikimedia\Rdbms\Database->query(string, string, integer)
#3 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/DBConnRef.php(69): Wikimedia\Rdbms\Database->update(string, array, string, string)
#4 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/DBConnRef.php(381): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#5 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/LinksUpdate/LinksUpdate.php(600): Wikimedia\Rdbms\DBConnRef->update(string, array, array, string)
#6 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/LinksUpdate/LinksUpdate.php(301): MediaWiki\Deferred\LinksUpdate\LinksUpdate->updateLinksTimestamp()
#7 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/LinksUpdate/LinksUpdate.php(251): MediaWiki\Deferred\LinksUpdate\LinksUpdate->doIncrementalUpdate()
#8 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/DeferredUpdates.php(536): MediaWiki\Deferred\LinksUpdate\LinksUpdate->doUpdate()
#9 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/RefreshSecondaryDataUpdate.php(103): DeferredUpdates::attemptUpdate(MediaWiki\Deferred\LinksUpdate\LinksUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#10 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/DeferredUpdates.php(536): RefreshSecondaryDataUpdate->doUpdate()
#11 /srv/mediawiki/php-1.38.0-wmf.21/includes/Storage/DerivedPageDataUpdater.php(1778): DeferredUpdates::attemptUpdate(RefreshSecondaryDataUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#12 /srv/mediawiki/php-1.38.0-wmf.21/includes/page/WikiPage.php(2211): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates(array)
#13 /srv/mediawiki/php-1.38.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(210): WikiPage->doSecondaryDataUpdates(array)
#14 /srv/mediawiki/php-1.38.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(137): RefreshLinksJob->runForTitle(Title)
#15 /srv/mediawiki/php-1.38.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(79): RefreshLinksJob->run()
#16 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#17 {main}
Impact

more than 300 errors in the last 30 min

Notes

Event Timeline

It seems enwiki has emitting read-only error quite a lot (https://logstash.wikimedia.org/goto/e9929ea16d86318dffd089d9837e90e6) but that's wmf.20 so not related to this (probably some of the maint work or increase in read has caused it).

For commons, it seems it's also going read-only for a bit as well. Strangely the next one is elwiktionary. Something is probably causing issues in wmf.21 and it needs inspection before moving forward.

Was noted in #wikimedia-operations yesterday, before the train rolled to group0. I've been anecdotally seeing more than usual since last week (first one in my bot's logs was Feb 3).

[16:56:30] <dancy>	 hmm.. DBReadOnlyError's happening more than usual.  708 in the last 15 minutes.
[16:56:38] <dancy>	 FIXME: Figure out what "usual" is
[16:57:10] <dancy>	 Fading away now
[16:57:12] <AntiComposite>	 approximately never
[16:57:27] <dancy>	 hehe.  I see them all the time
...
[20:17:41] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.21  refs T300197

https://wm-bot.wmcloud.org/logs/%23wikimedia-operations/20220208.txt

for enwiki read-only (slightly unrelated). Something is writing a lot there, rows written hitting 50k/s is just too much: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&from=1644439845024&to=1644449878742 I'm running actor migration there but it takes a two second break after writing 200 rows, it can't be it. I will dig into binlogs.

For other wikis (wmf.21), I'm not sure what's going on.

It seems CategoryMembership changes are adding a lot of pressure atm: https://logstash.wikimedia.org/goto/1a277f0c08b841e2a9ba2bc2f458f8db This probably can be moved to onTransactionorIdle (if not already) or moved to a job?

For the s1 issue: If you look at https://logstash.wikimedia.org/goto/f9ab8e9a2d1a6dc3e8e1cecf317ed905 and toggle url, almost all are /w/api.php?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options. Is that something with DIP?

binlog of s1 has a LOT of this:

DELETE /* MediaWiki\User\UserOptionsManager::saveOptionsInternal  */ FROM `user_properties` WHERE up_user = X AND up_property IN ('echo-subscriptions-push-edit-user-talk','echo-subscriptions-push-login-fail','echo-subscriptions-push-mention','echo-subscriptions-push-thank-you-edit','echo-subscriptions-push-reverted','echo-subscriptions-push-edit-thank','echo-cross-wiki-notifications')

And flaggedrevs is making a lot of writes there as well which is weird. I investigate that. But someone need to check the user options flood.

Change 761512 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/FlaggedRevs@master] Short circut updating stats when the page is not reviewable

https://gerrit.wikimedia.org/r/761512

Change 761512 merged by jenkins-bot:

[mediawiki/extensions/FlaggedRevs@master] Short circut updating stats when the page is not reviewable

https://gerrit.wikimedia.org/r/761512

Change 761408 had a related patch set uploaded (by Jforrester; author: Amir Sarabadani):

[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.21] Short circut updating stats when the page is not reviewable

https://gerrit.wikimedia.org/r/761408

Change 761409 had a related patch set uploaded (by Jforrester; author: Amir Sarabadani):

[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.20] Short circut updating stats when the page is not reviewable

https://gerrit.wikimedia.org/r/761409

For the s1 issue: If you look at https://logstash.wikimedia.org/goto/f9ab8e9a2d1a6dc3e8e1cecf317ed905 and toggle url, almost all are /w/api.php?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options. Is that something with DIP?

As of writing, 100% of these are debug-level messages of the form: Transaction spent {time_ms}ms in writes, under the 3s limit. This seems to simply be recording every single database change, no matter what, so long as it took non-zero milliseconds of time, and remained under the warning limit. (Source code). These probably don't need to be enabled in production, but maybe I'm missing something.

Having said that, it is odd that these are exclusively from POST /w/api.php?..&errorsuselocal=1&action=options, because I don't see anything in the Rdbms logging code that indicates a reason to not be logging e.g. all edits, logins and other just normal db changes, so there must be something special about these that I'm missing, but they're not slow.

I checked Codesearch for errorsuselocal and found none that concide with action=options. I checked Global Search tool for on-wiki use of errorsuselocal, and none there either.

I checked Logstash: mediawiki dashboard with the query channel:DBPerformance AND normalized_message:"Transaction spent" over the last 30 days. And then added to the query url:"action=options", which yielded the same results. So in fact 100% of these are from this weird API query. All 3 million a day and for at least 90 days, across different wikis.

Screenshot 2022-02-10 at 04.34.11.png (574×2 px, 122 KB)

I then checked api.log on mwlog1002 to get some samples, and noticed they are all exactly the same: change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1. Although on different wikis and from different user names and IP addresses. These appear related to Echo, and match the DB binlog sample that @Ladsgroup posted above, which confirms that that was related and representative of the same cause:

/srv/mw-log/api.log
krinkle@mwlog1002:/srv/mw-log$ ack POST api.log | fgrep errorsuselocal=1 | fgrep action=options | head
mw1362 zhwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2 
mw1412 arwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2  
..

I then tapped into the raw webrequest stream to capture a few samples, hoping to find a representative referer in order to understand what might be causing it, given that a codesearch led nowhere. Perhaps the code is obfuscated or coming from something third-party.

I got no referer, however there was a telling user-agent instead:

kafkacat webrequest_text
krinkle at stat1004.eqiad.wmnet in ~
$ kafkacat -C -b 'kafka-jumbo1005.eqiad.wmnet' -o '-100000000' -t webrequest_text | fgrep errorsuselocal | fgrep '"POST"' | head
{
  "http_method":"POST","uri_host":"en.wikipedia.org",
  "uri_path":"/w/api.php","uri_query":"?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options"
  ***,
  "user_agent":"WikipediaApp/2.7.50392-r-2022-01-24 (Android ***",
  ***
}

$ kafkacat -C -b 'kafka-jumbo1005.eqiad.wmnet' -o '-100000000' -t webrequest_text | fgrep errorsuselocal | fgrep '"POST"' | head | jq -r .user_agent
WikipediaApp/2.7.50392-r-2022-01-24 (Android 11; ***
WikipediaApp/2.7.50382-r-2021-10-25 (Android 11; ***
WikipediaApp/2.7.50392-r-2022-01-24 (Android 11; ***
WikipediaApp/2.7.50392-r-2022-01-24 (Android 9; ***

I then checked GitHub search errorsuselocal org:wikimedia and noticed one result that wasn't in our Codesearch: apps-android-wikipedia:/dataclient/Service.kt.

The relevant query appears to be in apps-android-wikipedia:/push/WikipediaFirebaseMessagingService.kt

\cc @cooltey @Dbrant Can you check whether there is something wrong here that may explain why the Wikipedia for Android app is sending 3 million identical uncached API action=options DB write queries a day, almost on par with the number of page views it serves? (~7 million a day per stats). It seems like there is likely a bug causing the app to be repeatedly setting or changing user preferences that either are or should be the default. In particular, change apps-android-wikipedia#2780 looks like it may be a culprit.

My vague memory of the push notification architecture is that there is a dedicated API endpoint for the apps to send a token, which the service then uses to send messages via the intermediate gateways (wikitech, mw.org, T252899). I'm not aware of it using the core databases or user preference system as part of its hot code path.

Change 761408 merged by jenkins-bot:

[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.21] Short circut updating stats when the page is not reviewable

https://gerrit.wikimedia.org/r/761408

Mentioned in SAL (#wikimedia-operations) [2022-02-10T10:58:37Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.21/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:761408|Short circut updating stats when the page is not reviewable (T301433)]] (duration: 00m 50s)

Change 761409 merged by jenkins-bot:

[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.20] Short circut updating stats when the page is not reviewable

https://gerrit.wikimedia.org/r/761409

Mentioned in SAL (#wikimedia-operations) [2022-02-10T11:01:16Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.20/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:761409|Short circut updating stats when the page is not reviewable (T301433)]] (duration: 00m 49s)

Before and after deployment of FR fix:

root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | wc -l
1516
root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | grep -a -i FlaggedRevs | wc -l
190

After:

root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | wc -l
1536
root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | grep -a -i FlaggedRevs | wc -l
1

Note that to get the actual number of FR queries, you need to multiply it by three because the other two don't have FlaggedRevs in them, they have "Flaggable" instead.

Thank you for the investigation of the other issue @Krinkle Awesome work!

To recap:

  • We do have several code paths putting pressure on primary causing it to go read-only from time to time. The first and the second just are big floods of fast write queries, the third is not massive but it's really slow.
  • We also have code paths that are putting pressure on replicas (which because of pt-heartbeat can come back and cause issues for primary)
  • When it goes read-only for that specific request, it triggers a lot of exceptions (in the linked req-id, we have 40 errors) but it's still in one request so the error numbers is not very representative

I think it's okay to move forward with the train because it's not new code paths but this issue in itself is UBN and must be fixed ASAP.

Thanks for the heads up! I do indeed see some code paths that would lead to this API call being made unnecessarily. We'll make an update to the app and get it released asap.
Pull request: https://github.com/wikimedia/apps-android-wikipedia/pull/3117

Dbrant triaged this task as High priority.
thcipriani raised the priority of this task from High to Unbreak Now!.Feb 10 2022, 4:22 PM
thcipriani subscribed.

(UBN! since it's a train blocker)

We've deployed an updated version of the app, however it will take a little while to be approved in the Play Store, and propagate to our install base. (I would estimate a few hours for approval, and then ~24 hours for an appreciable fraction of the install base to get the update and see a reduction in these API calls.) Sorry for any inconvenience.

Thanks for the quick action @Dbrant, much appreciated.

We've deployed an updated version of the app, however it will take a little while to be approved in the Play Store, and propagate to our install base. (I would estimate a few hours for approval, and then ~24 hours for an appreciable fraction of the install base to get the update and see a reduction in these API calls.) Sorry for any inconvenience.

I'm taking this to mean that we can proceed to deploy wmf.21 to all wikis today. Please let me know if there are any objections.

For the s1 issue: If you look at https://logstash.wikimedia.org/goto/f9ab8e9a2d1a6dc3e8e1cecf317ed905 and toggle url, almost all are /w/api.php?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options. Is that something with DIP?

As of writing, 100% of these are debug-level messages of the form: Transaction spent {time_ms}ms in writes, under the 3s limit. This seems to simply be recording every single database change, no matter what, so long as it took non-zero milliseconds of time, and remained under the warning limit. (Source code). These probably don't need to be enabled in production, but maybe I'm missing something.

Having said that, it is odd that these are exclusively from POST /w/api.php?..&errorsuselocal=1&action=options, because I don't see anything in the Rdbms logging code that indicates a reason to not be logging e.g. all edits, logins and other just normal db changes, so there must be something special about these that I'm missing, but they're not slow.

I checked Codesearch for errorsuselocal and found none that concide with action=options. I checked Global Search tool for on-wiki use of errorsuselocal, and none there either.

I checked Logstash: mediawiki dashboard with the query channel:DBPerformance AND normalized_message:"Transaction spent" over the last 30 days. And then added to the query url:"action=options", which yielded the same results. So in fact 100% of these are from this weird API query. All 3 million a day and for at least 90 days, across different wikis.

Screenshot 2022-02-10 at 04.34.11.png (574×2 px, 122 KB)

I then checked api.log on mwlog1002 to get some samples, and noticed they are all exactly the same: change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1. Although on different wikis and from different user names and IP addresses. These appear related to Echo, and match the DB binlog sample that @Ladsgroup posted above, which confirms that that was related and representative of the same cause:

/srv/mw-log/api.log
krinkle@mwlog1002:/srv/mw-log$ ack POST api.log | fgrep errorsuselocal=1 | fgrep action=options | head
mw1362 zhwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2 
mw1412 arwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2  
..

There is no need to trigger deletes if the values are already the default values, see T301506

The commons read-only hasn't improved yet. I think this has something to do the links update refactor: https://logstash.wikimedia.org/goto/5cda69d4ac689da8e2d7f340201e89a0

@Marostegui can it be semi-sync? I found these:

|  <readacted> | wikiuser        | 10.64.16.151:xxxx   | commonswiki | Query       |        0 | Waiting for semi-sync ACK from slave                                  | UPDATE /* UserEditCountUpdate::doUpdate  */  `user` SET user_editcount=user_editcount+1 WHERE user_i |    0.000 |

That would explain sudden read-only errors we see. Maybe we need to tune the parameters.

That's probably a consequence of the replication lag we are seeing often on the replicas.
Keep in mind that semi-sync config hasn't been changed in years (and we haven't had a master swap in months for commons).
Do you have a timestamp for that event? We can try to see if it happens at the same time one of the hosts had replication lag.

On Friday I saw pretty often 10 seconds lag in lots of slaves, which I would assume comes from a burst of writes on the master, but I wasn't able to identify what kind of writes they were.

Change 762128 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] WikiPage: Cast the category values to string in updateCategoryCounts

https://gerrit.wikimedia.org/r/762128

I found what's causing this issue, the bad news is that it's been causing data corruption.

This happens when a category that is only number is being added or removed. e.g. In this case I found, it was Category:1941. You can see that the query becomes this https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2022.02.13?id=q_yR834B-VMqbJtQXJdY and https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2022.02.13?id=86yR834BoAyk87sqX8bY

	UPDATE /* WikiPage::updateCategoryCounts  */  `category` SET cat_pages = cat_pages - 1,cat_files = cat_files - 1 WHERE cat_title = 1941

Mysql interprets cat_title = 1941 (not cat_title = '1941') as any category starting with 1941 meaning updating 2000 other categories as well. This is because addQuote() doesn't cast into string if the value is an int.

Why int is being sent to WikiPage::updateCategoryCounts(), looking at the usages, all are coming from CategoryLinksTable class which has had a refactor recently. So cc @tstarling and @aaron for finding the underlying problem.

I tested this patch in mwdebug and confirm that it fixes the issue.

Change 761755 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.21] WikiPage: Cast the category values to string in updateCategoryCounts

https://gerrit.wikimedia.org/r/761755

Change 761755 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.21] WikiPage: Cast the category values to string in updateCategoryCounts

https://gerrit.wikimedia.org/r/761755

Change 762128 merged by jenkins-bot:

[mediawiki/core@master] WikiPage: Cast the category values to string in updateCategoryCounts

https://gerrit.wikimedia.org/r/762128

There was some strval before the refactor from https://gerrit.wikimedia.org/r/c/mediawiki/core/+/555563, but no tests for it.
The new patch set is deeper in the stack and close before the update happen, that sounds better.

Mentioned in SAL (#wikimedia-operations) [2022-02-13T19:26:37Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.21/includes/page/WikiPage.php: Backport: [[gerrit:761755|WikiPage: Cast the category values to string in updateCategoryCounts (T301433)]] (duration: 00m 49s)

Further work:

  • Add regression test
  • Find the underlying cause and possibly check for similar issues in other places
  • Increase the level of rowsAffected > 1000 warnings to error so it would show up in trains.

The underlying cause is PHP's array key storage model, which doesn't preserve types but attempts to reconstruct them on access:

php > $categories = [];
php > $categories['1941'] = true;
php > foreach ( $categories as $name => $value ) var_dump( $name );
int(1941)

ParserOutput::getCategories() returns the category names in the array keys, so every caller has to cast to string to recover the correct type.

I have a CategoryLinksTable patch which redundantly fixes the bug, which I will upload as soon as I figure out how to write a test for it.

Change 762141 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] LinksTable: Cast all array keys to string

https://gerrit.wikimedia.org/r/762141

Change 762141 merged by jenkins-bot:

[mediawiki/core@master] LinksTable: Cast all array keys to string

https://gerrit.wikimedia.org/r/762141