Page MenuHomePhabricator

After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error"
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.5

message
Cannot access the database: Unknown error (10.64.48.153)

Impact

Saw a surge of these and similar on rolling 1.35.0-wmf.8 to group 1. This led to cascading failures and a massive spike in errors.

Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20191204-MediaWiki

Notes

Several folks have theories as to what's going on here. It's possible this duplicates something else or doesn't really touch on the underlying cause; I'm just getting it filed ASAP.

Details

Request ID
XehDagpAMEwAACTO@lYAAAHM
Request URL
https://es.wikipedia.org/wiki/Anexo:Canales_de_televisi%C3%B3n_en_M%C3%A9xico
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(932): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(899): Wikimedia\Rdbms\LoadBalancer->getServerConnection(integer, string, integer)
#2 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1029): Wikimedia\Rdbms\LoadBalancer->getConnection(integer, array, string, integer)
#3 /srv/mediawiki/php-1.35.0-wmf.5/extensions/GlobalUserPage/includes/GlobalUserPage.php(203): Wikimedia\Rdbms\LoadBalancer->getConnectionRef(integer, array, string)
#4 /srv/mediawiki/php-1.35.0-wmf.5/extensions/GlobalUserPage/includes/GlobalUserPage.php(178): MediaWiki\GlobalUserPage\GlobalUserPage::getCentralTouched(User)
#5 /srv/mediawiki/php-1.35.0-wmf.5/extensions/GlobalUserPage/includes/Hooks.php(57): MediaWiki\GlobalUserPage\GlobalUserPage::shouldDisplayGlobalPage(Title)
#6 /srv/mediawiki/php-1.35.0-wmf.5/includes/Hooks.php(174): MediaWiki\GlobalUserPage\Hooks::onTitleIsAlwaysKnown(Title, NULL)
#7 /srv/mediawiki/php-1.35.0-wmf.5/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#8 /srv/mediawiki/php-1.35.0-wmf.5/includes/Title.php(4179): Hooks::run(string, array)
#9 /srv/mediawiki/php-1.35.0-wmf.5/includes/Title.php(4221): Title->isAlwaysKnown()
#10 /srv/mediawiki/php-1.35.0-wmf.5/includes/skins/Skin.php(1269): Title->isKnown()
#11 /srv/mediawiki/php-1.35.0-wmf.5/includes/skins/SkinTemplate.php(197): Skin::makeUrlDetails(string)
#12 /srv/mediawiki/php-1.35.0-wmf.5/includes/skins/SkinTemplate.php(265): SkinTemplate->setupTemplateForOutput()
#13 /srv/mediawiki/php-1.35.0-wmf.5/includes/skins/SkinTemplate.php(215): SkinTemplate->prepareQuickTemplate()
#14 /srv/mediawiki/php-1.35.0-wmf.5/includes/OutputPage.php(2626): SkinTemplate->outputPage()
#15 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(974): OutputPage->output(boolean)
#16 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(987): MediaWiki->{closure}()
#17 /srv/mediawiki/php-1.35.0-wmf.5/includes/MediaWiki.php(530): MediaWiki->main()
#18 /srv/mediawiki/php-1.35.0-wmf.5/index.php(46): MediaWiki->run()
#19 /srv/mediawiki/w/index.php(3): require(string)
#20 {main}

Event Timeline

I'm not sure but I don't think the db host being unreachable is caused by the train. Reason being that it continued even after roll back of wmf.8, we're seeing it on wmf.5 now.

I spotted that during the first train attempt today and reported it at T239874: MediaWiki: "host db1062 is unreachable" (Connection refused).

Having said that, while it continues to affect wmf.5 now post-rollback, it being elevated with wmf.8 is certainly suspicious. Might be cache related where any branch/version change might cause the error to increase and unable to settle. This would mean that if the relevant cache starts to expire soon, we would see the same increase on wmf.5 while unattended.

The related issue at T239874 about db1062 (10.64.48.15, s7 wikis + centralauth) was much loader than the one reported here and turned out to be unrelated to the train. The s7 host was decom'ed but forgotten to be depooled, which @colewhite has now done. I've close that task now.

But, the issue from this task remains uninvestigated. Unlike db1062 errors, the ones for db1094 and db1136 (this task) were mainly (or only) on wmf.8 and we don't yet know if they will return if we roll out again, nor do we know why they happened.

[…] I also see MW errors in Logstash about two other s7 hosts:

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBConnectionError from line 1415 of /srv/mediawiki/php-1.35.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php:
Cannot access the database: Unknown error (10.64.48.109)

10.64.48.109 is the IP of host db1136 (also part of s7)

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBConnectionError from line 1415 of /srv/mediawiki/php-1.35.0-wmf.8/includes/libs/rdbms/loadbalancer/LoadBalancer.php:
Cannot access the database: Unknown error (10.64.48.153)

Where is the IP of host db1094 (also part of s7).

That seems rather concerning, one s7 host (db1062) was fully unreachable and two others are also (sometimes) unreachable?

I believe db1094 and db1136 being unreachable was the consequence and not the cause.
Both hosts had a huge spike on connections:
db1094: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1575476805856&to=1575515716494&var-dc=eqiad%20prometheus%2Fops&var-server=db1094&var-port=9104&fullscreen&panelId=37
db1136: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1575476805856&to=1575515716494&fullscreen&panelId=37&var-dc=eqiad%20prometheus%2Fops&var-server=db1136&var-port=9104

The second one is basically the same spike on both, which looks like too much of a coincidence. That such spike could've cause the host to go unreachable for a few minutes.

Those two spikes also match the errors for db1094: https://logstash.wikimedia.org/goto/070dd0dc0ecdbee6ee5b4a505479784d
And db1136: https://logstash.wikimedia.org/goto/748b67a507f591ba0d5c9a974029f93b

The rest of the host in s7 (with less weight on the LB than db1094 or db1136) also had that those spikes:
db1101: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1575476805856&to=1575515716494&var-dc=eqiad%20prometheus%2Fops&var-server=db1101&var-port=13317&fullscreen&panelId=3
db1098: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1575476805856&to=1575515716494&fullscreen&panelId=3&var-dc=eqiad%20prometheus%2Fops&var-server=db1098&var-port=13317
db1079: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1575476805856&to=1575515716494&var-dc=eqiad%20prometheus%2Fops&var-server=db1079&var-port=9104&fullscreen&panelId=3

So I believe the train caused that massive spike which made the hosts with more weight on the LB (db1136 and db1094) to go unreachable for a few minutes (as shown on the graphs):

"db1079": 200,
"db1090:3317": 1,
"db1094": 500,
"db1098:3317": 150,
"db1101:3317": 150,
"db1136": 400

I have not been able to find errors for those two hosts outside of the spikes, which I believe means that they are normally working fine and they only had issues during those two spikes which match the train
First spike starts at 20:15 (per logstatsh https://logstash.wikimedia.org/goto/cebe10b2507c977c10fd5bace04d937c) which matches:

20:14 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.8 (duration: 01m 29s)

The errors cease at 21:28 (per logstash: https://logstash.wikimedia.org/goto/cebe10b2507c977c10fd5bace04d937c) which matches the revert:

20:28 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.35.0-wmf.5"

The second spike is at 23:30 (per logstash https://logstash.wikimedia.org/goto/11b3b928d15ac70d676e78c7e678867b) which matches:

23:30 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.8 (duration: 01m 01s)
23:29 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.8

Outside of those slots, the hosts are giving no errors. Are we releasing something that introduces new queries or touches centralauth (that lives in s7)?

Given the dbs (in s7) I highly doubt it's wikidata but I also want to mention that s7 has only frwiktionary and metawiki as their group1 wikis. Is there anything special about those?

These queries showed up at the time of the incident, and I cannot see them happening before it:
{P9822}

I have not been able to see those slow queries before 4th Dec 20:21 UTC, so they might be new and could be potentially causing issues. They are definitely very slow and need to be looked at.

Given the dbs (in s7) I highly doubt it's wikidata but I also want to mention that s7 has only frwiktionary and metawiki as their group1 wikis. Is there anything special about those?

Looks like we were writing at the same time, group 1 also has hewiki cawiki, (along with metawiki), see my previous comment with some possibly new heavy queries arriving to those 3 wikis

Note that s7 is metawiki + centralauth which are involved on requests for all wikis pretty much all the time. That includes group0 but their traffic may be too small to cause issues and/or to notice them. So the sheer volume of group1 (commons, wikidata) switching to the new branch seems like a possible trigger, regardless of whether s7 wikis and group1 intersect.

These queries showed up at the time of the incident, and I cannot see them happening before it:
P9822

I have not been able to see those slow queries before 4th Dec 20:21 UTC, so they might be new and could be potentially causing issues. They are definitely very slow and need to be looked at.

I note all of those queries include titles formatted as integers rather than as strings, i.e. the opposite of the problem in T238378. It's probably caused by a combination of 673d496 and the fact that PHP's associative arrays convert integer-like strings used as keys to actual integers. Adjusting the queries to pass them as strings seems to make them significantly faster.

I can patch the code path creating these three queries easily enough, but how many other code paths that do this exist?

Change 554925 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] rdbms: Have Database::makeWhereFrom2d assume $subKey is string-based

https://gerrit.wikimedia.org/r/554925

I bet that WHERE foo=1 bypasses the index and does a scan of every single title string and converts it to an integer to cover the cases where the title might be (say) 01 or 1., whereas WHERE foo='1' uses the db index as you'd expect.

Change 554925 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Have Database::makeWhereFrom2d assume $subKey is string-based

https://gerrit.wikimedia.org/r/554925

Change 555468 had a related patch set uploaded (by Mobrovac; owner: Anomie):
[mediawiki/core@wmf/1.35.0-wmf.8] rdbms: Have Database::makeWhereFrom2d assume $subKey is string-based

https://gerrit.wikimedia.org/r/555468

Change 555468 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.8] rdbms: Have Database::makeWhereFrom2d assume $subKey is string-based

https://gerrit.wikimedia.org/r/555468

Mentioned in SAL (#wikimedia-operations) [2019-12-06T17:43:53Z] <jforrester@deploy1001> Synchronized php-1.35.0-wmf.8/includes/libs/rdbms/database/Database.php: T239877 Have Database::makeWhereFrom2d assume is string-based (duration: 01m 11s)

Deployed (to group0); all seems quiet, but we're not going to roll the train on a Friday (especially as it caused a site outage last time).

Maybe when using strict sql mode this results in an error and not in a number conversion of all rows of that column

See https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/555563/ for an issue on my localhost with numeric category names.

What's the plan for rolling the train forward now? Are we just skipping last week's train entirely, or are we going to try to roll everyone to wmf.8 today (Monday)?

What's the plan for rolling the train forward now? Are we just skipping last week's train entirely, or are we going to try to roll everyone to wmf.8 today (Monday)?

I suggest the latter option.

We're planning to roll wmf.8 forward today; will sort out exactly when shortly. It would be good for one of Dan Duvall or James F. to be about, as they're running the train for this week's wmf.10.

Rolling out now (~17:00 UTC).

Declaring this fixed. Thanks, all, and especially Anomie, Marostegui, and Krinkle.