Page MenuHomePhabricator

2021-09-04 enwiki was down at 10:44 (UTC)
Closed, ResolvedPublicBUG REPORT

Description

What happens?:

As with T279809 in the past, the outage was too short for me to run more tests.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
taavi subscribed.

There indeed was an alert for text request volume that matches your timing:

10:45:57 <jinxer-wm> (VarnishTrafficDrop) firing: 62% GET drop in text@ during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
10:50:57 <jinxer-wm> (VarnishTrafficDrop) resolved: 67% GET drop in text@ during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1630736814684&to=1630758414684&var-site=codfw&var-group=core&var-shard=s1&var-role=All shows a drop in enwiki queries and an increase in max response time after that (although it recovered suspiciously quickly)

Marostegui subscribed.

I'm on my phone but just to mention that the queries dropping is probably a consequence of something else and not the consequence. There's a huge spike and then the drop, which makes me thing we had something else failing in between.

Marostegui added a subscriber: Ladsgroup.

From what I can see all the API enwiki hosts got this query around the reported time which is pretty crazy (hiding the query for obvious reasons): {P17211}

Pasting the explain as that is fine - we can see it is doing a full table scan:

+------+-------------+-----------+------+----------------------+---------+---------+---------------------+----------+----------------------------------------------+
| id   | select_type | table     | type | possible_keys        | key     | key_len | ref                 | rows     | Extra                                        |
+------+-------------+-----------+------+----------------------+---------+---------+---------------------+----------+----------------------------------------------+
|    1 | SIMPLE      | page      | ALL  | PRIMARY              | NULL    | NULL    | NULL                | 52516309 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | pagelinks | ref  | PRIMARY,pl_namespace | PRIMARY | 4       | enwiki.page.page_id | 28       | Using where; Using index                     |
+------+-------------+-----------+------+----------------------+---------+---------+---------------------+----------+----------------------------------------------+
2 rows in set (0.037 sec)

That caused a massive spike on all of them and maybe got them unavailable : https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&from=1630744072686&to=1630753109255&var-server=db2092&var-port=9104

Peachey88 renamed this task from enwiki was down at 10:44 (UTC) to 2021-09-04 enwiki was down at 10:44 (UTC).Sep 5 2021, 5:19 AM

fyi there was a brief TCP/ACK DDoS toward esams at that time.

fyi there was a brief TCP/ACK DDoS toward esams at that time.

I connect to esams.

Reedy closed subtask Restricted Task as Resolved.Sep 25 2021, 11:48 PM

The private task T290394 has been resolved after the patch to mitigate this was pushed. Closing this task too.

Change 725056 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_31] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725056

Change 725056 merged by jenkins-bot:

[mediawiki/core@REL1_31] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725056

Change 725062 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_35] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725062

Change 725062 merged by jenkins-bot:

[mediawiki/core@REL1_35] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725062

Change 725067 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_36] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725067

Change 725072 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_37] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725072

Change 725075 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@master] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725075

Change 725067 merged by jenkins-bot:

[mediawiki/core@REL1_36] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725067

Change 725072 merged by jenkins-bot:

[mediawiki/core@REL1_37] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725072

Change 725075 merged by jenkins-bot:

[mediawiki/core@master] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725075