Page MenuHomePhabricator

2021-09-04 enwiki was down at 10:44 (UTC)
Closed, ResolvedPublicBUG REPORT

Description

What happens?:

As with T279809 in the past, the outage was too short for me to run more tests.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Majavah added a subscriber: Majavah.

There indeed was an alert for text request volume that matches your timing:

10:45:57 <jinxer-wm> (VarnishTrafficDrop) firing: 62% GET drop in text@ during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
10:50:57 <jinxer-wm> (VarnishTrafficDrop) resolved: 67% GET drop in text@ during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1630736814684&to=1630758414684&var-site=codfw&var-group=core&var-shard=s1&var-role=All shows a drop in enwiki queries and an increase in max response time after that (although it recovered suspiciously quickly)

Marostegui added a subscriber: Marostegui.

I'm on my phone but just to mention that the queries dropping is probably a consequence of something else and not the consequence. There's a huge spike and then the drop, which makes me thing we had something else failing in between.

Marostegui added a subscriber: Ladsgroup.

From what I can see all the API enwiki hosts got this query around the reported time which is pretty crazy (hiding the query for obvious reasons): {P17211}

Pasting the explain as that is fine - we can see it is doing a full table scan:

+------+-------------+-----------+------+----------------------+---------+---------+---------------------+----------+----------------------------------------------+
| id   | select_type | table     | type | possible_keys        | key     | key_len | ref                 | rows     | Extra                                        |
+------+-------------+-----------+------+----------------------+---------+---------+---------------------+----------+----------------------------------------------+
|    1 | SIMPLE      | page      | ALL  | PRIMARY              | NULL    | NULL    | NULL                | 52516309 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | pagelinks | ref  | PRIMARY,pl_namespace | PRIMARY | 4       | enwiki.page.page_id | 28       | Using where; Using index                     |
+------+-------------+-----------+------+----------------------+---------+---------+---------------------+----------+----------------------------------------------+
2 rows in set (0.037 sec)

That caused a massive spike on all of them and maybe got them unavailable : https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&from=1630744072686&to=1630753109255&var-server=db2092&var-port=9104

Peachey88 renamed this task from enwiki was down at 10:44 (UTC) to 2021-09-04 enwiki was down at 10:44 (UTC).Sep 5 2021, 5:19 AM

fyi there was a brief TCP/ACK DDoS toward esams at that time.

fyi there was a brief TCP/ACK DDoS toward esams at that time.

I connect to esams.

Reedy closed subtask Restricted Task as Resolved.Sep 25 2021, 11:48 PM

The private task T290394 has been resolved after the patch to mitigate this was pushed. Closing this task too.

Change 725056 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_31] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725056

Change 725056 merged by jenkins-bot:

[mediawiki/core@REL1_31] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725056

Change 725062 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_35] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725062

Change 725062 merged by jenkins-bot:

[mediawiki/core@REL1_35] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725062

Change 725067 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_36] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725067

Change 725072 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_37] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725072

Change 725075 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@master] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725075

Change 725067 merged by jenkins-bot:

[mediawiki/core@REL1_36] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725067

Change 725072 merged by jenkins-bot:

[mediawiki/core@REL1_37] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725072

Change 725075 merged by jenkins-bot:

[mediawiki/core@master] SECURITY: Add straight join to ApiQueryBacklinks

https://gerrit.wikimedia.org/r/725075