Page MenuHomePhabricator

Potential enwiki DOS due to slow WatchedItemStore::countVisitingWatchersMultiple
Closed, ResolvedPublic

Description

CVE: CVE-2019-12473

If this start going above a tipping point we will have a propoer DOS, for now the query killer is holding on the 2 dbs they are happening (so far recentchanges DBs):

Screenshot_20180918_171422.png (961×1 px, 160 KB)

Event Timeline

jcrespo triaged this task as Unbreak Now! priority.Sep 18 2018, 3:16 PM

@daniel @Addshore any idea of what could be causing this on enwiki hosts (so far). I am pinging you guys because I saw you are part of T129478 and we are trying to look for people who could maybe help to troubleshoot this a bit further?

Query and query plan:

SELECT /* WatchedItemStore::countVisitingWatchersMultiple  */  wl_namespace,wl_title,COUNT(*) AS `watchers`  FROM `watchlist`     GROUP BY wl_namespace,wl_title HAVING COUNT(*) >= 30


root@db2088.codfw.wmnet[enwiki]>  show explain for 207583907;
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
| id   | select_type | table     | type  | possible_keys | key             | key_len | ref  | rows      | Extra                       |
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
|    1 | SIMPLE      | watchlist | index | NULL          | namespace_title | 261     | NULL | 170180999 | Using index; Using filesort |
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
1 row in set, 1 warning (0.03 sec)

I have setup a query killer killling all WatchedItemStore::countVisitingWatchersMultiple queries longer than 1 second, that should contain the issue until someone understand what is going on. BUT IT IS CAUSING AN OUTAGE TO WHATEVER THE SERVICE is, so not undoing the unbreak now.

I don't see anything relevant (from the SAL) around that time, and it's a pretty sharp/immediate increase. The train wasn't deployed until 15:00 UTC today. And there was no train last week (DC switchover).

It is only hitting recentchanges slaves at the moment

Could it be someone on the API?

Yes, that would be my working theory, a bad performance on an uncommonly run query, which should be limited now that it is run by someone once per second.

We are now killing over 1 query per second on rc changes, which is not an ok solution:

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2085&var-port=13311&panelId=10&fullscreen&from=now-1h&to=now

For now it should be easily fixable by moving the load to vslow, similar to this: https://github.com/wikimedia/mediawiki/blob/master/includes/changetags/ChangeTags.php#L1573 until there is a proper fix.

by "fix" I mean removing the potential DoS vector point. Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

I checked the watchlist table and it is the same in eqiad an codfw for the recentchanges hosts.
Also we'd have seen that probably as soon as we got codfw active I guess :-(

SELECT /* WatchedItemStore::countVisitingWatchersMultiple  */  wl_namespace,wl_title,COUNT(*) AS `watchers`  FROM `watchlist`     GROUP BY wl_namespace,wl_title HAVING COUNT(*) >= 30

No WHERE clause? That's certainly not good.

It looks like if an empty array is passed as $targetsWithVisitThresholds to countVisitingWatchersMultiple(), it'll happily query all titles in the database instead of querying no titles as it would seem to be intended. That should be easy enough to fix.

Looking at the errors in Kibana, they all seem to be going via that code path. The few URLs I spot-checked are passing invalid titles, which would give an empty $targetsWithVisitThresholds.

by "fix" I mean removing the potential DoS vector point. Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

There is no index that can optimize a SELECT without a WHERE, I think that query shouldn't be able to run on a synchronous webrequest (it could run on a cached cron like some other slow queries).

Looking at the code path this will only be hit with action=query prop=info and with the visitingwatchers option.

My crappy command shows a single UA calling this:

addshore@stat1005:~$ kafkacat -b kafka-jumbo1002.eqiad.wmnet -t webrequest_text -q | grep "en.wikipedia" | grep "api.php" | grep "action=query" | grep "prop=info" | grep "visitingwatchers"

Sample request:

{P7564}

Oh, this got filed as a security bug, because of the DOS potential. See https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/461153 for a fix.

For roots: and (non root to request the query killer to be disabled)- pt-kill is running locally on the rc hosts as root on enwiki codfw, make sure to drop the screens if/when that is fixed.

I'v +2ed the change posted by @Anomie and will put it on the deployed branches shortly (just waiting on jenkins)

I'v +2ed the change posted by @Anomie and will put it on the deployed branches shortly (just waiting on jenkins)

Cool - let me know here or on IRC when merged, so I can kill the query killers and see if we are good
Thanks a lot guys

Patches now deployed to both branches:

2018-09-18
16:45 addshore@deploy1001: Synchronized php-1.32.0-wmf.22/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 57s)
16:43 addshore@deploy1001: Synchronized php-1.32.0-wmf.20/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 59s)

I have killed query killers on db2088 and db2085. Now monitoring the graphs to see if this is fixed.

This looks solved so far.

db2085: https://grafana.wikimedia.org/render/dashboard-solo/db/mysql?panelId=3&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2085&var-port=13311&from=1537278729923&to=1537289529923&width=1000&height=500&tz=UTC%2B02%3A00

db2088: https://grafana.wikimedia.org/render/dashboard-solo/db/mysql?panelId=3&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=1537278754322&to=1537289554322&width=1000&height=500&tz=UTC%2B02%3A00

And for the record, these were the query killers used:

while true; do pt-kill --victims=all --kill --print --busy-time 1 --match-info='WatchedItemStore::countVisitingWatchersMultiple' --match-user=wikiuser S=/run/mysqld/mysqld.s1.sock; sleep 100; done

I am going to give it some more more hours before closing the ticket as resolved just in case, but everything looks good so far now.
Thanks a lot @Addshore and @Anomie for getting this quickly tackled

Marostegui assigned this task to Anomie.

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

Yeah, this may need followup for software security backporting- being filled by DBAs and fixed by technology, the focus was mostly on wikimedia DOS vector and not on the software bug. This probably need further consideration for the product side of things.

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

It was introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/277488/
Based on githubs info https://github.com/wikimedia/mediawiki/commit/7d4e225cb90ac275c6077e830d8b63678c3ea9dc
It is included in the following tags: master 1.31.0 1.31.0-rc.2 1.31.0-rc.1 1.31.0-rc.0 1.30.0 1.30.0-rc.0 1.29.2 1.29.1 1.29.0 1.29.0-rc.1 1.29.0-rc.0 1.28.3 1.28.2 1.28.1 1.28.0 1.28.0-rc.1 1.28.0-rc.0 1.27.4 1.27.3 1.27.2 1.27.1 1.27.0 1.27.0-rc.1 1.27.0-rc.0
I guess we should drop it on REL 131, 130, 129, 128, 127?

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

It was introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/277488/
Based on githubs info https://github.com/wikimedia/mediawiki/commit/7d4e225cb90ac275c6077e830d8b63678c3ea9dc
It is included in the following tags: master 1.31.0 1.31.0-rc.2 1.31.0-rc.1 1.31.0-rc.0 1.30.0 1.30.0-rc.0 1.29.2 1.29.1 1.29.0 1.29.0-rc.1 1.29.0-rc.0 1.28.3 1.28.2 1.28.1 1.28.0 1.28.0-rc.1 1.28.0-rc.0 1.27.4 1.27.3 1.27.2 1.27.1 1.27.0 1.27.0-rc.1 1.27.0-rc.0
I guess we should drop it on REL 131, 130, 129, 128, 127?

1.28 is EOL already, so we don't need to worry about that. I don't really want to squeeze this into this week's security release since the prep work for that is basically already done, at which point 1.29 is EOL as well. So we could cherry-pick it next week for 1.31, 1.30, and 1.27, after the release? cc @Reedy

Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 6 2019, 4:00 PM