CVE: CVE-2019-12473
If this start going above a tipping point we will have a propoer DOS, for now the query killer is holding on the 2 dbs they are happening (so far recentchanges DBs):
jcrespo | |
Sep 18 2018, 3:16 PM |
F26021946: Screenshot_20180918_171422.png | |
Sep 18 2018, 3:16 PM |
CVE: CVE-2019-12473
If this start going above a tipping point we will have a propoer DOS, for now the query killer is holding on the 2 dbs they are happening (so far recentchanges DBs):
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | None | T302086 Set scap minimum python version to 3.7 | |||
Open | None | T247045 Migrate all of production metal and VMs to Buster or later | |||
Resolved | akosiaris | T249724 Track and remove jessie based container images from production | |||
Resolved | Jdforrester-WMF | T224908 Drop jessie testing support | |||
Resolved | Jdforrester-WMF | T224907 Drop php55 testing support | |||
Resolved | Reedy | T205039 Release MediaWiki 1.27.6/1.30.2/1.31.2/1.32.2 | |||
Resolved | Reedy | T205041 Tracking bug for 1.27.6/1.30.2/1.31.2/1.32.2 security release | |||
Resolved | Anomie | T204729 Potential enwiki DOS due to slow WatchedItemStore::countVisitingWatchersMultiple |
https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=now-24h&to=now can this be related to the last train?
There is a big change at 1AM
Query and query plan:
SELECT /* WatchedItemStore::countVisitingWatchersMultiple */ wl_namespace,wl_title,COUNT(*) AS `watchers` FROM `watchlist` GROUP BY wl_namespace,wl_title HAVING COUNT(*) >= 30 root@db2088.codfw.wmnet[enwiki]> show explain for 207583907; +------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+ | 1 | SIMPLE | watchlist | index | NULL | namespace_title | 261 | NULL | 170180999 | Using index; Using filesort | +------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+ 1 row in set, 1 warning (0.03 sec)
I have setup a query killer killling all WatchedItemStore::countVisitingWatchersMultiple queries longer than 1 second, that should contain the issue until someone understand what is going on. BUT IT IS CAUSING AN OUTAGE TO WHATEVER THE SERVICE is, so not undoing the unbreak now.
I don't see anything relevant (from the SAL) around that time, and it's a pretty sharp/immediate increase. The train wasn't deployed until 15:00 UTC today. And there was no train last week (DC switchover).
This code hasn't changed in a while, and isn't called by much.
https://codesearch.wmflabs.org/search/?q=countVisitingWatchersMultiple&i=nope&files=&repos= reveals a single use:
Could it be someone on the API?
Could it be someone on the API?
Yes, that would be my working theory, a bad performance on an uncommonly run query, which should be limited now that it is run by someone once per second.
We are now killing over 1 query per second on rc changes, which is not an ok solution:
For now it should be easily fixable by moving the load to vslow, similar to this: https://github.com/wikimedia/mediawiki/blob/master/includes/changetags/ChangeTags.php#L1573 until there is a proper fix.
by "fix" I mean removing the potential DoS vector point. Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?
I checked the watchlist table and it is the same in eqiad an codfw for the recentchanges hosts.
Also we'd have seen that probably as soon as we got codfw active I guess :-(
No WHERE clause? That's certainly not good.
It looks like if an empty array is passed as $targetsWithVisitThresholds to countVisitingWatchersMultiple(), it'll happily query all titles in the database instead of querying no titles as it would seem to be intended. That should be easy enough to fix.
Looking at the errors in Kibana, they all seem to be going via that code path. The few URLs I spot-checked are passing invalid titles, which would give an empty $targetsWithVisitThresholds.
There is no index that can optimize a SELECT without a WHERE, I think that query shouldn't be able to run on a synchronous webrequest (it could run on a cached cron like some other slow queries).
Looking at the code path this will only be hit with action=query prop=info and with the visitingwatchers option.
My crappy command shows a single UA calling this:
addshore@stat1005:~$ kafkacat -b kafka-jumbo1002.eqiad.wmnet -t webrequest_text -q | grep "en.wikipedia" | grep "api.php" | grep "action=query" | grep "prop=info" | grep "visitingwatchers"
Sample request:
{P7564}
Oh, this got filed as a security bug, because of the DOS potential. See https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/461153 for a fix.
For roots: and (non root to request the query killer to be disabled)- pt-kill is running locally on the rc hosts as root on enwiki codfw, make sure to drop the screens if/when that is fixed.
I'v +2ed the change posted by @Anomie and will put it on the deployed branches shortly (just waiting on jenkins)
Cool - let me know here or on IRC when merged, so I can kill the query killers and see if we are good
Thanks a lot guys
Patches now deployed to both branches:
2018-09-18 16:45 addshore@deploy1001: Synchronized php-1.32.0-wmf.22/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 57s) 16:43 addshore@deploy1001: Synchronized php-1.32.0-wmf.20/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 59s)
I have killed query killers on db2088 and db2085. Now monitoring the graphs to see if this is fixed.
This looks solved so far.
And for the record, these were the query killers used:
while true; do pt-kill --victims=all --kill --print --busy-time 1 --match-info='WatchedItemStore::countVisitingWatchersMultiple' --match-user=wikiuser S=/run/mysqld/mysqld.s1.sock; sleep 100; done
I am going to give it some more more hours before closing the ticket as resolved just in case, but everything looks good so far now.
Thanks a lot @Addshore and @Anomie for getting this quickly tackled
In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?
Yeah, this may need followup for software security backporting- being filled by DBAs and fixed by technology, the focus was mostly on wikimedia DOS vector and not on the software bug. This probably need further consideration for the product side of things.
It was introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/277488/
Based on githubs info https://github.com/wikimedia/mediawiki/commit/7d4e225cb90ac275c6077e830d8b63678c3ea9dc
It is included in the following tags: master 1.31.0 1.31.0-rc.2 1.31.0-rc.1 1.31.0-rc.0 1.30.0 1.30.0-rc.0 1.29.2 1.29.1 1.29.0 1.29.0-rc.1 1.29.0-rc.0 1.28.3 1.28.2 1.28.1 1.28.0 1.28.0-rc.1 1.28.0-rc.0 1.27.4 1.27.3 1.27.2 1.27.1 1.27.0 1.27.0-rc.1 1.27.0-rc.0
I guess we should drop it on REL 131, 130, 129, 128, 127?
1.28 is EOL already, so we don't need to worry about that. I don't really want to squeeze this into this week's security release since the prep work for that is basically already done, at which point 1.29 is EOL as well. So we could cherry-pick it next week for 1.31, 1.30, and 1.27, after the release? cc @Reedy