Maniphest T204729

Potential enwiki DOS due to slow WatchedItemStore::countVisitingWatchersMultiple
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Sep 18 2018, 3:16 PM

Description

CVE: CVE-2019-12473

If this start going above a tipping point we will have a propoer DOS, for now the query killer is holding on the 2 dbs they are happening (so far recentchanges DBs):

Screenshot_20180918_171422.png (961×1 px, 160 KB)

Details

Subject	Repo	Branch	Lines +/-
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for none	mediawiki/core	REL1_31	+13 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for none	mediawiki/core	REL1_30	+13 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for none	mediawiki/core	REL1_27	+13 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for none	mediawiki/core	master	+5 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for none	mediawiki/core	wmf/1.32.0-wmf.20	+5 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for none	mediawiki/core	wmf/1.32.0-wmf.22	+5 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T302086 Set scap minimum python version to 3.7
Resolved	None	T247045 Migrate all of production metal and VMs to Buster or later
Resolved	akosiaris	T249724 Track and remove jessie based container images from production
Resolved	Jdforrester-WMF	T224908 Drop jessie testing support
Resolved	Jdforrester-WMF	T224907 Drop php55 testing support
Resolved	Reedy	T205039 Release MediaWiki 1.27.6/1.30.2/1.31.2/1.32.2
Resolved	Reedy	T205041 Tracking bug for 1.27.6/1.30.2/1.31.2/1.32.2 security release
Resolved	Anomie	T204729 Potential enwiki DOS due to slow WatchedItemStore::countVisitingWatchersMultiple

Event Timeline

• jcrespo created this task.Sep 18 2018, 3:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2018, 3:16 PM

• jcrespo triaged this task as Unbreak Now! priority.Sep 18 2018, 3:16 PM

• Marostegui subscribed.Sep 18 2018, 3:18 PM

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=now-24h&to=now can this be related to the last train?
There is a big change at 1AM

• Marostegui updated the task description. (Show Details)Sep 18 2018, 3:26 PM

@daniel @Addshore any idea of what could be causing this on enwiki hosts (so far). I am pinging you guys because I saw you are part of T129478 and we are trying to look for people who could maybe help to troubleshoot this a bit further?

• Marostegui added a subscriber: daniel.Sep 18 2018, 3:35 PM

Query and query plan:

SELECT /* WatchedItemStore::countVisitingWatchersMultiple  */  wl_namespace,wl_title,COUNT(*) AS `watchers`  FROM `watchlist`     GROUP BY wl_namespace,wl_title HAVING COUNT(*) >= 30


root@db2088.codfw.wmnet[enwiki]>  show explain for 207583907;
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
| id   | select_type | table     | type  | possible_keys | key             | key_len | ref  | rows      | Extra                       |
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
|    1 | SIMPLE      | watchlist | index | NULL          | namespace_title | 261     | NULL | 170180999 | Using index; Using filesort |
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
1 row in set, 1 warning (0.03 sec)

I have setup a query killer killling all WatchedItemStore::countVisitingWatchersMultiple queries longer than 1 second, that should contain the issue until someone understand what is going on. BUT IT IS CAUSING AN OUTAGE TO WHATEVER THE SERVICE is, so not undoing the unbreak now.

In T204729#4594275, @Marostegui wrote:

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=now-24h&to=now can this be related to the last train?
There is a big change at 1AM

I don't see anything relevant (from the SAL) around that time, and it's a pretty sharp/immediate increase. The train wasn't deployed until 15:00 UTC today. And there was no train last week (DC switchover).

• jcrespo edited projects, added MediaWiki-General, Performance Issue, MediaWiki-libs-Rdbms; removed Release-Engineering-Team.Sep 18 2018, 3:41 PM

This code hasn't changed in a while, and isn't called by much.

https://codesearch.wmflabs.org/search/?q=countVisitingWatchersMultiple&i=nope&files=&repos= reveals a single use:

ApiQueryInfo - https://gerrit.wikimedia.org/g/mediawiki/core/+/8fb14da2d28a2b39197c83db8c66daad0e397298/includes/api/ApiQueryInfo.php#904

Could it be someone on the API?

In T204729#4594401, @Addshore wrote:

This code hasn't changed in a while, and isn't called by much.

https://codesearch.wmflabs.org/search/?q=countVisitingWatchersMultiple&i=nope&files=&repos= reveals a single use:

ApiQueryInfo - https://gerrit.wikimedia.org/g/mediawiki/core/+/8fb14da2d28a2b39197c83db8c66daad0e397298/includes/api/ApiQueryInfo.php#904

Could it be someone on the API?

It is only hitting recentchanges slaves at the moment

• Banyek subscribed.Sep 18 2018, 3:50 PM

Could it be someone on the API?

Yes, that would be my working theory, a bad performance on an uncommonly run query, which should be limited now that it is run by someone once per second.

We are now killing over 1 query per second on rc changes, which is not an ok solution:

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2085&var-port=13311&panelId=10&fullscreen&from=now-1h&to=now

For now it should be easily fixable by moving the load to vslow, similar to this: https://github.com/wikimedia/mediawiki/blob/master/includes/changetags/ChangeTags.php#L1573 until there is a proper fix.

by "fix" I mean removing the potential DoS vector point. Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

In T204729#4594445, @Ladsgroup wrote:

Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

I checked the watchlist table and it is the same in eqiad an codfw for the recentchanges hosts.
Also we'd have seen that probably as soon as we got codfw active I guess :-(

In T204729#4594347, @Marostegui wrote:

SELECT /* WatchedItemStore::countVisitingWatchersMultiple  */  wl_namespace,wl_title,COUNT(*) AS `watchers`  FROM `watchlist`     GROUP BY wl_namespace,wl_title HAVING COUNT(*) >= 30

No WHERE clause? That's certainly not good.

It looks like if an empty array is passed as $targetsWithVisitThresholds to countVisitingWatchersMultiple(), it'll happily query all titles in the database instead of querying no titles as it would seem to be intended. That should be easy enough to fix.

In T204729#4594401, @Addshore wrote:

This code hasn't changed in a while, and isn't called by much.

https://codesearch.wmflabs.org/search/?q=countVisitingWatchersMultiple&i=nope&files=&repos= reveals a single use:

ApiQueryInfo - https://gerrit.wikimedia.org/g/mediawiki/core/+/8fb14da2d28a2b39197c83db8c66daad0e397298/includes/api/ApiQueryInfo.php#904

Could it be someone on the API?

Looking at the errors in Kibana, they all seem to be going via that code path. The few URLs I spot-checked are passing invalid titles, which would give an empty $targetsWithVisitThresholds.

In T204729#4594445, @Ladsgroup wrote:

by "fix" I mean removing the potential DoS vector point. Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

There is no index that can optimize a SELECT without a WHERE, I think that query shouldn't be able to run on a synchronous webrequest (it could run on a cached cron like some other slow queries).

Looking at the code path this will only be hit with action=query prop=info and with the visitingwatchers option.

My crappy command shows a single UA calling this:

addshore@stat1005:~$ kafkacat -b kafka-jumbo1002.eqiad.wmnet -t webrequest_text -q | grep "en.wikipedia" | grep "api.php" | grep "action=query" | grep "prop=info" | grep "visitingwatchers"

Sample request:

{P7564}

Oh, this got filed as a security bug, because of the DOS potential. See https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/461153 for a fix.

For roots: and (non root to request the query killer to be disabled)- pt-kill is running locally on the rc hosts as root on enwiki codfw, make sure to drop the screens if/when that is fixed.

I'v +2ed the change posted by @Anomie and will put it on the deployed branches shortly (just waiting on jenkins)

In T204729#4594487, @Addshore wrote:

I'v +2ed the change posted by @Anomie and will put it on the deployed branches shortly (just waiting on jenkins)

Cool - let me know here or on IRC when merged, so I can kill the query killers and see if we are good
Thanks a lot guys

Addshore added a subscriber: WMDE-leszek.Sep 18 2018, 4:36 PM

Patches now deployed to both branches:

2018-09-18
16:45 addshore@deploy1001: Synchronized php-1.32.0-wmf.22/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 57s)
16:43 addshore@deploy1001: Synchronized php-1.32.0-wmf.20/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 59s)

I have killed query killers on db2088 and db2085. Now monitoring the graphs to see if this is fixed.

Addshore added a project: User-Addshore.Sep 18 2018, 4:49 PM

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

This looks solved so far.

db2085: https://grafana.wikimedia.org/render/dashboard-solo/db/mysql?panelId=3&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2085&var-port=13311&from=1537278729923&to=1537289529923&width=1000&height=500&tz=UTC%2B02%3A00

db2088: https://grafana.wikimedia.org/render/dashboard-solo/db/mysql?panelId=3&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=1537278754322&to=1537289554322&width=1000&height=500&tz=UTC%2B02%3A00

And for the record, these were the query killers used:

while true; do pt-kill --victims=all --kill --print --busy-time 1 --match-info='WatchedItemStore::countVisitingWatchersMultiple' --match-user=wikiuser S=/run/mysqld/mysqld.s1.sock; sleep 100; done

I am going to give it some more more hours before closing the ticket as resolved just in case, but everything looks good so far now.
Thanks a lot @Addshore and @Anomie for getting this quickly tackled

• Marostegui closed this task as Resolved.Sep 18 2018, 7:26 PM

• Marostegui assigned this task to Anomie.

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

Yeah, this may need followup for software security backporting- being filled by DBAs and fixed by technology, the focus was mostly on wikimedia DOS vector and not on the software bug. This probably need further consideration for the product side of things.

In T204729#4595739, @Legoktm wrote:

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

It was introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/277488/
Based on githubs info https://github.com/wikimedia/mediawiki/commit/7d4e225cb90ac275c6077e830d8b63678c3ea9dc
It is included in the following tags: master 1.31.0 1.31.0-rc.2 1.31.0-rc.1 1.31.0-rc.0 1.30.0 1.30.0-rc.0 1.29.2 1.29.1 1.29.0 1.29.0-rc.1 1.29.0-rc.0 1.28.3 1.28.2 1.28.1 1.28.0 1.28.0-rc.1 1.28.0-rc.0 1.27.4 1.27.3 1.27.2 1.27.1 1.27.0 1.27.0-rc.1 1.27.0-rc.0
I guess we should drop it on REL 131, 130, 129, 128, 127?

In T204729#4595925, @Addshore wrote:

In T204729#4595739, @Legoktm wrote:

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

It was introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/277488/
Based on githubs info https://github.com/wikimedia/mediawiki/commit/7d4e225cb90ac275c6077e830d8b63678c3ea9dc
It is included in the following tags: master 1.31.0 1.31.0-rc.2 1.31.0-rc.1 1.31.0-rc.0 1.30.0 1.30.0-rc.0 1.29.2 1.29.1 1.29.0 1.29.0-rc.1 1.29.0-rc.0 1.28.3 1.28.2 1.28.1 1.28.0 1.28.0-rc.1 1.28.0-rc.0 1.27.4 1.27.3 1.27.2 1.27.1 1.27.0 1.27.0-rc.1 1.27.0-rc.0
I guess we should drop it on REL 131, 130, 129, 128, 127?

1.28 is EOL already, so we don't need to worry about that. I don't really want to squeeze this into this week's security release since the prep work for that is basically already done, at which point 1.29 is EOL as well. So we could cherry-pick it next week for 1.31, 1.30, and 1.27, after the release? cc @Reedy

Addshore moved this task from Watching 👀 to Closing ✔️ on the User-Addshore board.Sep 20 2018, 1:09 PM

Cherry picked

https://gerrit.wikimedia.org/r/461800
https://gerrit.wikimedia.org/r/461799
https://gerrit.wikimedia.org/r/461794

Legoktm added a parent task: T205041: Tracking bug for 1.27.6/1.30.2/1.31.2/1.32.2 security release.Sep 23 2018, 7:58 PM

Legoktm mentioned this in T205048: Obtain CVEs for 1.27.6/1.30.2/1.31.2/1.32.2 security releases.Apr 7 2019, 11:48 PM

Reedy mentioned this in T205041: Tracking bug for 1.27.6/1.30.2/1.31.2/1.32.2 security release.May 28 2019, 2:50 PM

Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 6 2019, 4:00 PM

• jcrespo awarded a token.Jun 6 2019, 4:01 PM

Reedy updated the task description. (Show Details)Jun 17 2019, 5:30 PM

• chasemp added a project: Security.Feb 10 2020, 10:50 PM

• chasemp removed a project: acl*security.Feb 20 2020, 8:11 PM

Aklapper removed subscribers: Anomie, • Banyek.Oct 16 2020, 5:38 PM