Page MenuHomePhabricator

Spam graphite metrics from MediaWiki.objectcache
Closed, ResolvedPublicSecurity

Description

Starting around Nov 10th there is significant metric spam created in graphite:

10/11/2022 23:59:47 :: creating database metric MediaWiki.objectcache.5573333_a2cee473_6153_11ed_b4fb_1866da9950fa.set_call_rate.mean (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=None)
10/11/2022 23:59:49 :: creating database metric MediaWiki.objectcache.5469095_9d719f01_6153_11ed_bede_4cd98f892406.set_call_rate.rate (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=None)
10/11/2022 23:59:49 :: creating database metric MediaWiki.objectcache.0_9054aaf3_6153_11ed_b4fb_1866da9950fa.get_hit_rate.sum (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.0 agg=sum)
10/11/2022 23:59:49 :: creating database metric MediaWiki.objectcache.0_9054aaf3_6153_11ed_b4fb_1866da9950fa.get_hit_rate.count (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.0 agg=sum)
10/11/2022 23:59:49 :: creating database metric MediaWiki.objectcache.5508934_a793970e_6153_11ed_8808_2cea7f7fc983.set_call_rate.upper (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=max)
10/11/2022 23:59:49 :: creating database metric MediaWiki.objectcache.5508934_a793970e_6153_11ed_8808_2cea7f7fc983.set_call_rate.mean (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=None)
10/11/2022 23:59:49 :: creating database metric MediaWiki.objectcache.5508934_a793970e_6153_11ed_8808_2cea7f7fc983.set_call_rate.count (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.0 agg=sum)
10/11/2022 23:59:49 :: creating database metric MediaWiki.objectcache.5565942_b8dcaa01_6153_11ed_bede_4cd98f892406.set_bytes_sent.upper (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=max)
10/11/2022 23:59:50 :: creating database metric MediaWiki.objectcache.3878227_a06c89ae_6153_11ed_ad1d_4cd98faf5af0.set_bytes_sent.sum (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.0 agg=sum)
10/11/2022 23:59:50 :: creating database metric MediaWiki.objectcache.3878227_a06c89ae_6153_11ed_ad1d_4cd98faf5af0.set_bytes_sent.count (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.0 agg=sum)
10/11/2022 23:59:50 :: creating database metric MediaWiki.objectcache.3818463_a5313d4e_6153_11ed_897a_4cd98f8911b3.set_call_rate.upper (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=max)
10/11/2022 23:59:50 :: creating database metric MediaWiki.objectcache.3818463_a5313d4e_6153_11ed_897a_4cd98f8911b3.set_call_rate.lower (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=min)
10/11/2022 23:59:50 :: creating database metric MediaWiki.objectcache.5572906_b5e1b330_6153_11ed_ad4d_1866da99521a.set_bytes_sent.rate (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=None)
...

Using 40% of disk space in 10d:

2022-11-18-101302_1294x939_scrot.png (939×1 px, 61 KB)

Not sure about the root cause but looking at commit history I see @aaron your changes to objectcache seem to line up timing-wise with this, could you take a look? cc @colewhite too

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-11-18T09:21:52Z] <godog> nuke MediaWiki.objectcache.*_11ed_* - T323357

colewhite set Security to Software security bug.Nov 18 2022, 10:22 PM
colewhite added projects: Security, Security-Team.
colewhite changed the visibility from "Public (No Login Required)" to "Custom Policy".
colewhite changed the subtype of this task from "Task" to "Security Issue".

I was able to locate the source for the [01]_[a-z0-9]{8}_[a-z0-9]{4}_11ed_[a-z0-9]{4}_[a-z0-9]{12} metrics coming from SimpleParsoidOutputStash:49. These metrics can be generated by requesting data from DiscussionTools extension api endpoint discussiontoolspageinfo.

I'm guessing it's important to generate BagOStuff cache keys through a call to makeKey() so that the cache correctly tags the cache keys.

Marking security as it's a Graphite DoS vector.

colewhite added a project: Parsoid.

Great investigation @colewhite, thank you! @daniel does the above ring a bell? we're getting graphite spammed with metrics like MediaWiki.objectcache.3818463_a5313d4e_6153_11ed_897a_4cd98f8911b3.set_bytes_sent

Great investigation @colewhite, thank you! @daniel does the above ring a bell? we're getting graphite spammed with metrics like MediaWiki.objectcache.3818463_a5313d4e_6153_11ed_897a_4cd98f8911b3.set_bytes_sent

I'm surprised by the log spam, but if it comes from SimpleParsoidOutputStash, then the config change you mention will have triggered it.

I'm surprised at the volume, though. How does DiscussionTools trigger this? Do you have a full stack trace? DiscussionTools shouldn't be using the "stash" flag...

So it appears that DiscussionTools uses a method defined by VisualEditor for loading page HTML content: ApiParsoidTrait::requestRestbasePageHtml. This is method is intended for fetching HTML for editing. However, DiscussionTools is using that method to get rendered HTML for any purpose.

The problem is that ApiParsoidTrait::requestRestbasePageHtml has a side effect: it "stashed" the rendered HTML, so we have the exact rendering the edit is based on when we later save and need to convert HTML back to wikitext. VisualEditor needs this, and DiscussionTools needs this for editing.

But DiscussionTools doesn't need the stash for showing the page info or diffs or hust returning rendered content. It should not be using VisualEditor for getting the endered HTML in these cases. By doing so, it causes the "stash" to be flooded with renderings we never need again. It's not just the log, this puts pressure on the actual cache backend, causing other content to be evicted.

I suppsoe this isn't new, it just used to be invisible: VisualEditor was (and still is on all wikis except mediawiki.org and testwiki) backed by RESTbase. And the stashing happens in RESTbase. I guess that stash was always being flooded by DiscussionTools there, but we never noticed.

The solution is to make DiscussionTools use a different mechanism for fetching HTML.

Thank you for digging deep on this @daniel ! I don't have the full stack trace though I'd like to clarify this is metric spam (i.e. new graphite metrics created quite frequently) with a modicum of log spam on graphite hosts (though that's minimal in comparison to the metrics). In short it does look to my untrained eye that we're spamming MediaWiki.objectcache hierarchy with ever-changing cache keys for which graphite happily creates new metrics each time a new cache key shows up. HTH!

Thank you for digging deep on this @daniel ! I don't have the full stack trace though I'd like to clarify this is metric spam (i.e. new graphite metrics created quite frequently) with a modicum of log spam on graphite hosts (though that's minimal in comparison to the metrics). In short it does look to my untrained eye that we're spamming MediaWiki.objectcache hierarchy with ever-changing cache keys for which graphite happily creates new metrics each time a new cache key shows up. HTH!

The ever changing hashes in the metrics key is odd indeed. That issue is unrelated to the one we identified in DiscussionTools. It seems like both issues are coming together to create the log spam. I will have a deeper look at that one.

Are you sure that this code path is at fault? There is also a call in the persistRevisionThreadItems.php script – currently running: T315510#8392683 – to HookUtils::parseRevisionParsoidHtml(), which uses the same RESTBase API calls I believe (through ParsoidClient).

(I'm writing another patch for this case. If this part turns out to be the problem, we'll need to stop and re-start the script. The timing doesn't exactly match though, it was started on Nov 14.)

The "top level" issue (bad cache keys) should be fixed by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/859045

Please review.

Btw public patches seem ok to me, it's not obvious from the patch how to explit this, and we should have it fixed in a couple of hours.

I can confirm the metric spam is gone and we have a single MediaWiki.objectcache.ParsoidOutputStash metric as expected -- I'll leave the task open for the followups, feel free to resolve as needed. Thanks again @colewhite and @daniel for your help here

(I'm writing another patch for this case. If this part turns out to be the problem, we'll need to stop and re-start the script. The timing doesn't exactly match though, it was started on Nov 14.)

For the record, we did NOT re-start the script yet. The Graphite issue is fixed without that. If it turns out to be necessary, it can happen later.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/859045 has greatly reduced the number of metric namespaces generated. Thank you!

The deploy completed at 2022-11-21 @ 14:34. Since then we've seen three more uuid metrics created:

21/11/2022 15:42:57 :: new metric MediaWiki.objectcache.5496602_01495c06_69b3_11ed_95c8_d7cf03334e91.set_call_rate.rate matched schema zzdefault
21/11/2022 15:42:57 :: new metric MediaWiki.objectcache.5496602_01495c06_69b3_11ed_95c8_d7cf03334e91.set_call_rate.rate matched aggregation schema zzdefault
21/11/2022 15:42:57 :: creating database metric MediaWiki.objectcache.5496602_01495c06_69b3_11ed_95c8_d7cf03334e91.set_call_rate.rate (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=None)
21/11/2022 15:44:50 :: new metric MediaWiki.objectcache.5196869_52595006_69b3_11ed_95c8_d7cf03334e91.set_bytes_sent.lower matched schema zzdefault
21/11/2022 15:44:50 :: new metric MediaWiki.objectcache.5196869_52595006_69b3_11ed_95c8_d7cf03334e91.set_bytes_sent.lower matched aggregation schema lower
21/11/2022 15:44:50 :: creating database metric MediaWiki.objectcache.5196869_52595006_69b3_11ed_95c8_d7cf03334e91.set_bytes_sent.lower (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=min)
21/11/2022 15:44:52 :: new metric MediaWiki.objectcache.5196869_52595006_69b3_11ed_95c8_d7cf03334e91.set_call_rate.mean matched schema zzdefault
21/11/2022 15:44:52 :: new metric MediaWiki.objectcache.5196869_52595006_69b3_11ed_95c8_d7cf03334e91.set_call_rate.mean matched aggregation schema zzdefault
21/11/2022 15:44:52 :: creating database metric MediaWiki.objectcache.5196869_52595006_69b3_11ed_95c8_d7cf03334e91.set_call_rate.mean (archive=[(60, 10080), (300, 4032), (900, 2880), (3600, 8760), (86400, 1825)] xff=0.01 agg=None)

I think https://gerrit.wikimedia.org/r/c/mediawiki/core/+/859050 will sufficiently mitigate the cause.

Krinkle subscribed.

Regarding Graphite spam due to messy cache keys, we also noticed something in WikibaseMediaInfo:
T323340: Stop producing and cleanup wbmi-mediasearch-entities_* spam on WANObjectCache grafana dashboard

Btw public patches seem ok to me, it's not obvious from the patch how to explit this, and we should have it fixed in a couple of hours.

Yes, this is/was completely fine.

No new uuid entries have appeared in the namespace for a week. Should we consider this task resolved?

It would be nice if someone could make this task public, as I seem to lack permissions to do it.

fgiunchedi changed the visibility from "Custom Policy" to "Public (No Login Required)".Nov 30 2022, 8:52 AM

It would be nice if someone could make this task public, as I seem to lack permissions to do it.

{{done}}