Page MenuHomePhabricator

WANCache stats missing for 'CirrusSearchParserOutputPageProperties'
Closed, ResolvedPublic

Description

change 924569 in the CirrusSearch extension, added use of caching to its jobs using the name CirrusSearchParserOutputPageProperties for its cache key group.

The change was deployed last week on May 30th. Based on flame graphs from RunSingleJob we know that the code in question is in fact running in production. The graph shows the modified code of ParserOutputPageProperties::finalize, which is calling WANObjectCache->getWithSet.

Problem: The key is not available from the dropdown menu on Grafana: WANObjectCache stats by key group.

What I've checked off so far:

  • It's not a Grafana-related caching issue. When entering CirrusSearchParserOutputPageProperties directly in the dropdown menu as freeform text and pressing return, it shows no data.
  • It's not a Graphite API or replication issue. I checked the server-side at krinkle@graphite1005:/var/lib/carbon/whisper/MediaWiki/wanobjectcache and there is no directory there named CirrusSearchParserOutputPageProperties, nor anything that looks like it.
  • Other stats from CirrusSearch cache keys are working fine however. And new cache keys from other MW components also appear to work fine.
  • The specific call in MediaWiki code for extracting the key group from the key string seems work fine:
krinkle@mwmaint1002.eqiad.wmnet$ mwscript eval.php --wiki commonswiki
> $cache = MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
> $key = $cache->makeKey('CirrusSearchParserOutputPageProperties', 100, 100, '2023');
commonswiki:CirrusSearchParserOutputPageProperties:100:100:2023
> print explode( ':', $key, 3 )[1]; // WANObjectCache::determineKeyClassForStats
CirrusSearchParserOutputPageProperties

Event Timeline

Next step in tracing this down, how are the stats sent?

The stats from the WANObjectCache class, like everything in MediaWiki, are emitted using the StatsdDataFactory service. This is injected in ServiceWiring.php via $params['stats']. What's special, though, is that anything else in MW, the service is injected conditionally based on $wgCommandLineMode. This cites T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007:

WANObjectCache.php
class WANObjectCache {
	public function __construct( array $params ) {
		$this->stats = $params['stats'] ?? new NullStatsdDataFactory();
ServiceWiring.php
		$wanParams = $mainConfig->get( MainConfigNames::WANObjectCache ) + [
			'cache' => $store,
			'logger' => $logger,
			'secret' => $mainConfig->get( MainConfigNames::SecretKey ),
		];
		if ( !$GLOBALS[ 'wgCommandLineMode' ] ) {
			// Send the statsd data post-send on HTTP requests; avoid in CLI mode (T181385)
			$wanParams['stats'] = $services->getStatsdDataFactory();

I suspect this is obsolete now, given that later in the same task, a more general solution was found:

[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts
https://gerrit.wikimedia.org/r/394779

Likewise, this would be affecting jobs as well, although those don't run in command-line mode, but can be long-running and exchaust memory as well if we never flush anything. Anyway, while this is tech debt to clean up, it is not an immediate concern. But we can at least verify that this does or doesn't play a role.

The way we run jobs right now is a little weird (due to unfinished work from the former Core Platform Team, ref T175146, T246389). We run over HTTP via /rpc/RunSingleJob.php (View in Gerrit), which is defined in wmf-config. To my surprise, it does make reference to $wgCommandLineMode, although in theory that is only meant to be used as part of an error handler to force printing of plain text instead of HTML.

I decided to log into a jobrunner in production and emperically confirm the value of get_class( WANObjectCache->stats ) is at runtime, as to remove any doubt. I picked the server by checking the list (you can use Pybal, or pools.json in Etcd, or Puppet hieradata via Codesearch). I selected mw1439.eqiad.wmnet.

krinkle@mw1439:/srv/mediawiki/php-1.41.0-wmf.22/includes/libs/objectcache/wancache$ sudo -u mwdeploy vim WANObjectCache.php
WANObjectCache.php
  	$this->stats = $params['stats'] ?? new NullStatsdDataFactory();
+ 	wfDebugLog( 'AdHocDebug', __CLASS__ . ' using stats instanceof ' . get_class( $this->stats ) );

After restarting php-fpm (per wikitech:Debugging in production), the following started showing up the Logstash: mediawiki dashboard, when filtered for channel:AdHocDebug:

Aug 17, 2023 @ 18:27:07.760AdHocDebugmw1439enwikiWANObjectCache using stats instanceof BufferingStatsdDataFactory
Aug 17, 2023 @ 18:27:07.763AdHocDebugmw1439commonswikiWANObjectCache using stats instanceof NullStatsdDataFactory
Aug 17, 2023 @ 18:27:07.804AdHocDebugmw1439euwikiWANObjectCache using stats instanceof BufferingStatsdDataFactory
Aug 17, 2023 @ 18:27:07.821AdHocDebugmw1439enwiktionaryWANObjectCache using stats instanceof BufferingStatsdDataFactory
Aug 17, 2023 @ 18:27:07.822AdHocDebugmw1439ruwiktionaryWANObjectCache using stats instanceof NullStatsdDataFactory
Aug 17, 2023 @ 18:27:07.862AdHocDebugmw1439enwikiWANObjectCache using stats instanceof NullStatsdDataFactory
Aug 17, 2023 @ 18:27:08.222AdHocDebugmw1439hewikiWANObjectCache using stats instanceof NullStatsdDataFactory

Screenshot 2023-08-18 at 00.33.07.png (1×2 px, 271 KB)

It seems to be rather unpredictable. Something is causing this to be indeterministic.

One possible explanation would be something emitting the statsd data in a shutdown handler or object destructor, which would run after the mediawiki-config hack has changed the CLI mode flag.

One possible explanation would be something emitting the statsd data in a shutdown handler or object destructor, which would run after the mediawiki-config hack has changed the CLI mode flag.

If I understand correctly this would mean:

  1. Job fails and RPC's top-level catch block handles the exception, sets $wgCommandLineMode = false.
  2. During shutdown the main service container is somehow told to remove its reference to the WANObjectCache instance.
  3. During shutdown, something makes use of WANObjectCache without having stored a reference to it before, so it calls MediaWikiServices::getMainWANObjectCache, which then re-creates this service with a Null statsd object given wgCommandLineMode is now false
  4. During shutdown, that something also calls wanCache->getWithSet().

Point 2 seems least plausible here, but otherwise this seems theoretically possible.

However, I can't explain the issue in this task because we're specifically looking for the metrics for CirrusSearchParserOutputPageProperties which are only produced during the CirrusSearch/ElasticWrite job. The flamegraph linked in the task description shows this is logic is part of the job's main business logic; not part of a deferred, shutdown, or destructor callback.

I ran the same instrumention as T338189#9100702 a second time. This time, including a stack trace to see if I could spot a case where the service is constructed at a time other than during WebStart.php/Setup.php.

After 1 hour, there were still only log entries for the "good" case:

WANObjectCache using stats instanceof BufferingStatsdDataFactory

Screenshot 2023-09-13 at 18.34.43.png (1×2 px, 266 KB)

I realized that I had only updated the "next" WMF branch on disk (php-1.41.0-wmf.26) and as of writing we're on group0+group1. Still, it's interesting that there wasn't a single case of it going wrong.

Instrumenting php-1.41.0-wmf.25 as well, didn't make a difference though.

Screenshot 2023-09-13 at 18.39.32.png (1×2 px, 254 KB)

Given I now have stack traces, and the messages include a reqId, I decided to look for one of the jobs that we expect to run the CirrusSearch code in question, and see if anything else is logged by the same process.

krinkle@mwlog1002:/srv/mw-log$ tail -n1000 JobExecutor.log | fgrep mw1439 | fgrep ElasticaWrite
2023-09-14 01:43:52.558521 [e29e794b-f523-4fb0-98ba-c1439b31be09] mw1439 dewiki 1.41.0-wmf.25 JobExecutor INFO: Finished job execution {"job":"cirrusSearchElasticaWrite
…
\"page_id\":12181,\"namespace\":0,\"namespace_text\":\"\",\"title\":\"Sophienkirche (Dresden)\",
…
"job_type":"cirrusSearchElasticaWrite","job_status":true,"job_duration":0.05435299873352051}

And searching for that reqId on Logstash: mediawiki dashboard yields:

messageCount
WANObjectCache using stats instanceof BufferingStatsdDataFactory134
Parsing {title} was slow, took {time} seconds1

Next, instrumenting the class to confirm it really really is called:

@@@ php-1.41.0-wmf.25/extensions/CirrusSearch/includes/BuildDocument/ParserOutputPageProperties.php
@@@ public function finalizeReal(
+ wfDebugLog( 'AdHocDebug', __METHOD__ . ' called T338189' );
  $fieldContent = $wanCache->getWithSetCallback(

And searching for the next relevant reqId, we find:

host:mw1439 AND reqId:"8278535d-850b-47bb-8efe-aba42d37f701"
messageCount
WANObjectCache using stats instanceof BufferingStatsdDataFactory748
CirrusSearch\BuildDocument\ParserOutputPageProperties::finalizeReal called T33818969
mw1439:/srv/mediawiki/wmf-config/InitialiseSettings.php
- 'objectcache' => 'warning',
+ 'memcached' => [ 'logstash' => 'debug' ]

And the next log samples:

  • memcached DEBUG WANObjectCache using stats instanceof BufferingStatsdDataFactory
  • memcached DEBUG MemcachedPeclBagOStuff::initializeClient: initializing new client instance.
  • AdHocDebug INFO CirrusSearch\BuildDocument\ParserOutputPageProperties::finalizeReal calls dewiki:CirrusSearchParserOutputPageProperties:544852:224780923:20230913180705:v2
  • memcached DEBUG MemcachedPeclBagOStuff debug: getMulti(WANCache:dewiki:CirrusSearchParserOutputPageProperties:544852:224780923:20230913180705:v2|#|v)

And to close up today's investigation, let's see what ends up in the stats buffer in case anything doesn't make it there for some reason:

mw1439:/srv/mediawiki/php-1.41.0-wmf.25/includes/libs/Stats/BufferingStatsdDataFactory.php
@@@ class BufferingStatsdDataFactory
  public function timing( $key, $time ) {
    if ( !$this->enabled ) {
      return;
    }
+   if ( str_contains( $key, 'CirrusSearchParserOutputPageProperties' ) ) {
+     wfDebugLog( 'AdHocDebug', __METHOD__ . " $key $time" );
+   }
    $this->buffer[] = [ $key, $time, StatsdDataInterface::STATSD_METRIC_TIMING ];
  }

And the Logstash results:

message:*CirrusSearchParserOutputPageProperties*
CirrusSearch\BuildDocument\ParserOutputPageProperties::finalizeReal calls kowiki:CirrusSearchParserOutputPageProperties:435135:33301359:20230913235350:v2

MemcachedPeclBagOStuff debug: getMulti(WANCache:kowiki:CirrusSearchParserOutputPageProperties:435135:33301359:20230913235350:v2|#|v)

MemcachedPeclBagOStuff debug: add(WANCache:kowiki:CirrusSearchParserOutputPageProperties:435135:33301359:20230913235350:v2|#|v)

fetchOrRegenerate(kowiki:CirrusSearchParserOutputPageProperties:435135:33301359:20230913235350:v2): miss, new value computed

BufferingStatsdDataFactory::timing wanobjectcache.CirrusSearchParserOutputPageProperties.hit.good 0.3960132598877
BufferingStatsdDataFactory::timing wanobjectcache.CirrusSearchParserOutputPageProperties.regen_walltime 233.36410522461
BufferingStatsdDataFactory::timing wanobjectcache.CirrusSearchParserOutputPageProperties.regen_set_delay 233.62493515015

BufferingStatsdDataFactory::timing wanobjectcache.CirrusSearchParserOutputPageProperties.miss.compute 234.08079147339

BufferingStatsdDataFactory::timing wanobjectcache.CirrusSearchParserOutputPageProperties.hit.good 0.81586837768555

Long story short, everything appears to be working fine, but the stats still aren't any closer to appearing on graphite1005:/srv/carbon/whisper/MediaWiki/wanobjectcache or indeed Grafana.

Alright, one more thing then. Let's see that it actually hits the network socket. I've documented this at https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Logging:

krinkle@mw1439:/srv/mediawiki/php-1.41.0-wmf.25$ sudo tcpdump -i any -l -A -s0 port 8125 | head
…
MediaWiki.wanobjectcache.CirrusSearchParserOutputPageProperties.hit.good:0.61202049255371|ms
MediaWiki.CirrusSearch.cloudelastic.updates.all.doc_size:13848|ms
MediaWiki.CirrusSearch.cloudelastic.updates.all.sent:1|c
…

The fraction there definitely looks suspect, so I figured, maybe we've hit some kind of limit that makes statsd completely ignore the packet, not even treating it as a counter or truncating it, but just discarding it? Let's see.

`
krinkle@mwdebug1001:~$ php -a
Interactive mode enabled

php > $sock = socket_create( AF_INET, SOCK_DGRAM, SOL_UDP );
php > $statsd_host = '10.64.16.81';
php > $statsd_port = 8125;
php > $stat = 'MediaWiki.T338189.wanobjectcache.CirrusSearchParserOutputPageProperties.hit.good:0.1|ms';
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );

After a few seconds for the next 60s flush window to come around, stats are received and stored without issue:

krinkle@graphite1005:/srv/carbon/whisper/MediaWiki/T338189/wanobjectcache/CirrusSearchParserOutputPageProperties/hit/good$ ls -l
Sep 14 03:07 count.wsp
krinkle@graphite1005:/srv/carbon/whisper/MediaWiki/T338189/wanobjectcache/CirrusSearchParserOutputPageProperties/hit/good$ whisper-dump lower.wsp  | grep -v '0$'
Meta data:
  aggregation method: min
…
Archive 0 data:
0: 1694660820, 0.1000000000000000055511151231257827
1: 1694660880, 0.1000000000000000055511151231257827

So the depth of the metric name, the length of the labels, and the closeness to 0 are all fine. What's left is the length of the fraction.

php > $stat = 'MediaWiki.T338189.wanobjectcache.CirrusSearchParserOutputPageProperties.hit.good:0.61202049255371|ms';
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
krinkle@graphite1005:/srv/carbon/whisper/MediaWiki/T338189/wanobjectcache/CirrusSearchParserOutputPageProperties/hit/good$ whisper-dump lower.wsp  | grep -v '0$'
Meta data:
  aggregation method: min
…
Archive 0 data:
0: 1694660820, 0.1000000000000000055511151231257827
1: 1694660880, 0.1000000000000000055511151231257827
4: 1694661060, 0.61202000000000000845545855554519221

Well, that all seems to work just fine. I'm not sure what to consider next. Maybe the jobrunners are firewalled off from statsd for some reason?

krinkle@mw1439 ping graphite1005.eqiad.wmnet
PING graphite1005.eqiad.wmnet(graphite1005.eqiad.wmnet (2620:0:861:102:10:64:16:81)) 56 data bytes
64 bytes from graphite1005.eqiad.wmnet (2620:0:861:102:10:64:16:81): icmp_seq=1 ttl=63 time=0.862 ms
64 bytes from graphite1005.eqiad.wmnet (2620:0:861:102:10:64:16:81): icmp_seq=2 ttl=63 time=0.265 ms
64 bytes from graphite1005.eqiad.wmnet (2620:0:861:102:10:64:16:81): icmp_seq=3 ttl=63 time=0.196 ms

Nope, that's not it either!

So, after talking to @Joe, we realized there's two things I hadn't done yet.

  1. Do the above manual socket_sendto() call from mw1439's CLI rather than mwdebug. While a complete firewall is ruled out by the ping (I think?), if there is something interferring with (some?) statsd messages specifically and only from jobrunners, this could rule that out.
  1. Use the original stat name exactly. So far I've been avoiding to intefere with the real data, but I can always rm it from /srv/carbon later. So if there is something inteferring not just with metrics like this, but with *exactly* this metric only for some reason, then it would make sense that my variant with the "krinkle" prefix would not trigger the problem.
  1. I ran tcpdump on the jobrunner, which confirms things are dispatched, but it doesn't prove that anything was actually received. I should run the same tcpdump in reverse on the graphite host. I'm not very fluent in tcpdump so I don't know off-hand how to do that, but I'm going to guess that in the same way that I can pick network cards and ports on the way out, it probably also has a way of filtering by sender host (or at least sender IP) for incoming packets. That should definitively show that stuff is coming normally from the jobrunner all the way into statsd.

Manual socket from mw1439 CLI

rinkle@mw1439:~$ php -a
Interactive mode enabled

php > $sock = socket_create( AF_INET, SOCK_DGRAM, SOL_UDP ); $statsd_host = '10.64.16.81'; $statsd_port = 8125;
php > $stat = 'MediaWiki.T338189.wanobjectcache.CirrusSearchParserOutputPageProperties.hit.good:0.1|ms';
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );

Same as yesterday, these resulted in an extra data point being added to the Graphite database without issue. So it's definitely not a general firewall or something else dropping statsd packets that look roughly like this one.

krinkle@graphite1005:/srv/carbon/whisper/MediaWiki/T338189/wanobjectcache/CirrusSearchParserOutputPageProperties/hit/good$ whisper-dump lower.wsp  | grep -v '0$'
Meta data:
  aggregation method: min
…
Archive 0 data:
0: 1694660820, 0.1000000000000000055511151231257827
1: 1694660880, 0.1000000000000000055511151231257827
4: 1694661060, 0.61202000000000000845545855554519221
693: 1694702400, 0.1000000000000000055511151231257827

Increment the actual production metric

So far, I've inserted krinkle as prefix for the metric, but let's try without it. Before I do, confirming once more that there exists no data file at all for the "CirrusSearchParserOutputPageProperties" metric:

krinkle@graphite1005:/srv/carbon/whisper/MediaWiki/wanobjectcache$ ls | grep -i '^c'
CacheAwarePropertyInfoStore
categorytree_html_ajax
centralauth_user
centralnotice
CentralNoticeChoiceData
ChangesListSpecialPage_changeTagListSummary
changeslistspecialpage_changetags
cirrussearch
cirrussearch_boost_templates
cirrussearch_interwiki_matrix
codereview_authors
codereview_stats
codereview_tags
confirmedit
krinkle@mw1439:~$ php -a
Interactive mode enabled

php > $sock = socket_create( AF_INET, SOCK_DGRAM, SOL_UDP ); $statsd_host = '10.64.16.81'; $statsd_port = 8125;
php > $stat = 'MediaWiki.wanobjectcache.CirrusSearchParserOutputPageProperties.hit.good:0.61202049255371|ms';
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );
php > socket_sendto( $sock, $stat, strlen( $stat ), 0, $statsd_host, $statsd_port );

And... that didn't come through. So.. if there was any doubt that we're dealing with something oddly specific - it is very oddly specific indeed.

My last last resort at this point is to just search all the things and hope to find something relating to this somewhere. The phrase wanobjectcache should be fairly unique in our codebases since we don't really refer to metrics much by name, and most of the code reference will refer to the (case-sensitive) "WANObjectCache", "WANCache", or "MainWANObjectCache"

https://codesearch.wmcloud.org/search/?q=wanobjectcache&files=&excludeFiles=&repos=

Three results:

mediawiki/core: /includes/libs/objectcache/README.md
### wanobjectcache.{kClass}.{cache_action_and_result}

Documentation, okay good.

mediawiki/core: /includes/libs/objectcache/wancache/WANObjectCache.php
$this->stats->timing( "wanobjectcache.$keygroup.hit.good",`

The code that is our raison d'être.

And lastly, puppet: graphite/production.pp

operations/puppet: /modules/profile/manifests/graphite/production.pp
# wanobjectcache spams metrics with hex hashes - T178531
['^MediaWiki\.wanobjectcache\.[a-zA-Z0-9]{32}', 'blackhole'],

Macro mindblown:

# wanobjectcache spams metrics with hex hashes - T178531
['^MediaWiki\.wanobjectcache\.[a-zA-Z0-9]{32}', 'blackhole'],

So.. let's untagle this. The comment refers to T178531, which represents a 2017 incident relating to the Graphite service.

The background here is that in Graphite, each metric gets its own database file. This somewhat similar to how in MySQL, each row must go to one of the defined database tables. In Graphite, each data point goes into a metric database file (a so-called "whisper" file), and each metric (the full name, including all dots and segment) corresponds to one such database file. This file is automatically created when the first datapoint is received for a new/unknown metric.

In Prometheus, data is stored differently, but the same fundamental truth holds: It's important for metric names to be finite and have low cardinatlity as otherwise you explode the number of databases/timeseries that have to be created. In other words, a highly variable string like someone's user name, or a page title, or a revision ID must not be used as part of a metric name.

In 2017, BagOStuff::makeKey() method worked differently than it does today. It tolerated special characters in all dot-separated segments, including the first one (which we call the "keygroup"). For example, you might cache something under a key like this:

$cache->getWithSetCallback(
  $cache->makeKey( 'user-editcount', 'v2', $userID, $wikiDomain ),
  
  
);

In this case we refer to the user-editcount as the keygroup. The other parts may be variable and are automatically hashed if they are too long or otherwise can't be safely encoded into a cache key. The WANObjectCache service then keeps statistics for the Grafana: WANObjectCache by keygroup dashboard by incrementing counters like MediaWiki.wanobjectcache.$keygroup.{ hit, miss } and such. If someone wrote code by accident such that a variable ended up as the first part of makeKey() then that would cause a potentially infinite number of metric database files to be created. And, that's exactly what happened in 2017, so this problem was "temporarily" fixed (obligatory Tim Starling quote).

How? By configuring Graphite to automatically discard all incoming metrics that look like they put the result of md5( $variable ) in a metric name. What does it mean to "look like a hash"? Well, they're always made of hexidecimal characters (a-f and 0-9) and are exactly 32 characters long in the case of MD5.

MediaWiki.wanobjectcache.1522a7ce9d46392176d6a7da1a308e0a.hit.good

This can't always be distinguished from real words that happen to use the same letters, such as "deadbeef" or "cafe". But, they'd have to be used as a keygroup in some MediaWiki feature, and be fully lowercase, and be exactly 32 characters long. Seems pretty narrow, so perhaps good enough as a "temporary" mitigation.

But the blackhole regex configured above made two mistakes:

  1. It matches [a-zA-Z0-9] instead of [a-f0-9]. Which means clearly non-hex names such as "ResourceLoader" or "ParserCache" could also start matching.
  2. It lacks a length constraint. It uses {32} which means exactly 32 matching characters. But, it has no $ or \. to enforce that name isn't longer.

The result is that basically any component name, made with normal A-Z letters will be blackholed if it's 32 characters or longer. CirrusSearchParserOutputPageProperties is 38 characters long.

Next steps

One option could be to change the blackhole pattern to be more precise in rejecting only 32-char hexes and not other names. This is a fairly safe and conservative thing to do. And would close this task, and makes the 2017 keygroup bug still exists in MediaWiki today, and assumes that we still need the protection the Graphite layer in 2023 for this bug in this place and only in this place.

- ['^MediaWiki\.wanobjectcache\.[a-zA-Z0-9]{32}',   'blackhole'],
+    ['^MediaWiki\.wanobjectcache\.[a-f0-9]{32}\.', 'blackhole'],

Something a bit more proactive could be to move this protection to within the MediaWiki code. We'd remove the above line from Graphite's puppet configuration. And add something like the following in MediaWiki's WANObjectCache.php:

includes/libs/objectcache/WANObjectCache.php
  private function determineKeyGroupForStats( $key ) {
  	$parts = explode( ':', $key, 3 );
  	$keygroup = strtr( $parts[1] ?? $parts[0], '.', '_' );

+  	if ( strlen( $keygroup ) === 32 && preg_match( '/^[0-9a-f]{32}$/', $keygroup ) ) {
+ 		return 'WANCache_hex_keygroup';
+ 	}

  	return $keygroup;
  }

Lastly, I looked at how the makeKey() method and its internal helpers have evolved over the years via git-blame, and see that in static analysis we treat $keygroup as its own parameter. In case of Memcached, it does still get considered for hashing if it's too long, so the bug is not impossible to re-introduce. But, running tcpdump for an hour today on graphite1005 convinced me that, just like the last four comments said at T178531#3806072 in 2017, that we no longer have anything violating this in production today.

And, if something does regress at some point, it could just as easily happen in any of other hundreds of MediaWiki components and extensions that aren't WANObjectCache.

Change 957797 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] graphite: Remove temporary blackhole for wanobjectcache hex-like stats

https://gerrit.wikimedia.org/r/957797

Change 957797 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: Remove temporary blackhole for wanobjectcache hex-like stats

https://gerrit.wikimedia.org/r/957797