This is follow-up from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200522-thumbnails.
Different bits of information about media files require expensive computes and/or cross-database connections. As such, this is cached on-demand in Memcached and as media files are used from multiple wiki contexts (e.g. Commons), these cache keys include the wiki they are about, and then stored in the wikifarm cache namespace (aka "global" cache keys).
As part of the above incident it was found that these cache keys were outdated formatting logic, which hardcoded how shared keys work, using a format that has been unsupported for some years. It was similar enough that it still worked under normal conditions, but broke when we did routine Memcached maintenance.
To avoid this in the future, the logic should be audited, understood, and updated. And then simplified so as to make maintenance easier in the future and for the relation between "local" and "foreign" not accidentally the same, but explicitly so through sharing the same code path.
The other issue we found is that the order of key hashing segments is currently confusing our cache monitoring. The key group must be the first segment so that statistics can be attributes and extracted from it. Right now, these stastistics are attributed to 900+ disparate key groups named after the wiki-id (which happens to be the first segment). See Grafana dashboard: WANObjectCache for an example.
- rMW96ad0db6b482: filerepo: use makeGlobalKey() in ForiegnDBViaLBRepo::getSharedCacheKey() / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598118.
- rMW2d1c2154fa8a: filerepo: bump LocalFile::VERSION following 88e17d3f7c78 / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598190
- rMW88e17d3f7c78: filerepo: make LocalRepo::getSharedCacheKey() use makeGlobalKey() / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598182
- Update ForeignDBFileRepo logic to match that of ForiegnDBViaLBRepo. 58c94afe2e / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598183
- Move shared logic to the shared base class (currently duplicated in multiple places).
- Fix statsd metrics to use standard namespacing (currently pollutes the WANObjectCache stats dashboard with wiki IDs).
- Clean up statsd pollution