Page MenuHomePhabricator

Strengthen the shared cache key logic in FileRepo classes
Closed, ResolvedPublic

Description

This is follow-up from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200522-thumbnails.


Different bits of information about media files require expensive computes and/or cross-database connections. As such, this is cached on-demand in Memcached and as media files are used from multiple wiki contexts (e.g. Commons), these cache keys include the wiki they are about, and then stored in the wikifarm cache namespace (aka "global" cache keys).

As part of the above incident it was found that these cache keys were outdated formatting logic, which hardcoded how shared keys work, using a format that has been unsupported for some years. It was similar enough that it still worked under normal conditions, but broke when we did routine Memcached maintenance.

To avoid this in the future, the logic should be audited, understood, and updated. And then simplified so as to make maintenance easier in the future and for the relation between "local" and "foreign" not accidentally the same, but explicitly so through sharing the same code path.

The other issue we found is that the order of key hashing segments is currently confusing our cache monitoring. The key group must be the first segment so that statistics can be attributes and extracted from it. Right now, these stastistics are attributed to 900+ disparate key groups named after the wiki-id (which happens to be the first segment). See Grafana dashboard: WANObjectCache for an example.

Incident mitigation:

  1. rMW96ad0db6b482: filerepo: use makeGlobalKey() in ForiegnDBViaLBRepo::getSharedCacheKey() / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598118.
  2. rMW2d1c2154fa8a: filerepo: bump LocalFile::VERSION following 88e17d3f7c78 / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598190
  3. rMW88e17d3f7c78: filerepo: make LocalRepo::getSharedCacheKey() use makeGlobalKey() / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598182

Proper clean-up:

  • Update ForeignDBFileRepo logic to match that of ForiegnDBViaLBRepo. 58c94afe2e / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598183
  • Move shared logic to the shared base class (currently duplicated in multiple places).
  • Fix statsd metrics to use standard namespacing (currently pollutes the WANObjectCache stats dashboard with wiki IDs).
  • Clean up statsd pollution

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 520854 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] filerepo: move bits of logic from subclasses up to LocalRepo

https://gerrit.wikimedia.org/r/520854

Change 604161 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] filerepo: clean up shared cache keys to avoid key class metrics clutter

https://gerrit.wikimedia.org/r/604161

I've tagged two old draft commits that Aaron wrote that might serve as a useful starting point. Always happy to help explain and spread knowledge as-needed :)

Change 604161 merged by jenkins-bot:
[mediawiki/core@master] filerepo: clean up shared cache keys to avoid key metrics clutter

https://gerrit.wikimedia.org/r/604161

I guess we're doing this within the team. I'd recommend that for the remaining two points we pair in code review with someone from SD so that we can get a sense of what is and isn't familiar currently and can start some knowledge sharing between us on this area.

After some digging around Special:Newfiles searching for a pattern in the files that fail, and I think I found a smoking gun:

for file in `curl -s https://commons.wikimedia.org/wiki/Special:NewFiles | perl -ne 'print "$1\n" if /href="\/wiki\/File\:(.*?.jpg)".*galleryfilename galleryfilename-truncate/;'`; do curl -sLI https://test.wikipedia.org/wiki/File:"$file" | grep '^HTTP/2' | grep -v ' 301' | sed 's!HTTP/2 !!' | tr -d '\r' | tr '\n' '\t' ;         echo -e "\t\thttps://commons.wikimedia.org/wiki/File:$file" ; done

From the output I checked all files returning a 404 for inclusion, and they all turn out to be uploaded with https://commons.wikimedia.org/wiki/Commons:Cross-wiki_media_upload_tool

I'm sure the tool is fine, but its use invovlement is consistent with the root causes behind T253405 and T267668, which in both cases was a split-brain scenario where Commons and Wikipedia were not in sync on what the shared Memcached key looks like for certain pieces of meta-data. Thus causing English Wikipedia to store its knowledge of a file not existing under a certain cache key, and then Commons not able to purge it during the upload process.

In theory the issue could be triggered for any upload so long as the non-existent file description page would be viewed on a Wikipedia site prior to upload, but this seems very unlikely. In practice it'll only happen when the upload itself is happening across wikis where naturally both sites are involved and perform a lookup at some point.

In T253405, this happened because in wmf-config the Memcached sharding was different on commons as for enwiki. Which should have been fine since we only upgraded the sharing for wiki-local cache keys. However it went wrong because the FileRepo class code for that one Memcached key did something unusual causing it to be treated as a local-wiki cache key.

In T267668, this happened because the LocalRepo and Foreign…Repo classes had their cache key method run out of sync. Ironically, this happened as part of making the methods no longer duplicated. However in de-duplicating the methods (by making them both inherit from the same class), a dynamic instance variable as introduced in the code in question, which was set differently between the two classes. This has since been fixed, and we'll add an integration test to further avoid this in the future. (We didn't do that last time, because at the time it was a configuration issue that can't feasibly be caught by an integration test today.)

aaron updated the task description. (Show Details)
aaron updated the task description. (Show Details)