hhvm memcached and php7 memcached extensions do not play well together
Closed, ResolvedPublic

Description

HHVM memcached extension does not set compression flags properly, see upstream bug https://github.com/facebook/hhvm/issues/8028
php_memcached 3.x expects this flag to be set and fails to decompress values saved with the HHVM extension.

The long-term fix is to get the upstream HHVM patch merged. The short-term fix is to build our own packages with that patch. We will however have the issue of all existing keys in the caches. If expiry time for the longest of those is too long, we should also build our own php_memcached packages for 3.0.x and deploy those until all affected keys are gone.

ArielGlenn triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2018, 5:31 PM

More information on this issue is at https://phabricator.wikimedia.org/T184258#3895249

Patch for php_memcached attached in case we go that route, but also for information purposes.

Adding @aaron who once knew how long the longest-lived memcached key is. Also, if there's still no persistent storage for these caches, maybe this is a moot point.

@aaron Do you have an idea what the longest cache expiration times are? (See Ariel's question above)

Note that at the SRE team meeting we proposed maintaining a separate memcached pool for php7 appservers when it gets to that. We'd start with a tiny pool of one or two instances in ganeti for the dumps, no patches needed.

aaron added a comment.Feb 7 2018, 10:07 PM

Lots of keys use no value, 0, or TTL_INDEFINITE (all infinite), so there will be a lot of old keys.

If you wan the oldest key actually in a server, I'm not sure how to get that; using KEYS is not a great idea and only WANCache always stores "asOf" timestamps, not BagOStuff, though I suppose most caching using the former.

What did you want to do with the cache key age information?

Just know how long before the oldest keys in one of the caches go away, if we were to start putting keys with the right compression flags in these caches from a certain point on.

aaron added a comment.Feb 7 2018, 10:23 PM

I see, hhvm works with and without the flags, so they could be set in the background.

For WANObjectCache keys, the use of "hotTTR" (on by default) means that keys with 1hz access rate would clear out in 15min. A key getting 1 hit per hour (1/3600 hz) would take (900 * 3600 / 86400) = 37.5 days to refresh. Since it's randomized, there would be some that would expire/after that, but those are expected values. Given that, I'd assume that keys in any real use would be silently updated before any php7 switch if given a month or so. We'd want to double-check that of course.

BagOStuff callers (to memcached) are tricker when they have high nominal TTLs. There a lot of set() callers, so it would take a while to check them all in my IDEA (though I can see the list easily enough).

Change 409138 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/ShortUrl@master] Add TTL to set() call

https://gerrit.wikimedia.org/r/409138

Change 409088 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/CloseWikis@master] Add TTL to set() call

https://gerrit.wikimedia.org/r/409088

Change 409088 merged by jenkins-bot:
[mediawiki/extensions/CloseWikis@master] Add TTL to set() call

https://gerrit.wikimedia.org/r/409088

Change 409133 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/OpenStackManager@master] Add TTL to set() call

https://gerrit.wikimedia.org/r/409133

Change 409138 merged by jenkins-bot:
[mediawiki/extensions/ShortUrl@master] Add TTL to set() call

https://gerrit.wikimedia.org/r/409138

Change 409133 merged by jenkins-bot:
[mediawiki/extensions/OpenStackManager@master] Add TTL to set() call

https://gerrit.wikimedia.org/r/409133

Imarlier moved this task from Inbox to Radar on the Performance-Team board.Feb 26 2018, 9:36 PM
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.
Paladox added a subscriber: Paladox.Apr 9 2018, 4:30 PM

Mentioned in SAL (#wikimedia-releng) [2018-04-10T13:41:24Z] <moritzm> upgraded HHVM on mediawiki-deployment04/05/06 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)

Mentioned in SAL (#wikimedia-operations) [2018-04-10T13:41:27Z] <moritzm> upgraded HHVM on mediawiki-deployment04/05/06 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)

I've run some tests on snapshot01 in deployment-prep. I made sure I have the vanilla php-memcached installed. The errors I used to see, e.g.

PHP Warning: Memcached::getMulti(): could not decompress value: unrecognised encryption type

are now gone, for abstracts dumps and page content dumps. That's pretty conclusive. Thumbs up!

Mentioned in SAL (#wikimedia-operations) [2018-04-16T14:12:21Z] <moritzm> upgraded HHVM on mediawiki-deployment-09 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)

Mentioned in SAL (#wikimedia-operations) [2018-04-16T15:25:12Z] <moritzm> upgraded HHVM on mediawiki-deployment-07 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)

Mentioned in SAL (#wikimedia-operations) [2018-04-16T17:11:43Z] <moritzm> upgraded HHVM on mediawiki-jobrunner03 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)

Joe added a subscriber: Joe.Apr 17 2018, 10:28 AM

I performed various functional tests in deployment-prep.

Test conditions:

  • deployment-mediawiki-07 (mw07) running an unpatched HHVM on stretch
  • deployment-mediawiki-09 (mw09) running a patched HHVM on stretch
  • Both servers running a Debian vanilla php 7.0

Tests:

  • Set 100 keys from mw07, successfully read all of them from mw09 with HHVM and fail to read them from php7.0
  • Set 100 keys from mw09 via HHVM, successfully read all of them from mw07 with HHVM and php7.0
  • Set 100 keys from mw09 via PHP 7.0, successfully read all of them from mw07 and mw09 with HHVM

All of those tests passed correctly, so I expect the interactions between the different versions to behave as expected, and we should proceed with the upgrade.

Once we've upgraded all of production, we should slowly rolling-restart the memcached cluster, in order to cleanup old cache values with 100% certainty.

Mentioned in SAL (#wikimedia-operations) [2018-04-18T13:17:12Z] <moritzm> uploaded HHVM 3.18.5+dfsg-1+wmf7+deb9u1 to apt.wikimedia.org/stretch-wikimedia (includes a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)

Mentioned in SAL (#wikimedia-operations) [2018-04-18T13:44:51Z] <moritzm> uploaded HHVM 3.18.5+dfsg-1+wmf7+icu57 to apt.wikimedia.org/jessie-wikimedia (includes a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854))

Mentioned in SAL (#wikimedia-operations) [2018-04-19T08:14:07Z] <moritzm> upgrading app server canaries to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build (T184854)

Mentioned in SAL (#wikimedia-operations) [2018-04-19T09:03:20Z] <moritzm> upgrading API server canaries to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build (T184854)

Mentioned in SAL (#wikimedia-operations) [2018-04-23T08:08:57Z] <_joe_> restarting memcached in codfw (T184854)

All hosts running HHVM (with the exception of snapshot100[5-7]) have been upgraded to the new HHVM builds

Mentioned in SAL (#wikimedia-operations) [2018-04-23T09:05:48Z] <_joe_> AMEND: restart memcached on mc1019 (T184854)

Mentioned in SAL (#wikimedia-operations) [2018-04-23T09:56:30Z] <_joe_> restarting memcached on mc1020-1036 at 1 hour intervals - T184854

Joe added a comment.Apr 24 2018, 5:07 AM

The rolling restart of all memcacheds is done. This ticket might be considered resolved.

Joe closed this task as Resolved.Apr 24 2018, 5:07 AM
ArielGlenn moved this task from Watching now to Done on the User-ArielGlenn board.Aug 7 2018, 10:00 AM