Page MenuHomePhabricator

MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off)
Closed, ResolvedPublic

Description

During 2022-06-30's UTC late deployment, a config patch was synced:

Abridged deployment transcript
[thcipriani@deploy1002 mediawiki-staging (master u=)]$ scap sync-file wmf-config/InitialiseSettings.php 'Config: [[gerrit:809165|Enable Wikistories on idwiki (T311143)]]'
...
20:55:35 Started sync-masters
sync-masters: 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
sync-pull-masters: 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
sync-testservers: 100% (in-flight: 0; ok: 4; fail: 0; left: 0)
sync-canaries: 100% (in-flight: 0; ok: 9; fail: 0; left: 0)
20:55:46 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 9 host(s)  [canaries]
sync-proxies: 100% (in-flight: 0; ok: 8; fail: 0; left: 0)
sync-apaches: 100% (in-flight: 0; ok: 348; fail: 0; left: 0)
20:56:36 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 307 host(s)
20:59:05 Finished php-fpm-restarts (duration: 02m 29s)
20:59:05 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809165|Enable Wikistories on idwiki (T311143)]] (duration: 03m 31s)

(full transcript is in P30698)

However, a couple of minutes later, it was observed the change isn't fully live:

21:12 <stephanebisson> thcipriani somehow my patch doesn't seem to be sync'd everywhere. When I refresh, sometimes the code is there sometimes it isn't. Is there a long replication delay or could there be a problem?

This affected at least mw1414 and mw1369:

21:16 <urbanecm> stephanebisson: would you mind sharing which mw server works correctly and which one doesn't? should be available in the `server` header in your devtools
21:18 <stephanebisson> urbanecm mw1414.eqiad.wmnet (not updated)
21:18 <stephanebisson> urbanecm mw1413.eqiad.wmnet (updated)
21:20 <stephanebisson> urbanecm mw1369 also not up to date
StateHosts
goodmw1413
badmw1414, mw1369

For the record mw1414 is an appserver canary, but mw1369 and mw1413 are not canaries.

The code on all of those servers is the same:

urbanecm@notebook  ~
$ ssh mw1413.eqiad.wmnet md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php
cab73e4a083586b3f97260f3634a1414  /srv/mediawiki/wmf-config/InitialiseSettings.php
urbanecm@notebook  ~
$ ssh mw1414.eqiad.wmnet md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php
cab73e4a083586b3f97260f3634a1414  /srv/mediawiki/wmf-config/InitialiseSettings.php
urbanecm@notebook  ~
$ ssh mw1369.eqiad.wmnet md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php
cab73e4a083586b3f97260f3634a1414  /srv/mediawiki/wmf-config/InitialiseSettings.php
urbanecm@notebook  ~
$

This is likely caused by some issue in the PHP-fpm restarts (either the restart command not arriving at all necessary hosts, or the restart script not working under certain circumstances).

The same issue also happened on 2022-06-29 (see logs from https://wm-bot.wmcloud.org/browser/index.php?start=06%2F29%2F2022&end=06%2F29%2F2022&display=%23wikimedia-operations, search for dancy and MatmaRex after 14:35 on that day).

Event Timeline

Urbanecm renamed this task from scap does not fully deploy its code in some cases to scap does not fully deploy MW code in some cases.Jun 30 2022, 9:33 PM
dancy updated the task description. (Show Details)

checking /var/log/php7.2-fpm on mw1369 I see a restart around deployment time:

Jun 30 20:57:41 mw1369 php7.2-fpm[31932]: [NOTICE] Terminating ...
Jun 30 20:57:41 mw1369 php7.2-fpm[31932]: [NOTICE] exiting, bye-bye!
Jun 30 20:57:41 mw1369 php7.2-fpm[28511]: [NOTICE] fpm is running, pid 28511

on mw1413:

Jun 30 20:56:45 mw1413 php7.2-fpm[8816]: [NOTICE] Terminating ...
Jun 30 20:56:45 mw1413 php7.2-fpm[8816]: [NOTICE] exiting, bye-bye!
Jun 30 20:56:45 mw1413 php7.2-fpm[31601]: [NOTICE] fpm is running, pid 31601

on mw1414:

Jun 30 20:56:25 mw1414 php7.2-fpm[38105]: [NOTICE] Terminating ...
Jun 30 20:56:25 mw1414 php7.2-fpm[38105]: [NOTICE] exiting, bye-bye!
Jun 30 20:56:25 mw1414 php7.2-fpm[45005]: [NOTICE] fpm is running, pid 45005
Jun 30 20:56:25 mw1414 php7.2-fpm[45005]: [NOTICE] ready to handle connections
dancy triaged this task as High priority.Jun 30 2022, 10:51 PM
dancy added subscribers: Krinkle, Joe.

@Joe @Krinkle Jeena's evidence above shows that the php-fpm restart happened on the afflicted hosts. At this point I don't know what's going on but we're definitely in a state of decreased deployment confidence.

I think deployment of 809961 experienced a similar issue.

It's working on mwdebug1002 but not on production.

At a moment it worked but later it didn't so I'm assuming that I checked the debug server by mistake.

Krinkle renamed this task from scap does not fully deploy MW code in some cases to MW wmf-config tmp cache stays outdated after Scap deploy.Jun 30 2022, 11:40 PM
Krinkle renamed this task from MW wmf-config tmp cache stays outdated after Scap deploy to MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off).
Krinkle claimed this task.

Okay, let's consider the following timeline for a typical config change as it was prior to last week. This means prior to doing php restarts and prior to turning off live opcache revalidation.

  1. Deployment server: The config file holds version A on disk.
  2. Web server: Disk file = A, opcache = A, tmp-cache = A.
  3. Web request:
    • executes code A from opcache,
    • mtime of disk file returns A,
    • tmp-cache returns A, disk.mtime == tmp.mtime at this point, we decide tmp-cache is valid,
    • all is well!
  4. Deployment server: Run git pull. Disk file = B.
  5. Deployment server: Run scap sync-file.
  6. Web server: Receives sync. Disk file = B, opcache = A, tmp-cache = A.
  7. Web request:
    • executes code A from opcache,
    • mtime of disk file returns B,
    • tmp-cache returns A, disk.time > tmp.mtime but disk.time + opcache.revalidate_freq > now() still in the future so we don't write to tmp-cache with the understanding that if we were to compute config now we get result A since we are running code A from opcache.
  8. Web request after a few seconds have passed:
    • executes code B from opcache (since revalidation_freq has passed, so we recompiled by now).
    • mtime of disk file returns B,
    • tmp-cache returns A, disk.time > tmp.mtime and disk.time + opcache.revalidate_freq < now() so it's safe to recompute and save to cache.

The tmp-cache logic resides in multiversion/MWConfigCacheGenerator.php. It is my objective in T169821 to remove the need for this cache entirely.

With the above information, and the knowledge that "turning off live opcache revalidation" means setting revalidate_freq = 0, it becomes logical then that we are now computing and storing indefinitely, an outdated version of the config right after a sync.

Okay, let's consider the following timeline for a typical config change as it was prior to last week. This means prior to doing php restarts and prior to turning off live opcache revalidation.

  1. Deployment server: The config file holds version A on disk.
  2. Web server: Disk file = A, opcache = A, tmp-cache = A.
  3. Web request:
    • executes code A from opcache,
    • mtime of disk file returns A,
    • tmp-cache returns A, disk.mtime == tmp.mtime at this point, we decide tmp-cache is valid,
    • all is well!
  4. Deployment server: Run git pull. Disk file = B.
  5. Deployment server: Run scap sync-file.
  6. Web server: Receives sync. Disk file = B, opcache = A, tmp-cache = A.
  7. Web request:
    • executes code A from opcache,
    • mtime of disk file returns B,
    • tmp-cache returns A, disk.time > tmp.mtime but disk.time + opcache.revalidate_freq > now() still in the future so we don't write to tmp-cache with the understanding that if we were to compute config now we get result A since we are running code A from opcache.
  8. Web request after a few seconds have passed:
    • executes code B from opcache (since revalidation_freq has passed, so we recompiled by now).
    • mtime of disk file returns B,
    • tmp-cache returns A, disk.time > tmp.mtime and disk.time + opcache.revalidate_freq < now() so it's safe to recompute and save to cache.

The tmp-cache logic resides in multiversion/MWConfigCacheGenerator.php. It is my objective in T169821 to remove the need for this cache entirely.

With the above information, and the knowledge that "turning off live opcache revalidation" means setting revalidate_freq = 0, it becomes logical then that we are now computing and storing indefinitely, an outdated version of the config right after a sync.

If we want to preserve the same working logic as before, we'd have to increase the time delay (previously set to revalidate_freq) to cover the amount of time needed for FPM to have restarted itself. That's a messy thing to try to predict and setting it too high means that 1) you can't test things intuitively on mwdebug 2) you can't verify things intuitively in prod immediately and 3) we'd be running new opcache code after FPM restart but continue to use old possibly incompatible config.

The solution as such is not immediately obvious to me, other than to touch IS.php and sync again. For now I'll expedite T169821 instead.

Thanks for the analysis Krinkle.

A 10 seconds grace period, which was the previous value for revalidate_freq, would be ok I think in the meantime.

Mentioned in SAL (#wikimedia-operations) [2022-07-11T20:12:53Z] <krinkle@deploy1002> Synchronized wmf-config/InitialiseSettings.php: I82262ef6773ab228 try again ref T311788 (duration: 03m 07s)

Change 815317 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds

https://gerrit.wikimedia.org/r/815317

Change 815317 merged by jenkins-bot:

[operations/mediawiki-config@master] MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds

https://gerrit.wikimedia.org/r/815317

Mentioned in SAL (#wikimedia-operations) [2022-07-19T21:26:47Z] <dancy@deploy1002> Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:815317|MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds (T311788)]] (duration: 02m 59s)

The same situation happened today during a deployment, so we're not out of the woods yet.

Clarifying for the record: Opcache revalidation is turned off by setting opcache.validate_timestamps to 0. opcache.revalidate_freq still has a value of 10. So, the change I made in https://gerrit.wikimedia.org/r/815317 is effectively a non-change.

If we want to preserve the same working logic as before, we'd have to increase the time delay (previously set to revalidate_freq) to cover the amount of time needed for FPM to have restarted itself. That's a messy thing to try to predict and setting it too high means that 1) you can't test things intuitively on mwdebug 2) you can't verify things intuitively in prod immediately and 3) we'd be running new opcache code after FPM restart but continue to use old possibly incompatible config.

After looking at MWConfigCacheGenerator.php more carefully I think that using a generous (1-2 minute) grace period would be OK. According to the code, if the mtime field of the cache doesn't match $confActualMtime (the max mtime of InitialiseSettings.php, logos.php and includes/Defines.php), the cached config will be ignored and it will generate a fresh array of information and use it for the request. The cache is simply ineffective during the grace period. Once the grace period expires it will update the cache. As long as that is well after php-fpm restart, we're good. @Krinkle lemme know what you think.

@dancy As long as it means incurring cache misses instead of serving stale - Sounds good to me! We can afford a minute of forced cache misses, as we're already quite close to removing the cache anyway. I'm waiting for one more optimisation to the cache miss path, after which it will be faster than a cache hit (T169821). This because cache hits involve reading and parsing a JSON file.

Change 816022 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] MWConfigCacheGenerator.php: Use grace period of 3 minutes

https://gerrit.wikimedia.org/r/816022

I landed on a grace period of 3 minutes because the php-fpm-restart phase of scap currently takes about 2.5 minutes to complete.

Change 816022 merged by jenkins-bot:

[operations/mediawiki-config@master] MWConfigCacheGenerator.php: Use grace period of 3 minutes

https://gerrit.wikimedia.org/r/816022

Change 579653 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] multiversion: Remove use of the $globals temporary JSON cache file

https://gerrit.wikimedia.org/r/579653

Change 579653 merged by jenkins-bot:

[operations/mediawiki-config@master] multiversion: Remove use of the $globals temporary JSON cache file

https://gerrit.wikimedia.org/r/579653