MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Urbanecm
	Jun 30 2022, 9:29 PM

Description

During 2022-06-30's UTC late deployment, a config patch was synced:

Abridged deployment transcript

[thcipriani@deploy1002 mediawiki-staging (master u=)]$ scap sync-file wmf-config/InitialiseSettings.php 'Config: [[gerrit:809165|Enable Wikistories on idwiki (T311143)]]'
...
20:55:35 Started sync-masters
sync-masters: 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
sync-pull-masters: 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
sync-testservers: 100% (in-flight: 0; ok: 4; fail: 0; left: 0)
sync-canaries: 100% (in-flight: 0; ok: 9; fail: 0; left: 0)
20:55:46 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 9 host(s)  [canaries]
sync-proxies: 100% (in-flight: 0; ok: 8; fail: 0; left: 0)
sync-apaches: 100% (in-flight: 0; ok: 348; fail: 0; left: 0)
20:56:36 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 307 host(s)
20:59:05 Finished php-fpm-restarts (duration: 02m 29s)
20:59:05 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809165|Enable Wikistories on idwiki (T311143)]] (duration: 03m 31s)

(full transcript is in P30698)

However, a couple of minutes later, it was observed the change isn't fully live:

21:12 <stephanebisson> thcipriani somehow my patch doesn't seem to be sync'd everywhere. When I refresh, sometimes the code is there sometimes it isn't. Is there a long replication delay or could there be a problem?

This affected at least mw1414 and mw1369:

21:16 <urbanecm> stephanebisson: would you mind sharing which mw server works correctly and which one doesn't? should be available in the `server` header in your devtools
21:18 <stephanebisson> urbanecm mw1414.eqiad.wmnet (not updated)
21:18 <stephanebisson> urbanecm mw1413.eqiad.wmnet (updated)
21:20 <stephanebisson> urbanecm mw1369 also not up to date

State	Hosts
good	mw1413
bad	mw1414, mw1369

For the record mw1414 is an appserver canary, but mw1369 and mw1413 are not canaries.

The code on all of those servers is the same:

urbanecm@notebook  ~
$ ssh mw1413.eqiad.wmnet md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php
cab73e4a083586b3f97260f3634a1414  /srv/mediawiki/wmf-config/InitialiseSettings.php
urbanecm@notebook  ~
$ ssh mw1414.eqiad.wmnet md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php
cab73e4a083586b3f97260f3634a1414  /srv/mediawiki/wmf-config/InitialiseSettings.php
urbanecm@notebook  ~
$ ssh mw1369.eqiad.wmnet md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php
cab73e4a083586b3f97260f3634a1414  /srv/mediawiki/wmf-config/InitialiseSettings.php
urbanecm@notebook  ~
$

This is likely caused by some issue in the PHP-fpm restarts (either the restart command not arriving at all necessary hosts, or the restart script not working under certain circumstances).

The same issue also happened on 2022-06-29 (see logs from https://wm-bot.wmcloud.org/browser/index.php?start=06%2F29%2F2022&end=06%2F29%2F2022&display=%23wikimedia-operations, search for dancy and MatmaRex after 14:35 on that day).

Details

Subject	Repo	Branch	Lines +/-
multiversion: Remove use of the $globals temporary JSON cache file	operations/mediawiki-config	master	+6 -37
MWConfigCacheGenerator.php: Use grace period of 3 minutes	operations/mediawiki-config	master	+6 -10
MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds	operations/mediawiki-config	master	+8 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Krinkle	T212460 Adopt static array files for local disk storage of values (epic)
Open	None	T99740 Use static php array files for l10n cache at WMF (instead of CDB)
Resolved	Krinkle	T245183 PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
Resolved	Krinkle	T253673 Avoid php-opcache corruption in WMF production
Resolved	Joe	T266055 Update Scap to perform rolling restart for all MW deploy
Resolved	Krinkle	T311788 MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off)

Event Timeline

Urbanecm created this task.Jun 30 2022, 9:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2022, 9:29 PM

brennen subscribed.Jun 30 2022, 9:30 PM

Zabe subscribed.Jun 30 2022, 9:31 PM

Urbanecm renamed this task from scap does not fully deploy its code in some cases to scap does not fully deploy MW code in some cases.Jun 30 2022, 9:33 PM

dancy subscribed.Jun 30 2022, 9:34 PM

Stang subscribed.Jun 30 2022, 9:48 PM

dancy updated the task description. (Show Details)Jun 30 2022, 9:53 PM

dancy updated the task description. (Show Details)Jun 30 2022, 10:14 PM

dancy updated the task description. (Show Details)

dancy updated the task description. (Show Details)Jun 30 2022, 10:23 PM

checking /var/log/php7.2-fpm on mw1369 I see a restart around deployment time:

Jun 30 20:57:41 mw1369 php7.2-fpm[31932]: [NOTICE] Terminating ...
Jun 30 20:57:41 mw1369 php7.2-fpm[31932]: [NOTICE] exiting, bye-bye!
Jun 30 20:57:41 mw1369 php7.2-fpm[28511]: [NOTICE] fpm is running, pid 28511

on mw1413:

Jun 30 20:56:45 mw1413 php7.2-fpm[8816]: [NOTICE] Terminating ...
Jun 30 20:56:45 mw1413 php7.2-fpm[8816]: [NOTICE] exiting, bye-bye!
Jun 30 20:56:45 mw1413 php7.2-fpm[31601]: [NOTICE] fpm is running, pid 31601

on mw1414:

Jun 30 20:56:25 mw1414 php7.2-fpm[38105]: [NOTICE] Terminating ...
Jun 30 20:56:25 mw1414 php7.2-fpm[38105]: [NOTICE] exiting, bye-bye!
Jun 30 20:56:25 mw1414 php7.2-fpm[45005]: [NOTICE] fpm is running, pid 45005
Jun 30 20:56:25 mw1414 php7.2-fpm[45005]: [NOTICE] ready to handle connections

dancy updated the task description. (Show Details)Jun 30 2022, 10:33 PM

dancy added a parent task: T266055: Update Scap to perform rolling restart for all MW deploy.Jun 30 2022, 10:47 PM

@Joe @Krinkle Jeena's evidence above shows that the php-fpm restart happened on the afflicted hosts. At this point I don't know what's going on but we're definitely in a state of decreased deployment confidence.

I think deployment of 809961 experienced a similar issue.

It's working on mwdebug1002 but not on production.

At a moment it worked but later it didn't so I'm assuming that I checked the debug server by mistake.

DDeSouza mentioned this in T311015: Deploy QuickSurvey on Japanese Wikipedia.Jun 30 2022, 11:33 PM

Okay, let's consider the following timeline for a typical config change as it was prior to last week. This means prior to doing php restarts and prior to turning off live opcache revalidation.

Deployment server: The config file holds version A on disk.
Web server: Disk file = A, opcache = A, tmp-cache = A.
Web request:
- executes code A from opcache,
- mtime of disk file returns A,
- tmp-cache returns A, disk.mtime == tmp.mtime at this point, we decide tmp-cache is valid,
- all is well!
Deployment server: Run git pull. Disk file = B.
Deployment server: Run scap sync-file.
Web server: Receives sync. Disk file = B, opcache = A, tmp-cache = A.
Web request:
- executes code A from opcache,
- mtime of disk file returns B,
- tmp-cache returns A, disk.time > tmp.mtime but disk.time + opcache.revalidate_freq > now() still in the future so we don't write to tmp-cache with the understanding that if we were to compute config now we get result A since we are running code A from opcache.
Web request after a few seconds have passed:
- executes code B from opcache (since revalidation_freq has passed, so we recompiled by now).
- mtime of disk file returns B,
- tmp-cache returns A, disk.time > tmp.mtime and disk.time + opcache.revalidate_freq < now() so it's safe to recompute and save to cache.

The tmp-cache logic resides in multiversion/MWConfigCacheGenerator.php. It is my objective in T169821 to remove the need for this cache entirely.

With the above information, and the knowledge that "turning off live opcache revalidation" means setting revalidate_freq = 0, it becomes logical then that we are now computing and storing indefinitely, an outdated version of the config right after a sync.

In T311788#8042832, @Krinkle wrote:

Okay, let's consider the following timeline for a typical config change as it was prior to last week. This means prior to doing php restarts and prior to turning off live opcache revalidation.

Deployment server: The config file holds version A on disk.

Web server: Disk file = A, opcache = A, tmp-cache = A.

Web request:

executes code A from opcache,

mtime of disk file returns A,

tmp-cache returns A, disk.mtime == tmp.mtime at this point, we decide tmp-cache is valid,

all is well!

Deployment server: Run git pull. Disk file = B.

Deployment server: Run scap sync-file.

Web server: Receives sync. Disk file = B, opcache = A, tmp-cache = A.

Web request:

executes code A from opcache,

mtime of disk file returns B,

tmp-cache returns A, disk.time > tmp.mtime but disk.time + opcache.revalidate_freq > now() still in the future so we don't write to tmp-cache with the understanding that if we were to compute config now we get result A since we are running code A from opcache.

Web request after a few seconds have passed:

executes code B from opcache (since revalidation_freq has passed, so we recompiled by now).

mtime of disk file returns B,

tmp-cache returns A, disk.time > tmp.mtime and disk.time + opcache.revalidate_freq < now() so it's safe to recompute and save to cache.

The tmp-cache logic resides in multiversion/MWConfigCacheGenerator.php. It is my objective in T169821 to remove the need for this cache entirely.

With the above information, and the knowledge that "turning off live opcache revalidation" means setting revalidate_freq = 0, it becomes logical then that we are now computing and storing indefinitely, an outdated version of the config right after a sync.

If we want to preserve the same working logic as before, we'd have to increase the time delay (previously set to revalidate_freq) to cover the amount of time needed for FPM to have restarted itself. That's a messy thing to try to predict and setting it too high means that 1) you can't test things intuitively on mwdebug 2) you can't verify things intuitively in prod immediately and 3) we'd be running new opcache code after FPM restart but continue to use old possibly incompatible config.

The solution as such is not immediately obvious to me, other than to touch IS.php and sync again. For now I'll expedite T169821 instead.

Thanks for the analysis Krinkle.

A 10 seconds grace period, which was the previous value for revalidate_freq, would be ok I think in the meantime.

RhinosF1 subscribed.Jul 1 2022, 3:47 PM

Krinkle moved this task from Inbox, needs triage to Doing: Prio Interrupt on the Performance-Team board.Jul 5 2022, 6:28 PM

Mentioned in SAL (#wikimedia-operations) [2022-07-11T20:12:53Z] <krinkle@deploy1002> Synchronized wmf-config/InitialiseSettings.php: I82262ef6773ab228 try again ref T311788 (duration: 03m 07s)

Lucas_Werkmeister_WMDE subscribed.Jul 14 2022, 1:31 PM

Change 815317 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds

https://gerrit.wikimedia.org/r/815317

gerritbot added a project: Patch-For-Review.Jul 19 2022, 3:29 PM

Change 815317 merged by jenkins-bot:

[operations/mediawiki-config@master] MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds

https://gerrit.wikimedia.org/r/815317

Mentioned in SAL (#wikimedia-operations) [2022-07-19T21:26:47Z] <dancy@deploy1002> Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:815317|MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds (T311788)]] (duration: 02m 59s)

Maintenance_bot removed a project: Patch-For-Review.Jul 19 2022, 9:31 PM

dancy closed this task as Resolved.Jul 19 2022, 9:33 PM

The same situation happened today during a deployment, so we're not out of the woods yet.

Clarifying for the record: Opcache revalidation is turned off by setting opcache.validate_timestamps to 0. opcache.revalidate_freq still has a value of 10. So, the change I made in https://gerrit.wikimedia.org/r/815317 is effectively a non-change.

In T311788#8042839, @Krinkle wrote:

If we want to preserve the same working logic as before, we'd have to increase the time delay (previously set to revalidate_freq) to cover the amount of time needed for FPM to have restarted itself. That's a messy thing to try to predict and setting it too high means that 1) you can't test things intuitively on mwdebug 2) you can't verify things intuitively in prod immediately and 3) we'd be running new opcache code after FPM restart but continue to use old possibly incompatible config.

After looking at MWConfigCacheGenerator.php more carefully I think that using a generous (1-2 minute) grace period would be OK. According to the code, if the mtime field of the cache doesn't match $confActualMtime (the max mtime of InitialiseSettings.php, logos.php and includes/Defines.php), the cached config will be ignored and it will generate a fresh array of information and use it for the request. The cache is simply ineffective during the grace period. Once the grace period expires it will update the cache. As long as that is well after php-fpm restart, we're good. @Krinkle lemme know what you think.

@dancy As long as it means incurring cache misses instead of serving stale - Sounds good to me! We can afford a minute of forced cache misses, as we're already quite close to removing the cache anyway. I'm waiting for one more optimisation to the cache miss path, after which it will be faster than a cache hit (T169821). This because cache hits involve reading and parsing a JSON file.

Change 816022 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] MWConfigCacheGenerator.php: Use grace period of 3 minutes

https://gerrit.wikimedia.org/r/816022

gerritbot added a project: Patch-For-Review.Jul 21 2022, 5:44 PM

I landed on a grace period of 3 minutes because the php-fpm-restart phase of scap currently takes about 2.5 minutes to complete.

Change 816022 merged by jenkins-bot:

[operations/mediawiki-config@master] MWConfigCacheGenerator.php: Use grace period of 3 minutes

https://gerrit.wikimedia.org/r/816022

Maintenance_bot removed a project: Patch-For-Review.Jul 21 2022, 7:30 PM

dancy closed this task as Resolved.Jul 21 2022, 10:07 PM

Stang unsubscribed.Jul 25 2022, 12:05 AM

Change 579653 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] multiversion: Remove use of the $globals temporary JSON cache file

https://gerrit.wikimedia.org/r/579653

Change 579653 merged by jenkins-bot:

[operations/mediawiki-config@master] multiversion: Remove use of the $globals temporary JSON cache file

https://gerrit.wikimedia.org/r/579653

Maintenance_bot removed a project: Patch-For-Review.Aug 1 2022, 1:31 AM

Krinkle mentioned this in T169821: Try to make wmf-config/wgConf's per-wiki configuration cache redundant.Sep 5 2023, 11:18 PM

MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off)
Closed, ResolvedPublic
Actions

Related Objects
Search...