Page MenuHomePhabricator

Cache of wmf-config/InitialiseSettings often 1 step behind
Closed, ResolvedPublic

Description

It has happened to me 5 times out of 5 deployments I did in the past 30 days that when changing a value in InitialiseSettings.php, and syncing it out, that about half the servers (possibly more) the change does not apply.

But, it works fine on mwdebug1002. And a second touch+sync it always works on the rest of the fleet as well.

This suggests there might be a resurfaced old bug and/or new bug caused by recent changes to MWConfigCacheGenerator.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle updated the task description. (Show Details)Oct 21 2019, 9:30 PM
Krinkle updated the task description. (Show Details)Oct 21 2019, 9:32 PM
jijiki added a subscriber: jijiki.Oct 22 2019, 10:25 AM
CDanis added a subscriber: CDanis.Oct 22 2019, 12:23 PM
CDanis added a subscriber: MaxSem.Oct 22 2019, 12:26 PM

This happened yesterday, during firefighting work on {T235949}

All appservers had the same config:

1❌cdanis@cumin1001.eqiad.wmnet ~ πŸ•ŸπŸ΅ sudo !!
2sudo cumin A:mw 'md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php'
3126 hosts will be targeted:
4mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
5Confirm to continue [y/n]? y
6===== NODE GROUP =====
7(126) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
8----- OUTPUT of 'md5sum /srv/medi...liseSettings.php' -----
9b5a01965593c5a02207d93e70c630832 /srv/mediawiki/wmf-config/InitialiseSettings.php
10================
11PASS: |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100% (126/126) [00:01<00:00, 77.68hosts/s]
12FAIL: | | 0% (0/126) [00:01<?, ?hosts/s]
13100.0% (126/126) success ratio (>= 100.0% threshold) for command: 'md5sum /srv/medi...liseSettings.php'.
14100.0% (126/126) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

And yet a third of the fleet was still broken

20:39:15	<@cdanis>	(78) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1240,1273,1322].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet                                     
20:39:17	<@cdanis>	these appservers seem fine
20:39:20	<@cdanis>	(48) mw[1238-1239,1241-1258,1261-1272,1274-1275,1319-1321,1323-1333].eqiad.wmnet                                                                                                              
20:39:22	<@cdanis>	these appservers are not

@MaxSem touched InitialiseSettings.php and did another sync-file and that fixed it.

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

rsync certainly can, but won't by default unless you pass -t / --times. (-a / --archive includes this in the set of options it enables.)

I see --archive included in scap/tasks.py DEFAULT_RSYNC_ARGS, so I assume that scap DTRT here?

Krinkle added a comment.EditedOct 24 2019, 3:08 PM

I don't know off-hand whether the mtime comes from the deployment server (set when the last git command saved the file to disk), or whether it is set by the destination server at the time of local sync.

However, we don't currently depend on either model as far as I know. That is, so long as the mtime changes in some way, the code for config cache will handle it. It doesn't assume equality across the cluster and doesn't assume it goes up.

Having said that, it seems most likely that something in the config cache code is broken because it's the only part of this that has changed recently. I've looked at ed5297c107f38d and subsequent refactors by James many times but I don't see anything obvious stand out...

Anomie added a subscriber: Anomie.Nov 18 2019, 3:25 PM

Happened to me today when deploying a config change for T198312. Touching InitialiseSettings.php and re-syncing did fix it.

Addshore added subscribers: Addshore, Daimona, Catrope, Ladsgroup.

Also happened to me twice in the past few days.

about half the servers (possibly more) the change does not apply.

For some of my syncs I was lucky if I could find 1 or 2 servers with the config actually loaded correctly.

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

rsync certainly can, but won't by default unless you pass -t / --times. (-a / --archive includes this in the set of options it enables.)

I see --archive included in scap/tasks.py DEFAULT_RSYNC_ARGS, so I assume that scap DTRT here?

Those should be the rsync arguments that scap uses in this case; however, after each sync scap also touches the wmf-config/InitialiseSettings.php file to trigger the re-cache of settings.

This happened yesterday, during firefighting work on {T235949}

All appservers had the same config:

1❌cdanis@cumin1001.eqiad.wmnet ~ πŸ•ŸπŸ΅ sudo !!
2sudo cumin A:mw 'md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php'
3126 hosts will be targeted:
4mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
5Confirm to continue [y/n]? y
6===== NODE GROUP =====
7(126) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
8----- OUTPUT of 'md5sum /srv/medi...liseSettings.php' -----
9b5a01965593c5a02207d93e70c630832 /srv/mediawiki/wmf-config/InitialiseSettings.php
10================
11PASS: |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100% (126/126) [00:01<00:00, 77.68hosts/s]
12FAIL: | | 0% (0/126) [00:01<?, ?hosts/s]
13100.0% (126/126) success ratio (>= 100.0% threshold) for command: 'md5sum /srv/medi...liseSettings.php'.
14100.0% (126/126) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

And yet a third of the fleet was still broken

20:39:15	<@cdanis>	(78) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1240,1273,1322].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet                                     
20:39:17	<@cdanis>	these appservers seem fine
20:39:20	<@cdanis>	(48) mw[1238-1239,1241-1258,1261-1272,1274-1275,1319-1321,1323-1333].eqiad.wmnet                                                                                                              
20:39:22	<@cdanis>	these appservers are not

@MaxSem touched InitialiseSettings.php and did another sync-file and that fixed it.

So the actual rsyncing of the file appears to work just fine; however, the file is not loaded from the config correctly.

The next step of troubleshooting is to figure out if this is a problem with fixing the race condition in T217830 (or some subsequent refactor); or something entirely new to do with php caching. I guess that would mean comparing the mtime of /srv/mediawiki/wmf-config/InitialiseSettings.php with /tmp/mw-cache-[versions]/conf2-[wiki].json. If, on a misbehaving appserver, the cache has an older mtime than the IS.php file, then we know regeneration was not triggered. e.g.,

thcipriani@mwdebug1001:~$ sudo -u www-data jq '.mtime' /tmp/mw-cache-1.35.0-wmf.10/conf2-enwiki.json 
1576527000
thcipriani@mwdebug1001:~$ sudo -u www-data php -a
Interactive mode enabled

php > var_dump(filemtime('/srv/mediawiki/wmf-config/InitialiseSettings.php'));
int(1576527000)
php > ^Dthcipriani@mwdebug1001:~$ stat --format '%Y (%y) %n' /srv/mediawiki/wmf-config/InitialiseSettings.php 
1576527000 (2019-12-16 20:10:00.935866457 +0000) /srv/mediawiki/wmf-config/InitialiseSettings.php

Mentioned in SAL (#wikimedia-operations) [2020-01-06T12:31:00Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:562083|Don’t check constraints on P6685 statements]] Bypassing T236104 (duration: 00m 55s)

Catrope added a comment.EditedJan 22 2020, 1:58 AM

I encountered this bug again today, when deploying this config change. My teammates reported that the feature we'd just enabled was working about half the time, which likely means that about half (or at least a significant fraction, much more than 10%) of the app servers had an outdated config, I tried to use dsh to count how many app servers were affected, but I couldn't get keyholder to work. It gave me a "Too many authentication failures" error message for each host.

I asked a teammate to keep refreshing until they got a broken pageview (with the feature disabled), then run mw.config.get( 'wgHostname' ) in the browser console and tell me the output. This gave me the server name of a broken server (mw1319). Using the same process we found that mw1256 was not broken.

I compared the "good" server (mw1256) with the "bad" server (mw1319) and found that:

  • Running echo 'var_dump($wgGEHomepageSuggestedEditsTopicsRequiresOptIn)' | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=cswiki returned false (correct) on the good server, and true (incorrect) on the bad server. This is what I expected.
  • InitialiseSettings.php had the correct value for the config var on both servers. In fact, it was identical on both servers (same sha1sum), and had the same mtime on both
  • The config cache file (/tmp/mw-cache-1.35.0-wmf.15/conf2-cswiki.json) had the correct value on the good server, but had the wrong value on the bad server. However, the cache file did have the same mtime as InitialiseSettings.php (in the same minute that is, I forgot to use stat to look at it with second precision)

I'm not sure how this could have happened. Maybe the cache file was regenerated for other reasons in the same minute as InitialiseSettings? Or maybe the mtime of InitialiseSettings was updated before the file contents were? Or for some other reason the PHP code that regenerated the cache file saw the new mtime but the old file contents for InitialiseSettings.php, perhaps due to the php-fpm cache?

I have touched and resynced InitialiseSettings to fix the buggy deployment, which destroyed all the evidence, so unfortunately I can't investigate further at this point.

For next time:

  • Look at the file mtimes of InitialiseSettings.php and the cache file with second precision, using stat
  • Look at the mtime field in the cache file itself (at the top), which also has second precision
  • Cross reference these times with the second-precision times reported by scap for when it began/finished syncing and when it purged the php-fpm cache, and see if there's a difference between broken and non-broken servers
  • Figure out how to get dsh+keyholder to work, so that we can run debugging/investigation commands on all servers at once

Thanks to Giuseppe I got dsh+keyholder to work: SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- 'echo "echo \$wgGEHomepageSuggestedEditsTopicsRequiresOptIn ? \"BAD\" : \"good\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=cswiki | head -n 1'

  • InitialiseSettings.php had the correct value for the config var on both servers. In fact, it was identical on both servers (same sha1sum), and had the same mtime on both

This seems to indicate that the actual syncing is working correctly, so that's good.

  • The config cache file (/tmp/mw-cache-1.35.0-wmf.15/conf2-cswiki.json) had the correct value on the good server, but had the wrong value on the bad server. However, the cache file did have the same mtime as InitialiseSettings.php (in the same minute that is, I forgot to use stat to look at it with second precision)

This seems to implicate that in fixing the race condition from T217830 we have somehow created a *new* race condition OR something else has changed in the system.

Having the mtime that is embedded in the cache file would help; we now store the mtime of InitialiseSettings.php in the cache. That is, we read the cache, compare the mtime stored there with the mtime of InitialiseSettings.php and update if they are not ===.

If the mtime in the cache matched InitialiseSettings.php BUT had old values then there's probably something strange going on at the PHP level.

If the mtime in the cache didn't match InitialiseSettings.php and also had old values, then there's probably a race condition in mtime comparison/something strange at the filesystem level.

Maybe the cache file was regenerated for other reasons in the same minute as InitialiseSettings?

It would have to be in the same second -- IIRC that's the granularity of this filesystem/system. This is actually the race condition we were (trying) to control for by storing the mtime of InitialiseSettings in the cache; however, it's possible that if two syncs happened in the same second: that could be the problem. I don't think that was the case here?

I laid out the race condition we were solving for in T217830#5009234; however, that is likely not the only race condition here. Using mtime of a file to invalidate a cache that is generated by webtraffic hits is a complicated system and is easy to get wrong (I link to https://apenwarr.ca/log/20181113 a lot in this context -- tl;dr: a lot can go wrong :)) -- perhaps moving preemptively to something like a checksum stored in the cache would be easier? A checksum of stat info if that's less resource intensive?

Or maybe the mtime of InitialiseSettings was updated before the file contents were?

IIRC, from digging into how rsync works with the --delay-updates flag, rsync creates a .~tmp~ directory with the new file and uses rename(2) to move the file into place. This one seems least likely.

Or for some other reason the PHP code that regenerated the cache file saw the new mtime but the old file contents for InitialiseSettings.php, perhaps due to the php-fpm cache?

That's an interesting idea that I hadn't considered. Comparing the mtime inside the cache with the mtime of IS.php would help determine if this is the case.

For next time:

  • Look at the file mtimes of InitialiseSettings.php and the cache file with second precision, using stat
  • Look at the mtime field in the cache file itself (at the top), which also has second precision
  • Cross reference these times with the second-precision times reported by scap for when it began/finished syncing and when it purged the php-fpm cache, and see if there's a difference between broken and non-broken servers
  • Figure out how to get dsh+keyholder to work, so that we can run debugging/investigation commands on all servers at once

+1 thank you for all this!

Mentioned in SAL (#wikimedia-operations) [2020-01-22T23:46:29Z] <James_F> <RoanKattouw> T236104 happened again, and this time I'm leaving it broken so I can investigate. Please don't use do any MW deployments (use scap) for now

It happened again today, with this patch. This time I left it broken for a while so I could investigate.

I got dsh working, and it tells me that about 61 of the 336 hosts (18%) in mediawiki-installation are affected:

$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- 'echo "echo \$wgGEHomepageEnabled ? \"good\" : \"BAD\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=ukwiki | head -n 1' | tee ~/homepage-hosts
[...]
$ cat homepage-hosts | grep good | wc -l
275
$ cat homepage-hosts | grep BAD | wc -l
61

However, if you filter to just eqiad app servers, it's an exact 50%/50% split, which is interesting:

$ cat homepage-hosts | grep mw1 | grep good | wc -l
61
$ cat homepage-hosts | grep mw1 | grep BAD | wc -l
61

I didn't see any patterns offhand, but the list of good vs bad eqiad servers had a lot of clustering. I've put it in a paste so I can look back at it later (and perhaps compare it with future incidents):

1mw1221.eqiad.wmnet: BAD
2mw1222.eqiad.wmnet: BAD
3mw1223.eqiad.wmnet: BAD
4mw1224.eqiad.wmnet: good
5mw1225.eqiad.wmnet: BAD
6mw1226.eqiad.wmnet: good
7mw1227.eqiad.wmnet: good
8mw1228.eqiad.wmnet: good
9mw1229.eqiad.wmnet: BAD
10mw1230.eqiad.wmnet: BAD
11mw1231.eqiad.wmnet: BAD
12mw1232.eqiad.wmnet: good
13mw1233.eqiad.wmnet: BAD
14mw1234.eqiad.wmnet: BAD
15mw1235.eqiad.wmnet: good
16mw1238.eqiad.wmnet: good
17mw1239.eqiad.wmnet: good
18mw1240.eqiad.wmnet: BAD
19mw1241.eqiad.wmnet: good
20mw1242.eqiad.wmnet: good
21mw1243.eqiad.wmnet: good
22mw1244.eqiad.wmnet: good
23mw1245.eqiad.wmnet: good
24mw1246.eqiad.wmnet: good
25mw1247.eqiad.wmnet: BAD
26mw1248.eqiad.wmnet: good
27mw1249.eqiad.wmnet: BAD
28mw1250.eqiad.wmnet: BAD
29mw1251.eqiad.wmnet: BAD
30mw1252.eqiad.wmnet: good
31mw1253.eqiad.wmnet: good
32mw1254.eqiad.wmnet: good
33mw1255.eqiad.wmnet: BAD
34mw1256.eqiad.wmnet: BAD
35mw1257.eqiad.wmnet: good
36mw1258.eqiad.wmnet: good
37mw1261.eqiad.wmnet: BAD
38mw1262.eqiad.wmnet: BAD
39mw1263.eqiad.wmnet: BAD
40mw1264.eqiad.wmnet: BAD
41mw1265.eqiad.wmnet: good
42mw1266.eqiad.wmnet: BAD
43mw1267.eqiad.wmnet: BAD
44mw1268.eqiad.wmnet: BAD
45mw1269.eqiad.wmnet: BAD
46mw1270.eqiad.wmnet: BAD
47mw1271.eqiad.wmnet: BAD
48mw1272.eqiad.wmnet: good
49mw1273.eqiad.wmnet: BAD
50mw1274.eqiad.wmnet: good
51mw1275.eqiad.wmnet: BAD
52mw1276.eqiad.wmnet: BAD
53mw1277.eqiad.wmnet: BAD
54mw1278.eqiad.wmnet: good
55mw1279.eqiad.wmnet: BAD
56mw1280.eqiad.wmnet: BAD
57mw1281.eqiad.wmnet: good
58mw1282.eqiad.wmnet: BAD
59mw1283.eqiad.wmnet: good
60mw1284.eqiad.wmnet: BAD
61mw1285.eqiad.wmnet: good
62mw1286.eqiad.wmnet: BAD
63mw1287.eqiad.wmnet: BAD
64mw1288.eqiad.wmnet: BAD
65mw1289.eqiad.wmnet: good
66mw1290.eqiad.wmnet: good
67mw1293.eqiad.wmnet: BAD
68mw1294.eqiad.wmnet: good
69mw1295.eqiad.wmnet: good
70mw1296.eqiad.wmnet: good
71mw1297.eqiad.wmnet: good
72mw1298.eqiad.wmnet: good
73mw1299.eqiad.wmnet: good
74mw1300.eqiad.wmnet: good
75mw1301.eqiad.wmnet: good
76mw1302.eqiad.wmnet: good
77mw1303.eqiad.wmnet: BAD
78mw1304.eqiad.wmnet: good
79mw1305.eqiad.wmnet: good
80mw1306.eqiad.wmnet: good
81mw1307.eqiad.wmnet: good
82mw1308.eqiad.wmnet: good
83mw1309.eqiad.wmnet: good
84mw1310.eqiad.wmnet: good
85mw1311.eqiad.wmnet: good
86mw1312.eqiad.wmnet: good
87mw1313.eqiad.wmnet: BAD
88mw1314.eqiad.wmnet: good
89mw1315.eqiad.wmnet: good
90mw1316.eqiad.wmnet: good
91mw1317.eqiad.wmnet: BAD
92mw1318.eqiad.wmnet: good
93mw1319.eqiad.wmnet: BAD
94mw1320.eqiad.wmnet: good
95mw1321.eqiad.wmnet: BAD
96mw1322.eqiad.wmnet: BAD
97mw1323.eqiad.wmnet: BAD
98mw1324.eqiad.wmnet: BAD
99mw1325.eqiad.wmnet: BAD
100mw1326.eqiad.wmnet: good
101mw1327.eqiad.wmnet: BAD
102mw1328.eqiad.wmnet: good
103mw1329.eqiad.wmnet: BAD
104mw1330.eqiad.wmnet: BAD
105mw1331.eqiad.wmnet: BAD
106mw1332.eqiad.wmnet: BAD
107mw1333.eqiad.wmnet: BAD
108mw1334.eqiad.wmnet: good
109mw1335.eqiad.wmnet: good
110mw1336.eqiad.wmnet: good
111mw1337.eqiad.wmnet: good
112mw1338.eqiad.wmnet: BAD
113mw1339.eqiad.wmnet: BAD
114mw1340.eqiad.wmnet: good
115mw1341.eqiad.wmnet: BAD
116mw1342.eqiad.wmnet: good
117mw1343.eqiad.wmnet: BAD
118mw1344.eqiad.wmnet: BAD
119mw1345.eqiad.wmnet: good
120mw1346.eqiad.wmnet: BAD
121mw1347.eqiad.wmnet: BAD
122mw1348.eqiad.wmnet: BAD

On a "bad" server:

catrope@mw1240:~$ stat /srv/mediawiki/wmf-config/InitialiseSettings.php 
  File: /srv/mediawiki/wmf-config/InitialiseSettings.php
  Size: 806624    	Blocks: 1576       IO Block: 4096   regular file
Device: 801h/2049d	Inode: 20981413    Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (  499/mwdeploy)   Gid: (  499/mwdeploy)
Access: 2020-01-22 22:39:07.153319007 +0000
Modify: 2020-01-22 22:39:00.673380465 +0000
Change: 2020-01-22 22:39:00.673380465 +0000
 Birth: -

catrope@mw1240:~$ stat /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
  File: /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
  Size: 75700     	Blocks: 152        IO Block: 4096   regular file
Device: 802h/2050d	Inode: 7125        Links: 1
Access: (0600/-rw-------)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2020-01-22 22:39:01.765370109 +0000
Modify: 2020-01-22 22:39:01.281374699 +0000
Change: 2020-01-22 22:39:01.281374699 +0000
 Birth: -

catrope@mw1240:~$ sudo -u www-data head -n 2 /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
{
    "mtime": 1579732740,

catrope@mw1240:~$ date -d @1579732740
Wed Jan 22 22:39:00 UTC 2020

On a good server:

catrope@mw1241:~$ stat /srv/mediawiki/wmf-config/InitialiseSettings.php 
  File: /srv/mediawiki/wmf-config/InitialiseSettings.php
  Size: 806624    	Blocks: 1576       IO Block: 4096   regular file
Device: 801h/2049d	Inode: 6946842     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (  499/mwdeploy)   Gid: (  499/mwdeploy)
Access: 2020-01-22 22:39:01.011376233 +0000
Modify: 2020-01-22 22:38:58.727418502 +0000
Change: 2020-01-22 22:38:58.727418502 +0000
 Birth: -

catrope@mw1241:~$ stat /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
  File: /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
  Size: 75810     	Blocks: 152        IO Block: 4096   regular file
Device: 802h/2050d	Inode: 6750        Links: 1
Access: (0600/-rw-------)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2020-01-22 22:39:13.831139084 +0000
Modify: 2020-01-22 22:39:10.747196117 +0000
Change: 2020-01-22 22:39:10.747196117 +0000
 Birth: -

catrope@mw1241:~$ sudo -u www-data head -n 2 /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
{
    "mtime": 1579732738,

catrope@mw1241:~$ date -d @1579732738
Wed Jan 22 22:38:58 UTC 2020

Nothing super obvious jumps out here: on both servers, the mtime value encoded in the JSON file matches the mtime of the PHP file, and the mtime of the JSON file is after the mtime of the PHP file. The only difference I see is that the good server has a pretty big gap (12 seconds) between InitialiseSettings.php being written and the cache file being written, while on the bad server it's pretty small (0.6 seconds). So let's see if this is a common pattern:

$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- awk "\"BEGIN {x = \$(stat /srv/mediawiki/wmf-config/InitialiseSettings.php --printf=%y | cut -d ':' -f 3 | cut -d ' ' -f 1) ; y = \$(stat /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json --printf=%y | cut -d ':' -f 3 | cut -d ' ' -f 1) ; if ( y > x ) { printf \\\"%f\\\",y - x; } else { printf \\\"%f\\\",60 + y - x; }}\"" \; 'echo "echo \$wgGEHomepageEnabled ? \" good\" : \" BAD\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=ukwiki | head -n 1' | tee homepage-hosts-timing
$ cat homepage-hosts-timing | grep mw1 | sort -n -k 2

1mw1256.eqiad.wmnet: 0.048000 BAD
2mw1266.eqiad.wmnet: 0.052000 BAD
3mw1348.eqiad.wmnet: 0.059999 BAD
4mw1282.eqiad.wmnet: 0.072000 BAD
5mw1268.eqiad.wmnet: 0.084000 BAD
6mw1346.eqiad.wmnet: 0.084000 BAD
7mw1225.eqiad.wmnet: 0.116000 BAD
8mw1271.eqiad.wmnet: 0.163999 BAD
9mw1332.eqiad.wmnet: 0.236000 BAD
10mw1290.eqiad.wmnet: 0.243998 good
11mw1270.eqiad.wmnet: 0.260000 BAD
12mw1221.eqiad.wmnet: 0.275997 BAD
13mw1232.eqiad.wmnet: 0.295995 good
14mw1263.eqiad.wmnet: 0.315999 BAD
15mw1293.eqiad.wmnet: 0.352000 BAD
16mw1222.eqiad.wmnet: 0.415995 BAD
17mw1313.eqiad.wmnet: 0.443997 BAD
18mw1340.eqiad.wmnet: 0.447997 good
19mw1341.eqiad.wmnet: 0.448000 BAD
20mw1240.eqiad.wmnet: 0.607994 BAD
21mw1235.eqiad.wmnet: 0.615990 good
22mw1284.eqiad.wmnet: 0.643995 BAD
23mw1319.eqiad.wmnet: 0.643997 BAD
24mw1249.eqiad.wmnet: 0.660004 BAD
25mw1321.eqiad.wmnet: 0.671999 BAD
26mw1287.eqiad.wmnet: 0.675998 BAD
27mw1277.eqiad.wmnet: 0.748000 BAD
28mw1231.eqiad.wmnet: 0.787994 BAD
29mw1261.eqiad.wmnet: 0.827999 BAD
30mw1303.eqiad.wmnet: 0.871993 BAD
31mw1344.eqiad.wmnet: 0.923999 BAD
32mw1255.eqiad.wmnet: 1.179990 BAD
33mw1342.eqiad.wmnet: 1.311993 good
34mw1262.eqiad.wmnet: 1.332002 BAD
35mw1316.eqiad.wmnet: 1.340001 good
36mw1327.eqiad.wmnet: 1.379994 BAD
37mw1325.eqiad.wmnet: 1.403995 BAD
38mw1339.eqiad.wmnet: 1.411998 BAD
39mw1245.eqiad.wmnet: 1.455990 good
40mw1233.eqiad.wmnet: 1.467983 BAD
41mw1333.eqiad.wmnet: 1.612005 BAD
42mw1278.eqiad.wmnet: 1.716002 good
43mw1248.eqiad.wmnet: 1.723970 good
44mw1276.eqiad.wmnet: 1.792006 BAD
45mw1330.eqiad.wmnet: 1.831999 BAD
46mw1317.eqiad.wmnet: 1.903990 BAD
47mw1326.eqiad.wmnet: 2.028009 good
48mw1272.eqiad.wmnet: 2.039985 good
49mw1251.eqiad.wmnet: 2.239968 BAD
50mw1329.eqiad.wmnet: 2.340003 BAD
51mw1279.eqiad.wmnet: 2.363989 BAD
52mw1258.eqiad.wmnet: 2.371986 good
53mw1229.eqiad.wmnet: 2.447984 BAD
54mw1269.eqiad.wmnet: 2.480001 BAD
55mw1250.eqiad.wmnet: 2.579971 BAD
56mw1264.eqiad.wmnet: 2.583983 BAD
57mw1273.eqiad.wmnet: 2.608000 BAD
58mw1323.eqiad.wmnet: 2.619995 BAD
59mw1230.eqiad.wmnet: 2.639979 BAD
60mw1289.eqiad.wmnet: 2.843987 good
61mw1322.eqiad.wmnet: 2.880003 BAD
62mw1286.eqiad.wmnet: 3.171989 BAD
63mw1223.eqiad.wmnet: 3.183969 BAD
64mw1331.eqiad.wmnet: 3.191997 BAD
65mw1280.eqiad.wmnet: 3.255998 BAD
66mw1338.eqiad.wmnet: 3.564018 BAD
67mw1247.eqiad.wmnet: 3.687944 BAD
68mw1267.eqiad.wmnet: 3.755987 BAD
69mw1324.eqiad.wmnet: 3.795992 BAD
70mw1275.eqiad.wmnet: 3.931984 BAD
71mw1288.eqiad.wmnet: 4.083977 BAD
72mw1312.eqiad.wmnet: 4.099987 good
73mw1343.eqiad.wmnet: 4.175979 BAD
74mw1347.eqiad.wmnet: 4.287990 BAD
75mw1281.eqiad.wmnet: 4.511991 good
76mw1298.eqiad.wmnet: 4.639545 good
77mw1265.eqiad.wmnet: 4.831998 good
78mw1228.eqiad.wmnet: 5.099983 good
79mw1234.eqiad.wmnet: 5.267916 BAD
80mw1285.eqiad.wmnet: 5.387982 good
81mw1345.eqiad.wmnet: 5.868002 good
82mw1227.eqiad.wmnet: 6.419879 good
83mw1320.eqiad.wmnet: 6.435935 good
84mw1243.eqiad.wmnet: 7.016120 good
85mw1226.eqiad.wmnet: 7.039915 good
86mw1315.eqiad.wmnet: 7.451945 good
87mw1274.eqiad.wmnet: 7.523965 good
88mw1297.eqiad.wmnet: 7.883965 good
89mw1252.eqiad.wmnet: 7.915915 good
90mw1314.eqiad.wmnet: 8.155992 good
91mw1328.eqiad.wmnet: 8.619966 good
92mw1311.eqiad.wmnet: 8.651820 good
93mw1283.eqiad.wmnet: 9.287967 good
94mw1238.eqiad.wmnet: 9.323960 good
95mw1335.eqiad.wmnet: 9.420090 good
96mw1242.eqiad.wmnet: 10.351987 good
97mw1253.eqiad.wmnet: 10.543800 good
98mw1244.eqiad.wmnet: 10.599887 good
99mw1241.eqiad.wmnet: 12.019778 good
100mw1254.eqiad.wmnet: 12.083807 good
101mw1257.eqiad.wmnet: 12.515929 good
102mw1246.eqiad.wmnet: 16.556147 good
103mw1224.eqiad.wmnet: 17.403888 good
104mw1301.eqiad.wmnet: 17.483877 good
105mw1294.eqiad.wmnet: 20.183906 good
106mw1306.eqiad.wmnet: 20.587876 good
107mw1336.eqiad.wmnet: 20.864022 good
108mw1300.eqiad.wmnet: 21.859921 good
109mw1299.eqiad.wmnet: 23.803954 good
110mw1310.eqiad.wmnet: 23.911903 good
111mw1239.eqiad.wmnet: 24.191688 good
112mw1304.eqiad.wmnet: 24.207962 good
113mw1296.eqiad.wmnet: 25.447901 good
114mw1309.eqiad.wmnet: 27.652113 good
115mw1334.eqiad.wmnet: 28.035787 good
116mw1337.eqiad.wmnet: 34.923871 good
117mw1295.eqiad.wmnet: 35.275967 good
118mw1308.eqiad.wmnet: 40.231992 good
119mw1318.eqiad.wmnet: 42.620378 good
120mw1302.eqiad.wmnet: 50.883745 good
121mw1305.eqiad.wmnet: 52.787839 good
122mw1307.eqiad.wmnet: 57.911558 good

This isn't very satisfying: the bad servers do cluster near the top with deltas below 1 second, but there are a few good servers with <1s too, and some bad servers with deltas over 3 seconds. There does seem to be some sort of a correlation, but this is by no means a smoking gun.

Given that a scap takes ~60 seconds, should we just change the granularity of the stored/compared mtime down to the minute, if that would that avoid this?

To throw in my own filesystem fiddling which shows much the same as above:

thcipriani@mw1341:~$ stat --format=%Y /srv/mediawiki/wmf-config/InitialiseSettings.php
1579732739
thcipriani@mw1341:~$ sudo -u www-data grep mtime /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
    "mtime": 1579732739,

The previous sync happened 9 minutes earlier, so this is a cache for this version of the file.

thcipriani@mw1341:~$ sudo -u www-data grep wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
        "wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage": 0,
thcipriani@mw1341:~$ grep -A10 wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage /srv/mediawiki/wmf-config/InitialiseSettings.php
'wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage' => [
        'default' => 0,
        'arwiki' => 50,
        'cswiki' => 50,
        'kowiki' => 50,
        'viwiki' => 50,
        'ukwiki' => 50,
        'huwiki' => 50,
        'hywiki' => 50,
],

However ^ shows that the new version of the file wasn't cached correctly.

Recapping what I just talked with @Catrope about in IRC: somehow the information that php pulled into memory via require_once was not the information that was on disk in InitialiseSettings.php even though the php process require_onced that file after it had been updated (or so the code in CommonSettings.php would have us believe, anyway).

I don't think that that would help. I was measuring mtime deltas not because I think there's a bug of the type you're alluding to, but as a proxy for finding out whether the bad servers generated their cache files very shortly after the scap happened. If the cache file is only a tenth of a second younger than the settings file, that might indicate some sort of race condition, but at 0.25 seconds that's less plausible, and at 3 seconds it's definitely not plausible.

My remaining theories are:

  • Somehow a different code path does require_once "InitialiseSettings.php" before the mtime+cache check, and so when the cache refresh code path does it again, that reuses the stale version of IS.php that was loaded before
  • The opcache caches a stale version of IS.php, somehow

I'm going to comb through the initialization code in CommonSettings.php to see if I can exclude the former.

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:58:00Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: cirrus: Enable ores_articletopics field creation for all wikis (extra sync for T236104) (duration: 01m 04s)

Simplified Recreation

I was able to recreate this issue with a simplified example.

This is a simplified version of how the cache is generated by CommonSettings.php

test.php
<?php

$cache_path = '/tmp/cache.json';
$confActualMtime = filemtime('InitialiseSettings.php');

$globals = json_decode(file_get_contents($cache_path), true);

# If the cache has an older mtime, it's invalid
if ( $globals['timestamp'] !== $confActualMtime ) {
        $globals = [];
}

# An invalid cache means a re-read from disk
if ( ! $globals ) {
        require_once "InitialiseSettings.php";

        $globals = [
                "timestamp" => $confActualMtime,
                "foo" => $foo,
        ];

        file_put_contents($cache_path, json_encode($globals));
}

var_dump($globals);

Then I made an extremely simplified InitialiseSettings.php

InitialiseSettings.php
<?php

$foo = 1;

The first time I call the script, the cache is created with:

/tmp/cache.json
{"timestamp":1582521524,"foo":1}

Then I update InitialiseSettings.php to have $foo = 2;; hitting test.php in the browser results in:

/tmp/cache.json
{"timestamp":1582521648,"foo":1}

Which is exactly what we're seeing above: php is able to see the mtime of InitialiseSettings.php has changed and therefore rebuilds /tmp/cache.json with a new "timestamp" value but the old variables from InitialiseSettings.php.

We use the following relevant opcache settings in production:

php.ini
opcache.validate_timestamps = 1
opcache.revalidate_freq = 10

That is, php-fpm will check for updated scripts if the key in opcache is older than 10 seconds.

That is, in the example above the opcache value of InitialiseSettings.php is not expired according to the opcache.revalidate_freq (it was compiled < 10 seconds ago) and so when we called require_once php got the old value of $foo from the not-yet-expired-but-nonetheless-invalid opcache.

What's worse is that since we're looking at the mtime of InitialiseSettings.php to regenerate this cache file the problem will not be corrected when the opcache for the InitialiseSettings.php opcache key expires: only touching InitialiseSettings.php AFTER the 10 second window in which the opcache is valid for that file will fix this problem.


The first awful fix that came to mind was:

test.php
if ( ! $globals ) {
    symlink('InitialiseSettings.php', "InitialiseSettings-${confActualMtime}.php");
    require_once "InitialiseSettings-${confActualMtime}.php";
    unlink("InitialiseSettings-${confActualMtime}.php");
}

This causes the opcache to use a new key for require_once. The nice thing to do would be to use opcache_compile_file('InitialiseSettings.php');; however, I worry that causes subtle corruption of the opcache that we've seen with other invalidation methods.

Another possibility is to restart php-fpm whenever we update InitialiseSettings.php; although that may be costly.

Joe added a comment.Feb 24 2020, 6:49 AM

Another possibility is to restart php-fpm whenever we update InitialiseSettings.php; although that may be costly.

{{cn}}

I've never observed a real problem when restarting php-fpm en masse.

Having said that, I think a better strategy would be to have some "release identifier" (which in an ideal world would be a git sha1, for example of mediawiki-config) and use that to build the path for the cache file.

The idea of using mtime was always to avoid having to inspect the contents of the IS file itself. Maybe it's not a necessary optimisation, however? If so, and we always load IS.php, then we can just md5() the file contents or whatever, and compare to that cached key?

Krinkle added a comment.EditedFeb 25 2020, 3:38 AM

If I understand @thcipriani 's conclusion correctly, the problem is that code execution from disk (opcache) and mtimes from disk (filemtime) are coming from two different realities.

Code execution happens in a lagged reality that can be upto 10 seconds behind (via opcache), with its 10s loop of checking for updates. But file operations such as filemtime, even if performed on PHP files that opcache has available, are not responded to by the same lagged reality. Instead, the are responded to by the underlying file system directly. Is that right?

  • Is this is documented and intended behaviour from upstream PHP? It doesn't seam unreasable to keep these separate. But, I also vaguely recall PHP caching mtimes, realpath resolution and other information about PHP files in its opcache and (intending to be?) using that. If so, then this might be a regresion or other bug, and not intended by upstream. Possibly fixable in the short-term.
  • If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. To make sure we don't populate it with stale data from opcache that predates the disk sync. However, given the bad race conditions we know opcache is vunlerable to surrounding opcache invalidation, I am not 100% sure whether that method is safe to call in our production environment. If opcache_invalidate() ends up doing the same that opcache would normally do after 10 seconds on-demand, then it might be safe. On the other hand, if it is more like the resets that the php CLI can do (which we used to use), or the ones from the memory pressure trigger, then it could be a problem.
Joe added a comment.Feb 25 2020, 8:55 AM

If I understand @thcipriani 's conclusion correctly, the problem is that code execution from disk (opcache) and mtimes from disk (filemtime) are coming from two different realities.

Code execution happens in a lagged reality that can be upto 10 seconds behind (via opcache), with its 10s loop of checking for updates. But file operations such as filemtime, even if performed on PHP files that opcache has available, are not responded to by the same lagged reality. Instead, the are responded to by the underlying file system directly. Is that right?

yes and no. While your description is generally correct, there is also a realpath cache that can come into play, see https://engineering.facile.it/blog/eng/realpath-cache-is-it-all-php-opcache-s-fault/ for example.

What happens is:

  • We check the mtime of InitialiseSettings.php, which is a syscall and returns a valid value
  • We then re-include InitialiseSettings.php, but we get the file cached in opcache before that mtime
  • Is this is documented and intended behaviour from upstream PHP? It doesn't seam unreasable to keep these separate. But, I also vaguely recall PHP caching mtimes, realpath resolution and other information about PHP files in its opcache and (intending to be?) using that. If so, then this might be a regresion or other bug, and not intended by upstream. Possibly fixable in the short-term.

It's the intended behaviour. Now the obvious fix would be

  • If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. To make sure we don't populate it with stale data from opcache that predates the disk sync. However, given the bad race conditions we know opcache is vunlerable to surrounding opcache invalidation, I am not 100% sure whether that method is safe to call in our production environment. If opcache_invalidate() ends up doing the same that opcache would normally do after 10 seconds on-demand, then it might be safe. On the other hand, if it is more like the resets that the php CLI can do (which we used to use), or the ones from the memory pressure trigger, then it could be a problem.

which we did when doing scap, and we stopped doing because calling opcache_invalidate corrupts opcache under heavy load.

I propose an alternate solution in the next comment.

Joe added a comment.Feb 25 2020, 9:00 AM

I think we need to go at this another way. We should refresh the json cache only when filemtime('InitialiseSettings.php') is different from the $globals['timestamp'] and the current time is larger than filemtime('InitialiseSettings.php') + ini_get('opcache.revalidate_freq').

This way, we'll ensure not only that we do refresh the cache when things change, but also that we are getting the refreshed version of InitialiseSettings.php.

  • If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. […]

which we did when doing scap, and we stopped doing because calling opcache_invalidate corrupts opcache under heavy load.

Are you sure that was opcache_invalidate(), and not opcache_reset()?

My assumptions:

  • No matter how broken opcache is, there seems to be a way currently that opcache does recompile source code after a file has changed on disk. Specifically, within about 10 seconds from a file changing, we believe opcache recompiles the file.
  • We believe that this on-demand recompilation during web requests, when it finds that a source file has changed, does not cause corruption.
  • Other means of invalidating, clearing or resetting opcache (e.g. when it detects it runs out of memory, or when we use php7adm to clear opcache in its entirety) seem to cause corruption. It was my understanding that these both happen "externally". That is, php7adm is invoked outside any web request. And the out of memory detection might be in some kind of parent process rather than the web request thread, but I'm not sure that is actually the case. I assume it is in a parent process because that is how I've explained the fact that "normal" source updates don't cause corruption. It seems that calls from within web request under normal operation seem to do something right that these "external" calls get wrong or are unable to.

We should refresh the json cache only when filemtime('InitialiseSettings.php') is different from the $globals['timestamp'] and the current time is larger than filemtime('InitialiseSettings.php') + ini_get('opcache.revalidate_freq').

This way, we'll ensure not only that we do refresh the cache when things change, but also that we are getting the refreshed version of InitialiseSettings.php.

This means we'll have a 10 second delay between scap sync-file completing (or on debug servers, scap pull), and it actually taking effect. We'll likely want to build in some kind of delay in these commands to avoid giving the illusion that deployment is done. We'll also need to factor this into our canary checks somehow, so that they wait an additional 10 seconds compared to how long they wait today so that we keep the same size monitoring/comparison window.

I'd like to avoid that if possible as it makes the overal system a lot more difficult to reason about. Especially considering multi-file syncing where some files are behind the cache and some aren't. If resolving this bug in a way that doesn't involve the above workaround is considered possible, but non-trivial, then I'd rather we invest that time and effort into T169821 instead. With a bit of work from us, and possibly some scap-driven or pre-commit step, we could eliminate this system entirely.

Joe added a comment.Feb 25 2020, 4:00 PM
  • If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. […]

which we did when doing scap, and we stopped doing because calling opcache_invalidate corrupts opcache under heavy load.

Are you sure that was opcache_invalidate(), and not opcache_reset()?

Pretty sure, having written the code myself:
https://github.com/wikimedia/puppet/blob/36527ba432ea39086f0474373b2c51b4259e58f2/modules/profile/files/mediawiki/php/admin/lib.php#L371

My assumptions:

  • No matter how broken opcache is, there seems to be a way currently that opcache does recompile source code after a file has changed on disk. Specifically, within about 10 seconds from a file changing, we believe opcache recompiles the file.
  • We believe that this on-demand recompilation during web requests, when it finds that a source file has changed, does not cause corruption.
  • Other means of invalidating, clearing or resetting opcache (e.g. when it detects it runs out of memory, or when we use php7adm to clear opcache in its entirety) seem to cause corruption. It was my understanding that these both happen "externally". That is, php7adm is invoked outside any web request.

php7adm is a tool I wrote and just performs a webrequest to php-fpm, to a special, access-restricted admin endpoint. So no big difference there.

To sum up: I'll have to check php's source code, but I expect the revalidation procedure to look more like what opcache_compile_file does instead.

We also have the possibility of doing what I proposed originally for deployments: disable opcache revalidation completely, and restart php-fpm for every release. This will give us a series of advantages, including guaranteeing atomicity of releases on each server.

If we want to go down that road, I'll probably have to add some safeguards to the restart scripts, but given we'll have to go down that route once we move to containers anyways, I think it's a route worth pursuing.

I'd also like to add we've done rolling restarts of php-fpm and we've barely seen any regression in the performance metrics afterwards.

Example code above for anyone interested in fiddling with this: https://github.com/thcipriani/T236104

Joe added a comment.EditedFeb 26 2020, 8:51 AM

I just made a tiny change to your test.php file

https://github.com/lavagetto/T236104/commit/e32162717a79e09a1aa62b07d86b018f71385f88

To only save the file if the grace period is passed.

The result is that even if the bug is triggered (so we're reading a stale version of InitialiseSettings.php) it recovers as soon as opcache is revalidated, like for every other file. We will be able to use the cache only after 11 seconds (the grace period)

We also have the possibility of doing what I proposed originally for deployments: disable opcache revalidation completely, and restart php-fpm for every release. This will give us a series of advantages, including guaranteeing atomicity of releases on each server.

If we want to go down that road, I'll probably have to add some safeguards to the restart scripts, but given we'll have to go down that route once we move to containers anyways, I think it's a route worth pursuing.

I'd also like to add we've done rolling restarts of php-fpm and we've barely seen any regression in the performance metrics afterwards.

Since our efforts to clear the opcache have been foiled by php itself, I think it makes sense to revisit restarting php as part of deployment. I have been hesitant as I want to preserve fast rollback: that will have to be a design consideration. We'd need to be sure to touch InitialiseSettings.php after php restart (otherwise we'll just end-up using the value cache in the json).

Is there a sane workaround we could put in place in the interim? Either opcache_compile_file or the ugly symlink solution I posted above (or a smarter variation of that)?

Joe added a comment.Feb 26 2020, 5:36 PM

Is there a sane workaround we could put in place in the interim? Either opcache_compile_file or the ugly symlink solution I posted above (or a smarter variation of that)?

See my proposal above:

  • We read the file from cache, if it's stale we require InitialiseSettings.php
  • We only save the file if a grace period is passed, so that we're sure we will save in cache a version of globals that is updated.

Change 575469 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Do not update the globals cache file while opcache needs regeneration

https://gerrit.wikimedia.org/r/575469

I admit I don't understand PHP or MW, but... would a simpler approach be for scap to wait 10s before it touches the settings file to update its mtime, and rsyncs the file file to the appservers?

[…] would a simpler approach be for scap to wait 10s before it touches the settings file to update its mtime, and rsyncs the file file to the appservers?

Each PHP file has its compilation cached by PHP7 for 10s before it revalidates against the source file. This repeating 10s window isn't coordinated between servers. If the last deployment was an hour ago, there will have been many revalidation checks (during which nothing happeneds if no deployment happened). It is not until 10 seconds after the local rsync completes that we know it's been picked up, because the last revalidation attempt may've completed just before we rsync started.

If I understand your idea correctly, waiting longer on the deployment server wouldn't help in this case. Does that make sense?

Krinkle added a comment.EditedFeb 28 2020, 2:00 PM

I'm wondering if it'd be easier if we simply generate this cache upfront the way we do for localisation cache. I can look into writing a script that a Scap plugin could call which would produce the necessary config files.

Okay, there's clearly things here I'm not getting yet. Scratch the idea of having scap wait ten seconds.

Has anyone tried Giuseppe's change in Gerrit? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/575469

Joe added a comment.Feb 28 2020, 2:48 PM

I'm wondering if it'd be easier if we simply generate this cache upfront the way we do for localisation cache. I can look into writing a script that a Scap plugin could call which would produce the necessary config files.

That would work for sure, but it means adding yet another process in the hands of scap pre-deployment, vs having such things generated at merge time for instance.

I thought there were further reasons not to distribute the json files with the code itself. If not, let's generate them as part of mediawiki-config merge process instead.

I'm wondering if it'd be easier if we simply generate this cache upfront the way we do for localisation cache. I can look into writing a script that a Scap plugin could call which would produce the necessary config files.

Indeed, and if you recall I built this and you have concerns about deploying it. :-)

Deploys would be composer buildConfigCache && scap sync-dir wmf-config/config-cache. No need to change anything in scap to get this done.

Stalled on working out what to do about some calls to SiteConfiguration wanting to know everything about every wiki, not just one. Otherwise we'll just make the wrong-file-loaded issue happen only for some kinds of request.

FWIW, ran some tests with apachebench today and the patchset on this task and it didn't seem to make requests significantly slower or load significantly higher in my rudimentary testing:

$ ab -n 1000 -c 10 -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' https://en.wikipedia.org/wiki/Main_page
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking en.wikipedia.org (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        mwdebug1002.eqiad.wmnet
Server Hostname:        en.wikipedia.org
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        en.wikipedia.org

Document Path:          /wiki/Main_page
Document Length:        79269 bytes

Concurrency Level:      10
Time taken for tests:   48.676 seconds
Complete requests:      1000
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Total transferred:      80405609 bytes
HTML transferred:       79269000 bytes
Requests per second:    20.54 [#/sec] (mean)
Time per request:       486.758 [ms] (mean)
Time per request:       48.676 [ms] (mean, across all concurrent requests)
Transfer rate:          1613.14 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       75   83   3.6     82     119
Processing:   263  398  82.9    378    1582
Waiting:      203  337  82.5    317    1522
Total:        347  481  83.1    462    1667

Percentage of the requests served within a certain time (ms)
  50%    462
  66%    487
  75%    507
  80%    527
  90%    575
  95%    613
  98%    653
  99%    695
 100%   1667 (longest request)
~ 49s
$ ab -n 1000 -c 10 -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' https://en.wikipedia.org/wiki/Main_page
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking en.wikipedia.org (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        mwdebug1002.eqiad.wmnet
Server Hostname:        en.wikipedia.org
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        en.wikipedia.org

Document Path:          /wiki/Main_page
Document Length:        79311 bytes

Concurrency Level:      10
Time taken for tests:   51.275 seconds
Complete requests:      1000
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Total transferred:      80447873 bytes
HTML transferred:       79310998 bytes
Requests per second:    19.50 [#/sec] (mean)
Time per request:       512.749 [ms] (mean)
Time per request:       51.275 [ms] (mean, across all concurrent requests)
Transfer rate:          1532.18 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       75   84   7.6     83     145
Processing:   236  421  84.8    402    1177
Waiting:      178  358  82.9    337    1117
Total:        315  506  86.1    484    1265

Percentage of the requests served within a certain time (ms)
  50%    484
  66%    520
  75%    550
  80%    566
  90%    623
  95%    670
  98%    740
  99%    777
 100%   1265 (longest request)
thcipriani triaged this task as High priority.

Looks like this just needs deployment, assigning to @Jdforrester-WMF to do the needful.

Change 575469 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Do not update the globals cache file while opcache needs regeneration

https://gerrit.wikimedia.org/r/575469

If I understand correctly, this means forcing a 10 second period after every deployment of unconditionall cache miss for the configuration. Regenerating on every request thread during that time. […]

This approach favours consistency (the version of the code you get is always consistent with how $globals is generated) at the cost of some more computing power. […] I'd like to quantify the cost of /not/ caching in terms of cpu load and response times to our users.

Some weeks later now, having been deep into this mess as part of T169821. I too now prefer the simple approach by @Joe. The latency impact of having to re-compute this over and over again used to be quite expensive, but ever since the switch to HHVM and more recently PHP 7.2 the hit has become relatively small in the grand scheme of things (about 10-20 ms).

That is still quite a significant amount of time to be spending on the CPU given all it does is give us some configuration variables to initialise MediaWiki. It is expensive mainly because it involves reading several non-PHP files off disk. And then performing many hundreds of complex iterations over each of the thousands of config settings.

This kind of time spend is comparable to performing 10 Memcached queries, connecting to a master DB and fetching rows from a few simple queries.

It is also comparable to the entire backend response time of some load.php requests (when the config is cached). So for an SLO of endpoints that aim for 50-100ms, an additional 10-20ms is huge.

But given the limited options at our disposal, at least for the short-term, we should make this stuff work first, and a latency regression for a few seconds after every deploy seems acceptable for the short term.

For the mid-term, I have already started work on T169821 which so far has reduced this cost down to ~ 4ms. And my next step there is to simply remove this caching mechanism entirely pretty soon, which isn't too much harder given that a "cache hit" scenario involves synchronous disk I/O to read a JSON file and then parse it on the CPU, whereas a "cache miss" scenario right now is doing a lot of CPU computations in pure memory (no disk I/O).

And for the mid-to-long term, as part of T223602, the new cache we need for the YAML expansions would be handled by a Scap build step of sorts and thus not be something for the run-time to be concerned about.

awight added a subscriber: awight.Apr 15 2020, 11:06 AM

Change 575469 merged by jenkins-bot:
[operations/mediawiki-config@master] Do not update the globals cache file while opcache needs regeneration

https://gerrit.wikimedia.org/r/575469

Mentioned in SAL (#wikimedia-operations) [2020-04-20T18:14:21Z] <jforrester@deploy1001> Synchronized wmf-config/CommonSettings.php: T236104 Wait to update the globals cache file for opcache regeneration (duration: 01m 02s)

OK, this is deployed, and I made a quick patch and used SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- 'echo "echo \$wgTmhUseBetaFeatures ? \"BAD\" : \"good\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=testwiki | head -n 1' | tee ~/wgTmhUseBetaFeatures-testwiki-hosts to verify that every machine had the correct value. Success.

Jdforrester-WMF closed this task as Resolved.Apr 20 2020, 6:40 PM
Jdforrester-WMF reassigned this task from Jdforrester-WMF to Joe.

I’ve removed it again.

Aklapper removed a subscriber: Anomie.Fri, Oct 16, 5:40 PM