Cache of wmf-config/InitialiseSettings often 1 step behind
Closed, ResolvedPublic
Actions

Description

It has happened to me 5 times out of 5 deployments I did in the past 30 days that when changing a value in InitialiseSettings.php, and syncing it out, that about half the servers (possibly more) the change does not apply.

But, it works fine on mwdebug1002. And a second touch+sync it always works on the rest of the fleet as well.

This suggests there might be a resurfaced old bug and/or new bug caused by recent changes to MWConfigCacheGenerator.

Details

	Subject	Repo	Branch	Lines +/-
	Do not update the globals cache file while opcache needs regeneration	operations/mediawiki-config	master	+10 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T165585 Make creating a new Language project easier
Open	None	T158730 Automate WMF wiki creation
Stalled	None	T264964 Automate creating the initial config patch for new wikis
Resolved	Krinkle	T220775 Create a a compiler "diff" job for site configuration in mediawiki-config.git repo
Open	None	T239133 For containerisation, make MediaWiki-land production config injectable as much as possible
Stalled	None	T223602 Define variant Wikimedia production config in compiled, static files
Resolved	Joe	T236104 Cache of wmf-config/InitialiseSettings often 1 step behind

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Krinkle updated the task description. (Show Details)Oct 21 2019, 9:30 PM

Krinkle updated the task description. (Show Details)Oct 21 2019, 9:32 PM

jijiki subscribed.Oct 22 2019, 10:25 AM

CDanis subscribed.Oct 22 2019, 12:23 PM

This happened yesterday, during firefighting work on {T235949}

All appservers had the same config:

P9416 (An Untitled Masterwork)

1	❌cdanis@cumin1001.eqiad.wmnet ~ 🕟🍵 sudo !!
2	sudo cumin A:mw 'md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php'
3	126 hosts will be targeted:
4	mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
5	Confirm to continue [y/n]? y
6	===== NODE GROUP =====
7	(126) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
8	----- OUTPUT of 'md5sum /srv/medi...liseSettings.php' -----
9	b5a01965593c5a02207d93e70c630832 /srv/mediawiki/wmf-config/InitialiseSettings.php
10	================
11	PASS: \|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100% (126/126) [00:01<00:00, 77.68hosts/s]
12	FAIL: \| \| 0% (0/126) [00:01<?, ?hosts/s]
13	100.0% (126/126) success ratio (>= 100.0% threshold) for command: 'md5sum /srv/medi...liseSettings.php'.
14	100.0% (126/126) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

And yet a third of the fleet was still broken

20:39:15	<@cdanis>	(78) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1240,1273,1322].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet                                     
20:39:17	<@cdanis>	these appservers seem fine
20:39:20	<@cdanis>	(48) mw[1238-1239,1241-1258,1261-1272,1274-1275,1319-1321,1323-1333].eqiad.wmnet                                                                                                              
20:39:22	<@cdanis>	these appservers are not

@MaxSem touched InitialiseSettings.php and did another sync-file and that fixed it.

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Oct 22 2019, 8:40 PM

In T236104#5595508, @Jdforrester-WMF wrote:

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

rsync certainly can, but won't by default unless you pass -t / --times. (-a / --archive includes this in the set of options it enables.)

I see --archive included in scap/tasks.py DEFAULT_RSYNC_ARGS, so I assume that scap DTRT here?

Krinkle merged a task: T236366: MediaWiki production config change not/randomly picked up by startup module after deployment.Oct 24 2019, 2:19 PM

Krinkle added subscribers: • Gilles, Liuxinyu970226.

I don't know off-hand whether the mtime comes from the deployment server (set when the last git command saved the file to disk), or whether it is set by the destination server at the time of local sync.

However, we don't currently depend on either model as far as I know. That is, so long as the mtime changes in some way, the code for config cache will handle it. It doesn't assume equality across the cluster and doesn't assume it goes up.

Having said that, it seems most likely that something in the config cache code is broken because it's the only part of this that has changed recently. I've looked at ed5297c107f38d and subsequent refactors by James many times but I don't see anything obvious stand out...

Happened to me today when deploying a config change for T198312. Touching InitialiseSettings.php and re-syncing did fix it.

Daimona mentioned this in T238816: Config scap seemed to update file but not get config changes into mediawiki (until a re scap).Nov 21 2019, 9:02 PM

Also happened to me twice in the past few days.

about half the servers (possibly more) the change does not apply.

For some of my syncs I was lucky if I could find 1 or 2 servers with the config actually loaded correctly.

In T236104#5599672, @CDanis wrote:

In T236104#5595508, @Jdforrester-WMF wrote:

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

rsync certainly can, but won't by default unless you pass -t / --times. (-a / --archive includes this in the set of options it enables.)

I see --archive included in scap/tasks.py DEFAULT_RSYNC_ARGS, so I assume that scap DTRT here?

Those should be the rsync arguments that scap uses in this case; however, after each sync scap also touches the wmf-config/InitialiseSettings.php file to trigger the re-cache of settings.

In T236104#5594793, @CDanis wrote:
This happened yesterday, during firefighting work on {T235949}

All appservers had the same config:
P9416 (An Untitled Masterwork)
1 ❌cdanis@cumin1001.eqiad.wmnet ~ 🕟🍵 sudo !!
2 sudo cumin A:mw 'md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php'
3 126 hosts will be targeted:
4 mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
5 Confirm to continue [y/n]? y
6 ===== NODE GROUP =====
7 (126) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
8 ----- OUTPUT of 'md5sum /srv/medi...liseSettings.php' -----
9 b5a01965593c5a02207d93e70c630832 /srv/mediawiki/wmf-config/InitialiseSettings.php
10 ================
11 PASS: |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (126/126) [00:01<00:00, 77.68hosts/s]
12 FAIL: | | 0% (0/126) [00:01<?, ?hosts/s]
13 100.0% (126/126) success ratio (>= 100.0% threshold) for command: 'md5sum /srv/medi...liseSettings.php'.
14 100.0% (126/126) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

And yet a third of the fleet was still broken
20:39:15	<@cdanis>	(78) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1240,1273,1322].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet                                     
20:39:17	<@cdanis>	these appservers seem fine
20:39:20	<@cdanis>	(48) mw[1238-1239,1241-1258,1261-1272,1274-1275,1319-1321,1323-1333].eqiad.wmnet                                                                                                              
20:39:22	<@cdanis>	these appservers are not
@MaxSem touched InitialiseSettings.php and did another sync-file and that fixed it.

So the actual rsyncing of the file appears to work just fine; however, the file is not loaded from the config correctly.

The next step of troubleshooting is to figure out if this is a problem with fixing the race condition in T217830 (or some subsequent refactor); or something entirely new to do with php caching. I guess that would mean comparing the mtime of /srv/mediawiki/wmf-config/InitialiseSettings.php with /tmp/mw-cache-[versions]/conf2-[wiki].json. If, on a misbehaving appserver, the cache has an older mtime than the IS.php file, then we know regeneration was not triggered. e.g.,

thcipriani@mwdebug1001:~$ sudo -u www-data jq '.mtime' /tmp/mw-cache-1.35.0-wmf.10/conf2-enwiki.json 
1576527000
thcipriani@mwdebug1001:~$ sudo -u www-data php -a
Interactive mode enabled

php > var_dump(filemtime('/srv/mediawiki/wmf-config/InitialiseSettings.php'));
int(1576527000)
php > ^Dthcipriani@mwdebug1001:~$ stat --format '%Y (%y) %n' /srv/mediawiki/wmf-config/InitialiseSettings.php 
1576527000 (2019-12-16 20:10:00.935866457 +0000) /srv/mediawiki/wmf-config/InitialiseSettings.php

EBernhardson subscribed.Dec 28 2019, 11:03 PM

Urbanecm subscribed.Jan 6 2020, 12:23 PM

Mentioned in SAL (#wikimedia-operations) [2020-01-06T12:31:00Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:562083|Don’t check constraints on P6685 statements]] Bypassing T236104 (duration: 00m 55s)

DannyS712 subscribed.Jan 7 2020, 12:40 AM

Tgr mentioned this in T218626: [Epic] Partial block rollout.Jan 7 2020, 1:24 AM

I encountered this bug again today, when deploying this config change. My teammates reported that the feature we'd just enabled was working about half the time, which likely means that about half (or at least a significant fraction, much more than 10%) of the app servers had an outdated config, I tried to use dsh to count how many app servers were affected, but I couldn't get keyholder to work. It gave me a "Too many authentication failures" error message for each host.

I asked a teammate to keep refreshing until they got a broken pageview (with the feature disabled), then run mw.config.get( 'wgHostname' ) in the browser console and tell me the output. This gave me the server name of a broken server (mw1319). Using the same process we found that mw1256 was not broken.

I compared the "good" server (mw1256) with the "bad" server (mw1319) and found that:

Running echo 'var_dump($wgGEHomepageSuggestedEditsTopicsRequiresOptIn)' | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=cswiki returned false (correct) on the good server, and true (incorrect) on the bad server. This is what I expected.
InitialiseSettings.php had the correct value for the config var on both servers. In fact, it was identical on both servers (same sha1sum), and had the same mtime on both
The config cache file (/tmp/mw-cache-1.35.0-wmf.15/conf2-cswiki.json) had the correct value on the good server, but had the wrong value on the bad server. However, the cache file did have the same mtime as InitialiseSettings.php (in the same minute that is, I forgot to use stat to look at it with second precision)

I'm not sure how this could have happened. Maybe the cache file was regenerated for other reasons in the same minute as InitialiseSettings? Or maybe the mtime of InitialiseSettings was updated before the file contents were? Or for some other reason the PHP code that regenerated the cache file saw the new mtime but the old file contents for InitialiseSettings.php, perhaps due to the php-fpm cache?

I have touched and resynced InitialiseSettings to fix the buggy deployment, which destroyed all the evidence, so unfortunately I can't investigate further at this point.

For next time:

Look at the file mtimes of InitialiseSettings.php and the cache file with second precision, using stat
Look at the mtime field in the cache file itself (at the top), which also has second precision
Cross reference these times with the second-precision times reported by scap for when it began/finished syncing and when it purged the php-fpm cache, and see if there's a difference between broken and non-broken servers
Figure out how to get dsh+keyholder to work, so that we can run debugging/investigation commands on all servers at once

Thanks to Giuseppe I got dsh+keyholder to work: SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- 'echo "echo \$wgGEHomepageSuggestedEditsTopicsRequiresOptIn ? \"BAD\" : \"good\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=cswiki | head -n 1'

In T236104#5821812, @Catrope wrote:

InitialiseSettings.php had the correct value for the config var on both servers. In fact, it was identical on both servers (same sha1sum), and had the same mtime on both

This seems to indicate that the actual syncing is working correctly, so that's good.

The config cache file (/tmp/mw-cache-1.35.0-wmf.15/conf2-cswiki.json) had the correct value on the good server, but had the wrong value on the bad server. However, the cache file did have the same mtime as InitialiseSettings.php (in the same minute that is, I forgot to use stat to look at it with second precision)

This seems to implicate that in fixing the race condition from T217830 we have somehow created a *new* race condition OR something else has changed in the system.

Having the mtime that is embedded in the cache file would help; we now store the mtime of InitialiseSettings.php in the cache. That is, we read the cache, compare the mtime stored there with the mtime of InitialiseSettings.php and update if they are not ===.

If the mtime in the cache matched InitialiseSettings.php BUT had old values then there's probably something strange going on at the PHP level.

If the mtime in the cache didn't match InitialiseSettings.php and also had old values, then there's probably a race condition in mtime comparison/something strange at the filesystem level.

Maybe the cache file was regenerated for other reasons in the same minute as InitialiseSettings?

It would have to be in the same second -- IIRC that's the granularity of this filesystem/system. This is actually the race condition we were (trying) to control for by storing the mtime of InitialiseSettings in the cache; however, it's possible that if two syncs happened in the same second: that could be the problem. I don't think that was the case here?

I laid out the race condition we were solving for in T217830#5009234; however, that is likely not the only race condition here. Using mtime of a file to invalidate a cache that is generated by webtraffic hits is a complicated system and is easy to get wrong (I link to https://apenwarr.ca/log/20181113 a lot in this context -- tl;dr: a lot can go wrong :)) -- perhaps moving preemptively to something like a checksum stored in the cache would be easier? A checksum of stat info if that's less resource intensive?

Or maybe the mtime of InitialiseSettings was updated before the file contents were?

IIRC, from digging into how rsync works with the --delay-updates flag, rsync creates a .~tmp~ directory with the new file and uses rename(2) to move the file into place. This one seems least likely.

Or for some other reason the PHP code that regenerated the cache file saw the new mtime but the old file contents for InitialiseSettings.php, perhaps due to the php-fpm cache?

That's an interesting idea that I hadn't considered. Comparing the mtime inside the cache with the mtime of IS.php would help determine if this is the case.

For next time:

Look at the file mtimes of InitialiseSettings.php and the cache file with second precision, using stat

Look at the mtime field in the cache file itself (at the top), which also has second precision

Cross reference these times with the second-precision times reported by scap for when it began/finished syncing and when it purged the php-fpm cache, and see if there's a difference between broken and non-broken servers

Figure out how to get dsh+keyholder to work, so that we can run debugging/investigation commands on all servers at once

+1 thank you for all this!

Mentioned in SAL (#wikimedia-operations) [2020-01-22T23:46:29Z] <James_F> <RoanKattouw> T236104 happened again, and this time I'm leaving it broken so I can investigate. Please don't use do any MW deployments (use scap) for now

It happened again today, with this patch. This time I left it broken for a while so I could investigate.

I got dsh working, and it tells me that about 61 of the 336 hosts (18%) in mediawiki-installation are affected:

$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- 'echo "echo \$wgGEHomepageEnabled ? \"good\" : \"BAD\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=ukwiki | head -n 1' | tee ~/homepage-hosts
[...]
$ cat homepage-hosts | grep good | wc -l
275
$ cat homepage-hosts | grep BAD | wc -l
61

However, if you filter to just eqiad app servers, it's an exact 50%/50% split, which is interesting:

$ cat homepage-hosts | grep mw1 | grep good | wc -l
61
$ cat homepage-hosts | grep mw1 | grep BAD | wc -l
61

I didn't see any patterns offhand, but the list of good vs bad eqiad servers had a lot of clustering. I've put it in a paste so I can look back at it later (and perhaps compare it with future incidents):

P10245 2020-01-22 config mis-sync good and bad servers

1	mw1221.eqiad.wmnet: BAD
2	mw1222.eqiad.wmnet: BAD
3	mw1223.eqiad.wmnet: BAD
4	mw1224.eqiad.wmnet: good
5	mw1225.eqiad.wmnet: BAD
6	mw1226.eqiad.wmnet: good
7	mw1227.eqiad.wmnet: good
8	mw1228.eqiad.wmnet: good
9	mw1229.eqiad.wmnet: BAD
10	mw1230.eqiad.wmnet: BAD
11	mw1231.eqiad.wmnet: BAD
12	mw1232.eqiad.wmnet: good
13	mw1233.eqiad.wmnet: BAD
14	mw1234.eqiad.wmnet: BAD
15	mw1235.eqiad.wmnet: good
16	mw1238.eqiad.wmnet: good
17	mw1239.eqiad.wmnet: good
18	mw1240.eqiad.wmnet: BAD
19	mw1241.eqiad.wmnet: good
20	mw1242.eqiad.wmnet: good
21	mw1243.eqiad.wmnet: good
22	mw1244.eqiad.wmnet: good
23	mw1245.eqiad.wmnet: good
24	mw1246.eqiad.wmnet: good
25	mw1247.eqiad.wmnet: BAD
26	mw1248.eqiad.wmnet: good
27	mw1249.eqiad.wmnet: BAD
28	mw1250.eqiad.wmnet: BAD
29	mw1251.eqiad.wmnet: BAD
30	mw1252.eqiad.wmnet: good
31	mw1253.eqiad.wmnet: good
32	mw1254.eqiad.wmnet: good
33	mw1255.eqiad.wmnet: BAD
34	mw1256.eqiad.wmnet: BAD
35	mw1257.eqiad.wmnet: good
36	mw1258.eqiad.wmnet: good
37	mw1261.eqiad.wmnet: BAD
38	mw1262.eqiad.wmnet: BAD
39	mw1263.eqiad.wmnet: BAD
40	mw1264.eqiad.wmnet: BAD
41	mw1265.eqiad.wmnet: good
42	mw1266.eqiad.wmnet: BAD
43	mw1267.eqiad.wmnet: BAD
44	mw1268.eqiad.wmnet: BAD
45	mw1269.eqiad.wmnet: BAD
46	mw1270.eqiad.wmnet: BAD
47	mw1271.eqiad.wmnet: BAD
48	mw1272.eqiad.wmnet: good
49	mw1273.eqiad.wmnet: BAD
50	mw1274.eqiad.wmnet: good
51	mw1275.eqiad.wmnet: BAD
52	mw1276.eqiad.wmnet: BAD
53	mw1277.eqiad.wmnet: BAD
54	mw1278.eqiad.wmnet: good
55	mw1279.eqiad.wmnet: BAD
56	mw1280.eqiad.wmnet: BAD
57	mw1281.eqiad.wmnet: good
58	mw1282.eqiad.wmnet: BAD
59	mw1283.eqiad.wmnet: good
60	mw1284.eqiad.wmnet: BAD
61	mw1285.eqiad.wmnet: good
62	mw1286.eqiad.wmnet: BAD
63	mw1287.eqiad.wmnet: BAD
64	mw1288.eqiad.wmnet: BAD
65	mw1289.eqiad.wmnet: good
66	mw1290.eqiad.wmnet: good
67	mw1293.eqiad.wmnet: BAD
68	mw1294.eqiad.wmnet: good
69	mw1295.eqiad.wmnet: good
70	mw1296.eqiad.wmnet: good
71	mw1297.eqiad.wmnet: good
72	mw1298.eqiad.wmnet: good
73	mw1299.eqiad.wmnet: good
74	mw1300.eqiad.wmnet: good
75	mw1301.eqiad.wmnet: good
76	mw1302.eqiad.wmnet: good
77	mw1303.eqiad.wmnet: BAD
78	mw1304.eqiad.wmnet: good
79	mw1305.eqiad.wmnet: good
80	mw1306.eqiad.wmnet: good
81	mw1307.eqiad.wmnet: good
82	mw1308.eqiad.wmnet: good
83	mw1309.eqiad.wmnet: good
84	mw1310.eqiad.wmnet: good
85	mw1311.eqiad.wmnet: good
86	mw1312.eqiad.wmnet: good
87	mw1313.eqiad.wmnet: BAD
88	mw1314.eqiad.wmnet: good
89	mw1315.eqiad.wmnet: good
90	mw1316.eqiad.wmnet: good
91	mw1317.eqiad.wmnet: BAD
92	mw1318.eqiad.wmnet: good
93	mw1319.eqiad.wmnet: BAD
94	mw1320.eqiad.wmnet: good
95	mw1321.eqiad.wmnet: BAD
96	mw1322.eqiad.wmnet: BAD
97	mw1323.eqiad.wmnet: BAD
98	mw1324.eqiad.wmnet: BAD
99	mw1325.eqiad.wmnet: BAD
100	mw1326.eqiad.wmnet: good
101	mw1327.eqiad.wmnet: BAD
102	mw1328.eqiad.wmnet: good
103	mw1329.eqiad.wmnet: BAD
104	mw1330.eqiad.wmnet: BAD
105	mw1331.eqiad.wmnet: BAD
106	mw1332.eqiad.wmnet: BAD
107	mw1333.eqiad.wmnet: BAD
108	mw1334.eqiad.wmnet: good
109	mw1335.eqiad.wmnet: good
110	mw1336.eqiad.wmnet: good
111	mw1337.eqiad.wmnet: good
112	mw1338.eqiad.wmnet: BAD
113	mw1339.eqiad.wmnet: BAD
114	mw1340.eqiad.wmnet: good
115	mw1341.eqiad.wmnet: BAD
116	mw1342.eqiad.wmnet: good
117	mw1343.eqiad.wmnet: BAD
118	mw1344.eqiad.wmnet: BAD
119	mw1345.eqiad.wmnet: good
120	mw1346.eqiad.wmnet: BAD
121	mw1347.eqiad.wmnet: BAD
122	mw1348.eqiad.wmnet: BAD

On a "bad" server:

catrope@mw1240:~$ stat /srv/mediawiki/wmf-config/InitialiseSettings.php 
  File: /srv/mediawiki/wmf-config/InitialiseSettings.php
  Size: 806624    	Blocks: 1576       IO Block: 4096   regular file
Device: 801h/2049d	Inode: 20981413    Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (  499/mwdeploy)   Gid: (  499/mwdeploy)
Access: 2020-01-22 22:39:07.153319007 +0000
Modify: 2020-01-22 22:39:00.673380465 +0000
Change: 2020-01-22 22:39:00.673380465 +0000
 Birth: -

catrope@mw1240:~$ stat /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
  File: /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
  Size: 75700     	Blocks: 152        IO Block: 4096   regular file
Device: 802h/2050d	Inode: 7125        Links: 1
Access: (0600/-rw-------)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2020-01-22 22:39:01.765370109 +0000
Modify: 2020-01-22 22:39:01.281374699 +0000
Change: 2020-01-22 22:39:01.281374699 +0000
 Birth: -

catrope@mw1240:~$ sudo -u www-data head -n 2 /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
{
    "mtime": 1579732740,

catrope@mw1240:~$ date -d @1579732740
Wed Jan 22 22:39:00 UTC 2020

On a good server:

catrope@mw1241:~$ stat /srv/mediawiki/wmf-config/InitialiseSettings.php 
  File: /srv/mediawiki/wmf-config/InitialiseSettings.php
  Size: 806624    	Blocks: 1576       IO Block: 4096   regular file
Device: 801h/2049d	Inode: 6946842     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (  499/mwdeploy)   Gid: (  499/mwdeploy)
Access: 2020-01-22 22:39:01.011376233 +0000
Modify: 2020-01-22 22:38:58.727418502 +0000
Change: 2020-01-22 22:38:58.727418502 +0000
 Birth: -

catrope@mw1241:~$ stat /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
  File: /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
  Size: 75810     	Blocks: 152        IO Block: 4096   regular file
Device: 802h/2050d	Inode: 6750        Links: 1
Access: (0600/-rw-------)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2020-01-22 22:39:13.831139084 +0000
Modify: 2020-01-22 22:39:10.747196117 +0000
Change: 2020-01-22 22:39:10.747196117 +0000
 Birth: -

catrope@mw1241:~$ sudo -u www-data head -n 2 /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json 
{
    "mtime": 1579732738,

catrope@mw1241:~$ date -d @1579732738
Wed Jan 22 22:38:58 UTC 2020

Nothing super obvious jumps out here: on both servers, the mtime value encoded in the JSON file matches the mtime of the PHP file, and the mtime of the JSON file is after the mtime of the PHP file. The only difference I see is that the good server has a pretty big gap (12 seconds) between InitialiseSettings.php being written and the cache file being written, while on the bad server it's pretty small (0.6 seconds). So let's see if this is a common pattern:

$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- awk "\"BEGIN {x = \$(stat /srv/mediawiki/wmf-config/InitialiseSettings.php --printf=%y | cut -d ':' -f 3 | cut -d ' ' -f 1) ; y = \$(stat /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json --printf=%y | cut -d ':' -f 3 | cut -d ' ' -f 1) ; if ( y > x ) { printf \\\"%f\\\",y - x; } else { printf \\\"%f\\\",60 + y - x; }}\"" \; 'echo "echo \$wgGEHomepageEnabled ? \" good\" : \" BAD\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=ukwiki | head -n 1' | tee homepage-hosts-timing
$ cat homepage-hosts-timing | grep mw1 | sort -n -k 2

P10246 config mis-sync: servers sorted by mtime delta between InitialiseSettings.php and cache file

1	mw1256.eqiad.wmnet: 0.048000 BAD
2	mw1266.eqiad.wmnet: 0.052000 BAD
3	mw1348.eqiad.wmnet: 0.059999 BAD
4	mw1282.eqiad.wmnet: 0.072000 BAD
5	mw1268.eqiad.wmnet: 0.084000 BAD
6	mw1346.eqiad.wmnet: 0.084000 BAD
7	mw1225.eqiad.wmnet: 0.116000 BAD
8	mw1271.eqiad.wmnet: 0.163999 BAD
9	mw1332.eqiad.wmnet: 0.236000 BAD
10	mw1290.eqiad.wmnet: 0.243998 good
11	mw1270.eqiad.wmnet: 0.260000 BAD
12	mw1221.eqiad.wmnet: 0.275997 BAD
13	mw1232.eqiad.wmnet: 0.295995 good
14	mw1263.eqiad.wmnet: 0.315999 BAD
15	mw1293.eqiad.wmnet: 0.352000 BAD
16	mw1222.eqiad.wmnet: 0.415995 BAD
17	mw1313.eqiad.wmnet: 0.443997 BAD
18	mw1340.eqiad.wmnet: 0.447997 good
19	mw1341.eqiad.wmnet: 0.448000 BAD
20	mw1240.eqiad.wmnet: 0.607994 BAD
21	mw1235.eqiad.wmnet: 0.615990 good
22	mw1284.eqiad.wmnet: 0.643995 BAD
23	mw1319.eqiad.wmnet: 0.643997 BAD
24	mw1249.eqiad.wmnet: 0.660004 BAD
25	mw1321.eqiad.wmnet: 0.671999 BAD
26	mw1287.eqiad.wmnet: 0.675998 BAD
27	mw1277.eqiad.wmnet: 0.748000 BAD
28	mw1231.eqiad.wmnet: 0.787994 BAD
29	mw1261.eqiad.wmnet: 0.827999 BAD
30	mw1303.eqiad.wmnet: 0.871993 BAD
31	mw1344.eqiad.wmnet: 0.923999 BAD
32	mw1255.eqiad.wmnet: 1.179990 BAD
33	mw1342.eqiad.wmnet: 1.311993 good
34	mw1262.eqiad.wmnet: 1.332002 BAD
35	mw1316.eqiad.wmnet: 1.340001 good
36	mw1327.eqiad.wmnet: 1.379994 BAD
37	mw1325.eqiad.wmnet: 1.403995 BAD
38	mw1339.eqiad.wmnet: 1.411998 BAD
39	mw1245.eqiad.wmnet: 1.455990 good
40	mw1233.eqiad.wmnet: 1.467983 BAD
41	mw1333.eqiad.wmnet: 1.612005 BAD
42	mw1278.eqiad.wmnet: 1.716002 good
43	mw1248.eqiad.wmnet: 1.723970 good
44	mw1276.eqiad.wmnet: 1.792006 BAD
45	mw1330.eqiad.wmnet: 1.831999 BAD
46	mw1317.eqiad.wmnet: 1.903990 BAD
47	mw1326.eqiad.wmnet: 2.028009 good
48	mw1272.eqiad.wmnet: 2.039985 good
49	mw1251.eqiad.wmnet: 2.239968 BAD
50	mw1329.eqiad.wmnet: 2.340003 BAD
51	mw1279.eqiad.wmnet: 2.363989 BAD
52	mw1258.eqiad.wmnet: 2.371986 good
53	mw1229.eqiad.wmnet: 2.447984 BAD
54	mw1269.eqiad.wmnet: 2.480001 BAD
55	mw1250.eqiad.wmnet: 2.579971 BAD
56	mw1264.eqiad.wmnet: 2.583983 BAD
57	mw1273.eqiad.wmnet: 2.608000 BAD
58	mw1323.eqiad.wmnet: 2.619995 BAD
59	mw1230.eqiad.wmnet: 2.639979 BAD
60	mw1289.eqiad.wmnet: 2.843987 good
61	mw1322.eqiad.wmnet: 2.880003 BAD
62	mw1286.eqiad.wmnet: 3.171989 BAD
63	mw1223.eqiad.wmnet: 3.183969 BAD
64	mw1331.eqiad.wmnet: 3.191997 BAD
65	mw1280.eqiad.wmnet: 3.255998 BAD
66	mw1338.eqiad.wmnet: 3.564018 BAD
67	mw1247.eqiad.wmnet: 3.687944 BAD
68	mw1267.eqiad.wmnet: 3.755987 BAD
69	mw1324.eqiad.wmnet: 3.795992 BAD
70	mw1275.eqiad.wmnet: 3.931984 BAD
71	mw1288.eqiad.wmnet: 4.083977 BAD
72	mw1312.eqiad.wmnet: 4.099987 good
73	mw1343.eqiad.wmnet: 4.175979 BAD
74	mw1347.eqiad.wmnet: 4.287990 BAD
75	mw1281.eqiad.wmnet: 4.511991 good
76	mw1298.eqiad.wmnet: 4.639545 good
77	mw1265.eqiad.wmnet: 4.831998 good
78	mw1228.eqiad.wmnet: 5.099983 good
79	mw1234.eqiad.wmnet: 5.267916 BAD
80	mw1285.eqiad.wmnet: 5.387982 good
81	mw1345.eqiad.wmnet: 5.868002 good
82	mw1227.eqiad.wmnet: 6.419879 good
83	mw1320.eqiad.wmnet: 6.435935 good
84	mw1243.eqiad.wmnet: 7.016120 good
85	mw1226.eqiad.wmnet: 7.039915 good
86	mw1315.eqiad.wmnet: 7.451945 good
87	mw1274.eqiad.wmnet: 7.523965 good
88	mw1297.eqiad.wmnet: 7.883965 good
89	mw1252.eqiad.wmnet: 7.915915 good
90	mw1314.eqiad.wmnet: 8.155992 good
91	mw1328.eqiad.wmnet: 8.619966 good
92	mw1311.eqiad.wmnet: 8.651820 good
93	mw1283.eqiad.wmnet: 9.287967 good
94	mw1238.eqiad.wmnet: 9.323960 good
95	mw1335.eqiad.wmnet: 9.420090 good
96	mw1242.eqiad.wmnet: 10.351987 good
97	mw1253.eqiad.wmnet: 10.543800 good
98	mw1244.eqiad.wmnet: 10.599887 good
99	mw1241.eqiad.wmnet: 12.019778 good
100	mw1254.eqiad.wmnet: 12.083807 good
101	mw1257.eqiad.wmnet: 12.515929 good
102	mw1246.eqiad.wmnet: 16.556147 good
103	mw1224.eqiad.wmnet: 17.403888 good
104	mw1301.eqiad.wmnet: 17.483877 good
105	mw1294.eqiad.wmnet: 20.183906 good
106	mw1306.eqiad.wmnet: 20.587876 good
107	mw1336.eqiad.wmnet: 20.864022 good
108	mw1300.eqiad.wmnet: 21.859921 good
109	mw1299.eqiad.wmnet: 23.803954 good
110	mw1310.eqiad.wmnet: 23.911903 good
111	mw1239.eqiad.wmnet: 24.191688 good
112	mw1304.eqiad.wmnet: 24.207962 good
113	mw1296.eqiad.wmnet: 25.447901 good
114	mw1309.eqiad.wmnet: 27.652113 good
115	mw1334.eqiad.wmnet: 28.035787 good
116	mw1337.eqiad.wmnet: 34.923871 good
117	mw1295.eqiad.wmnet: 35.275967 good
118	mw1308.eqiad.wmnet: 40.231992 good
119	mw1318.eqiad.wmnet: 42.620378 good
120	mw1302.eqiad.wmnet: 50.883745 good
121	mw1305.eqiad.wmnet: 52.787839 good
122	mw1307.eqiad.wmnet: 57.911558 good

This isn't very satisfying: the bad servers do cluster near the top with deltas below 1 second, but there are a few good servers with <1s too, and some bad servers with deltas over 3 seconds. There does seem to be some sort of a correlation, but this is by no means a smoking gun.

Given that a scap takes ~60 seconds, should we just change the granularity of the stored/compared mtime down to the minute, if that would that avoid this?

To throw in my own filesystem fiddling which shows much the same as above:

thcipriani@mw1341:~$ stat --format=%Y /srv/mediawiki/wmf-config/InitialiseSettings.php
1579732739
thcipriani@mw1341:~$ sudo -u www-data grep mtime /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
    "mtime": 1579732739,

The previous sync happened 9 minutes earlier, so this is a cache for this version of the file.

thcipriani@mw1341:~$ sudo -u www-data grep wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage /tmp/mw-cache-1.35.0-wmf.15/conf2-ukwiki.json
        "wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage": 0,
thcipriani@mw1341:~$ grep -A10 wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage /srv/mediawiki/wmf-config/InitialiseSettings.php
'wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage' => [
        'default' => 0,
        'arwiki' => 50,
        'cswiki' => 50,
        'kowiki' => 50,
        'viwiki' => 50,
        'ukwiki' => 50,
        'huwiki' => 50,
        'hywiki' => 50,
],

However ^ shows that the new version of the file wasn't cached correctly.

Recapping what I just talked with @Catrope about in IRC: somehow the information that php pulled into memory via require_once was not the information that was on disk in InitialiseSettings.php even though the php process require_onced that file after it had been updated (or so the code in CommonSettings.php would have us believe, anyway).

I don't think that that would help. I was measuring mtime deltas not because I think there's a bug of the type you're alluding to, but as a proxy for finding out whether the bad servers generated their cache files very shortly after the scap happened. If the cache file is only a tenth of a second younger than the settings file, that might indicate some sort of race condition, but at 0.25 seconds that's less plausible, and at 3 seconds it's definitely not plausible.

My remaining theories are:

Somehow a different code path does require_once "InitialiseSettings.php" before the mtime+cache check, and so when the cache refresh code path does it again, that reuses the stale version of IS.php that was loaded before
The opcache caches a stale version of IS.php, somehow

I'm going to comb through the initialization code in CommonSettings.php to see if I can exclude the former.

Tgr mentioned this in T234743: User rights validation is sometimes malfunctioning (with FlaggedRevs only?).Feb 4 2020, 4:12 AM

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:58:00Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: cirrus: Enable ores_articletopics field creation for all wikis (extra sync for T236104) (duration: 01m 04s)

Simplified Recreation

I was able to recreate this issue with a simplified example.

This is a simplified version of how the cache is generated by CommonSettings.php

test.php

<?php

$cache_path = '/tmp/cache.json';
$confActualMtime = filemtime('InitialiseSettings.php');

$globals = json_decode(file_get_contents($cache_path), true);

# If the cache has an older mtime, it's invalid
if ( $globals['timestamp'] !== $confActualMtime ) {
        $globals = [];
}

# An invalid cache means a re-read from disk
if ( ! $globals ) {
        require_once "InitialiseSettings.php";

        $globals = [
                "timestamp" => $confActualMtime,
                "foo" => $foo,
        ];

        file_put_contents($cache_path, json_encode($globals));
}

var_dump($globals);

Then I made an extremely simplified InitialiseSettings.php

InitialiseSettings.php

<?php

$foo = 1;

The first time I call the script, the cache is created with:

/tmp/cache.json

{"timestamp":1582521524,"foo":1}

Then I update InitialiseSettings.php to have $foo = 2;; hitting test.php in the browser results in:

/tmp/cache.json

{"timestamp":1582521648,"foo":1}

Which is exactly what we're seeing above: php is able to see the mtime of InitialiseSettings.php has changed and therefore rebuilds /tmp/cache.json with a new "timestamp" value but the old variables from InitialiseSettings.php.

We use the following relevant opcache settings in production:

php.ini

opcache.validate_timestamps = 1
opcache.revalidate_freq = 10

That is, php-fpm will check for updated scripts if the key in opcache is older than 10 seconds.

That is, in the example above the opcache value of InitialiseSettings.php is not expired according to the opcache.revalidate_freq (it was compiled < 10 seconds ago) and so when we called require_once php got the old value of $foo from the not-yet-expired-but-nonetheless-invalid opcache.

What's worse is that since we're looking at the mtime of InitialiseSettings.php to regenerate this cache file the problem will not be corrected when the opcache for the InitialiseSettings.php opcache key expires: only touching InitialiseSettings.php AFTER the 10 second window in which the opcache is valid for that file will fix this problem.

The first awful fix that came to mind was:

test.php

if ( ! $globals ) {
    symlink('InitialiseSettings.php', "InitialiseSettings-${confActualMtime}.php");
    require_once "InitialiseSettings-${confActualMtime}.php";
    unlink("InitialiseSettings-${confActualMtime}.php");
}

This causes the opcache to use a new key for require_once. The nice thing to do would be to use opcache_compile_file('InitialiseSettings.php');; however, I worry that causes subtle corruption of the opcache that we've seen with other invalidation methods.

Another possibility is to restart php-fpm whenever we update InitialiseSettings.php; although that may be costly.

In T236104#5910908, @thcipriani wrote:

Another possibility is to restart php-fpm whenever we update InitialiseSettings.php; although that may be costly.

I've never observed a real problem when restarting php-fpm en masse.

Having said that, I think a better strategy would be to have some "release identifier" (which in an ideal world would be a git sha1, for example of mediawiki-config) and use that to build the path for the cache file.

The idea of using mtime was always to avoid having to inspect the contents of the IS file itself. Maybe it's not a necessary optimisation, however? If so, and we always load IS.php, then we can just md5() the file contents or whatever, and compare to that cached key?

If I understand @thcipriani 's conclusion correctly, the problem is that code execution from disk (opcache) and mtimes from disk (filemtime) are coming from two different realities.

Code execution happens in a lagged reality that can be upto 10 seconds behind (via opcache), with its 10s loop of checking for updates. But file operations such as filemtime, even if performed on PHP files that opcache has available, are not responded to by the same lagged reality. Instead, the are responded to by the underlying file system directly. Is that right?

Is this is documented and intended behaviour from upstream PHP? It doesn't seam unreasable to keep these separate. But, I also vaguely recall PHP caching mtimes, realpath resolution and other information about PHP files in its opcache and (intending to be?) using that. If so, then this might be a regresion or other bug, and not intended by upstream. Possibly fixable in the short-term.
If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. To make sure we don't populate it with stale data from opcache that predates the disk sync. However, given the bad race conditions we know opcache is vunlerable to surrounding opcache invalidation, I am not 100% sure whether that method is safe to call in our production environment. If opcache_invalidate() ends up doing the same that opcache would normally do after 10 seconds on-demand, then it might be safe. On the other hand, if it is more like the resets that the php CLI can do (which we used to use), or the ones from the memory pressure trigger, then it could be a problem.

In T236104#5914485, @Krinkle wrote:

If I understand @thcipriani 's conclusion correctly, the problem is that code execution from disk (opcache) and mtimes from disk (filemtime) are coming from two different realities.

Code execution happens in a lagged reality that can be upto 10 seconds behind (via opcache), with its 10s loop of checking for updates. But file operations such as filemtime, even if performed on PHP files that opcache has available, are not responded to by the same lagged reality. Instead, the are responded to by the underlying file system directly. Is that right?

yes and no. While your description is generally correct, there is also a realpath cache that can come into play, see https://engineering.facile.it/blog/eng/realpath-cache-is-it-all-php-opcache-s-fault/ for example.

What happens is:

We check the mtime of InitialiseSettings.php, which is a syscall and returns a valid value
We then re-include InitialiseSettings.php, but we get the file cached in opcache before that mtime

Is this is documented and intended behaviour from upstream PHP? It doesn't seam unreasable to keep these separate. But, I also vaguely recall PHP caching mtimes, realpath resolution and other information about PHP files in its opcache and (intending to be?) using that. If so, then this might be a regresion or other bug, and not intended by upstream. Possibly fixable in the short-term.

It's the intended behaviour. Now the obvious fix would be

If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. To make sure we don't populate it with stale data from opcache that predates the disk sync. However, given the bad race conditions we know opcache is vunlerable to surrounding opcache invalidation, I am not 100% sure whether that method is safe to call in our production environment. If opcache_invalidate() ends up doing the same that opcache would normally do after 10 seconds on-demand, then it might be safe. On the other hand, if it is more like the resets that the php CLI can do (which we used to use), or the ones from the memory pressure trigger, then it could be a problem.

which we did when doing scap, and we stopped doing because calling opcache_invalidate corrupts opcache under heavy load.

I propose an alternate solution in the next comment.

I think we need to go at this another way. We should refresh the json cache only when filemtime('InitialiseSettings.php') is different from the $globals['timestamp'] and the current time is larger than filemtime('InitialiseSettings.php') + ini_get('opcache.revalidate_freq').

This way, we'll ensure not only that we do refresh the cache when things change, but also that we are getting the refreshed version of InitialiseSettings.php.

Nikerabbit mentioned this in T246071: Interface language using Accept-Language header value instead of $wgLanguageCode.Feb 25 2020, 2:57 PM

• Mholloway subscribed.Feb 25 2020, 2:57 PM

Adding Wikimedia-Incident for 20200225-mediawiki interface language.

Lucas_Werkmeister_WMDE moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Feb 25 2020, 3:20 PM

In T236104#5914734, @Joe wrote:

In T236104#5914485, @Krinkle wrote:

If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. […]

which we did when doing scap, and we stopped doing because calling opcache_invalidate corrupts opcache under heavy load.

Are you sure that was opcache_invalidate(), and not opcache_reset()?

My assumptions:

No matter how broken opcache is, there seems to be a way currently that opcache does recompile source code after a file has changed on disk. Specifically, within about 10 seconds from a file changing, we believe opcache recompiles the file.
We believe that this on-demand recompilation during web requests, when it finds that a source file has changed, does not cause corruption.
Other means of invalidating, clearing or resetting opcache (e.g. when it detects it runs out of memory, or when we use php7adm to clear opcache in its entirety) seem to cause corruption. It was my understanding that these both happen "externally". That is, php7adm is invoked outside any web request. And the out of memory detection might be in some kind of parent process rather than the web request thread, but I'm not sure that is actually the case. I assume it is in a parent process because that is how I've explained the fact that "normal" source updates don't cause corruption. It seems that calls from within web request under normal operation seem to do something right that these "external" calls get wrong or are unable to.

In T236104#5914744, @Joe wrote:

We should refresh the json cache only when filemtime('InitialiseSettings.php') is different from the $globals['timestamp'] and the current time is larger than filemtime('InitialiseSettings.php') + ini_get('opcache.revalidate_freq').

This way, we'll ensure not only that we do refresh the cache when things change, but also that we are getting the refreshed version of InitialiseSettings.php.

This means we'll have a 10 second delay between scap sync-file completing (or on debug servers, scap pull), and it actually taking effect. We'll likely want to build in some kind of delay in these commands to avoid giving the illusion that deployment is done. We'll also need to factor this into our canary checks somehow, so that they wait an additional 10 seconds compared to how long they wait today so that we keep the same size monitoring/comparison window.

I'd like to avoid that if possible as it makes the overal system a lot more difficult to reason about. Especially considering multi-file syncing where some files are behind the cache and some aren't. If resolving this bug in a way that doesn't involve the above workaround is considered possible, but non-trivial, then I'd rather we invest that time and effort into T169821 instead. With a bit of work from us, and possibly some scap-driven or pre-commit step, we could eliminate this system entirely.

In T236104#5916074, @Krinkle wrote:

In T236104#5914734, @Joe wrote:

In T236104#5914485, @Krinkle wrote:

If we can't fix it easily and/or is intended by upstream, then we may want to use opcache_invalidate('./InitialiseSettings.php'); in our else branch that repopulates the cache right before require-ing it. […]

which we did when doing scap, and we stopped doing because calling opcache_invalidate corrupts opcache under heavy load.

Are you sure that was opcache_invalidate(), and not opcache_reset()?

Pretty sure, having written the code myself:
https://github.com/wikimedia/puppet/blob/36527ba432ea39086f0474373b2c51b4259e58f2/modules/profile/files/mediawiki/php/admin/lib.php#L371

My assumptions:

No matter how broken opcache is, there seems to be a way currently that opcache does recompile source code after a file has changed on disk. Specifically, within about 10 seconds from a file changing, we believe opcache recompiles the file.

We believe that this on-demand recompilation during web requests, when it finds that a source file has changed, does not cause corruption.

Other means of invalidating, clearing or resetting opcache (e.g. when it detects it runs out of memory, or when we use php7adm to clear opcache in its entirety) seem to cause corruption. It was my understanding that these both happen "externally". That is, php7adm is invoked outside any web request.

php7adm is a tool I wrote and just performs a webrequest to php-fpm, to a special, access-restricted admin endpoint. So no big difference there.

To sum up: I'll have to check php's source code, but I expect the revalidation procedure to look more like what opcache_compile_file does instead.

We also have the possibility of doing what I proposed originally for deployments: disable opcache revalidation completely, and restart php-fpm for every release. This will give us a series of advantages, including guaranteeing atomicity of releases on each server.

If we want to go down that road, I'll probably have to add some safeguards to the restart scripts, but given we'll have to go down that route once we move to containers anyways, I think it's a route worth pursuing.

I'd also like to add we've done rolling restarts of php-fpm and we've barely seen any regression in the performance metrics afterwards.

• marcella subscribed.Feb 25 2020, 4:44 PM

Example code above for anyone interested in fiddling with this: https://github.com/thcipriani/T236104

greg edited projects, added Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)); removed Release-Engineering-Team.Feb 25 2020, 7:47 PM

greg moved this task from INBOX to Maintenance on the Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)) board.

I just made a tiny change to your test.php file

https://github.com/lavagetto/T236104/commit/e32162717a79e09a1aa62b07d86b018f71385f88

To only save the file if the grace period is passed.

The result is that even if the bug is triggered (so we're reading a stale version of InitialiseSettings.php) it recovers as soon as opcache is revalidated, like for every other file. We will be able to use the cache only after 11 seconds (the grace period)

In T236104#5916148, @Joe wrote:

We also have the possibility of doing what I proposed originally for deployments: disable opcache revalidation completely, and restart php-fpm for every release. This will give us a series of advantages, including guaranteeing atomicity of releases on each server.

If we want to go down that road, I'll probably have to add some safeguards to the restart scripts, but given we'll have to go down that route once we move to containers anyways, I think it's a route worth pursuing.

I'd also like to add we've done rolling restarts of php-fpm and we've barely seen any regression in the performance metrics afterwards.

Since our efforts to clear the opcache have been foiled by php itself, I think it makes sense to revisit restarting php as part of deployment. I have been hesitant as I want to preserve fast rollback: that will have to be a design consideration. We'd need to be sure to touch InitialiseSettings.php after php restart (otherwise we'll just end-up using the value cache in the json).

Is there a sane workaround we could put in place in the interim? Either opcache_compile_file or the ugly symlink solution I posted above (or a smarter variation of that)?

In T236104#5920394, @thcipriani wrote:

Is there a sane workaround we could put in place in the interim? Either opcache_compile_file or the ugly symlink solution I posted above (or a smarter variation of that)?

See my proposal above:

We read the file from cache, if it's stale we require InitialiseSettings.php
We only save the file if a grace period is passed, so that we're sure we will save in cache a version of globals that is updated.

Change 575469 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Do not update the globals cache file while opcache needs regeneration

https://gerrit.wikimedia.org/r/575469

gerritbot added a project: Patch-For-Review.Feb 28 2020, 8:24 AM

I admit I don't understand PHP or MW, but... would a simpler approach be for scap to wait 10s before it touches the settings file to update its mtime, and rsyncs the file file to the appservers?

In T236104#5926376, @LarsWirzenius wrote:

[…] would a simpler approach be for scap to wait 10s before it touches the settings file to update its mtime, and rsyncs the file file to the appservers?

Each PHP file has its compilation cached by PHP7 for 10s before it revalidates against the source file. This repeating 10s window isn't coordinated between servers. If the last deployment was an hour ago, there will have been many revalidation checks (during which nothing happeneds if no deployment happened). It is not until 10 seconds after the local rsync completes that we know it's been picked up, because the last revalidation attempt may've completed just before we rsync started.

If I understand your idea correctly, waiting longer on the deployment server wouldn't help in this case. Does that make sense?

I'm wondering if it'd be easier if we simply generate this cache upfront the way we do for localisation cache. I can look into writing a script that a Scap plugin could call which would produce the necessary config files.

Okay, there's clearly things here I'm not getting yet. Scratch the idea of having scap wait ten seconds.

Has anyone tried Giuseppe's change in Gerrit? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/575469

In T236104#5926454, @Krinkle wrote:

I'm wondering if it'd be easier if we simply generate this cache upfront the way we do for localisation cache. I can look into writing a script that a Scap plugin could call which would produce the necessary config files.

That would work for sure, but it means adding yet another process in the hands of scap pre-deployment, vs having such things generated at merge time for instance.

I thought there were further reasons not to distribute the json files with the code itself. If not, let's generate them as part of mediawiki-config merge process instead.

In T236104#5926454, @Krinkle wrote:

I'm wondering if it'd be easier if we simply generate this cache upfront the way we do for localisation cache. I can look into writing a script that a Scap plugin could call which would produce the necessary config files.

Indeed, and if you recall I built this and you have concerns about deploying it. :-)

Deploys would be composer buildConfigCache && scap sync-dir wmf-config/config-cache. No need to change anything in scap to get this done.

Stalled on working out what to do about some calls to SiteConfiguration wanting to know everything about every wiki, not just one. Otherwise we'll just make the wrong-file-loaded issue happen only for some kinds of request.

thcipriani mentioned this in T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.Mar 3 2020, 1:07 AM

FWIW, ran some tests with apachebench today and the patchset on this task and it didn't seem to make requests significantly slower or load significantly higher in my rudimentary testing:

$ ab -n 1000 -c 10 -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' https://en.wikipedia.org/wiki/Main_page
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking en.wikipedia.org (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        mwdebug1002.eqiad.wmnet
Server Hostname:        en.wikipedia.org
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        en.wikipedia.org

Document Path:          /wiki/Main_page
Document Length:        79269 bytes

Concurrency Level:      10
Time taken for tests:   48.676 seconds
Complete requests:      1000
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Total transferred:      80405609 bytes
HTML transferred:       79269000 bytes
Requests per second:    20.54 [#/sec] (mean)
Time per request:       486.758 [ms] (mean)
Time per request:       48.676 [ms] (mean, across all concurrent requests)
Transfer rate:          1613.14 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       75   83   3.6     82     119
Processing:   263  398  82.9    378    1582
Waiting:      203  337  82.5    317    1522
Total:        347  481  83.1    462    1667

Percentage of the requests served within a certain time (ms)
  50%    462
  66%    487
  75%    507
  80%    527
  90%    575
  95%    613
  98%    653
  99%    695
 100%   1667 (longest request)
~ 49s
$ ab -n 1000 -c 10 -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' https://en.wikipedia.org/wiki/Main_page
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking en.wikipedia.org (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        mwdebug1002.eqiad.wmnet
Server Hostname:        en.wikipedia.org
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-CHACHA20-POLY1305,256,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        en.wikipedia.org

Document Path:          /wiki/Main_page
Document Length:        79311 bytes

Concurrency Level:      10
Time taken for tests:   51.275 seconds
Complete requests:      1000
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Total transferred:      80447873 bytes
HTML transferred:       79310998 bytes
Requests per second:    19.50 [#/sec] (mean)
Time per request:       512.749 [ms] (mean)
Time per request:       51.275 [ms] (mean, across all concurrent requests)
Transfer rate:          1532.18 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       75   84   7.6     83     145
Processing:   236  421  84.8    402    1177
Waiting:      178  358  82.9    337    1117
Total:        315  506  86.1    484    1265

Percentage of the requests served within a certain time (ms)
  50%    484
  66%    520
  75%    550
  80%    566
  90%    623
  95%    670
  98%    740
  99%    777
 100%   1265 (longest request)

Jdforrester-WMF edited projects, added Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)); removed Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)).Apr 1 2020, 3:04 PM

Jdforrester-WMF moved this task from INBOX to Maintenance on the Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)) board.

Looks like this just needs deployment, assigning to @Jdforrester-WMF to do the needful.

In T236104#5925763, @gerritbot wrote:

Change 575469 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Do not update the globals cache file while opcache needs regeneration

https://gerrit.wikimedia.org/r/575469

At https://gerrit.wikimedia.org/r/575469 @Krinkle wrote:

If I understand correctly, this means forcing a 10 second period after every deployment of unconditionall cache miss for the configuration. Regenerating on every request thread during that time. […]

At https://gerrit.wikimedia.org/r/575469 @Joe wrote:

This approach favours consistency (the version of the code you get is always consistent with how $globals is generated) at the cost of some more computing power. […] I'd like to quantify the cost of /not/ caching in terms of cpu load and response times to our users.

Some weeks later now, having been deep into this mess as part of T169821. I too now prefer the simple approach by @Joe. The latency impact of having to re-compute this over and over again used to be quite expensive, but ever since the switch to HHVM and more recently PHP 7.2 the hit has become relatively small in the grand scheme of things (about 10-20 ms).

That is still quite a significant amount of time to be spending on the CPU given all it does is give us some configuration variables to initialise MediaWiki. It is expensive mainly because it involves reading several non-PHP files off disk. And then performing many hundreds of complex iterations over each of the thousands of config settings.

This kind of time spend is comparable to performing 10 Memcached queries, connecting to a master DB and fetching rows from a few simple queries.

It is also comparable to the entire backend response time of some load.php requests (when the config is cached). So for an SLO of endpoints that aim for 50-100ms, an additional 10-20ms is huge.

But given the limited options at our disposal, at least for the short-term, we should make this stuff work first, and a latency regression for a few seconds after every deploy seems acceptable for the short term.

For the mid-term, I have already started work on T169821 which so far has reduced this cost down to ~ 4ms. And my next step there is to simply remove this caching mechanism entirely pretty soon, which isn't too much harder given that a "cache hit" scenario involves synchronous disk I/O to read a JSON file and then parse it on the CPU, whereas a "cache miss" scenario right now is doing a lot of CPU computations in pure memory (no disk I/O).

And for the mid-to-long term, as part of T223602, the new cache we need for the YAML expansions would be handled by a Scap build step of sorts and thus not be something for the run-time to be concerned about.

awight subscribed.Apr 15 2020, 11:06 AM

Change 575469 merged by jenkins-bot:
[operations/mediawiki-config@master] Do not update the globals cache file while opcache needs regeneration

https://gerrit.wikimedia.org/r/575469

Mentioned in SAL (#wikimedia-operations) [2020-04-20T18:14:21Z] <jforrester@deploy1001> Synchronized wmf-config/CommonSettings.php: T236104 Wait to update the globals cache file for opcache regeneration (duration: 01m 02s)

OK, this is deployed, and I made a quick patch and used SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -r ssh -F 20 -M -g mediawiki-installation -o -i/etc/keyholder.d/mwdeploy -o -oUser=mwdeploy -o -oBatchMode=yes -- 'echo "echo \$wgTmhUseBetaFeatures ? \"BAD\" : \"good\";" | sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki=testwiki | head -n 1' | tee ~/wgTmhUseBetaFeatures-testwiki-hosts to verify that every machine had the correct value. Success.

Jdforrester-WMF closed this task as Resolved.Apr 20 2020, 6:40 PM

Jdforrester-WMF reassigned this task from Jdforrester-WMF to Joe.