Page MenuHomePhabricator

Cache of wmf-config/InitialiseSettings often 1 step behind
Open, Needs TriagePublic

Description

It has happened to me 5 times out of 5 deployments I did in the past 30 days that when changing a value in InitialiseSettings.php, and syncing it out, that about half the servers (possibly more) the change does not apply.

But, it works fine on mwdebug1002. And a second touch+sync it always works on the rest of the fleet as well.

This suggests there might be a resurfaced old bug and/or new bug caused by recent changes to MWConfigCacheGenerator.

Event Timeline

Krinkle created this task.Oct 21 2019, 9:30 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Oct 21 2019, 9:32 PM
jijiki added a subscriber: jijiki.Oct 22 2019, 10:25 AM
CDanis added a subscriber: CDanis.Oct 22 2019, 12:23 PM
CDanis added a subscriber: MaxSem.Oct 22 2019, 12:26 PM

This happened yesterday, during firefighting work on {T235949}

All appservers had the same config:

1❌cdanis@cumin1001.eqiad.wmnet ~ πŸ•ŸπŸ΅ sudo !!
2sudo cumin A:mw 'md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php'
3126 hosts will be targeted:
4mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
5Confirm to continue [y/n]? y
6===== NODE GROUP =====
7(126) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
8----- OUTPUT of 'md5sum /srv/medi...liseSettings.php' -----
9b5a01965593c5a02207d93e70c630832 /srv/mediawiki/wmf-config/InitialiseSettings.php
10================
11PASS: |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100% (126/126) [00:01<00:00, 77.68hosts/s]
12FAIL: | | 0% (0/126) [00:01<?, ?hosts/s]
13100.0% (126/126) success ratio (>= 100.0% threshold) for command: 'md5sum /srv/medi...liseSettings.php'.
14100.0% (126/126) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

And yet a third of the fleet was still broken

20:39:15	<@cdanis>	(78) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1240,1273,1322].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet                                     
20:39:17	<@cdanis>	these appservers seem fine
20:39:20	<@cdanis>	(48) mw[1238-1239,1241-1258,1261-1272,1274-1275,1319-1321,1323-1333].eqiad.wmnet                                                                                                              
20:39:22	<@cdanis>	these appservers are not

@MaxSem touched InitialiseSettings.php and did another sync-file and that fixed it.

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

rsync certainly can, but won't by default unless you pass -t / --times. (-a / --archive includes this in the set of options it enables.)

I see --archive included in scap/tasks.py DEFAULT_RSYNC_ARGS, so I assume that scap DTRT here?

Krinkle added a comment.EditedOct 24 2019, 3:08 PM

I don't know off-hand whether the mtime comes from the deployment server (set when the last git command saved the file to disk), or whether it is set by the destination server at the time of local sync.

However, we don't currently depend on either model as far as I know. That is, so long as the mtime changes in some way, the code for config cache will handle it. It doesn't assume equality across the cluster and doesn't assume it goes up.

Having said that, it seems most likely that something in the config cache code is broken because it's the only part of this that has changed recently. I've looked at ed5297c107f38d and subsequent refactors by James many times but I don't see anything obvious stand out...

Anomie added a subscriber: Anomie.Nov 18 2019, 3:25 PM

Happened to me today when deploying a config change for T198312. Touching InitialiseSettings.php and re-syncing did fix it.

Addshore added subscribers: Addshore, Daimona, Catrope, Ladsgroup.

Also happened to me twice in the past few days.

about half the servers (possibly more) the change does not apply.

For some of my syncs I was lucky if I could find 1 or 2 servers with the config actually loaded correctly.

How reliable is the filemtime function in a scap world? Does scap / rsync always migrate the mtime from the host to the target machine?

rsync certainly can, but won't by default unless you pass -t / --times. (-a / --archive includes this in the set of options it enables.)
I see --archive included in scap/tasks.py DEFAULT_RSYNC_ARGS, so I assume that scap DTRT here?

Those should be the rsync arguments that scap uses in this case; however, after each sync scap also touches the wmf-config/InitialiseSettings.php file to trigger the re-cache of settings.

This happened yesterday, during firefighting work on {T235949}
All appservers had the same config:

1❌cdanis@cumin1001.eqiad.wmnet ~ πŸ•ŸπŸ΅ sudo !!
2sudo cumin A:mw 'md5sum /srv/mediawiki/wmf-config/InitialiseSettings.php'
3126 hosts will be targeted:
4mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
5Confirm to continue [y/n]? y
6===== NODE GROUP =====
7(126) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1238-1258,1261-1275,1319-1333].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
8----- OUTPUT of 'md5sum /srv/medi...liseSettings.php' -----
9b5a01965593c5a02207d93e70c630832 /srv/mediawiki/wmf-config/InitialiseSettings.php
10================
11PASS: |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100% (126/126) [00:01<00:00, 77.68hosts/s]
12FAIL: | | 0% (0/126) [00:01<?, ?hosts/s]
13100.0% (126/126) success ratio (>= 100.0% threshold) for command: 'md5sum /srv/medi...liseSettings.php'.
14100.0% (126/126) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

And yet a third of the fleet was still broken

20:39:15	<@cdanis>	(78) mw[2163-2199,2224-2242,2254-2258,2268-2277].codfw.wmnet,mw[1240,1273,1322].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet                                     
20:39:17	<@cdanis>	these appservers seem fine
20:39:20	<@cdanis>	(48) mw[1238-1239,1241-1258,1261-1272,1274-1275,1319-1321,1323-1333].eqiad.wmnet                                                                                                              
20:39:22	<@cdanis>	these appservers are not

@MaxSem touched InitialiseSettings.php and did another sync-file and that fixed it.

So the actual rsyncing of the file appears to work just fine; however, the file is not loaded from the config correctly.

The next step of troubleshooting is to figure out if this is a problem with fixing the race condition in T217830 (or some subsequent refactor); or something entirely new to do with php caching. I guess that would mean comparing the mtime of /srv/mediawiki/wmf-config/InitialiseSettings.php with /tmp/mw-cache-[versions]/conf2-[wiki].json. If, on a misbehaving appserver, the cache has an older mtime than the IS.php file, then we know regeneration was not triggered. e.g.,

thcipriani@mwdebug1001:~$ sudo -u www-data jq '.mtime' /tmp/mw-cache-1.35.0-wmf.10/conf2-enwiki.json 
1576527000
thcipriani@mwdebug1001:~$ sudo -u www-data php -a
Interactive mode enabled

php > var_dump(filemtime('/srv/mediawiki/wmf-config/InitialiseSettings.php'));
int(1576527000)
php > ^Dthcipriani@mwdebug1001:~$ stat --format '%Y (%y) %n' /srv/mediawiki/wmf-config/InitialiseSettings.php 
1576527000 (2019-12-16 20:10:00.935866457 +0000) /srv/mediawiki/wmf-config/InitialiseSettings.php

Mentioned in SAL (#wikimedia-operations) [2020-01-06T12:31:00Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:562083|Don’t check constraints on P6685 statements]] Bypassing T236104 (duration: 00m 55s)