Page MenuHomePhabricator

Figure out why HHVM kept running stale code for hours
Open, Needs TriagePublic

Description

Follows T181773: Deployment to change wgFragmentMode not consistent on all app servers

After a change to InitialiseSettings.php, for at least 5 hours HHVM was still running with outdated code.

It is unknown how long it would've continued like this, but it stands to reason that if it lasted 5 hours, it probably just missed the change and wasn't gonna detect it later. It was worked around by performing another deployment, which appears to have been detected consistently by at least the servers that previously didn't detect it.

We should figure out:

  • What caused it?
  • How can we avoid it in the future?
  • How can we detect it?

The detection is imho most important because the rest can be worked around, and there might in fact be more than one scenario/cause that could lead to stale HHVM byte code. A generic detection is more important than fixing or avoiding this particular case.

Note that it would be wrong to yesterday's deployment was exceptional in not reaching all HHVM instances. It's quite likely the secondary fix-up sync also didn't reach all instances. The only thing we know is that the subset of instances that didn't get it the first time, got it the second time.

Lastly, we should also keep an eye out for the possibility that this isn't caused by HHVM's byte code cache, but instead a bug in our own application-layer caches. Because for the specific case of wmf-config settings, we do cache those in some way as well.

Event Timeline

Krinkle created this task.Dec 1 2017, 6:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 1 2017, 6:45 PM
Legoktm added a subscriber: Legoktm.Dec 2 2017, 7:15 AM

Are we sure this is an HHVM problem? It could have also been a problem with the InitialiseSettings.php cache not being invalidated?

mmodell added a subscriber: mmodell.Dec 7 2017, 5:08 AM

This (or similar) seems to happen pretty regularly. Is there even any real justification for the mess that is InitializeSettings cache? It seems to me that it wouldn't provide much performance benefit but it definitely can have a significant cost (e.g. causing problems like this)

greg assigned this task to thcipriani.Dec 11 2017, 10:08 PM
greg added subscribers: thcipriani, greg.

@thcipriani will take a stab at some diagnosis here and report back.

thcipriani removed thcipriani as the assignee of this task.May 3 2018, 6:24 PM

From the deployment tooling standpoint there are certainly cases where the value in /tmp/mw-cache-[version]/conf-[dbname] might be incorrect and caching old values. For instance, since the deployment tooling uses rsync with --delay-updates if multiple files are synced the order of sync of those files is not guaranteed. There is a corner case possible where:

  1. rsync renames wmf-config/.~tmp~/InitialiseSettings.php -> wmf-config/InitialiseSettings.php
  2. An http request is received by the server which triggers this code in CommonSettings.php:
CommonSettings.php
if ( @filemtime( $filename ) >= filemtime( "$wmfConfigDir/InitialiseSettings.php" ) ) {
	$cacheRecord = @file_get_contents( $filename );
	if ( $cacheRecord !== false ) {
		$globals = unserialize( $cacheRecord );
	}
}
if ( !$globals ) {
	# Get configuration from SiteConfiguration object
	require "$wmfConfigDir/InitialiseSettings.php";
        # [...]
       # Save cache
	@mkdir( '/tmp/mw-cache-' . $wmgVersionNumber );
	$tmpFile = tempnam( '/tmp/', "conf-$wmgVersionNumber-$wgDBname" );
	if ( $tmpFile && file_put_contents( $tmpFile, serialize( $globals ) ) ) {
		if ( !rename( $tmpFile, $filename ) ) {
			// T136258: Rename failed, cleanup temp file
			unlink( $tmpFile );
		};
	}
# [...]
}
  1. rsync finishes possibly after renaming additional files, i.e., wmf-config/.~tmp~/CommonSettings.php -> wmf-config/CommonSettings.php
  2. Scap touches InitialiseSettings.php, but the touch happens the same second as when the rsync ends, and since filemtime returns a unix timestamp which doesn't have sub-second resolution a subsequent recache of InitialiseSettings.php is never triggered.

This is one possible race-condition scenario I could see happening.


However, in this instance, the above would not be a valid explanation since only InitialiseSettings.php was synced. Scap touches InitialiseSettings.php after sync is complete so it should have re-triggered CommonSettings.php cache. Further pulling down changes via git on tin should have updated the mtime of the file, so rsync should have changed the file mtime even if scap failed to touch InitialiseSettings.php (which I don't think I've actually seen happen looking at the logs). Rsync changes file mtime before moving the file into its final location so that doesn't account for the problem either (that is, the timestamp should never change before the file content).

There could be something being cached by hhvm that caused this; however, a subsequent touch and re-sync solving the problem seems to point to something in the interplay of rsync and the cache/cache invalidation in CommonSettings.php.

I'm unassigning myself from this task since I initially suspected some race-condition to be the problem, I can't see how that is the case after some investigation :(

greg added a comment.Nov 19 2018, 11:41 PM

I don't know how much further we can go with this after so long... propose to decline.

Can we think about ways to detect the problem within short period of time? I'd like to see that happen at least.