For both performance reasons (we have some long running scripts on these that can benefit from this) and to get rid of the remaining PHP 5.3 hosts.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +2 -1 | enabling hhvm for dumps | |
operations/puppet | production | +7 -16 | Don't install mcelog on machines with AMD processors |
Event Timeline
@ArielGlenn: What's the status of this? MediaWiki is not going to support PHP 5.3 forever, at some point these hosts are going to break.
MediaWiki will support 5.3 as long as it takes to get the cluster off 5.3...the hosts aren't going to arbitrarily break.
To nudge this along, I applied the roles and classes currently applied to the snapshot hosts to osmium, which is running Trusty. With the exception of the need to add new snapshot hosts to datasets1001:/etc/exports, I did not run into any issues. Puppet ran to completion and made all requisite state changes.
The way we run the dumps now, we have a (scheduled) break in between runs, which should come up in a week and last for 5 days. This will be enough time to rebuild one more library, reimage one host, test it, deal with any fallout and then do the same for the rest. So expect action on this around September 15, earlier if the run of the dumps finishes earlier.
The idea is that we have two runs a month, with some dead time after each run to deal with security updates or anything else desired.
We've only just got rid of the last 10.04 host; 5.5 years after release.
Ariel has also been ill, so this probably has been delayed a little bit.
Progress is happening, and other blockers are being found. For example the video scalers aren't faring well under hhvm, see T113284
What's the major urgency to use > PHP 5.3 features? Better, cleaner, less verbose code? Super performance?
snapshot1002 was reinstalled with trusty last week.
The utf8 normal extension had already been rebuilt by ori for trusty, so I figured we were good to go. But it had not been packaged and built for HHVM. For a while I was fighting with that, not fun since it relies on swig for php, not supported for hhvm. But it turns out that MW now uses the intl library so I was able to just remove all dependencies: https://gerrit.wikimedia.org/r/#/c/241537/ and https://gerrit.wikimedia.org/r/#/c/241540/ now merged.
The remaining known blocker is that our HHVM seems to be built without compress.bzip2 stream support (see blocking tasks). If that's not fixed before October I will have the dumps fall back to running php5 (https://gerrit.wikimedia.org/r/#/c/241612/)
I'm going to go ahead and convert over snapshot1004 and snapshot1001 to trusty today. Because snapshot1003 runs daily cron jobs, I will wait on it util Oct 1 when I'll make the decision about fallback for the month or not.
upgraded snapshot1001 to trusty. puppet runs ok with one exception, unable to finish installation of mcelog because:
root@snapshot1001:~# mcelog --foreground
mcelog: AMD Processor family 16: Please load edac_mce_amd module.
: Success
CPU is unsupported
root@snapshot1001:~# lsmod | grep mce
edac_mce_amd 22617 1 amd64_edac_mod
mcelog doesn't run on more recent AMD cpus. http://www.mcelog.org/faq.html#18 Need to remove it from modules/base/manifests/standard-packages.pp for hosts with AMD cpus I guess.
Change 241736 had a related patch set uploaded (by Ori.livneh):
Don't install mcelog on machines with AMD processors
Change 241736 merged by Ori.livneh:
Don't install mcelog on machines with AMD processors
Strictly speaking we only need to exclude from hosts with AMD processor family >= 16, but since there's only one host on the cluster that has AMD processors, meh. Thanks, Ori! After a puppet run and manual removal of that package, snapshot1001 looks good.
snapshot1004 reinstalled. pending are: hhvm upstream bug, snapshot1003 to be converted as soon as I know if we'll have a fix by Oct 1 or not
all snapshot hosts now converted to use trusty. Joe found two differences in php5 from precise to trusty, no igbinary and no apc extensions. but we use neither of those so no problem. a test run on a small wiki worked just fine with php5. stalled on conversion to hhvm until the upstream bug (see blocking task) is fixed.
T65899 is not a blocked task; all snapshots are converted to trusty. that's why I removed this from blockers.
Just curious, did this change have an effect on the dumps generation process? It seems like the probability of getting failed jobs is higher in the first October 2015 run.
To trusty and PHP 5.5 would presumably have some improvements, but I suspect that going up to hhvm in the near future will be a bigger net win...
And yeah, it's possible both parts of the "upgrade" will cause problems, but I'm fairly sure Ariel will be keeping a close eye on them to begin with
Moving to trusty and php 5.5 didn't impact the dumps particularly. There was an issue caused by my pylint changes, unrelated to the upgrade, but the way stages go, there's a catch-all stage which will rerun all jobs that failed earlier, which you can see has been happening for the "big" wikis.
Thanks for the information, hopefully we are progressing moving towards a faster dump generation process. Thanks!
After chat on IRC with ori: The current dump run will likely get done in about 12 days, and we'll know a bit more about the stability of the new hhvm version by then too. So leaving as 'stalled' til then.
Change 281547 had a related patch set uploaded (by ArielGlenn):
enabling hhvm for dumps
I've done this for the new snapshot hosts and run a test dump of a wiki; it looked fine. I'll keep this open til the misc cron jobs are moved over however.
During the full run some jobs produce "Lost parent" from hhvm; so far I have only seen this for articles dumps, i.e. those that spawn a separate php script to get revisions The job itself runs to completion with fully produced files but the manager sees errors and believes it to have failed. As well it spams the status html page for users. Needs more investigation.
the LightprocessCount: Lost parent message is just signaling the exit of a child php process.
So it's most definitely not an error.
You might want to set the spawned php processes so that those have lightprocess count equal to zero. it can be done by using for example a separate ini file for those.
In general, though, spawning a php process from within a php process instead of using fork() feels like the wrong thing to do.
Some investigation from the command line shows me that these messages for dumps are only produced when there's a Segfault causing the parent indeed to exit, and this is connectect to using bzip2 compressed files for the prefetch, even without spawn. I.e.:
datasets@snapshot1002:~/testing$ /usr/bin/php /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=lawikibooks --stub=file:/home/datasets/testing/stubs.xml --prefetch=bzip2:/home/datasets/testing/prefetch.xml.bz2 --report=1000 --current > /dev/null
with only one page in stubs.xml, one page in prefetch.xml (not necessarily the one to retrieve), and output to stdout or anywhere else, gives:
2016-04-11 11:20:18: lawikibooks (ID 24566) 1 pages (34.2|34.2/sec all|curr), 1 revs (34.2|34.2/sec all|curr), 0.0%|0.0% prefetched (all|curr), ETA 2016-04-11 11:24:06 [max 7794]
Segmentation fault
datasets@snapshot1002:~/testing$ [Mon Apr 11 11:20:18 2016] [hphp] [24570:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24568:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24571:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24569:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24572:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
Output is the perfectly formed output of the header, retrieved page and the footer, so it's just fine.
Note that if the prefetch file is 0 pages (head and footer only) instead on one page, it succeeds.
Note also that this is without spawn of a subprocess for fetch revisions from the db.
I suspect a bad interaction with hhvm/bzip2 (compress streams)/XMLReader but have yet to be able to isolate it. Still investigating.
I've got a tiny program that fails, now trying to get the equivalent with gzip set up for comparison.
This runs successfully with a 0 exit code:
<?php
ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.zlib:///home/datasets/testing/prefetch.xml.gz";
this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$bpre->reader->close();
This fails with seg fault at exit, right after the print (only difference is the bz2 input):
<?php
ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$bpre->reader->close();
for a tiny prefetch.xml file with valid mw xml in it.
<?php
ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
// this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$reader->close();
<?php
$infile = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2"; $me = fopen($infile, "r"); echo fgets($me); // fclose($me); print "yay\n";
?>
That runs successfully btw. bzip2, no fclose. BUT no XMLReader implicated.
Have backed off of hhvm use for dumps for now. The current run will continue with regular php5.
Not really:
Start Addr End Addr Size Offset objfile 0x400000 0x282c000 0x242c000 0x0 /usr/bin/hhvm 0x282d000 0x2830000 0x3000 0x242c000 /usr/bin/hhvm 0x2830000 0x2854000 0x24000 0x242f000 /usr/bin/hhvm 0x2854000 0xf600000 0xcdac000 0x0 [heap] 0x7fffde000000 0x7fffe1c00000 0x3c00000 0x0 0x7fffe1c00000 0x7fffe2c00000 0x1000000 0x0 /tmp/tcygjm4T (deleted) 0x7fffe2c00000 0x7fffe3000000 0x400000 0x0 0x7fffe3059000 0x7fffe3322000 0x2c9000 0x0 /usr/lib/locale/locale-archive
...
But it's clear that the script proper has exited and it's in the hhvm cleanup that it's barfing.
Perhaps installing the debug symbols will just work (package hhvm-dbg) to resolve the location.
Ori had problems reproducing this so I am putting the exact script and data file with step by step "how to reproduce" here, which will make it easier for anyone else as well.
- copy this script into a php file:
<?php
ini_set( 'display_errors', 'stderr' );
$reader = new XMLReader();
with gz compression it works, with bz2 it does not
$url = "compress.zlib://xmlstuff.gz";
$url = "compress.bzip2://xmlstuff.bz2";
with a regular fopen and fgets and no XMLReader it works (with bz2)
$me = fopen($infile, "r");
with either type of open, it fails
$reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
// echo fgets($me);
$reader->read();
print( "so much for that\n" );
with close, it works
$reader->close();
- copy this text into a file called "xmlstuff" in the same directory: --------
<mediawiki>
<page> <title>Formula:Stages</title> <ns>10</ns> <id>922</id> <revision> <id>6998</id> <contributor> <username>UV</username> <id>48</id> </contributor> <text xml:space="preserve">just some junk</text> <sha1>r0wg7039eqns9sz9m1o2j43el1xce5m</sha1> </revision> </page>
</mediawiki>
- bzip2 xmlstuff
- php name-of-php-script-here
I did these as the 'ariel' user on snapshot1002 just because that happened to be where I was at the time. It runs trusty, nothing particularly special about it. Also it is set up to dump core files for the moment.
Sample output:
ariel@snapshot1002:~$ php xmltest.php
so much for that
[Fri May 6 22:33:24 2016] [hphp] [12322:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12320:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
Segmentation fault (core dumped)
ariel@snapshot1002:~$ [Fri May 6 22:33:24 2016] [hphp] [12323:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12321:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12319:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
The comments in the php script file show things that I have tried that work (i.e. gz compression works properly; bz2 compression without the XMLReader is ok; if the XMLReader is closed before exit it runs ok). Oly this specific combination fails.
Well I now see why people couldn't reproduce: my nice copy-pastes were transformed by phab into weirdness, with comments stripped out. Grrr. Instead, see upstream bug with attachments, filed here:
https://github.com/facebook/hhvm/issues/7075 has just been closed after this commit: https://github.com/facebook/hhvm/commit/9d2be6c30b5b6dadf414692d0e7fbab5f9105b5f so now we need this to make it into a release and then eventually into a WMF package.
The parent task has been declined, with the view that we've moving to PHP7 instead. Should this task be re-titled to reflect that?
Let's leave it for now. I've been following that discussion closely, but if I get to update the snapshot boxes to jessie before stretch + php7 is available, I'll likely want to move to hhvm like all of the app servers, migrating off when we move all of the production cluster. The good news is that I have been running against php7 for local tests for months now and it's working fine.
This is probably going to turn into "update to stretch and php7", but we're waiting on the RFC last call, due to close Jan 10th, before I decline or merge the task elsewhere.