Convert snapshot hosts to use HHVM and trusty
Closed, DeclinedPublic

Description

For both performance reasons (we have some long running scripts on these that can benefit from this) and to get rid of the remaining PHP 5.3 hosts.

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
matmarex removed a subscriber: matmarex.Mar 28 2015, 2:31 AM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 30 2015, 5:44 AM

@ArielGlenn: What's the status of this? MediaWiki is not going to support PHP 5.3 forever, at some point these hosts are going to break.

demon added a comment.Sep 4 2015, 4:52 PM

@ArielGlenn: What's the status of this? MediaWiki is not going to support PHP 5.3 forever, at some point these hosts are going to break.

MediaWiki will support 5.3 as long as it takes to get the cluster off 5.3...the hosts aren't going to arbitrarily break.

ori added a comment.Sep 5 2015, 12:13 AM

To nudge this along, I applied the roles and classes currently applied to the snapshot hosts to osmium, which is running Trusty. With the exception of the need to add new snapshot hosts to datasets1001:/etc/exports, I did not run into any issues. Puppet ran to completion and made all requisite state changes.

The way we run the dumps now, we have a (scheduled) break in between runs, which should come up in a week and last for 5 days. This will be enough time to rebuild one more library, reimage one host, test it, deal with any fallout and then do the same for the rest. So expect action on this around September 15, earlier if the run of the dumps finishes earlier.

The idea is that we have two runs a month, with some dead time after each run to deal with security updates or anything else desired.

jcrespo added a subscriber: jcrespo.

Ariel generously accepted to remind the rest of ops about this by owning this.

ori added a comment.Sep 16 2015, 7:16 PM

So expect action on this around September 15, earlier if the run of the dumps finishes earlier.

September 15 has come and gone.

September itself is about to be gone...

September itself is about to be gone...

We've only just got rid of the last 10.04 host; 5.5 years after release.

Ariel has also been ill, so this probably has been delayed a little bit.

Progress is happening, and other blockers are being found. For example the video scalers aren't faring well under hhvm, see T113284

What's the major urgency to use > PHP 5.3 features? Better, cleaner, less verbose code? Super performance?

snapshot1002 was reinstalled with trusty last week.

The utf8 normal extension had already been rebuilt by ori for trusty, so I figured we were good to go. But it had not been packaged and built for HHVM. For a while I was fighting with that, not fun since it relies on swig for php, not supported for hhvm. But it turns out that MW now uses the intl library so I was able to just remove all dependencies: https://gerrit.wikimedia.org/r/#/c/241537/ and https://gerrit.wikimedia.org/r/#/c/241540/ now merged.

The remaining known blocker is that our HHVM seems to be built without compress.bzip2 stream support (see blocking tasks). If that's not fixed before October I will have the dumps fall back to running php5 (https://gerrit.wikimedia.org/r/#/c/241612/)

I'm going to go ahead and convert over snapshot1004 and snapshot1001 to trusty today. Because snapshot1003 runs daily cron jobs, I will wait on it util Oct 1 when I'll make the decision about fallback for the month or not.

September itself is about to be gone...

We've only just got rid of the last 10.04 host; 5.5 years after release.

From a misc (non-MediaWiki) host. Not quite the same thing.

mark raised the priority of this task from Normal to High.Sep 28 2015, 1:16 PM
mark added a subscriber: mark.

upgraded snapshot1001 to trusty. puppet runs ok with one exception, unable to finish installation of mcelog because:

root@snapshot1001:~# mcelog --foreground
mcelog: AMD Processor family 16: Please load edac_mce_amd module.
: Success
CPU is unsupported
root@snapshot1001:~# lsmod | grep mce
edac_mce_amd 22617 1 amd64_edac_mod

mcelog doesn't run on more recent AMD cpus. http://www.mcelog.org/faq.html#18 Need to remove it from modules/base/manifests/standard-packages.pp for hosts with AMD cpus I guess.

with certain AMD cpus.

Change 241736 had a related patch set uploaded (by Ori.livneh):
Don't install mcelog on machines with AMD processors

https://gerrit.wikimedia.org/r/241736

Change 241736 merged by Ori.livneh:
Don't install mcelog on machines with AMD processors

https://gerrit.wikimedia.org/r/241736

Strictly speaking we only need to exclude from hosts with AMD processor family >= 16, but since there's only one host on the cluster that has AMD processors, meh. Thanks, Ori! After a puppet run and manual removal of that package, snapshot1001 looks good.

snapshot1004 reinstalled. pending are: hhvm upstream bug, snapshot1003 to be converted as soon as I know if we'll have a fix by Oct 1 or not

ArielGlenn changed the task status from Open to Stalled.Sep 30 2015, 5:16 PM

all snapshot hosts now converted to use trusty. Joe found two differences in php5 from precise to trusty, no igbinary and no apc extensions. but we use neither of those so no problem. a test run on a small wiki worked just fine with php5. stalled on conversion to hhvm until the upstream bug (see blocking task) is fixed.

T65899 is not a blocked task; all snapshots are converted to trusty. that's why I removed this from blockers.

Hydriz added a subscriber: Hydriz.Oct 8 2015, 1:38 PM

Just curious, did this change have an effect on the dumps generation process? It seems like the probability of getting failed jobs is higher in the first October 2015 run.

Reedy added a comment.Oct 8 2015, 1:55 PM

Just curious, did this change have an effect on the dumps generation process? It seems like the probability of getting failed jobs is higher in the first October 2015 run.

To trusty and PHP 5.5 would presumably have some improvements, but I suspect that going up to hhvm in the near future will be a bigger net win...

And yeah, it's possible both parts of the "upgrade" will cause problems, but I'm fairly sure Ariel will be keeping a close eye on them to begin with

Moving to trusty and php 5.5 didn't impact the dumps particularly. There was an issue caused by my pylint changes, unrelated to the upgrade, but the way stages go, there's a catch-all stage which will rerun all jobs that failed earlier, which you can see has been happening for the "big" wikis.

Thanks for the information, hopefully we are progressing moving towards a faster dump generation process. Thanks!

Hydriz removed a subscriber: Hydriz.Oct 28 2015, 4:26 PM

After chat on IRC with ori: The current dump run will likely get done in about 12 days, and we'll know a bit more about the stability of the new hhvm version by then too. So leaving as 'stalled' til then.

Change 281547 had a related patch set uploaded (by ArielGlenn):
enabling hhvm for dumps

https://gerrit.wikimedia.org/r/281547

Change 281547 merged by ArielGlenn:
enabling hhvm for dumps

https://gerrit.wikimedia.org/r/281547

I've done this for the new snapshot hosts and run a test dump of a wiki; it looked fine. I'll keep this open til the misc cron jobs are moved over however.

During the full run some jobs produce "Lost parent" from hhvm; so far I have only seen this for articles dumps, i.e. those that spawn a separate php script to get revisions The job itself runs to completion with fully produced files but the manager sees errors and believes it to have failed. As well it spams the status html page for users. Needs more investigation.

Joe added a subscriber: Joe.Apr 8 2016, 2:21 PM

the LightprocessCount: Lost parent message is just signaling the exit of a child php process.

So it's most definitely not an error.

You might want to set the spawned php processes so that those have lightprocess count equal to zero. it can be done by using for example a separate ini file for those.

In general, though, spawning a php process from within a php process instead of using fork() feels like the wrong thing to do.

ArielGlenn added a comment.EditedApr 11 2016, 11:22 AM

Some investigation from the command line shows me that these messages for dumps are only produced when there's a Segfault causing the parent indeed to exit, and this is connectect to using bzip2 compressed files for the prefetch, even without spawn. I.e.:

datasets@snapshot1002:~/testing$ /usr/bin/php /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=lawikibooks --stub=file:/home/datasets/testing/stubs.xml --prefetch=bzip2:/home/datasets/testing/prefetch.xml.bz2 --report=1000 --current > /dev/null

with only one page in stubs.xml, one page in prefetch.xml (not necessarily the one to retrieve), and output to stdout or anywhere else, gives:

2016-04-11 11:20:18: lawikibooks (ID 24566) 1 pages (34.2|34.2/sec all|curr), 1 revs (34.2|34.2/sec all|curr), 0.0%|0.0% prefetched (all|curr), ETA 2016-04-11 11:24:06 [max 7794]
Segmentation fault
datasets@snapshot1002:~/testing$ [Mon Apr 11 11:20:18 2016] [hphp] [24570:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24568:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24571:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24569:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24572:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting

Output is the perfectly formed output of the header, retrieved page and the footer, so it's just fine.

Note that if the prefetch file is 0 pages (head and footer only) instead on one page, it succeeds.

Note also that this is without spawn of a subprocess for fetch revisions from the db.

I suspect a bad interaction with hhvm/bzip2 (compress streams)/XMLReader but have yet to be able to isolate it. Still investigating.

I've got a tiny program that fails, now trying to get the equivalent with gzip set up for comparison.

This runs successfully with a 0 exit code:

<?php

ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.zlib:///home/datasets/testing/prefetch.xml.gz";
this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$bpre->reader->close();

This fails with seg fault at exit, right after the print (only difference is the bz2 input):

<?php

ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$bpre->reader->close();

for a tiny prefetch.xml file with valid mw xml in it.

This is the sample xml file I used. So hhvm somehow doesn't clean up properly compress.bzip2 streams if the php process leaves them open. This script (bzip2 stream but close at the end) works properly every time.

<?php

ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
// this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$reader->close();

<?php

$infile = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
$me = fopen($infile, "r");
echo fgets($me);
// fclose($me);
print "yay\n";

?>

That runs successfully btw. bzip2, no fclose. BUT no XMLReader implicated.

Have backed off of hhvm use for dumps for now. The current run will continue with regular php5.

This seems to be the best I can do for a stack trace; what am I missing?

Does info proc mappings help at least identifying the coarse location?

Not really:

    Start Addr           End Addr       Size     Offset objfile
      0x400000          0x282c000  0x242c000        0x0 /usr/bin/hhvm
     0x282d000          0x2830000     0x3000  0x242c000 /usr/bin/hhvm
     0x2830000          0x2854000    0x24000  0x242f000 /usr/bin/hhvm
     0x2854000          0xf600000  0xcdac000        0x0 [heap]
0x7fffde000000     0x7fffe1c00000  0x3c00000        0x0 
0x7fffe1c00000     0x7fffe2c00000  0x1000000        0x0 /tmp/tcygjm4T (deleted)
0x7fffe2c00000     0x7fffe3000000   0x400000        0x0 
0x7fffe3059000     0x7fffe3322000   0x2c9000        0x0 /usr/lib/locale/locale-archive

...

But it's clear that the script proper has exited and it's in the hhvm cleanup that it's barfing.

Perhaps installing the debug symbols will just work (package hhvm-dbg) to resolve the location.

It's already installed and provides hhvm-gdb which I used above.

Ori had problems reproducing this so I am putting the exact script and data file with step by step "how to reproduce" here, which will make it easier for anyone else as well.


  1. copy this script into a php file:

<?php

ini_set( 'display_errors', 'stderr' );

$reader = new XMLReader();

with gz compression it works, with bz2 it does not
$url = "compress.zlib://xmlstuff.gz";
$url = "compress.bzip2://xmlstuff.bz2";

with a regular fopen and fgets and no XMLReader it works (with bz2)
$me = fopen($infile, "r");
with either type of open, it fails
$reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );

// echo fgets($me);
$reader->read();
print( "so much for that\n" );

with close, it works
$reader->close();


  1. copy this text into a file called "xmlstuff" in the same directory: --------

<mediawiki>

<page>
  <title>Formula:Stages</title>
  <ns>10</ns>
  <id>922</id>
  <revision>
    <id>6998</id>
    <contributor>
      <username>UV</username>
      <id>48</id>
    </contributor>
    <text xml:space="preserve">just some junk</text>
    <sha1>r0wg7039eqns9sz9m1o2j43el1xce5m</sha1>
  </revision>
</page>

</mediawiki>

  1. bzip2 xmlstuff
  2. php name-of-php-script-here

I did these as the 'ariel' user on snapshot1002 just because that happened to be where I was at the time. It runs trusty, nothing particularly special about it. Also it is set up to dump core files for the moment.

Sample output:
ariel@snapshot1002:~$ php xmltest.php
so much for that
[Fri May 6 22:33:24 2016] [hphp] [12322:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12320:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
Segmentation fault (core dumped)
ariel@snapshot1002:~$ [Fri May 6 22:33:24 2016] [hphp] [12323:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12321:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12319:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting

The comments in the php script file show things that I have tried that work (i.e. gz compression works properly; bz2 compression without the XMLReader is ok; if the XMLReader is closed before exit it runs ok). Oly this specific combination fails.

Well I now see why people couldn't reproduce: my nice copy-pastes were transformed by phab into weirdness, with comments stripped out. Grrr. Instead, see upstream bug with attachments, filed here:

https://github.com/facebook/hhvm/issues/7075

https://github.com/facebook/hhvm/issues/7075 has just been closed after this commit: https://github.com/facebook/hhvm/commit/9d2be6c30b5b6dadf414692d0e7fbab5f9105b5f so now we need this to make it into a release and then eventually into a WMF package.

The parent task has been declined, with the view that we've moving to PHP7 instead. Should this task be re-titled to reflect that?

Let's leave it for now. I've been following that discussion closely, but if I get to update the snapshot boxes to jessie before stretch + php7 is available, I'll likely want to move to hhvm like all of the app servers, migrating off when we move all of the production cluster. The good news is that I have been running against php7 for local tests for months now and it's working fine.

This is probably going to turn into "update to stretch and php7", but we're waiting on the RFC last call, due to close Jan 10th, before I decline or merge the task elsewhere.

ArielGlenn closed this task as Declined.Jan 10 2018, 9:35 PM

Officially declining, move to php7 has been approved, see T176370
I've been working on a dump instance in deployment-prep already on stretch/php7, see T184258

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptJan 10 2018, 9:35 PM