Convert snapshot hosts to use HHVM and trusty
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	hoo
	Mar 27 2015, 10:36 PM

Description

For both performance reasons (we have some long running scripts on these that can benefit from this) and to get rid of the remaining PHP 5.3 hosts.

Details

	Subject	Repo	Branch	Lines +/-
	enabling hhvm for dumps	operations/puppet	production	+2 -1
	Don't install mcelog on machines with AMD processors	operations/puppet	production	+7 -16

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Legoktm	T75901 Drop PHP 5.3 support
Declined	• demon	T91590 [Spike] Try out hack (<?hh) for mediawiki-config
Resolved	Joe	T104147 can we get rid of rsvg security patch?
Resolved	Reedy	T94149 Get rid of Zend 5.5 tests for wmf branches
Resolved	None	T86081 Complete the use of HHVM over Zend PHP on the Wikimedia cluster
Declined	hashar	T64519 Upgrade git > 1.7
Resolved	• brooke	T69951 Vagrant+TMH: webm transcodes don't play in Firefox (upstream Ubuntu/avconv issue)
Resolved	None	T34317 PDF export extension fails to render Arabic characters in monospace font
Declined	None	T65899 Upgrade Wikimedia servers to Ubuntu Trusty (14.04)
Declined	ArielGlenn	T94277 Convert snapshot hosts to use HHVM and trusty
Resolved	ArielGlenn	T111586 Snapshot hosts need to be manually added to dataset1001's exports
Resolved	ori	T113932 HHVM does not recognize compress.bzip2 streams
Declined	None	T117534 DCAT-AP: XML produces invalid output with HHVM
Resolved	• MoritzMuehlenhoff	T143648 Merge facebook/hhvm@9d2be6c30b into build of next hhvm release

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

matmarex unsubscribed.Mar 28 2015, 2:31 AM

JanZerebecki subscribed.Apr 1 2015, 2:37 PM

hoo added a parent task: T86081: Complete the use of HHVM over Zend PHP on the Wikimedia cluster.Apr 9 2015, 5:16 PM

Ricordisamoa subscribed.Apr 20 2015, 5:16 PM

Krenair added a parent task: T65899: Upgrade Wikimedia servers to Ubuntu Trusty (14.04).May 30 2015, 1:40 PM

jeremyb subscribed.Jun 30 2015, 5:44 AM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 30 2015, 5:44 AM

Krenair removed a parent task: T75901: Drop PHP 5.3 support.Jul 4 2015, 12:41 AM

@ArielGlenn: What's the status of this? MediaWiki is not going to support PHP 5.3 forever, at some point these hosts are going to break.

In T94277#1605165, @Krenair wrote:

@ArielGlenn: What's the status of this? MediaWiki is not going to support PHP 5.3 forever, at some point these hosts are going to break.

MediaWiki will support 5.3 as long as it takes to get the cluster off 5.3...the hosts aren't going to arbitrarily break.

ori added a subtask: T111586: Snapshot hosts need to be manually added to dataset1001's exports.Sep 5 2015, 12:06 AM

To nudge this along, I applied the roles and classes currently applied to the snapshot hosts to osmium, which is running Trusty. With the exception of the need to add new snapshot hosts to datasets1001:/etc/exports, I did not run into any issues. Puppet ran to completion and made all requisite state changes.

The way we run the dumps now, we have a (scheduled) break in between runs, which should come up in a week and last for 5 days. This will be enough time to rebuild one more library, reimage one host, test it, deal with any fallout and then do the same for the rest. So expect action on this around September 15, earlier if the run of the dumps finishes earlier.

The idea is that we have two runs a month, with some dead time after each run to deal with security updates or anything else desired.

Ariel generously accepted to remind the rest of ops about this by owning this.

Krenair mentioned this in T86081: Complete the use of HHVM over Zend PHP on the Wikimedia cluster.Sep 14 2015, 12:07 PM

In T94277#1615702, @ArielGlenn wrote:

So expect action on this around September 15, earlier if the run of the dumps finishes earlier.

September 15 has come and gone.

September itself is about to be gone...

In T94277#1679712, @JeroenDeDauw wrote:

September itself is about to be gone...

We've only just got rid of the last 10.04 host; 5.5 years after release.

Ariel has also been ill, so this probably has been delayed a little bit.

Progress is happening, and other blockers are being found. For example the video scalers aren't faring well under hhvm, see T113284

What's the major urgency to use > PHP 5.3 features? Better, cleaner, less verbose code? Super performance?

ArielGlenn added a subtask: T113932: HHVM does not recognize compress.bzip2 streams.Sep 28 2015, 9:47 AM

snapshot1002 was reinstalled with trusty last week.

The utf8 normal extension had already been rebuilt by ori for trusty, so I figured we were good to go. But it had not been packaged and built for HHVM. For a while I was fighting with that, not fun since it relies on swig for php, not supported for hhvm. But it turns out that MW now uses the intl library so I was able to just remove all dependencies: https://gerrit.wikimedia.org/r/#/c/241537/ and https://gerrit.wikimedia.org/r/#/c/241540/ now merged.

The remaining known blocker is that our HHVM seems to be built without compress.bzip2 stream support (see blocking tasks). If that's not fixed before October I will have the dumps fall back to running php5 (https://gerrit.wikimedia.org/r/#/c/241612/)

I'm going to go ahead and convert over snapshot1004 and snapshot1001 to trusty today. Because snapshot1003 runs daily cron jobs, I will wait on it util Oct 1 when I'll make the decision about fallback for the month or not.

ArielGlenn moved this task from Backlog to Active on the Datasets-General-or-Unknown board.Sep 28 2015, 10:11 AM

In T94277#1679736, @Reedy wrote:

In T94277#1679712, @JeroenDeDauw wrote:

September itself is about to be gone...

We've only just got rid of the last 10.04 host; 5.5 years after release.

From a misc (non-MediaWiki) host. Not quite the same thing.

mark raised the priority of this task from Medium to High.Sep 28 2015, 1:16 PM

mark subscribed.

upgraded snapshot1001 to trusty. puppet runs ok with one exception, unable to finish installation of mcelog because:

root@snapshot1001:~# mcelog --foreground
mcelog: AMD Processor family 16: Please load edac_mce_amd module.
: Success
CPU is unsupported
root@snapshot1001:~# lsmod | grep mce
edac_mce_amd 22617 1 amd64_edac_mod

mcelog doesn't run on more recent AMD cpus. http://www.mcelog.org/faq.html#18 Need to remove it from modules/base/manifests/standard-packages.pp for hosts with AMD cpus I guess.

with certain AMD cpus.

Change 241736 had a related patch set uploaded (by Ori.livneh):
Don't install mcelog on machines with AMD processors

https://gerrit.wikimedia.org/r/241736

gerritbot added a project: Patch-For-Review.Sep 28 2015, 6:40 PM

Change 241736 merged by Ori.livneh:
Don't install mcelog on machines with AMD processors

https://gerrit.wikimedia.org/r/241736

Strictly speaking we only need to exclude from hosts with AMD processor family >= 16, but since there's only one host on the cluster that has AMD processors, meh. Thanks, Ori! After a puppet run and manual removal of that package, snapshot1001 looks good.

ori mentioned this in rOPUP66994b04da30: Don't install mcelog on machines with AMD processors.Sep 28 2015, 7:12 PM

snapshot1004 reinstalled. pending are: hhvm upstream bug, snapshot1003 to be converted as soon as I know if we'll have a fix by Oct 1 or not

all snapshot hosts now converted to use trusty. Joe found two differences in php5 from precise to trusty, no igbinary and no apc extensions. but we use neither of those so no problem. a test run on a small wiki worked just fine with php5. stalled on conversion to hhvm until the upstream bug (see blocking task) is fixed.

ArielGlenn removed a parent task: T65899: Upgrade Wikimedia servers to Ubuntu Trusty (14.04).Sep 30 2015, 5:17 PM

ArielGlenn moved this task from Active to Blocked on the Datasets-General-or-Unknown board.

Krenair added a parent task: T65899: Upgrade Wikimedia servers to Ubuntu Trusty (14.04).Sep 30 2015, 5:21 PM

T65899 is not a blocked task; all snapshots are converted to trusty. that's why I removed this from blockers.

Just curious, did this change have an effect on the dumps generation process? It seems like the probability of getting failed jobs is higher in the first October 2015 run.

In T94277#1712386, @Hydriz wrote:

Just curious, did this change have an effect on the dumps generation process? It seems like the probability of getting failed jobs is higher in the first October 2015 run.

To trusty and PHP 5.5 would presumably have some improvements, but I suspect that going up to hhvm in the near future will be a bigger net win...

And yeah, it's possible both parts of the "upgrade" will cause problems, but I'm fairly sure Ariel will be keeping a close eye on them to begin with

Moving to trusty and php 5.5 didn't impact the dumps particularly. There was an issue caused by my pylint changes, unrelated to the upgrade, but the way stages go, there's a catch-all stage which will rerun all jobs that failed earlier, which you can see has been happening for the "big" wikis.

Thanks for the information, hopefully we are progressing moving towards a faster dump generation process. Thanks!

Hydriz unsubscribed.Oct 28 2015, 4:26 PM

ArielGlenn added a subtask: T117534: DCAT-AP: XML produces invalid output with HHVM.Nov 9 2015, 9:48 AM

ArielGlenn edited projects, added Dumps-Generation; removed Datasets-General-or-Unknown.Nov 23 2015, 4:58 PM

ArielGlenn set Security to None.

ArielGlenn moved this task from Backlog to Blocked/Stalled/Waiting for event on the Dumps-Generation board.

hashar mentioned this in T94149: Get rid of Zend 5.5 tests for wmf branches.Feb 19 2016, 9:14 PM

ori closed subtask T113932: HHVM does not recognize compress.bzip2 streams as Resolved.Mar 11 2016, 7:51 PM

After chat on IRC with ori: The current dump run will likely get done in about 12 days, and we'll know a bit more about the stability of the new hhvm version by then too. So leaving as 'stalled' til then.

ArielGlenn closed subtask T111586: Snapshot hosts need to be manually added to dataset1001's exports as Resolved.Mar 18 2016, 8:41 PM

ArielGlenn added a parent task: T131113: deploy new snapshot hosts in eqiad.Mar 28 2016, 11:04 PM

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Active on the Dumps-Generation board.Mar 29 2016, 10:37 AM

Change 281547 had a related patch set uploaded (by ArielGlenn):
enabling hhvm for dumps

https://gerrit.wikimedia.org/r/281547

Change 281547 merged by ArielGlenn:
enabling hhvm for dumps

https://gerrit.wikimedia.org/r/281547

I've done this for the new snapshot hosts and run a test dump of a wiki; it looked fine. I'll keep this open til the misc cron jobs are moved over however.

During the full run some jobs produce "Lost parent" from hhvm; so far I have only seen this for articles dumps, i.e. those that spawn a separate php script to get revisions The job itself runs to completion with fully produced files but the manager sees errors and believes it to have failed. As well it spams the status html page for users. Needs more investigation.

the LightprocessCount: Lost parent message is just signaling the exit of a child php process.

So it's most definitely not an error.

You might want to set the spawned php processes so that those have lightprocess count equal to zero. it can be done by using for example a separate ini file for those.

In general, though, spawning a php process from within a php process instead of using fork() feels like the wrong thing to do.

Some investigation from the command line shows me that these messages for dumps are only produced when there's a Segfault causing the parent indeed to exit, and this is connectect to using bzip2 compressed files for the prefetch, even without spawn. I.e.:

datasets@snapshot1002:~/testing$ /usr/bin/php /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=lawikibooks --stub=file:/home/datasets/testing/stubs.xml --prefetch=bzip2:/home/datasets/testing/prefetch.xml.bz2 --report=1000 --current > /dev/null

with only one page in stubs.xml, one page in prefetch.xml (not necessarily the one to retrieve), and output to stdout or anywhere else, gives:

2016-04-11 11:20:18: lawikibooks (ID 24566) 1 pages (34.2|34.2/sec all|curr), 1 revs (34.2|34.2/sec all|curr), 0.0%|0.0% prefetched (all|curr), ETA 2016-04-11 11:24:06 [max 7794]
Segmentation fault
datasets@snapshot1002:~/testing$ [Mon Apr 11 11:20:18 2016] [hphp] [24570:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24568:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24571:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24569:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting
[Mon Apr 11 11:20:18 2016] [hphp] [24572:7fc3c1977100:0:000001] [] Lost parent, LightProcess exiting

Output is the perfectly formed output of the header, retrieved page and the footer, so it's just fine.

Note that if the prefetch file is 0 pages (head and footer only) instead on one page, it succeeds.

Note also that this is without spawn of a subprocess for fetch revisions from the db.

I suspect a bad interaction with hhvm/bzip2 (compress streams)/XMLReader but have yet to be able to isolate it. Still investigating.

I've got a tiny program that fails, now trying to get the equivalent with gzip set up for comparison.

This runs successfully with a 0 exit code:

<?php

ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.zlib:///home/datasets/testing/prefetch.xml.gz";
this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$bpre->reader->close();

This fails with seg fault at exit, right after the print (only difference is the bz2 input):

<?php

ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$bpre->reader->close();

for a tiny prefetch.xml file with valid mw xml in it.

prefetch.xml3 KBDownload

This is the sample xml file I used. So hhvm somehow doesn't clean up properly compress.bzip2 streams if the php process leaves them open. This script (bzip2 stream but close at the end) works properly every time.

<?php

ini_set( 'display_errors', 'stderr' );
$reader = null;
$reader = new XMLReader();
$url = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
// this also fails: $reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );
$reader->read();
print( "so much for that\n" );
$reader->close();

<?php

$infile = "compress.bzip2:///home/datasets/testing/prefetch.xml.bz2";
$me = fopen($infile, "r");
echo fgets($me);
// fclose($me);
print "yay\n";

That runs successfully btw. bzip2, no fclose. BUT no XMLReader implicated.

Have backed off of hhvm use for dumps for now. The current run will continue with regular php5.

gbd.txt2 KBDownload

This seems to be the best I can do for a stack trace; what am I missing?

Does info proc mappings help at least identifying the coarse location?

Not really:

    Start Addr           End Addr       Size     Offset objfile
      0x400000          0x282c000  0x242c000        0x0 /usr/bin/hhvm
     0x282d000          0x2830000     0x3000  0x242c000 /usr/bin/hhvm
     0x2830000          0x2854000    0x24000  0x242f000 /usr/bin/hhvm
     0x2854000          0xf600000  0xcdac000        0x0 [heap]
0x7fffde000000     0x7fffe1c00000  0x3c00000        0x0 
0x7fffe1c00000     0x7fffe2c00000  0x1000000        0x0 /tmp/tcygjm4T (deleted)
0x7fffe2c00000     0x7fffe3000000   0x400000        0x0 
0x7fffe3059000     0x7fffe3322000   0x2c9000        0x0 /usr/lib/locale/locale-archive

...

But it's clear that the script proper has exited and it's in the hhvm cleanup that it's barfing.

Perhaps installing the debug symbols will just work (package hhvm-dbg) to resolve the location.

It's already installed and provides hhvm-gdb which I used above.

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Apr 26 2016, 2:22 PM

Ori had problems reproducing this so I am putting the exact script and data file with step by step "how to reproduce" here, which will make it easier for anyone else as well.

copy this script into a php file:

<?php

ini_set( 'display_errors', 'stderr' );

$reader = new XMLReader();

with gz compression it works, with bz2 it does not
$url = "compress.zlib://xmlstuff.gz";
$url = "compress.bzip2://xmlstuff.bz2";

with a regular fopen and fgets and no XMLReader it works (with bz2)
$me = fopen($infile, "r");
with either type of open, it fails
$reader->open( $url, null, LIBXML_PARSEHUGE );
$reader->open( $url, null );

// echo fgets($me);
$reader->read();
print( "so much for that\n" );

with close, it works
$reader->close();

copy this text into a file called "xmlstuff" in the same directory: --------

<page>
  <title>Formula:Stages</title>
  <ns>10</ns>
  <id>922</id>
  <revision>
    <id>6998</id>
    <contributor>
      <username>UV</username>
      <id>48</id>
    </contributor>
    <text xml:space="preserve">just some junk</text>
    <sha1>r0wg7039eqns9sz9m1o2j43el1xce5m</sha1>
  </revision>
</page>

</mediawiki>

bzip2 xmlstuff
php name-of-php-script-here

I did these as the 'ariel' user on snapshot1002 just because that happened to be where I was at the time. It runs trusty, nothing particularly special about it. Also it is set up to dump core files for the moment.

Sample output:
ariel@snapshot1002:~$ php xmltest.php
so much for that
[Fri May 6 22:33:24 2016] [hphp] [12322:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12320:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
Segmentation fault (core dumped)
ariel@snapshot1002:~$ [Fri May 6 22:33:24 2016] [hphp] [12323:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12321:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting
[Fri May 6 22:33:24 2016] [hphp] [12319:7fa4b62af100:0:000001] [] Lost parent, LightProcess exiting

The comments in the php script file show things that I have tried that work (i.e. gz compression works properly; bz2 compression without the XMLReader is ok; if the XMLReader is closed before exit it runs ok). Oly this specific combination fails.

Well I now see why people couldn't reproduce: my nice copy-pastes were transformed by phab into weirdness, with comments stripped out. Grrr. Instead, see upstream bug with attachments, filed here:

https://github.com/facebook/hhvm/issues/7075

https://github.com/facebook/hhvm/issues/7075 has just been closed after this commit: https://github.com/facebook/hhvm/commit/9d2be6c30b5b6dadf414692d0e7fbab5f9105b5f so now we need this to make it into a release and then eventually into a WMF package.

ArielGlenn removed a parent task: T131113: deploy new snapshot hosts in eqiad.Jul 30 2016, 9:10 AM

ArielGlenn created subtask T143648: Merge facebook/hhvm@9d2be6c30b into build of next hhvm release.Aug 23 2016, 10:01 AM

• MoritzMuehlenhoff closed subtask T143648: Merge facebook/hhvm@9d2be6c30b into build of next hhvm release as Resolved.Mar 6 2017, 2:39 PM

WMDE-leszek subscribed.Jul 11 2017, 2:57 PM

WMDE-leszek mentioned this in T170281: Raise PHP version requirement of Wikibase (and its related extensions) to 5.6.Jul 11 2017, 3:16 PM

The parent task has been declined, with the view that we've moving to PHP7 instead. Should this task be re-titled to reflect that?

Let's leave it for now. I've been following that discussion closely, but if I get to update the snapshot boxes to jessie before stretch + php7 is available, I'll likely want to move to hhvm like all of the app servers, migrating off when we move all of the production cluster. The good news is that I have been running against php7 for local tests for months now and it's working fine.

Jdforrester-WMF mentioned this in T178538: Bump PHP requirement to 5.6 in 1.31.Oct 18 2017, 10:26 PM

hoo closed subtask T117534: DCAT-AP: XML produces invalid output with HHVM as Resolved.Nov 16 2017, 12:33 PM

Liuxinyu970226 subscribed.Nov 24 2017, 11:56 AM

nichtich reopened subtask T117534: DCAT-AP: XML produces invalid output with HHVM as Open.Nov 27 2017, 8:34 AM

This is probably going to turn into "update to stretch and php7", but we're waiting on the RFC last call, due to close Jan 10th, before I decline or merge the task elsewhere.

Officially declining, move to php7 has been approved, see T176370
I've been working on a dump instance in deployment-prep already on stretch/php7, see T184258