Page MenuHomePhabricator

HHVM hangs on the API cluster
Closed, ResolvedPublic

Description

We've been hit by a new bug in HHVM, again happening mostly on the API cluster which is a different beast than the bug we reported a few weeks ago (https://phabricator.wikimedia.org/T182568):

  • some threads are locked into starving out one cpu at a time
  • strace shows no syscalls being made
  • inspecting a thread with perf shows most of the time is spent in HPHP::Class::getDeclPropIndex

I didn't go much further with my debugging session, as that part of the codebase is full of abstractions that are quite hard to follow.

Event Timeline

Joe triaged this task as High priority.Jan 3 2018, 8:24 AM
Joe created this task.
elukey added a subscriber: elukey.Jan 3 2018, 8:30 AM

Change 401714 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::appserver::api: add load monitoring

https://gerrit.wikimedia.org/r/401714

Change 401714 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::appserver::api: add load monitoring

https://gerrit.wikimedia.org/r/401714

@Joe Anything that you need from us on this? Or are you all set as far as further tracing and getting a bug reported?

Joe added a comment.Jan 4 2018, 8:34 AM

@Imarlier no I think there isn't much we can do until we have a reproduction case. For now I'm focusing on mitigations for this issue as 1) we're not on the latest HHVM LTS anyways and it's likely this bug might be resolved (and new ones pop up), and we're not investing on the very long term on HHVM.

Imarlier moved this task from Inbox to Radar on the Performance-Team board.Jan 8 2018, 3:48 PM
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

Change 413271 had a related patch set uploaded (by Brion VIBBER; owner: Brion VIBBER):
[mediawiki/core@master] Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413271

Joe added a comment.Feb 21 2018, 10:52 PM

we finally tracked this down to JpegMetadataExtractor::segmentSplitter where a infinite loop can happen in case the jpeg is broken:

https://github.com/wikimedia/mediawiki/blame/89843b44ce94bcbb75b69f25c00c30f0ecc12752/includes/media/JpegMetadataExtractor.php#L85-L88

repro case:

<?php
$fh = fopen( 'broken.jpg', "rb" );
$buffer = fread( $fh, 2 );
while ( !feof( $fh ) ) {
    $buffer = fread( $fh, 1 );
    while ( $buffer !== "\xFF"  ) { #without feof, this loops forevah
        $buffer = fread( $fh, 1 );
    }
    echo "Out of inner loop!";
}

this hangs hhvm with the image fragment that was causing the issue.

This also happens with php5, of course, as the error is ours.

Change 413272 had a related patch set uploaded (by Chad; owner: Brion VIBBER):
[mediawiki/core@wmf/1.31.0-wmf.21] Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413272

Change 413272 abandoned by Chad:
Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413272

Change 413274 had a related patch set uploaded (by Chad; owner: Brion VIBBER):
[mediawiki/core@wmf/1.31.0-wmf.22] Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413274

Change 413272 restored by Chad:
Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413272

Change 413272 had a related patch set uploaded (by Chad; owner: Brion VIBBER):
[mediawiki/core@wmf/1.31.0-wmf.21] Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413272

Change 413271 merged by jenkins-bot:
[mediawiki/core@master] Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413271

Change 413272 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.21] Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413272

Change 413274 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.22] Fix infinite loop in JpegMetadataExtractor

https://gerrit.wikimedia.org/r/413274

Mentioned in SAL (#wikimedia-operations) [2018-02-22T00:12:14Z] <demon@tin> Synchronized php-1.31.0-wmf.21/includes/media/JpegMetadataExtractor.php: T184048 (duration: 01m 21s)

Mentioned in SAL (#wikimedia-operations) [2018-02-22T00:13:38Z] <demon@tin> Synchronized php-1.31.0-wmf.22/includes/media/JpegMetadataExtractor.php: T184048 (duration: 01m 13s)

Jdforrester-WMF added a subscriber: Jdforrester-WMF.

So… theoretically this should now be fixed, and we'll see fewer hung processes on the API servers.

elukey merged a task: Restricted Task.Feb 22 2018, 3:59 PM
elukey added a subscriber: MoritzMuehlenhoff.
Joe closed this task as Resolved.Feb 26 2018, 11:15 AM
Joe claimed this task.