Page MenuHomePhabricator

Frequent job timeouts on HHVM video scalers
Closed, ResolvedPublic

Description

Since updating two of the video scalers to HHVM (T104747) and leaving the remaining old one out of rotation, we've seen that things work sometimes but also frequently time out without properly recording the failure.

The jobrunner.log is listing 503 errors from HHVM, which may indicate we're hitting HHVM's generic timeout, likely much too short for the long-running transcode processes. It looks like this kills the job process immediately, without a chance to write an update to the transcode table, so Special:TimedMediaHandler and transcode tables on File: pages still claim they're running.

Need to investigate with joe exactly what's going on and if we can adjust it to handle the longer-running processes better.

Event Timeline

brion raised the priority of this task from to Needs Triage.
brion updated the task description. (Show Details)
brion added subscribers: brion, Joe.

This should be a bit better now given I raised a few HHVM timeouts.

@Joe I still see a lot of failures, but now they come with a giant WMF error page:

2015-09-25T19:50:40+0000: Runner loop 0 process in slot 3 gave status '0':
curl -XPOST -s -a 'http://127.0.0.1:9005/rpc/RunJobs.php?wiki=commonswiki&type=webVideoTranscode&maxtime=30&maxmem=300M'
	Encoding to codec: vp8
Running cmd: 

'/usr/bin/ffmpeg' -y -i '/tmp/localcopy_4118d5722782-1.webm' -threads 2 -skip_threshold 0 -bufsize 6000k -rc_init_occupancy 4000 -qmin 1 -qmax 51 -vb '1024000' -vcodec libvpx -g '128' -keyint_min '128' -f webm -s 854x480 -an -pass '1' -passlogfile '/tmp/transcode_480p.webm88f646e0f3e8-1.webm.log' /dev/null

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>Wikimedia Error</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <meta name="author" content="Mark Ryan, with translation by many people; see http://meta.wikimedia.org/wiki/Multilingual_error_messages"/>
  <meta name="copyright" content="(c) 2005-2007 Mark Ryan and others. Text licensed under the GNU Free Documentation License. http://www.gnu.org/licenses/fdl.txt"/>

  <style type="text/css"><!--
   body {
     background-color: #dbe5df;
     font-family: "Gill Sans MT", "Gill Sans", "Trebuchet MS", Helvetica, sans-serif;
     margin-left: 0px;
     margin-right: 0px;
    }
   .TechnicalStuff {
     font-style: italic;
     text-align: center;
     font-size: 0.8em;
     padding-bottom: 0.8em;
    }
   .BottomStrip {
     background: #9fbfd8;
     text-align: center;
     font-size: 0.85em;
    }
   .RightToLeft {
     direction: rtl;
    }
   .Lines {
     width: 100%;
     height: 1px;
     overflow: hidden;
     font-size: 0.5px;
    }
   .ContentArea {
     background-color: white;
     padding-left: 10%;
     padding-right: 10%;
     padding-top: 0.8em;
     font-size: 1.0em;
    }
   a:hover {
     color: red;
    }
   a.BottomLinks {
     color: #000000;
     text-decoration: none;
    }
   a.BottomLinks:hover {
     color: red;
     text-decoration: none;
    }
   h1, h2 {
     margin: 0px;
     font-size: 1.0em;
    }
   h3.LanguageHeading {
     font-weight: bold
    }
   #ErrorTitleDiv {
     background: #9fbfd8;
     font-size: 1.2em;
     font-weight: bold;
     text-align: center;
    }
   #FoundationNameDiv {
     background: #dbe5df;
     font-size: 1.5em;
     font-family: "Gill Sans MT", "Gill Sans", Helvetica, Humanist, sans-serif;
     font-weight: bold;
     text-transform: uppercase;
     text-align: center;
     width: 100%;
     padding-top:0.8em;
    }
   #TopLinks {
     text-align: center;
     font-size: 0.8em
    }
   -->
  </style>

  <script type="text/javascript"><!-- Begin

   // The first column of this array is for the local language name of the Wikimedia Foundation
   // ('Wikimedia Foundation' should be used for all Latin-based languages)
   // The second column of the array is the localised language word for 'Error'.
   var LanguageDetails = new Array();
   LanguageDetails['ar'] = new Array( "مؤسسة ويكيميديا", "خطأ" );
   LanguageDetails['cs'] = new Array( "Wikimedia Foundation", "Chyba" );
   LanguageDetails['da'] = new Array( "Wikimedia Foundation", "Fejl" );
   LanguageDetails['de'] = new Array( "Wikimedia Foundation", "Fehler" );
   LanguageDetails['el'] = new Array( "Ίδρυμα Wikimedia", "Σφάλμα" );
   LanguageDetails['en'] = new Array( "Wikimedia Foundation", "Error" );
   LanguageDetails['es'] = new Array( "Wikimedia Foundation", "Error" );
   LanguageDetails['et'] = new Array( "Wikimedia Foundation", "Viga" );
   LanguageDetails['fa'] = new Array( "بنیاد ویکی‌مدیا", "خطا" );
   LanguageDetails['fi'] = new Array( "Wikimedia Foundation", "Virhe" );
   LanguageDetails['fr'] = new Array( "Wikimedia Foundation", "Erreur" );
   LanguageDetails['he'] = new Array( "קרן ויקימדיה", "שגיאה" );
   LanguageDetails['id'] = new Array( "Wikimedia Foundation", "Error" );
   LanguageDetails['it'] = new Array( "Wikimedia Foundation", "Errore" );
   LanguageDetails['ja'] = new Array( "ウィキメディア財団", "エラー" );
   LanguageDetails['ko'] = new Array( "위키미디어 재단", "오류" );
   LanguageDetails['no'] = new Array( "Wikimedia Foundation", "Feil" );
   LanguageDetails['nl'] = new Array( "Wikimedia Foundation", "Fout" );
...

Hmm maxtime=60 ? Do we really want that in the URL? :)

@Paladox yes, those are linked on the duplicate bug report.

Ok. but I mean I have re run them and still taking a while.

@Paladox please stop re-running transcodes; it interferes with our ability to track what's going on and fix the problem to have other people resetting things unexpectedly.

Oh sorry I didn't know I shoulden have done that sorry.

@Paladox @brion I think Ori might have found the problem and fixed it: we were setting an override on max_execution_time in mediawiki-config if not running on CLI.

It should go much better from now on, please let me know.

Joe set Security to None.

Might be better for some cases, but the Lila Tetrikov file from T113532 still appears to have troubles. Possibly that's another issue and the ticket should be unduped ?

Thankyou @Joe and @ori for fixing the problem.

\o/ Resolving this, and reimaging the remaining videoscaler!

@TheDJ I'll look into that specific bug today.

Joe triaged this task as High priority.