Page MenuHomePhabricator

Frequent job timeouts on HHVM video scalers
Closed, ResolvedPublic

Description

Since updating two of the video scalers to HHVM (T104747) and leaving the remaining old one out of rotation, we've seen that things work sometimes but also frequently time out without properly recording the failure.

The jobrunner.log is listing 503 errors from HHVM, which may indicate we're hitting HHVM's generic timeout, likely much too short for the long-running transcode processes. It looks like this kills the job process immediately, without a chance to write an update to the transcode table, so Special:TimedMediaHandler and transcode tables on File: pages still claim they're running.

Need to investigate with joe exactly what's going on and if we can adjust it to handle the longer-running processes better.

Related Objects

Event Timeline

brion created this task.Sep 21 2015, 7:57 PM
brion updated the task description. (Show Details)
brion raised the priority of this task from to Needs Triage.
brion added subscribers: brion, Joe.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 21 2015, 7:57 PM
Joe added a comment.Sep 23 2015, 7:08 AM

This should be a bit better now given I raised a few HHVM timeouts.

brion added a comment.Sep 25 2015, 7:52 PM

@Joe I still see a lot of failures, but now they come with a giant WMF error page:

2015-09-25T19:50:40+0000: Runner loop 0 process in slot 3 gave status '0':
curl -XPOST -s -a 'http://127.0.0.1:9005/rpc/RunJobs.php?wiki=commonswiki&type=webVideoTranscode&maxtime=30&maxmem=300M'
	Encoding to codec: vp8
Running cmd: 

'/usr/bin/ffmpeg' -y -i '/tmp/localcopy_4118d5722782-1.webm' -threads 2 -skip_threshold 0 -bufsize 6000k -rc_init_occupancy 4000 -qmin 1 -qmax 51 -vb '1024000' -vcodec libvpx -g '128' -keyint_min '128' -f webm -s 854x480 -an -pass '1' -passlogfile '/tmp/transcode_480p.webm88f646e0f3e8-1.webm.log' /dev/null

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>Wikimedia Error</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <meta name="author" content="Mark Ryan, with translation by many people; see http://meta.wikimedia.org/wiki/Multilingual_error_messages"/>
  <meta name="copyright" content="(c) 2005-2007 Mark Ryan and others. Text licensed under the GNU Free Documentation License. http://www.gnu.org/licenses/fdl.txt"/>

  <style type="text/css"><!--
   body {
     background-color: #dbe5df;
     font-family: "Gill Sans MT", "Gill Sans", "Trebuchet MS", Helvetica, sans-serif;
     margin-left: 0px;
     margin-right: 0px;
    }
   .TechnicalStuff {
     font-style: italic;
     text-align: center;
     font-size: 0.8em;
     padding-bottom: 0.8em;
    }
   .BottomStrip {
     background: #9fbfd8;
     text-align: center;
     font-size: 0.85em;
    }
   .RightToLeft {
     direction: rtl;
    }
   .Lines {
     width: 100%;
     height: 1px;
     overflow: hidden;
     font-size: 0.5px;
    }
   .ContentArea {
     background-color: white;
     padding-left: 10%;
     padding-right: 10%;
     padding-top: 0.8em;
     font-size: 1.0em;
    }
   a:hover {
     color: red;
    }
   a.BottomLinks {
     color: #000000;
     text-decoration: none;
    }
   a.BottomLinks:hover {
     color: red;
     text-decoration: none;
    }
   h1, h2 {
     margin: 0px;
     font-size: 1.0em;
    }
   h3.LanguageHeading {
     font-weight: bold
    }
   #ErrorTitleDiv {
     background: #9fbfd8;
     font-size: 1.2em;
     font-weight: bold;
     text-align: center;
    }
   #FoundationNameDiv {
     background: #dbe5df;
     font-size: 1.5em;
     font-family: "Gill Sans MT", "Gill Sans", Helvetica, Humanist, sans-serif;
     font-weight: bold;
     text-transform: uppercase;
     text-align: center;
     width: 100%;
     padding-top:0.8em;
    }
   #TopLinks {
     text-align: center;
     font-size: 0.8em
    }
   -->
  </style>

  <script type="text/javascript"><!-- Begin

   // The first column of this array is for the local language name of the Wikimedia Foundation
   // ('Wikimedia Foundation' should be used for all Latin-based languages)
   // The second column of the array is the localised language word for 'Error'.
   var LanguageDetails = new Array();
   LanguageDetails['ar'] = new Array( "مؤسسة ويكيميديا", "خطأ" );
   LanguageDetails['cs'] = new Array( "Wikimedia Foundation", "Chyba" );
   LanguageDetails['da'] = new Array( "Wikimedia Foundation", "Fejl" );
   LanguageDetails['de'] = new Array( "Wikimedia Foundation", "Fehler" );
   LanguageDetails['el'] = new Array( "Ίδρυμα Wikimedia", "Σφάλμα" );
   LanguageDetails['en'] = new Array( "Wikimedia Foundation", "Error" );
   LanguageDetails['es'] = new Array( "Wikimedia Foundation", "Error" );
   LanguageDetails['et'] = new Array( "Wikimedia Foundation", "Viga" );
   LanguageDetails['fa'] = new Array( "بنیاد ویکی‌مدیا", "خطا" );
   LanguageDetails['fi'] = new Array( "Wikimedia Foundation", "Virhe" );
   LanguageDetails['fr'] = new Array( "Wikimedia Foundation", "Erreur" );
   LanguageDetails['he'] = new Array( "קרן ויקימדיה", "שגיאה" );
   LanguageDetails['id'] = new Array( "Wikimedia Foundation", "Error" );
   LanguageDetails['it'] = new Array( "Wikimedia Foundation", "Errore" );
   LanguageDetails['ja'] = new Array( "ウィキメディア財団", "エラー" );
   LanguageDetails['ko'] = new Array( "위키미디어 재단", "오류" );
   LanguageDetails['no'] = new Array( "Wikimedia Foundation", "Feil" );
   LanguageDetails['nl'] = new Array( "Wikimedia Foundation", "Fout" );
...
brion added a comment.Sep 25 2015, 8:07 PM

Hmm maxtime=60 ? Do we really want that in the URL? :)

brion added a comment.Sep 25 2015, 8:10 PM

@Paladox yes, those are linked on the duplicate bug report.

Ok. but I mean I have re run them and still taking a while.

brion added a comment.Sep 25 2015, 8:14 PM

@Paladox please stop re-running transcodes; it interferes with our ability to track what's going on and fix the problem to have other people resetting things unexpectedly.

Oh sorry I didn't know I shoulden have done that sorry.

Joe added a comment.Sep 28 2015, 7:47 AM

@Paladox @brion I think Ori might have found the problem and fixed it: we were setting an override on max_execution_time in mediawiki-config if not running on CLI.

It should go much better from now on, please let me know.

Joe claimed this task.Sep 28 2015, 7:47 AM
Joe set Security to None.
TheDJ added a subscriber: TheDJ.Sep 28 2015, 10:42 AM

Might be better for some cases, but the Lila Tetrikov file from T113532 still appears to have troubles. Possibly that's another issue and the ticket should be unduped ?

Paladox added a subscriber: ori.Sep 28 2015, 11:55 AM

Thankyou @Joe and @ori for fixing the problem.

Joe added a comment.Sep 29 2015, 12:41 PM

\o/ Resolving this, and reimaging the remaining videoscaler!

@TheDJ I'll look into that specific bug today.

Joe closed this task as Resolved.Sep 29 2015, 12:46 PM
Joe triaged this task as High priority.