Page MenuHomePhabricator

Commons video transcoders have over 6500 tasks in the backlog.
Closed, ResolvedPublic

Description

Please hold off on more server-side uploads until they can catch up. The bug in how tasks are 'run' (that it will keep starting tasks until memory runs out, far in excess of CPU count) is causing these large uploads to prevent the successful completion of any transcodes unless tasks are manually reset to control load on the server.

https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=7&fullscreen&var-jobType=webVideoTranscode&from=now-30d&to=now-5m
https://commons.wikimedia.org/wiki/Special:TimedMediaHandler

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 328345 merged by Elukey:
Move mw116[89] to a single partition recipe

https://gerrit.wikimedia.org/r/328345

Mentioned in SAL (#wikimedia-operations) [2016-12-20T21:48:17Z] <mutante> depooled mw1168 videoscaler for reinstall (part of T153488)

mw1168 and mw1169 have been reinstalled with the new partman recipe and now have more space in / (no separate /tmp)

Change 328473 had a related patch set uploaded (by Elukey):
Increase hhvm threads and transcode capabilities on mw116[89]

https://gerrit.wikimedia.org/r/328473

Change 328473 merged by Elukey:
Increase hhvm threads and transcode capabilities on mw116[89]

https://gerrit.wikimedia.org/r/328473

Change 328479 had a related patch set uploaded (by Elukey):
Add hhvm timeouts overrides to hiera for mw116[89]

https://gerrit.wikimedia.org/r/328479

Change 328479 merged by Elukey:
Add hhvm timeouts overrides to hiera for mw116[89]

https://gerrit.wikimedia.org/r/328479

Change 328194 merged by Giuseppe Lavagetto:
mediawiki::scaler: check orphaned HHVM threads

https://gerrit.wikimedia.org/r/328194

elukey lowered the priority of this task from High to Medium.Dec 22 2016, 9:26 AM

The queue, as far as I can see, has been processing tasks during the past day without hitting any major issue. I believe that after this action the video-scalers hit the apache/hhvm timeout bug causing the load congestion. The queue backlog is still big but I believe that we'd need to wait some days to let jobs being processed.

@Revent, please let us know if you see anything weird still going on.

Jobs seem to be processing without problems... I have seen tasks completing successfully in significant numbers (hundreds), and very few recent failures.

The only thing keeping the queue high is that those pesky users keep uploading videos. :P

Mentioned in SAL (#wikimedia-labs) [2016-12-22T12:10:16Z] <zhuyifei1999_> depooling v2c workers at encoding02 and encoding03 in order to make transcoding slower and have production videoscalers catch up (T153488)

As of now there are

  • 30 running transcodes
  • 10,571 queued transcodes
  • 361,961 failed transcodes

Also there are a large number of videos being uploaded daily.
Not only it seems it will take ages to clear the backlog, but the queue is not being reduced now.
So it seems this bug is still quite serious, isn't?

@Yann I have the impression that more action will be taken after the holidays.

So it seems this bug is still quite serious, isn't?

Thanks a lot for following up, the overall work is indeed not completed. Up to now we have:

  • fixed the immediate road blocker, namely the apache and hhvm timeout causing high load/utilization and tons of jobs re-submitted.
  • doubled the capacity, from 2 to 4 hosts dedicated to video transcoding.

We wanted to wait some time before taking any other action because of the holiday season and also because we wanted to see the backlog growth with a healthy processing pipeline. As you pointed out, the queue is growing, but we'd need to figure out exactly what is currently into it. For example, are we currently seeing the effects of this action, or it is only new videos uploaded? Is it really a bug or only a simple congestion?

The job queue wait time trend is not really giving us good news, but my point is that we can't simply throw hardware at it without understanding what it is happening.

I had a quick glance to Apache response timings (so the overall time spent by a videoscaler's jobrunner to complete a job) and most of the requests are served in less than 5/10 minutes (small jobs), but some takes up to 4 hours.

So in my opinion the next step is to figure out what's currently inside the queue and restart from that result.

Anybody that has experience and wants to chime in, please do! :)

@elukey Just to be clear, I have not (other than possibly incidentally) been putting old failed transcodes back on the queue since the backlog exploded (my comment there was back in mid-November). I have been resetting the transcodes with 'broken' (as in, both queued and errored) DB statuses that were created when the servers were reset on the 19th, but those don't 'add anything' to the backlog.... they were already queued.

@zhuyifei1999, the main issue seems to be a ton of jobs submitted to the queue, that we don't really know anything about. We have been running with a cluster of two hosts up to a month ago without any issues, meanwhile now 4 of them can't keep up with the work. I don't think this is all related to new video uploads. We can add a bit more throughput to the hosts but it will not help resolving the main problem..

@brion - do you have any idea what is happening to the queue? Is it related to the major re-encoding that you mentioned in T150067?

Let me point out that the queue is hardly growing anymore, we're in the middle of the holiday freeze and that I think that at this point this is not a capacity issue, unless there is an inflow of transcode requests that is really large.

I just looked at the trends of job waiting times and they don't really make sense unless we're experiencing a very very large backlog due to a very strong activity from video2commons.

@Matanya can we stop all video2commons activity until this is resolved?

In the meanwhile, I'll try to fine-tune the number of available workers, and maybe add a few more videoscalers for the time being, but I strongly suspect this is more than just backlog, and it might have something to do with the consequences of the crisis we had until one week ago, more or less.

Clearly if we can run 30 /40 transcodes in parallel, and most of those are for large videos that take hours to be transcoded, it's pretty hard to get through this huge backlog we're getting.

Change 329447 had a related patch set uploaded (by Giuseppe Lavagetto):
videoscaler: bump up the number of running transcodes (again)

https://gerrit.wikimedia.org/r/329447

Something really curious about the queue is that Special:TimedMediaHandler seems to report that many files from 2014/2015 are in the queue. Eg. File:Kaefer_-_2014-07-19.webm (2014) File:Klara_Golicina_video.ogv (2014) Ispod_Čučina,_pogled_na_cestu_102626.webm (2015) File:JornadaTic2014Secundario6Jujuy.webm (2014), beside the uploads from this month. Looks like some sort of batch rerun?

Regarding rETMH497c3532e787, the patch was merged in March, long before the cluster became overloaded this November/December. Yes, VP9 encoding is crazily-slow (one report said that it's 50x slower than VP8, iirc). Can you check on the cluster that there are VP9-encoding ffmpegs running? They should have a -vcodec libvpx-vp9 in their command line arguments.

And yeah, we do have some floods of videos this month, especially from video2commons. See Special:RecentChanges. The statistics showing the number of possibly chunk-uploaded files (> 100MB) from the tool skyrocketed from 1374 in Nov 26 to 2229 in Dec 24.

Change 329447 merged by Giuseppe Lavagetto:
videoscaler: bump up the number of running transcodes (again)

https://gerrit.wikimedia.org/r/329447

Change 329487 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki::jobrunner: rotate log files weekly

https://gerrit.wikimedia.org/r/329487

Change 329487 merged by Giuseppe Lavagetto:
mediawiki::jobrunner: rotate log files weekly

https://gerrit.wikimedia.org/r/329487

So, after I fine-tuned the video scalers load, we're now processing 73 transcodes at the same time. From what I can see from https://commons.wikimedia.org/wiki/Special:TimedMediaHandler, it's mostly running transcodes of white house press briefings; it seems like someone enqued re-transcodes of all these files, which span as far back as 2013 (see e.g. https://commons.wikimedia.org/wiki/File:12-6-13-_White_House_Press_Briefing.webm); while we're churning them at an unprecedented rate, there is a clearly legitimate queue of those files being transcoded.

The only way I see to clean up the backlog is to remove these transcodes from the queue and favour more recent videos.

Adding some info about https://bz.apache.org/bugzilla/show_bug.cgi?id=56188, that we are pretty sure is the root cause of the timeout issues.

I checked the FCGI specs in https://web.archive.org/web/20150129002720/http://www.fastcgi.com/drupal/node/6?q=node/22#S5.4:

5.4 FCGI_ABORT_REQUEST

The Web server sends a FCGI_ABORT_REQUEST record to abort a request. After receiving {FCGI_ABORT_REQUEST, R}, the application responds as soon as possible with {FCGI_END_REQUEST, R, {FCGI_REQUEST_COMPLETE, appStatus}}. This is truly a response from the application, not a low-level acknowledgement from the FastCGI library.

A Web server aborts a FastCGI request when an HTTP client closes its transport connection while the FastCGI request is running on behalf of that client. The situation may seem unlikely; most FastCGI requests will have short response times, with the Web server providing output buffering if the client is slow. But the FastCGI application may be delayed communicating with another system, or performing a server push.

When a Web server is not multiplexing requests over a transport connection, the Web server can abort a request by closing the request's transport connection. But with multiplexed requests, closing the transport connection has the unfortunate effect of aborting all the requests on the connection.

Apache httpd does not support connection multiplexing with the FCGI backend (and afaik nobody does it) so theoretically a simple connection drop would be fine for httpd to signal the client connection abort (last sentence of the above specs). This is what, as far as I saw, most of the webservers implementation do.

HHVM offers a very limited support for FCGI_ABORT:

https://github.com/facebook/hhvm/blob/HHVM-3.12/hphp/runtime/server/fastcgi/fastcgi-session.cpp#L475-L480

void FastCGISession::onRecordImpl(const fcgi::abort_record* rec) {
  if (!m_requestId || rec->requestId != m_requestId) {
    // Garbage record
    dropConnection();
    return;
  }

If I got it correctly, even HHVM does not support the FCGI_ABORT functionality as it is indicated in the FCGI specs ("This is truly a response from the application, not a low-level acknowledgement from the FastCGI library").

Last but not the least, there is an important note in http://php.net/manual/en/function.ignore-user-abort.php:

PHP will not detect that the user has aborted the connection until an attempt is made to send information to the client.

I didn't check in depth the PHP code that runs the jobs, but I don't expect it to flush a lot to stdout.

Someone just reset a ton of 720p webm White House Press Briefing videos (which by current configuration are destined to time out and fail), causing an increased backlog yesterday & today. Please don't do this again before the backlog is clear.

Just to update, the transcoders are still (as of now) mainly processing old multi-GB files.

Mentioned in SAL (#wikimedia-labs) [2017-01-11T05:46:48Z] <zhuyifei1999_> re-enabling v2c frontend & pooling worker at encoding02; transcode queue is < 500 now, and mostly containing relatively small files (T153488)

(Un-stalled the requests for server side uploads)

I wonder is this http://stackoverflow.com/questions/30801590/ffmpeg-performance-issue-when-transcoding-to-webm related. I ask because is this a performance problem?

Should be easily fixed with --speed

If I'm reading the graph data correctly, the queue is now caught up; I'm seeing actual transcodes go through quickly today, so it all looks good.

I'd strongly recommend deploying the initial changes for T155098 before restarting mass uploads via video2commons, we should be good to do that next week... never deploy on a friday during an all-hands meeting ;)

@brion Yes, the 'pending' queue is now empty, other than brief spikes (mainly due to spurts of large uploads, or me throwing old failed stuff back through) and it re-empties itself in a reasonable time.

Yeah I think we're good to close this one out; improvements to the queue handling are in a separate ticket