Page MenuHomePhabricator

Certain videos on Commons do not start at all or have interruptions
Closed, ResolvedPublic

Description

We do have a somewhat urgent situation here:

On January 1st a campaign is going to start in DE-WP, based on videos which we uploaded in Commons. We are expecting at least 1.000 page views of the Wiki-page which includes the videos.

It seems that there are some performance problems with playing the videos (interruptions or not starting at all). They are all webm files and should have been converted properly. I am talking about all video files in the category: https://commons.wikimedia.org/wiki/Category:Neue_Ehrenamtliche

This task is very likely linked with T153488: Commons video transcoders have over 6500 tasks in the backlog.

Does anyone have an idea on this? We want to avoid to use another platform than Commons.

Related Objects

StatusSubtypeAssignedTask
DuplicateNone
DuplicateNone
DuplicateNone
DuplicateNone
DuplicateNone
ResolvedDereckson
ResolvedDereckson
ResolvedDereckson
ResolvedDereckson
DeclinedDereckson
ResolvedDereckson
ResolvedDereckson
Resolvedhoo
ResolvedRevent
DuplicateNone
DuplicateNone
DeclinedDereckson
Resolvedmatmarex
Resolved brooke
DeclinedNone

Event Timeline

What makes you think the two problems are related?

It was mentioned by another Wikipedian in another discussion. I am no expert but it sounds like a plausible reason for the performance problem. Do you have other ideas?

@Verena transcoding is an async process, so a video has either been transcoded or not; in terms of user-facing playback, having a transcode queued is not really affecting playback.

What can be related is if the default video player by default uses a smaller-resolution transcode and it's slow because of the high bandwidth needed for playing the video.

The playback at the original resolution shouldn't be affected in any ways by the other issue, that's why I was asking a sincere question :)

Aklapper renamed this task from Commons videos: performance problems to Certain videos on Commons do not start at all or have interruptions.Dec 28 2016, 11:56 AM

Unrelated: I see some HTTP 500 thumbnail creation issues for e.g. Machmit_Intro_Videos.webm in the "network" tab of the "Developer Tools" of my web browser (for more info how to find such problems, please see: Firefox ≥24; Internet Explorer; Google Chrome; Apple Safari).

Longest videos seem to be https://commons.wikimedia.org/wiki/File:Tutorial_04_Artikel_verbessern.webm and https://commons.wikimedia.org/wiki/File:ADA-Video_Mach_mit_bei_Wikipedia.webm . I'd expect an HTTP 206 (Partial content) in the network tab.

I just looked into this, and for me in Firefox 50, (some of) the videos only play once their entirety has been downloaded twice(!).

Even when downloading with about 80-100Mbps, this means that we need to wait over a minute for some videos to properly start.

Thanks for the reply. Is that an issue of Commons or an issue of the video file itself?

hoo added subscribers: Tgr, brooke, TheDJ.

Thanks for the reply. Is that an issue of Commons or an issue of the video file itself?

I think we have two issues here: The player seems to be very inefficient (why is it loading the files twice?), also we don't have any reasonably sized transcodes available, so people have to load a huge chunks of data (up to 500MiB) before some videos even start playing.

The videos not starting instantaneous could be due to some files not being streamable (although they seem streamable in VLC)?

The videos in question are in the transcode queue, but the transcoders will at least take another couple of days to reach them. In theory we could probably kick these jobs of by hand on an idle misc server (do they have the dependencies for video transcodes?), but I'm not sure that's desirable.

1: If you browser doesn't support webm, the films might not start, because there is no ogg transcode available yet.
2: If you have a slow internet connection, then you might experience stuttering, because lower resolution transcodes are not available yet.
3: If the file was not properly prepared (and you are playing the original because of 1 or 2), it might be downloaded in full or otherwise inefficiently used.
4: If there is a network problem that strips or otherwise causes loss Range headers or range capabilities, then you will also have a problem (and this could also be inside our backend landscape, because i think we do quite a bit of magic with swift and varnish).

I'd try to use Youtube whenever possible, if you need any guarantees.

The videos not starting instantaneous could be due to some files not being streamable (although they seem streamable in VLC)?

VLC is smart at recovery. It does many things that most simpler video playback engines won't do.

That sounds sensible, thanks @TheDJ. Shall we open a new bug about the player loading the files twice?

Shall we open a new bug about the player loading the files twice?

Yes, that seems sensible. I have not been able to reproduce that problem btw. For me it all downloads immediately in 1.5MB chunks when I use FF.

And since it's native playback, you should probably see the same behavior if you open the original http url of the file in your browser standalone.

Oh and I just noticed https://commons.wikimedia.org/wiki/File:Machmit_Intro_Videos.webm claiming to be 41hours long (also in VLC). That indicates a significant problem with the original file.

With something like the following, we could manually run the transcodes in question on one of the video scalers:

<?php
$jobQueueGroup = JobQueueGroup::singleton();
$jobQueue = $jobQueueGroup->get( 'webVideoTranscode' );
 
$jobs = $jobQueue->getAllQueuedJobs();
$job = null;
foreach ( $jobs as $cJob ) {
        if ( $cJob instanceof WebVideoTranscodeJob && preg_match( '/^File:Tutorial_\d/', $cJob->getTitle()->getFullText() ) ) {
                $job = $cJob;
                break;
        }
}
if ( $job === null ) {
        die( 'Nothing found, seems we are done!' );
}
 
echo 'Chose: ' . $job . "\n\n\n";
 
$success = $job->run();
if ( $success ) {
        $jobQueueGroup->ack( $job );
        echo "Success.";
} else {
        echo "Something went wrong :(";
}
 
echo "\n\n\n";

@aaron @Reedy Do you see a better way to go about this?

@hoo FYI, due to a recent operations issue, all of the backlogged video transcodes were booted out of the queue, and have to be restarted (at a sane rate, ofc)... I'm booting your videos back through first.

Is this still a problem? I tried a few of the videos and they played properly. Was this just caused by missing transcodes?

Actually not. The campaign is over and all videos are transcoded. Unfortunately the transcoding process did not happen in time for the campaign. In addition there where some problems with the videos in Safari and Internet Explorer, but this seems to be an general Commons problem.

There is an issue that occurs sometimes (such as when the servers are restarted) where transcodes end up with both a 'success' and a 'failure' status in the SQL... they are shown on the file pages as successfully transcoded, but it seems that, in general, they were not.

The transcodes affected by this show up in https://quarry.wmflabs.org/query/14916 (as well as exactly 35 transcodes - from 2013 - that are not due to this, and are 'orphaned' because of files that were renamed, the bug that created those seems to be long fixed)

Right now there are not 'any' listed here, because I have run them back through, over time. I'm reasonably sure that this was the cause of the videos that would not play successfully... that the transcode was somehow 'aborted', but the partially transcoded video was treated as successful.

If this occurs again (hopefully, in smaller numbers) I'll preserve them, and open an actual bug, but in the case of hundreds (mainly new uploads) it seemed important to simply get them rerun.

@Revent, i really wish you had saved the sql output before rerunning everything. At least a few rows. The logic of determining if there was a failure is quite intricate :(

So you say that they have a 'success' timestamp, but also an error timestamp, an error, or they are generally just don't have a proper result file ?

@TheDJ The search checks for the existence of 'not null' in transcode_time_success and transcode_time error. The ones that were originally in the report, other than the 35 there now, were all from 'years ago', and the ones I checked at the time were rather 'obviously' deleted shortly after being uploaded. As it stands now, if the file is deleted or renamed shortly after upload, the entries in the transcode table go away, so that seems to have been long ago fixed.

When the servers were restarted in December (the 19th I think), what appeared to be all the 'overloaded' tasks started due to the apache bug appeared in that report... it was, iirc, about 1200.

When the servers were restarted again this month, on the 11th or 12th, about 450 tasks (what appeared to be the entire content of the queue at that time) were dumped into that report.

What I 'have' noticed is that if a long-running transcode is reset 'while it is running' it ends up in that report.... IIRC, with an error that indicates that the working file went away. If you want, I can probably quite easily cause it to happen again a few times.... we have plenty of big files that need to be run.

Closing this out as the backlog issue is over; there may still be general issues with transcodes that get reset at weird times, so people open a new specific bug if can reproduce. (Will be doing more maintenance on the transcode queue and how it's handled later this spring, probably.)