Page MenuHomePhabricator

"Error, running transcode job, for job that has already started" notices
Open, LowPublic

Description

Lots of these in runJobs.log:

2015-02-19 18:32:49 tmh1002 commonswiki: webVideoTranscode File:СВ-ДНР-360._Обстрелы_со_стороны_Песок._Донецк.ogv transcodeMode=derivative t
ranscodeKey=160p.ogv (uuid=f6aa4c5cba62491d889a79f85701df0b,QueuePartition=rdb2) t=2594 error=Error, running transcode job, for job that has
already started

Event Timeline

aaron created this task.Feb 19 2015, 7:19 PM
aaron raised the priority of this task from to Needs Triage.
aaron updated the task description. (Show Details)
aaron added subscribers: aaron, MarkTraceur.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2015, 7:19 PM
Bawolff set Security to None.
Restricted Application added a project: Multimedia. · View Herald TranscriptFeb 19 2015, 7:20 PM
Aklapper triaged this task as High priority.Feb 20 2015, 10:32 AM
Gilles edited a custom field.Feb 25 2015, 9:17 AM
Gilles claimed this task.Mar 4 2015, 10:49 AM

More recent examples:

2015-03-02 10:23:55 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=eaa830854744403e9280c8a790ae292d,QueuePartition=rdb2) t=440 error=Error, running transcode job, for job that has already started
2015-03-02 19:00:53 tmh1002 commonswiki: webVideoTranscode File:SPRING_SPREE_spending_patterns_in_spain_during_easter_2011_hd1080.webm transcodeMode=derivative transcodeKey=360p.webm (uuid=db7f5b33eaf149bb932806cc13c9692a,QueuePartition=rdb2) t=660 error=Error, running transcode job, for job that has already started
2015-03-02 21:05:44 tmh1002 commonswiki: webVideoTranscode File:Gent_-_Omloop_Het_Nieuwsblad,_28_februari_2015_(E13A).ogv transcodeMode=derivative transcodeKey=360p.ogv (uuid=5d83bb8a8ab84a2491fe84010388bd3a,QueuePartition=rdb2) t=1248 error=Error, running transcode job, for job that has already started
2015-03-02 21:05:45 tmh1002 commonswiki: webVideoTranscode File:Gent_-_Omloop_Het_Nieuwsblad,_28_februari_2015_(E13A).ogv transcodeMode=derivative transcodeKey=480p.ogv (uuid=b4489859b74b4940b2e2b208af973b45,QueuePartition=rdb2) t=1827 error=Error, running transcode job, for job that has already started
2015-03-02 21:05:45 tmh1002 commonswiki: webVideoTranscode File:Gent_-_Omloop_Het_Nieuwsblad,_28_februari_2015_(E13A).ogv transcodeMode=derivative transcodeKey=160p.ogv (uuid=dc54092ef1504d3f8c81f47f29dccff4,QueuePartition=rdb2) t=1829 error=Error, running transcode job, for job that has already started
2015-03-04 10:25:13 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=eaa830854744403e9280c8a790ae292d,QueuePartition=rdb2) t=405 error=Error, running transcode job, for job that has already started

Looking up the value of transcode_time_startwork for those in the DB:

СВ-ДНР-360._Обстрелы_со_стороны_Песок._Донецк.ogv 160p.ogv 20150217183249
Las_Chapas_a_Confrides.ogg 360p.webm 20150302102355
SPRING_SPREE_spending_patterns_in_spain_during_easter_2011_hd1080.webm 360p.webm 20150301190037 (had a transcode error)
Gent_-_Omloop_Het_Nieuwsblad,_28_februari_2015_(E13A).ogv 360p.ogv 20150228205805
Gent_-_Omloop_Het_Nieuwsblad,_28_februari_2015_(E13A).ogv 480p.ogv 20150228205805
Gent_-_Omloop_Het_Nieuwsblad,_28_februari_2015_(E13A).ogv 160p.ogv 20150228205805

It seems like the transcodes are re-attempted sometimes several days after they already ran and completed successfully.

Gilles added a comment.Mar 4 2015, 1:54 PM

Looking at this one specifically:

2015-03-02 10:23:55 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=c52ee93cdd2c4d3f8cc68012736b40b3,QueuePartition=rdb2) STARTING
2015-03-02 10:23:55 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=eaa830854744403e9280c8a790ae292d,QueuePartition=rdb2) STARTING
2015-03-02 10:23:55 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=eaa830854744403e9280c8a790ae292d,QueuePartition=rdb2) t=440 error=Error, running transcode job, for job that has already started
2015-03-02 10:25:49 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=c52ee93cdd2c4d3f8cc68012736b40b3,QueuePartition=rdb2) t=114122 good

Then the same happens the next day:

2015-03-04 10:25:12 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=eaa830854744403e9280c8a790ae292d,QueuePartition=rdb2) STARTING
2015-03-04 10:25:13 tmh1001 commonswiki: webVideoTranscode File:Las_Chapas_a_Confrides.ogg transcodeMode=derivative transcodeKey=360p.webm (uuid=eaa830854744403e9280c8a790ae292d,QueuePartition=rdb2) t=405 error=Error, running transcode job, for job that has already started

It seems like the failed job will retry running again (normal behavior for errored jobs, I imagine) and keep running in to the same issue, creating those repeats.

The original issues seems to be that the same job, with 2 different uuids gets picked up in rapid succession. There might be a race condition that allows 2 jobs with the same settings to be pushed to the queue.

Gilles added a comment.Mar 4 2015, 2:09 PM

I've read the history of the file, it seems like it's the remaining race conditions that don't get caught by the memcache lock.

If the goal of updateJobQueue is to insert a job if there isn't already one, I wonder why basing this on the DB insert's success/failure isn't enough? It seems to me like bandaids were added, but the original logic was flawed.

I think this tells the story: https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/commit/8d87f6048a03044062aa328974d9526c37e2b6cb

Originally, the logic was to insert the job regardless, presumably whoever wrote the original code thought it was harmless to run the job more than once. With that logic it made sense to either insert if no job had been inserted before, or update the values if this was a repeat.

Since then, we've been trying to avoid duplicate jobs, but this insert or update logic is still there, and doesn't make sense in this paradigm. If we don't want to insert the same job twice, we should just use the DB's insert mechanism to avoid re-inserting a job that's already there.

Change 194333 had a related patch set uploaded (by Gilles):
Improve transcode job insertion deduping

https://gerrit.wikimedia.org/r/194333

Change 194333 merged by jenkins-bot:
Improve transcode job insertion deduping

https://gerrit.wikimedia.org/r/194333

I've seen an occurrence of that error in today's logs, will have to keep an eye on it in the following days to see if it's still frequent.

Gilles lowered the priority of this task from High to Low.Mar 25 2015, 3:04 PM

It's not very frequent, I see one occurence in the last 15 hours, but it's still happening.

Tgr added a subscriber: Tgr.Apr 20 2015, 6:05 PM

On vagrant this happens sometimes when reuploading deleted files (or, presumably, when uploading a new revision) - there is already a row in the transcode table, corresponding to the old version of the file, so the new job is not added. Couldn't reproduce on live though. Might be a race condition between the hook that cleans up the transcode table on delete/new upload and the new transcode. Might also be the cause for T63401.

Gilles removed Gilles as the assignee of this task.Apr 27 2015, 8:03 AM
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 5:53 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 4 2015, 5:53 PM
TheDJ moved this task from Backlog to Transcoding on the TimedMediaHandler board.Oct 21 2015, 7:25 PM
TheDJ removed the point value for this task.May 3 2016, 8:26 PM