Page MenuHomePhabricator

Very high number of transcode failure on Commons
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://commons.wikimedia.org/wiki/Special:TimedMediaHandler

901,065 transcodes
138,852 Web streamable WebM (360P)
117,040 Web streamable WebM (480P)
96,513 High quality downloadable WebM (720P)
83,799 Full HD downloadable WebM (1080P)
138,650 Low bandwidth Ogg video (160P)
6,719 Web streamable Ogg video (240P)
123,882 Web streamable Ogg video (360P)
118,516 Web streamable Ogg video (480P)
77,094 Ogg Vorbis

This seems higher than it should be (but I don't know what is normal).

Looking through recent failures using api requests like these: https://commons.wikimedia.org/w/api.php?action=query&titles=File:Crohn_disease_video.webm&prop=transcodestatus&format=jsonfm
Most of these seem to be timeouts.

Event Timeline

Hashar reports: hashar> A connection error occured. Query: COMMIT Function: WebVideoTranscodeJob::run Error: 2006 MySQL server has gone away (10.64.48.23)

A few failed WebVideoTranscodeJob's:

1webVideoTranscode File:Doli_-_2016-01-26.webm transcodeMode=derivative transcodeKey=1080p.webm (uuid=2f8a7492f60a4540a3cbff6d3ff6555a,timestamp=1455570039,attempts=3) status=abandoned
2webVideoTranscode File:DSC_0189.webm transcodeMode=derivative transcodeKey=1080p.webm (uuid=b13e72eac6b04998969a5535b80d9835,timestamp=1455617518,attempts=3) status=abandoned
3webVideoTranscode File:Pust_16_5.webm transcodeMode=derivative transcodeKey=720p.webm (uuid=7c84663d83ca4544b916f5622cc86671,timestamp=1455289049,attempts=3) status=abandoned
4webVideoTranscode File:Doli_-_2016-01-26.webm transcodeMode=derivative transcodeKey=720p.webm (uuid=409e56ca0e614e03a07bcd50725dc15c,timestamp=1455570039,attempts=3) status=abandoned
5webVideoTranscode File:The_Window_2011.webm transcodeMode=derivative transcodeKey=360p.ogv (uuid=e82aa6ba3e09471b8c2300d185becb8b,timestamp=1455443898,attempts=3) status=abandoned
6webVideoTranscode File:The_Window_2011.webm transcodeMode=derivative transcodeKey=360p.webm (uuid=8d72dc38b0644ced92cfcb0aaae2a9ca,timestamp=1455443898,attempts=3) status=abandoned
7webVideoTranscode File:The_Window_2011.webm transcodeMode=derivative transcodeKey=720p.webm (uuid=967d8bd209a44928b971ee3460d980ba,timestamp=1455443898,attempts=3) status=abandoned

Might be related to: T97641: WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master ?

TheDJ added a project: Commons.
TheDJ added a subscriber: hashar.
hashar added subscribers: aaron, brion.

Looking in https://logstash.wikimedia.org/ (private) for WebVideoTranscodeJob I have chain of messages such as:

DatabaseBase::query: lost connection to 10.64.48.23; reconnected:
Database.php line 875 calls wfBacktrace()
Database.php line 2694 calls DatabaseBase->query()
Database.php line 2677 calls DatabaseBase->doCommit()
LoadBalancer.php line 1048 calls DatabaseBase->commit()
LBFactory.php line 206 calls LoadBalancer->commitAll()
LBFactoryMulti.php line 395 calls Closure$LBFactory::forEachLBCallMethod()
LBFactory.php line 209 calls LBFactoryMulti->forEachLB()
LBFactory.php line 222 calls LBFactory->forEachLBCallMethod()
WebVideoTranscodeJob.php line 256 calls LBFactory->commitAll()
JobRunner.php line 262 calls WebVideoTranscodeJob->run()
JobRunner.php line 176 calls JobRunner->executeJob()
RunJobs.php line 47 calls JobRunner->run()

/rpc/RunJobs.php?wiki=commonswiki&type=webVideoTranscode&maxtime=30&maxmem=300M

webVideoTranscode File:Imbecille_Pericoloso.webm transcodeMode=derivative transcodeKey=480p.webm (uuid=27ee5eb63d694f519b07c5365a510b96,timestamp=1455877787,QueuePartition=rdb1-6380) t=338505 error=DBQueryError: A connection error occured. 
Query: COMMIT
Function: WebVideoTranscodeJob::run
Error: 2006 MySQL server has gone away (10.64.48.23)
A connection error occured. 
Query: COMMIT
Function: WebVideoTranscodeJob::run
Error: 2006 MySQL server has gone away (10.64.48.23)
WebVideoTranscodeJob::run	10.64.48.23	2006	MySQL server has gone away (10.64.48.23)	COMMIT

You can easily lookup such chain of messages by filtering on the uid field.


On commonswiki job queue we have:

$ mwscript showJobs.php --wiki=commonswiki --group --type=webVideoTranscode
webVideoTranscode: 3 queued; 129 claimed (121 active, 8 abandoned); 0 delayed

Each job has a timestamp associated to hit which is I believe the time at which they got enqueued. Using magic Unix command we can have list of human friendly start times of claimed jobs via:

mwscript showJobs.php --wiki=commonswiki --list --type webVideoTranscode --filter claimed|grep -o -P 'timestamp=\d+'|cut -d\= -f2|xargs -I {} date --date=@{} +'%F %T'|sort -n

1$ mwscript showJobs.php --wiki=commonswiki --list --type webVideoTranscode --filter claimed|grep -o -P 'timestamp=\d+'|cut -d\= -f2|xargs -I {} date --date=@{} +'%F %T'|sort -n
22016-02-12 14:57:29
32016-02-13 12:23:11
42016-02-14 09:58:18
52016-02-14 09:58:18
62016-02-14 09:58:18
72016-02-15 21:00:39
82016-02-15 21:00:39
92016-02-16 10:11:58
102016-02-16 11:14:37
112016-02-16 11:14:37
122016-02-18 10:52:00
132016-02-18 10:52:00
142016-02-18 10:52:00
152016-02-18 11:01:49
162016-02-18 11:01:49
172016-02-18 11:42:10
182016-02-18 11:42:10
192016-02-18 11:44:01
202016-02-18 14:10:04
212016-02-18 14:11:28
222016-02-18 14:28:50
232016-02-18 14:31:08
242016-02-18 14:31:08
252016-02-18 14:31:08
262016-02-18 14:31:08
272016-02-18 14:31:08
282016-02-18 14:31:08
292016-02-18 14:31:08
302016-02-18 14:38:23
312016-02-18 14:38:23
322016-02-18 14:38:23
332016-02-18 15:02:23
342016-02-18 15:03:56
352016-02-18 15:03:56
362016-02-18 15:03:56
372016-02-18 15:03:56
382016-02-18 15:03:56
392016-02-18 15:07:18
402016-02-18 15:07:18
412016-02-18 15:07:18
422016-02-18 15:36:02
432016-02-18 15:37:50
442016-02-18 15:48:02
452016-02-18 15:48:02
462016-02-18 15:48:02
472016-02-18 16:19:43
482016-02-18 16:20:20
492016-02-18 16:21:36
502016-02-18 17:00:24
512016-02-18 17:19:58
522016-02-18 17:19:58
532016-02-18 17:19:58
542016-02-18 17:19:58
552016-02-18 17:19:58
562016-02-18 17:19:58
572016-02-18 17:23:28
582016-02-18 17:23:28
592016-02-18 17:32:50
602016-02-18 17:32:50
612016-02-18 17:41:58
622016-02-18 17:41:58
632016-02-18 17:41:58
642016-02-18 17:41:58
652016-02-18 17:41:58
662016-02-18 17:41:58
672016-02-18 17:44:55
682016-02-18 17:44:55
692016-02-18 17:44:56
702016-02-18 17:48:36
712016-02-18 17:48:36
722016-02-18 17:48:36
732016-02-18 17:53:25
742016-02-18 17:53:25
752016-02-18 17:53:25
762016-02-18 18:17:03
772016-02-18 18:19:12
782016-02-18 18:33:11
792016-02-18 18:33:11
802016-02-18 18:33:11
812016-02-18 18:33:11
822016-02-18 18:33:11
832016-02-19 01:09:50
842016-02-19 01:09:50
852016-02-19 01:09:50
862016-02-19 01:10:01
872016-02-19 01:10:01
882016-02-19 06:46:54
892016-02-19 08:08:43
902016-02-19 08:08:43
912016-02-19 08:16:31
922016-02-19 09:10:45
932016-02-19 09:10:45
942016-02-19 09:13:23
952016-02-19 09:13:23
962016-02-19 09:19:05
972016-02-19 09:19:05
982016-02-19 09:19:05
992016-02-19 09:19:05
1002016-02-19 09:19:05
1012016-02-19 09:20:20
1022016-02-19 09:23:16
1032016-02-19 09:38:24
1042016-02-19 09:43:38
1052016-02-19 09:43:38
1062016-02-19 09:43:38
1072016-02-19 09:43:38
1082016-02-19 09:43:38
1092016-02-19 09:43:38
1102016-02-19 09:53:09
1112016-02-19 10:09:40
1122016-02-19 10:09:40
1132016-02-19 10:16:35
1142016-02-19 10:16:35
1152016-02-19 10:16:35
1162016-02-19 10:24:25
1172016-02-19 10:24:25
1182016-02-19 10:29:47
1192016-02-19 10:29:47
1202016-02-19 10:34:13
1212016-02-19 10:47:25
1222016-02-19 10:47:25
1232016-02-19 10:47:25
1242016-02-19 10:47:43
1252016-02-19 10:47:43
1262016-02-19 10:47:43
1272016-02-19 10:47:43
1282016-02-19 10:47:43
129

There are a few old jobs (more than a day old).

For the File:Imbecille_Pericoloso.webm file mwscript showJobs.php --wiki=commonswiki --list --type webVideoTranscode|grep -i Imbecile I ran the command at 10:57:

transcodeModetranscodeKeyuuidtimestamphumanattemptsstatus
derivative720p.web5fa99334029d40649ecce4174314bc9d14558777872016-02-19 10:29:471claimed
derivative480p.web27ee5eb63d694f519b07c5365a510b9614558777872016-02-19 10:29:471claimed

Server examples: mw1259 mw1260. The way the jobs are run is the jobrunner daemon hit /rpc/runJobs.php entries served by HHVM daemon. webTranscodeVideo then execute shell script to do the transcoding.

I suspect that:

  • the job/MediaWiki starts a transaction
  • triggers the transcoding
  • after X minutes the database kill the idle transaction
  • MediaWiki flag a database error occured
  • MediaWiki reconnects a new connection
  • transaction commit fails because the db connection has gone away

Might be a general issues with rpcJobs.

Looks like ffmpegEncode() triggers some slave query (and thus transaction via DBO_TRX) which wait the whole time and fails in the commitAll() call.

A few things would help:
a) Disabling DBO_TRX at that point and restoring it later
b) Suppressing exceptions when trying to commit slave transactions if the connection is lost (the snapshot is flushed anyway, so there is nothing to do that didn't happen anyway)

Change 271940 had a related patch set uploaded (by Aaron Schulz):
Ignore noop DB transactions on connection loss as if there was none

https://gerrit.wikimedia.org/r/271940

Change 271940 merged by jenkins-bot:
Ignore noop DB transactions errors on connection loss

https://gerrit.wikimedia.org/r/271940

Poyekhali added a subscriber: Poyekhali.

@aaron so did it improve after your patch ?

aaron claimed this task.

Nothing in the last 7 days of logs.

Thank you aaron, great work !

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM