Page MenuHomePhabricator

Thumbor 500 while thumbnailing some webm files
Closed, ResolvedPublic

Description

Noticed this while looking at 500s, also visible with this query: https://logstash.wikimedia.org/goto/9da0da7323b4a0f168d1ee1eba36b478

All webm files I spot checked seem to play fine in the browser and we successfully transcoded them, though we're failing to generate their thumbs. e.g. this:

GET https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Cape_Town_under_the_clouds.webm/220px--Cape_Town_under_the_clouds.webm.jpg

The original exists and is playable: https://upload.wikimedia.org/wikipedia/commons/7/73/Cape_Town_under_the_clouds.webm

None of its thumbnails exist in swift:

root@ms-fe1005:~# swift list wikipedia-commons-local-thumb.73 --prefix '7/73/Cape_Town_under_the_clou'
root@ms-fe1005:~#

And thumbor errors out with an exiftool error (logstash id AWC2LpIjyCJQER2bKpce)

	[ExiftoolRunner] error: 'Error: Unknown file type /srv/thumbor/tmp/thumbor@8827/tmpjeSV2T\n'

Though downloading the file manually and running exiftool succeeds:

thumbor1003:~$ curl -Os https://upload.wikimedia.org/wikipedia/commons/7/73/Cape_Town_under_the_clouds.webm
thumbor1003:~$ exiftool Cape_Town_under_the_clouds.webm 
ExifTool Version Number         : 9.74
File Name                       : Cape_Town_under_the_clouds.webm
Directory                       : .
File Size                       : 9.2 MB
File Modification Date/Time     : 2018:01:02 10:54:05+00:00
File Access Date/Time           : 2018:01:02 10:40:49+00:00
File Inode Change Date/Time     : 2018:01:02 10:54:05+00:00
File Permissions                : rw-r--r--
File Type                       : WEBM
MIME Type                       : video/webm
EBML Version                    : 1
EBML Read Version               : 1
Doc Type                        : webm
Doc Type Version                : 4
Doc Type Read Version           : 2
Timecode Scale                  : 1 ms
Muxing App                      : Lavf55.26.100
Writing App                     : Lavf55.26.100
Duration                        : 0:11:45
Codec ID                        : A_VORBIS
Audio Channels                  : 2
Audio Sample Rate               : 44100
Track Number                    : 2
Track Language                  : und
Audio Codec ID                  : V_VP8
Track Type                      : Video
Video Frame Rate                : 29.97
Image Width                     : 2704
Image Height                    : 1524
Display Width                   : 2704
Display Height                  : 1524
Image Size                      : 2704x1524
thumbor1003:~$

Revisions and Commits

rTHMBREXT Thumbor Plugins
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision

Event Timeline

Cape_Town_under_the_clouds.webm is playable, but for me it skips straight to 11:12 when I open it in Chrome. Which probably suggests something unusual about the file or its keyframes. On VLC it plays for 33 seconds, instead of the announced 11+ minutes duration. Which is consistent with the approximate amount of time it plays in Chrome. This file looks corrupt.

I don't think that exiftool is running on the video, it's probably running (needlessly) on the extracted frame. I imagine no frame, or only a corrupted frame, could be extracted by a ffmpeg, which snowballs into the exiftool error. It's possible that there's some error handling improvements to be made to get a more meaningful error related to the actual problem.

Testing this on thumbor1001, seeking for the midpoint with ffmpeg fails silently. The status code is 0, but no screenshot is generated. If the fallback of seeking to 0 kicked in, it would work.

Currently the video loader trusts the ffmpeg status code to determine whether or not the screenshot rendering worked (both in the case of the fallback and for final result status). I think this shows that it shouldn't be trusted, and that we should check for the existence of the requested screenshot file instead.

Gilles triaged this task as Medium priority.Jan 2 2018, 2:51 PM
Gilles added a revision: Restricted Differential Revision.Jan 3 2018, 12:00 PM

This currently fixes some of the files showing up in logstash, but not all.

Change 401725 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.8

https://gerrit.wikimedia.org/r/401725

Change 401725 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.8

https://gerrit.wikimedia.org/r/401725

Mentioned in SAL (#wikimedia-operations) [2018-01-03T14:39:27Z] <godog> rollout python-thumbor-wikimedia 1.8 - T183907

Mentioned in SAL (#wikimedia-operations) [2018-01-03T15:01:07Z] <godog> roll-restart thumbor in eqiad after upgrade - T183907

The percent character without being an encoded character code seems like the likely culprit for most of them.

Actually, from your last comment, the 1st and 3rd one come out as 400s for me right now, and the 2nd one works. It's possible that the ones with % in them shouldn't be 400s, but I wonder why they showed up in the Varnish logs as 500s.

Gilles added a revision: Restricted Differential Revision.Jan 8 2018, 1:58 PM

Change 402906 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.10

https://gerrit.wikimedia.org/r/402906

Change 402906 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.10

https://gerrit.wikimedia.org/r/402906

Doesn't seem like it fixed the issue with the percentage character in production. Oh well, back to the drawing board...

Looking for one of those particular files in the error log, it 500s without any other error log entry beforehand (the previous message is unrelated):

Jan 24 12:04:36 thumbor1004 thumbor@8831[11770]: 2018-01-24 12:04:36,111 8831 thumbor:ERROR Error reading image metadata: Failed to read image data
Jan 24 12:04:44 thumbor1004 thumbor@8831[11770]: 2018-01-24 12:04:44,232 8831 tornado.access:ERROR 500 GET /wikipedia/commons/thumb/c/ce/Dramatische_daling_insectenpopulatie%2C_75%25_is_verdwenen.webm/800px--Dramatische_daling_insectenpopulatie%2C_75%25_is_verdwenen.webm.jpg (10.64.32.221) 187.71ms

I'll check if there's insufficient logging of error conditions/caught exceptions in the video loader.

Gilles added a revision: Restricted Differential Revision.Jan 24 2018, 1:04 PM

Change 407412 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.11

https://gerrit.wikimedia.org/r/407412

Change 407412 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.11

https://gerrit.wikimedia.org/r/407412

Issue with percentage seems to be gone. I'll double check that archived files work fine, as the latest instance is one like that. Could be the container rehashing of archived files not working for videos.

Gilles lowered the priority of this task from Medium to Low.Feb 5 2018, 1:43 PM

Looking at the remaining hits, Wikimaps_Nordic_Kickoff.webm is a legitimate failure. The error message could be better, but that file is unplayable in Chrome and VLC, and can't be transcoded.

Then the rest is a bunch of temp files, including chunked uploads.

Ah, the temp files are a permissions problem. The video loader tries to access those via https (not swift) to leverage the smart loading of video, and those files required authed access.

I'm going to create a specific task for that and close this one, since we no longer have other bugs that can be found looking at those 500s for videos.