Page MenuHomePhabricator

Thumbor 500 while thumbnailing some webm files
Closed, ResolvedPublic

Description

Noticed this while looking at 500s, also visible with this query: https://logstash.wikimedia.org/goto/9da0da7323b4a0f168d1ee1eba36b478

All webm files I spot checked seem to play fine in the browser and we successfully transcoded them, though we're failing to generate their thumbs. e.g. this:

GET https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Cape_Town_under_the_clouds.webm/220px--Cape_Town_under_the_clouds.webm.jpg

The original exists and is playable: https://upload.wikimedia.org/wikipedia/commons/7/73/Cape_Town_under_the_clouds.webm

None of its thumbnails exist in swift:

root@ms-fe1005:~# swift list wikipedia-commons-local-thumb.73 --prefix '7/73/Cape_Town_under_the_clou'
root@ms-fe1005:~#

And thumbor errors out with an exiftool error (logstash id AWC2LpIjyCJQER2bKpce)

	[ExiftoolRunner] error: 'Error: Unknown file type /srv/thumbor/tmp/thumbor@8827/tmpjeSV2T\n'

Though downloading the file manually and running exiftool succeeds:

thumbor1003:~$ curl -Os https://upload.wikimedia.org/wikipedia/commons/7/73/Cape_Town_under_the_clouds.webm
thumbor1003:~$ exiftool Cape_Town_under_the_clouds.webm 
ExifTool Version Number         : 9.74
File Name                       : Cape_Town_under_the_clouds.webm
Directory                       : .
File Size                       : 9.2 MB
File Modification Date/Time     : 2018:01:02 10:54:05+00:00
File Access Date/Time           : 2018:01:02 10:40:49+00:00
File Inode Change Date/Time     : 2018:01:02 10:54:05+00:00
File Permissions                : rw-r--r--
File Type                       : WEBM
MIME Type                       : video/webm
EBML Version                    : 1
EBML Read Version               : 1
Doc Type                        : webm
Doc Type Version                : 4
Doc Type Read Version           : 2
Timecode Scale                  : 1 ms
Muxing App                      : Lavf55.26.100
Writing App                     : Lavf55.26.100
Duration                        : 0:11:45
Codec ID                        : A_VORBIS
Audio Channels                  : 2
Audio Sample Rate               : 44100
Track Number                    : 2
Track Language                  : und
Audio Codec ID                  : V_VP8
Track Type                      : Video
Video Frame Rate                : 29.97
Image Width                     : 2704
Image Height                    : 1524
Display Width                   : 2704
Display Height                  : 1524
Image Size                      : 2704x1524
thumbor1003:~$

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 2 2018, 10:54 AM
Gilles claimed this task.Jan 2 2018, 11:04 AM
Gilles added a project: Performance-Team.
Gilles added a comment.EditedJan 2 2018, 2:40 PM

Cape_Town_under_the_clouds.webm is playable, but for me it skips straight to 11:12 when I open it in Chrome. Which probably suggests something unusual about the file or its keyframes. On VLC it plays for 33 seconds, instead of the announced 11+ minutes duration. Which is consistent with the approximate amount of time it plays in Chrome. This file looks corrupt.

I don't think that exiftool is running on the video, it's probably running (needlessly) on the extracted frame. I imagine no frame, or only a corrupted frame, could be extracted by a ffmpeg, which snowballs into the exiftool error. It's possible that there's some error handling improvements to be made to get a more meaningful error related to the actual problem.

Gilles added a comment.Jan 2 2018, 2:50 PM

Testing this on thumbor1001, seeking for the midpoint with ffmpeg fails silently. The status code is 0, but no screenshot is generated. If the fallback of seeking to 0 kicked in, it would work.

Currently the video loader trusts the ffmpeg status code to determine whether or not the screenshot rendering worked (both in the case of the fallback and for final result status). I think this shows that it shouldn't be trusted, and that we should check for the existence of the requested screenshot file instead.

Gilles triaged this task as Normal priority.Jan 2 2018, 2:51 PM

This currently fixes some of the files showing up in logstash, but not all.

Change 401725 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.8

https://gerrit.wikimedia.org/r/401725

Change 401725 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.8

https://gerrit.wikimedia.org/r/401725

Mentioned in SAL (#wikimedia-operations) [2018-01-03T14:39:27Z] <godog> rollout python-thumbor-wikimedia 1.8 - T183907

Mentioned in SAL (#wikimedia-operations) [2018-01-03T15:01:07Z] <godog> roll-restart thumbor in eqiad after upgrade - T183907

Gilles added a comment.Jan 4 2018, 5:32 PM

The percent character without being an encoded character code seems like the likely culprit for most of them.

Gilles added a comment.Jan 8 2018, 9:37 AM

Actually, from your last comment, the 1st and 3rd one come out as 400s for me right now, and the 2nd one works. It's possible that the ones with % in them shouldn't be 400s, but I wonder why they showed up in the Varnish logs as 500s.

Change 402906 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.10

https://gerrit.wikimedia.org/r/402906

Gilles moved this task from Inbox to Doing on the Performance-Team board.Jan 8 2018, 9:14 PM

Change 402906 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.10

https://gerrit.wikimedia.org/r/402906

Mentioned in SAL (#wikimedia-operations) [2018-01-09T14:36:03Z] <godog> upgrade and roll-restart thumbor in codfw/eqiad - T182656 T183907 T169144

Rollout/upgrade complete

Gilles added a comment.Jan 9 2018, 6:37 PM

Doesn't seem like it fixed the issue with the percentage character in production. Oh well, back to the drawing board...

Looking for one of those particular files in the error log, it 500s without any other error log entry beforehand (the previous message is unrelated):

Jan 24 12:04:36 thumbor1004 thumbor@8831[11770]: 2018-01-24 12:04:36,111 8831 thumbor:ERROR Error reading image metadata: Failed to read image data
Jan 24 12:04:44 thumbor1004 thumbor@8831[11770]: 2018-01-24 12:04:44,232 8831 tornado.access:ERROR 500 GET /wikipedia/commons/thumb/c/ce/Dramatische_daling_insectenpopulatie%2C_75%25_is_verdwenen.webm/800px--Dramatische_daling_insectenpopulatie%2C_75%25_is_verdwenen.webm.jpg (10.64.32.221) 187.71ms

I'll check if there's insufficient logging of error conditions/caught exceptions in the video loader.

Change 407412 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.11

https://gerrit.wikimedia.org/r/407412

Change 407412 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.11

https://gerrit.wikimedia.org/r/407412

Issue with percentage seems to be gone. I'll double check that archived files work fine, as the latest instance is one like that. Could be the container rehashing of archived files not working for videos.

Gilles lowered the priority of this task from Normal to Low.Feb 5 2018, 1:43 PM
Gilles added a comment.Feb 5 2018, 2:58 PM

Looking at the remaining hits, Wikimaps_Nordic_Kickoff.webm is a legitimate failure. The error message could be better, but that file is unplayable in Chrome and VLC, and can't be transcoded.

Gilles added a comment.Feb 5 2018, 3:02 PM

Then the rest is a bunch of temp files, including chunked uploads.

Gilles closed this task as Resolved.Feb 5 2018, 3:06 PM

Ah, the temp files are a permissions problem. The video loader tries to access those via https (not swift) to leverage the smart loading of video, and those files required authed access.

I'm going to create a specific task for that and close this one, since we no longer have other bugs that can be found looking at those 500s for videos.