Page MenuHomePhabricator

Requesting an out of bounds page from a PDF results in the first page being rendered
Open, Needs TriagePublic

Description

Example page 1
Example page 5 (valid)
Example page 10 (invalid, actually renders page 1)

Thumbor debug output:

2023-03-20 12:18:19,343 ???? thumbor:DEBUG [SWIFT_LOADER] return_contents: /tmp/tmp_tyxelpc
2023-03-20 12:18:19,343 ???? thumbor:DEBUG [Proxy] load: '.pdf'
2023-03-20 12:18:19,343 ???? thumbor:DEBUG [Proxy] Looking for a pdf engine
2023-03-20 12:18:19,350 ???? thumbor:DEBUG [BWE] Found source file in context
2023-03-20 12:18:19,356 ???? thumbor:DEBUG [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '59', '/usr/bin/gs', '-sDEVICE=jpeg', '-dJPEG=90', '-sstdout=%stderr', '-sOutputFile=%stdout', '-dFirstPage=221', '-dLastPage=221', '-r150', '-dUseCropBox', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-q', '-f/tmp/tmp_tyxelpc']
2023-03-20 12:18:19,463 ???? thumbor:DEBUG [ShellRunner] Stdout: b''
2023-03-20 12:18:19,472 ???? thumbor:DEBUG [ShellRunner] Stderr: b'\nRequested FirstPage is greater than the number of pages in the file: 7\n   No pages will be processed (FirstPage > LastPage).\n'
2023-03-20 12:18:19,479 ???? thumbor:DEBUG [ShellRunner] Return code: 0
2023-03-20 12:18:19,485 ???? thumbor:DEBUG [ShellRunner] Duration: 95.238
2023-03-20 12:18:19,491 ???? thumbor:DEBUG [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '59', '/usr/bin/gs', '-sDEVICE=jpeg', '-dJPEG=90', '-sstdout=%stderr', '-sOutputFile=%stdout', '-dFirstPage=1', '-dLastPage=1', '-r150', '-dUseCropBox', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-q', '-f/tmp/tmp_tyxelpc']
2023-03-20 12:18:19,685 ???? thumbor:DEBUG [ShellRunner] Stdout: <too long to display (490094 bytes)>
2023-03-20 12:18:19,695 ???? thumbor:DEBUG [ShellRunner] Stderr: b''
2023-03-20 12:18:19,701 ???? thumbor:DEBUG [ShellRunner] Return code: 0
2023-03-20 12:18:19,707 ???? thumbor:DEBUG [ShellRunner] Duration: 183.35399999999998
2023-03-20 12:18:19,714 ???? thumbor:DEBUG [IM] Dumping buffer into temp file

It appears we do this deliberately, catching this specific error case. I'd like to revisit this behaviour as serving valid content for invalid URLs seems a bit broken to me. However, if this is designed for a specific historic case then we can drop it.

Event Timeline

It was added in {D306}, I couldn't find any documentation for why, https://wikitech.wikimedia.org/wiki/Thumbor/PDF and the code comments just says that it does. I'm trying to think of what cases could require this, and the only thing I can think of would be cached page numbers/URLs after a reupload that changes the page count. However, I don't think that serving the first page makes more sense than sending an error. Someone more familiar with ProofreadPage
might be able to thing of a situation where this makes sense.

We would still need to catch the error, the behavior would just need to be changed to raise an exception instead of rerendering.

It was added in {D306}, I couldn't find any documentation for why, https://wikitech.wikimedia.org/wiki/Thumbor/PDF and the code comments just says that it does. I'm trying to think of what cases could require this, and the only thing I can think of would be cached page numbers/URLs after a reupload that changes the page count. However, I don't think that serving the first page makes more sense than sending an error. Someone more familiar with ProofreadPage
might be able to thing of a situation where this makes sense.

We would still need to catch the error, the behavior would just need to be changed to raise an exception instead of rerendering.

In the case of a reupload users could just purge the caches although that is a bit arduous. In a world where we error the negative caching will be relatively brief as opposed to longterm caching on invalid-but-served incorrect pages. Happy to take instruction from anyone with interests in ProofreadPage, but generally I think a 404 is appropriate in this case as the client is requesting a resource that doesn't exist.