pdf files quite often in eqiad
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	May 28 2023, 12:20 PM

Description

As you can see here https://grafana.wikimedia.org/d/FMKakEyVz/hnowlan-thumbor-k8s?orgId=1&viewPanel=61&from=now-7d&to=now

the rate of 429s emitted by thumbor in eqiad rose by about one order of magnitude on may 24th.

This seems to be mostly localized to rendering of multipage documents and in particular djvu files.

The impact of this bug is that work on wikisource is heavily disrupted:

https://en.wikisource.org/wiki/Wikisource:Scriptorium#ocrtoy-no-text
https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Changes_in_the_past_week,_affect_the_Page_namespace.

I have verified that the 429s are emitted by thumbor itself and that thumbor reports rate-limiting by poolcounter. I don't have enough confidence with thumbor on k8s to pinpoint what the problem is, though.

Details

Subject	Repo	Branch	Lines +/-
thumbor: disable expensive throttling	operations/deployment-charts	master	+5 -4
jobqueue: reduce ThumbnailRender concurrency	operations/deployment-charts	master	+1 -1
jobqueue, thumbor: attempt to limit impact of thumbnailrender job	operations/deployment-charts	master	+2 -2
thumbor: include workers for djvu files, limit thumbnailrender concurrency	operations/deployment-charts	master	+2 -2
file: Make pre-gen rendering of multi-page files (pdf, ...) serial	mediawiki/core	wmf/1.41.0-wmf.13	+37 -12
file: Make pre-gen rendering of multi-page files (pdf, ...) serial	mediawiki/core	master	+37 -12
images: log key limited by poolcounter	operations/software/thumbor-plugins	master	+1 -1
handler.images: remove async from poolcounter release	operations/software/thumbor-plugins	master	+3 -3
thumbor: include poolcounter.failure metric	operations/deployment-charts	master	+3 -1
thumbor: split expensive format poolcounter buckets	operations/deployment-charts	master	+1 -1
poolcounter: use per-format throttling key	operations/software/thumbor-plugins	master	+1 -1
Thumbor: deploy various poolcounter fixes	operations/deployment-charts	master	+1 -1
Also add Poolcounter.release() to on_finish	operations/software/thumbor-plugins	master	+1 -0
Poolcounter.release: don't reconnect if the stream is lost	operations/software/thumbor-plugins	master	+2 -3
poolcounter: Make it release before closing connection	operations/software/thumbor-plugins	master	+1 -0
thumbor: allow changing poolcounter's release timeout	operations/deployment-charts	master	+9 -2
thumbor: add more expensive workers	operations/deployment-charts	master	+5 -0
thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable	operations/deployment-charts	master	+9 -5
thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable	operations/deployment-charts	master	+9 -5

Related Objects
Search...

Status	Assigned	Task
Resolved	hnowlan	T337649 Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad
Open	Joe	T338297 Revisit thumbor's poolcounter integration
Open	None	T339863 Thumbor's use of poolcounter is rate limiting Kubernetes IPs
Open	None	T376828 Thumbor's use of the `expensive` poolcounter queue can break rendering formats

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Quiddity mentioned this in T338765: Image 429 errors for most images on private wikis.Jun 11 2023, 8:40 PM

akosiaris subscribed.Jun 12 2023, 4:53 PM

Change 929392 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] Thumbor: deploy various poolcounter fixes

https://gerrit.wikimedia.org/r/929392

gerritbot added a project: Patch-For-Review.Jun 12 2023, 5:00 PM

Change 929394 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] poolcounter: use per-format throttling key

https://gerrit.wikimedia.org/r/929394

Change 929392 merged by jenkins-bot:

[operations/deployment-charts@master] Thumbor: deploy various poolcounter fixes

https://gerrit.wikimedia.org/r/929392

Change 929394 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] poolcounter: use per-format throttling key

https://gerrit.wikimedia.org/r/929394

Maintenance_bot removed a project: Patch-For-Review.Jun 13 2023, 10:10 AM

Change 929676 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: split expensive format poolcounter buckets

https://gerrit.wikimedia.org/r/929676

gerritbot added a project: Patch-For-Review.Jun 13 2023, 11:05 AM

Change 930001 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] handler.images: await poolcounter release

https://gerrit.wikimedia.org/r/930001

Change 930158 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: include poolcounter.failure metric

https://gerrit.wikimedia.org/r/930158

Change 929676 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: split expensive format poolcounter buckets

https://gerrit.wikimedia.org/r/929676

Change 930158 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: include poolcounter.failure metric

https://gerrit.wikimedia.org/r/930158

Change 930001 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] handler.images: remove async from poolcounter release

https://gerrit.wikimedia.org/r/930001

Maintenance_bot removed a project: Patch-For-Review.Jun 15 2023, 10:10 AM

Change 930664 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] images: log key limited by poolcounter

https://gerrit.wikimedia.org/r/930664

gerritbot added a project: Patch-For-Review.Jun 15 2023, 4:28 PM

Change 930664 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] images: log key limited by poolcounter

https://gerrit.wikimedia.org/r/930664

Maintenance_bot removed a project: Patch-For-Review.Jun 16 2023, 10:30 AM

A few of the fixes recently have fixed some behaviours but there are still some pretty chronic behaviours present (and also some worrying side-effects from poolcounter rate limit behaviour that indicates something much more broken than just maintaining locks for a long time).

However, something notable happened at around 6AM UTC today (or perhaps a bit earlier with a knockon effect that had a long tail in thumbor). For the first time in a week, poolcounter times dropped to what would be a more normal range, rate limit responses dropped quite a bit into what would be a more normal range and times spent throttled also dropped precipitously. Does anyone tagged on this ticket know of any kind of batches or large jobs (OCR, translation etc) that might have triggered this behaviour?

Poolcounter drops:

ghostscript QPS and 429 rate dropping:

Further update: there was a sustained and unprecedentedly long-running period of activity for the ThumbnailRender job:

Not sure what triggered this, I'd be very curious to see what kicked it off. I'm guessing this was mostly PDFs

oh I actually might know what's happening. When someone uploads a multi-page file, like a pdf or djvu, then to produce the pregen sizes, mw creates a job for each size and each page (e.g. if we have five pregen sizes and a 300 page pdf, you get 1500 jobs) and pushes all of them at the same time to the job queue which I assume they start around the same time... Thankfully it has the limit of up to first 50 pages but that's still 4*50 jobs at the same time. I think I can put out a solution to this. Basically make it similar to refreshlinks. Give me a minute.

Change 930884 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/930884

gerritbot added a project: Patch-For-Review.Jun 16 2023, 10:29 PM

Reviews of this would be appreciated ^

In T337649#8939812, @Ladsgroup wrote:

When someone uploads a multi-page file, like a pdf or djvu, then to produce the pregen sizes, mw creates a job for each size and each page (e.g. if we have five pregen sizes and a 300 page pdf, you get 1500 jobs) and pushes all of them at the same time to the job queue which I assume they start around the same time... Thankfully it has the limit of up to first 50 pages but that's still 4*50 jobs at the same time. I think I can put out a solution to this. Basically make it similar to refreshlinks. Give me a minute.

Bigger but relevant issue: does it make sense in the first place to pregen the "thumb"-sized thumb sizes for PDF/DjVu files? Fæ alone uploaded north of a million of these two years ago as a backup of IA (worried they'd get sued into oblivion), most of which are never going to get used for anything. Of the fraction of extant PDF/DjVu files, the usage outside of Wikisourcen is limited to a few (mainly Wikipedias) showing a single page (the title page, or rarely a plate or illustration). I don't have hard data but I imagine the fraction of those pregen thumbs that ever gets used is infinitesimal and hard to justify expending the resources on, especially if we run into issues such as the current one.

The place where page images actually are needed is on Wikisource, because they access all the pages sequentially (and it's in the hot path of the workflow), but that doesn't use the standard pregen sizes. The access pattern, coarsly, is also going to be that the page images are accessed when proofreading (transcribing) a book, but once that's done it's going to be years before someone needs it. The brute-force strategy is pre-gen'ing all the page images at PRP-usable sizes (not the thumb sizes); the smart strategy would be to not kick off that job until someone first accesses a page from the DjVu/PDF.

I put this out there because there's a need for highly tuning the specific subset of relevance to Wikisource, but right now the strategy is optimized for a use case that is uncommon and that it is very hard to actually optimize for (for example, plates and illustrations are no more likely to be in the first 50 pages than in the subsequent several hundred).

(PS. a more ambitious, but better long term, approach would probably be to extend the stack such that you could upload raw .jp2 images into a magic "Category:"—or other virtual container—that tied them together without having to physically collect them inside a PDF/DjVu container. Even .zip would be a better container for this. Combined with a derived slot where we could store (and access) the OCR text layer this would be a much more elegant solution. Thumb generation would then no longer have to shell out to ghostscript/ddjvu, and apart from needing PRP-usable "thumb" sizes, you could treat this use case the same way as every other single-image use case. It would also have several other technical and functional benefits. Please hit me up if anybody has even the slightest interest in exploring that idea!)

I would suggest keeping ladsgroups patch, but limiting the max to 10, instead of 50.

In books, there are often empty pages at the start of an book, just after the title. When a user is uploading a PDF/DjVu for wikisource, it is common for them to create those pages, in the "Page" namespace, just after creation. Wikisource users do review OCR and move to the next page, going through a empty page naturally takes shorter amount of time. Wikisource users do work in sequence, so the old system of trying to get 50 thumbs at once is overkill. Admittedly, edit-in-sequence at T308098 does make this particular pre-gen a lot less important than it is, but then again, that feature has not rolled out.

I agree with Xovers part of explaining how PDFs/DjVus are used outside of Wikisource.

In T309114#7958592, @Krinkle wrote:

This issue was introduced in MW 1.37 as part of change 698367 (ref T284416) which added support for pre-rendering the thumbnail for multi-page files, which was previously not functional (not even the first page, presumably) .

Completely anecdotally, subjectively, and dependent on my hazy memory: that time frame can coincide with generally worse performance of loading page images on Wikisource. If it is accurate that prior to change 698367 no thumbs were pregenerated for multipage files, it is entirely possible that with the current implementation this pregen'ing is actually making performance worse and triggering (making visible) all these other problems. If that's the case then MAX_PAGE_RENDER_JOBS is just a bandaid that hides the underlying problem.

So dumb question time: do we have the logs to, and is it feasible in practice to analyse them to, get some hard data on how many of those pregen'ed standard thumbs are actually accessed and used for anything? Can we tell with any kind of objective metric whether the current strategy for it is actually doing anything beneficial?

Oh, also, the linked fork of PdfHandler claims performance gains of 10x by using pdftocairo instead of ghostscript+imagemagick. At the scale we're talking here, 10x improved performance for each thumb generated—iff the pregen is found beneficial in the first place—may be a worthwhile investment. It'd dramatically reduce the pressure for the pregen jobs, and improve interactive performance whenever a non-cached thumb is requested (which, as mentioned, seems to be most of them for the Wikisource use case right now). That it also has much better font rasterization is pure bonus.

Change 930884 merged by jenkins-bot:

[mediawiki/core@master] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/930884

Maintenance_bot removed a project: Patch-For-Review.Jun 18 2023, 2:10 AM

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.15; 2023-06-27).Jun 18 2023, 3:00 AM

Looked at my browsers networking tab, loaded the same wikipage in "Page" namespace on wikisource and switched my resolution around, with these results:

Resolution	Image fetched	Image for next+prev page
1920x1080	1535px	1024px
1600x900	1024px	1024px
1280x720	1024px	1024px

1024px is not pre-generated as of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/912837 but it probaby does not need to be, since next and prev page thumbnails are fetched anyway.

Tried turning prefetch off in my browser, did not affect the next+prev page loading, seems to be an WMF thing.

It would be helpful though to pre-gen the first page in 1024px, and then visiting the first page generates page 2, and so on and so forth. It would be an PDF/DjVu only thing, 1024px is seldom used outside of that.

In T337649#8944872, @Xover wrote:

Oh, also, the linked fork of PdfHandler claims performance gains of 10x by using pdftocairo instead of ghostscript+imagemagick. At the scale we're talking here, 10x improved performance for each thumb generated—iff the pregen is found beneficial in the first place—may be a worthwhile investment. It'd dramatically reduce the pressure for the pregen jobs, and improve interactive performance whenever a non-cached thumb is requested (which, as mentioned, seems to be most of them for the Wikisource use case right now). That it also has much better font rasterization is pure bonus.

If this is true then this could clearly be a huge benefit for us - I've created T339845 to investigate this. We might need to do a bit of our own work to compare engines as lots of the comparisons and claims I can see about various engines are fairly old, but this is definitely work worth considering. Thanks for pointing it out!

Change 931073 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.41.0-wmf.13] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/931073

gerritbot added a project: Patch-For-Review.Jun 19 2023, 1:04 PM

Change 931073 merged by jenkins-bot:

[mediawiki/core@wmf/1.41.0-wmf.13] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/931073

Mentioned in SAL (#wikimedia-operations) [2023-06-19T14:19:24Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-19T14:20:48Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet

Maintenance_bot removed a project: Patch-For-Review.Jun 19 2023, 2:30 PM

Mentioned in SAL (#wikimedia-operations) [2023-06-19T14:39:32Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]] (duration: 20m 07s)

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.13; 2023-06-13); removed MW-1.41-notes (1.41.0-wmf.15; 2023-06-27).Jun 19 2023, 3:00 PM

hnowlan mentioned this in T339863: Thumbor's use of poolcounter is rate limiting Kubernetes IPs.Jun 19 2023, 3:43 PM

Thanks to Amir's change and a change in jobqueue's concurrency for Thumbnail generation I think much of the rate limiting people have been seeing should have subsided. We've surfaced multiple other issues that need to be resolved or have already been resolved while doing this though. If anyone is still seeing the specific original issue regularly please mention in this ticket

Snaevar added a comment.Jun 22 2023, 5:18 PM

This comment was removed by Snaevar.

In T337649#8939812, @Ladsgroup wrote:

oh I actually might know what's happening. When someone uploads a multi-page file, like a pdf or djvu, then to produce the pregen sizes, mw creates a job for each size and each page (e.g. if we have five pregen sizes and a 300 page pdf, you get 1500 jobs) and pushes all of them at the same time to the job queue which I assume they start around the same time... Thankfully it has the limit of up to first 50 pages but that's still 4*50 jobs at the same time. I think I can put out a solution to this. Basically make it similar to refreshlinks. Give me a minute.

I had requested that we would pre-generate images just for the first N pages in the past, and IIRC it was implemented at least for PDFs.

Yeah yeah, For the first 50 pages. The problem was that it was queuing all pages in all pre-gen sizes all at once.

Xover mentioned this in T341916: Some images not being downloaded by Tesseract (HTTP 503).Jul 15 2023, 9:50 AM

Xover mentioned this in T341918: Pregenerate Wikisource-useful thumbnail sizes for multi-page media.Jul 15 2023, 10:35 AM

In T337649#8984769, @Ladsgroup wrote:

Yeah yeah, For the first 50 pages. The problem was that it was queuing all pages in all pre-gen sizes all at once.

Since this task is more narrowly focussed on keeping the stack from melting, I've opened T341918 as a separate task. But I will note here that the essence of it is that the static 50-pages limit is an extremely blunt instrument that is not serving the Wikisource community well, but does point out an opportunity for a significant optimization in their core workflow.

Over the last few days (early Sunday UTC was when I consciously noted it, but that's unlikely to be very accurate timing) I'm seeing variable but worsening image load times and increasing number of 429 responses. In a quick test just now, two out of three thumbs loaded from the same DjVu returned 429 (example in case you want to look for it in logs).

Clement_Goubert moved this task from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.Aug 1 2023, 11:54 AM

And the current status is that it takes 5–10 reloads, with a timeout in the 10–15s range, to load each page. That is, an average total effective thumbnail render time of 1–2 minutes. From my perspective "Commons is down".

429 response codes in codfw seems to have started jumping some time during July 31; briefly fell back during August 2nd, only to go through the roof again and stay there later that day.

Screenshot 2023-08-03 at 10.25.30.png (514×1 px, 275 KB)

NOTE: These graphs have a logarithmic vertical axis: 429 response codes have jumped from an average of around 0.5 per second to around 20 per second, an increase by a factor of 40. We are now on average seeing equal or higher number of 429 responses as normal 200 responses. If we factor in the normal noise floor of 404s that graphs shows twice as many error response codes as normal 200 response codes.

This comports with poolcounter times in codfw that hit the roof, briefly fell back to normal, and then was pegged to the 5s timeout in the same pattern.

Screenshot 2023-08-03 at 10.26.09.png (594×1 px, 459 KB)

The HTTP response code rate does not follow the same pattern, but as of yesterday the error responses have increased by a factor of 3–4.

Screenshot 2023-08-03 at 10.27.05.png (594×1 px, 142 KB)

This is probably also related to why HAProxy current sessions has more than doubled.

Screenshot 2023-08-03 at 10.27.24.png (590×1 px, 418 KB)

Looking back as far as the Thumbor dashboard has data (April-ish) this does not obviously appear to be a normal seasonal pattern, and both HTTP response code rate of 4xx codes and HAProxy current sessions are currently the highest they have been for as long as we have measurements.

Screenshot 2023-08-03 at 10.45.29.png (594×1 px, 155 KB)

Screenshot 2023-08-03 at 10.45.47.png (590×1 px, 306 KB)

Change 945553 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: include workers for djvu files, limit thumbnailrender concurrency

https://gerrit.wikimedia.org/r/945553

gerritbot added a project: Patch-For-Review.Aug 3 2023, 10:04 AM

Fabfur unsubscribed.Aug 3 2023, 10:05 AM

In T337649#9065655, @Xover wrote:

429 response codes in codfw seems to have started jumping some time during July 31; briefly fell back during August 2nd, only to go through the roof again and stay there later that day.

This comports with poolcounter times in codfw that hit the roof, briefly fell back to normal, and then was pegged to the 5s timeout in the same pattern.

The Thumbor logs indicate that at least some throttling in codfw is legitimate since 02:00 on August 2nd. However, the issues reported before that are unrelated and I'm attempting to look into that now - it appears your rate limiting is being caused by eqiad. There is some overlap from a thumbnailing job that is using up workers for expensive formats, I'm going to reduce concurrency there and also increase the threshold for workers on expensive formats to help remedy the issues you're seeing.

Change 945553 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: include workers for djvu files, limit thumbnailrender concurrency

https://gerrit.wikimedia.org/r/945553

Maintenance_bot removed a project: Patch-For-Review.Aug 3 2023, 10:10 AM

In T337649#9065813, @hnowlan wrote:

I'm going to reduce concurrency there and also increase the threshold for workers on expensive formats to help remedy the issues you're seeing.

Current user experience is "pretty slow" rather than "broken". Page images take somewhere north of 5s to load, but in my relatively small sample size the images do all load on the first try.

“So, Xover, how do you contribute to the Wikimedia projects?”

“Well, mostly I wait around while Thumbor stares at me blankly and mumbles ‘429? 429?’.”

“That doesn't sound like a very productive way to contribute?”

“Well, it's not, but you have to understand… Look, it's like the Washington press corps, alright?”

“I don't follow?”

“It's like journalists in Washington sitting around watching an aging politician zone way out on the podium. They're not getting much done beside waiting, and eventually the aides will manage to recall him to the present so they can ask some questions and do some writing. It's horribly inefficient and a waste of everyone's time, but we still need a free and independent press.”

“So you're comparing the importance of your contributions to a fundamental tenet of a free society?”

“No, no. I'm just saying it's about as much use waiting for Thumbor to fix itself as hoping dementia will magically go away...”

Oh, this is Phabricator? I thought it was the submission form for my creative writing class. My bad… 😎

Change 956370 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: limit thumbnailrender job concurrency further

https://gerrit.wikimedia.org/r/956370

gerritbot added a project: Patch-For-Review.Sep 11 2023, 9:33 AM

Hi @Xover,

I can sense the frustration pretty clearly and I appreciate the effort to illustrate it via this story, avoiding lashing out. As a data point, users aren't the only ones frustrated with the situation. Engineers (developers, software engineers, SREs) are frustrated (and have been for a long time) too.

Looking at the larger picture, I think it's safe for me to say that we are admittedly and unfortunately, as a movement, paying the price of years of neglect and under investment in this critical part of the infrastructure. This neglect wasn't limited to just Thumbor, but as a component, Thumbor was affected more. There are ongoing efforts to address the general problem. Perhaps moving slower that we 'd like but at least moving. Thumbor has finally been upgraded to a newer version, it's been moved to a new platform and in the process a variety of small bugs and problems here and there have been fixed. Engineers are participating in this exact task, trying to somehow fix the reported problems. There's obviously still a long road ahead, but there is some momentum.

Why do I say all that? Because I fear that venting frustration, in a forum of technical people (Phabricator), that are already all well aware of the frustration, never mind sharing it, will do more harm than good by causing people to become disengaged.

In your shoes, I would now obviously ask, "if not here, where?". I can only offer suggestions unfortunately. Reaching out to management via email would be my go to.

In T337649#9162778, @akosiaris wrote:

I can sense the frustration pretty clearly […]

Oh, if that came across as venting frustration I have to apologize. It was an attempt at a funny way to communicate 1) that Thumbor needs kicking again (as it has needed periodically since the opening of this task), and 2) that tweaking the parameters for parallelism, queue size, and timeouts is probably not going to be sufficient since it keeps recurring (iow, we'll be back here in a couple of weeks, wasting Hnowlan's time with yet another temporary workaround).

Change 956370 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue, thumbor: attempt to limit impact of thumbnailrender job

https://gerrit.wikimedia.org/r/956370

In T337649#9162878, @Xover wrote:

In T337649#9162778, @akosiaris wrote:

I can sense the frustration pretty clearly […]

Oh, if that came across as venting frustration I have to apologize. It was an attempt at a funny way to communicate 1) that Thumbor needs kicking again (as it has needed periodically since the opening of this task), and 2) that tweaking the parameters for parallelism, queue size, and timeouts is probably not going to be sufficient since it keeps recurring (iow, we'll be back here in a couple of weeks, wasting Hnowlan's time with yet another temporary workaround).

Apologies then for misinterpreting your intent. Thank you for clarifying it.

Snaevar unsubscribed.Sep 13 2023, 2:02 PM

MatthewVernon mentioned this in T291137: The file "XXX" is in an inconsistent state within the internal storage backends.Oct 10 2023, 8:05 AM

Samwilson mentioned this in T307787: Error: 429, Too Many Requests while creating thumbnail.Dec 4 2023, 5:15 AM

Another update by a layman user: not only the page previews for PDF indices on Wikisource have been completely down for the past few days, but the bug also starts affecting page previews on Commons. Some of my latest uploads are completely unable to be previewed, yet everything is completely normal when I click into the involving PDF file. Not only that, OCR is also available to the affected pages.

Pursuant to this advice on the Commons Help Desk thread, I'm posting my observations here.

Change 979942 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: reduce ThumbnailRender concurrency

https://gerrit.wikimedia.org/r/979942

Change 979942 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue: reduce ThumbnailRender concurrency

https://gerrit.wikimedia.org/r/979942

We recently re-raised concurrency on the thumbnailrender job as it appeared backlogs had dropped causing initial issues. This had a knock-on effect related to T338297 which caused erroneous rate limits to be applied. This has hopefully reduced somewhat, please let us know.

Samwilson merged a task: T307787: Error: 429, Too Many Requests while creating thumbnail.Dec 5 2023, 2:22 AM

In T337649#9378880, @hnowlan wrote:

We recently re-raised concurrency on the thumbnailrender job as it appeared backlogs had dropped causing initial issues. This had a knock-on effect related to T338297 which caused erroneous rate limits to be applied. This has hopefully reduced somewhat, please let us know.

The current situation is so bad such that the pages on Wikisource won't even properly display the corresponding pages (see this page, for example). As mentioned before, this time the OCR is surprisingly functional. Can someone explain why this problem and the previous OCR service down bug are interrelated?

Change 980413 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: disable expensive throttling

https://gerrit.wikimedia.org/r/980413

Change 980413 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: disable expensive throttling

https://gerrit.wikimedia.org/r/980413

We've removed the expensive file format throttling entirely from Thumbor for the time being. This has reduced throttling across the board for the time being. We're trying to allocate time to a more holistic solution for this issue. Please let me know if there is an improvement in the situation or if the problem persists.

Can someone explain why this problem and the previous OCR service down bug are interrelated?

This problem and the OCR service being down are (as far as I know) independent, Thumbor doesn't make a distinction between files being OCRed and any other thumbnail.

One relationship between OCR and Thumbor is because there seems to be a difference in response from Thumbor based on where a request is made from. So people have been seeing a) a thumbnail load correctly in Wikisource, followed by b) Google OCR then complaining that it can't find that thumbnail (the opposite is also happening, where a user can't see load the thumbnail but Google can, but that's less common because people generally don't request OCR on an image they can't see). So it's nothing really to do with OCR, a similar issue would be seen when requesting the same thumbnail from multiple places, it's just that that happens more visibly with the OCR tool.

We've worked around the Google issue for now by making the OCR tool catch that case and resubmit the OCR with the actual image data rather than a URL. But if the thumbnails now load correctly then that workaround will just not be used, which is great!

In my testing this morning, it looks like things are vastly improved compared with the last couple of weeks.

There has been considerable improvement for most of the files recently but some of the files are still not loading like this one.

In T337649#9388968, @Bodhisattwa wrote:

There has been considerable improvement for most of the files recently but some of the files are still not loading like this one.

This file failing to load is a separate issue due to the size of the file causing us to hit resource limits.

Midleading mentioned this in T345334: Cache thumbs in our caching infrastructure (e.g. ATS).Jan 19 2024, 4:31 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 19 2024, 5:30 AM

I believe this specific issue has been resolved as a result of per-format throttling. Feel free to reopen if this specific issue reoccurs, otherwise please file a new ticket

Poslovitch mentioned this in T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC.Aug 14 2024, 12:57 PM

jeremyb-phone subscribed.Aug 16 2024, 6:23 AM

	F37161853: Screenshot 2023-08-03 at 10.45.47.png
	Aug 3 2023, 9:01 AM

	F37161852: Screenshot 2023-08-03 at 10.45.29.png
	Aug 3 2023, 9:01 AM

	F37161840: Screenshot 2023-08-03 at 10.27.24.png
	Aug 3 2023, 9:01 AM

	F37161838: Screenshot 2023-08-03 at 10.27.05.png
	Aug 3 2023, 9:01 AM

	F37161836: Screenshot 2023-08-03 at 10.26.09.png
	Aug 3 2023, 9:01 AM

	F37161833: Screenshot 2023-08-03 at 10.25.30.png
	Aug 3 2023, 9:01 AM

	F37107247: image.png
	Jun 16 2023, 2:59 PM

	F37107058: image.png
	Jun 16 2023, 11:53 AM

Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiadClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad
Closed, ResolvedPublic
Actions

Related Objects
Search...