Page MenuHomePhabricator

Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad
Open, HighPublic

Description

As you can see here https://grafana.wikimedia.org/d/FMKakEyVz/hnowlan-thumbor-k8s?orgId=1&viewPanel=61&from=now-7d&to=now

the rate of 429s emitted by thumbor in eqiad rose by about one order of magnitude on may 24th.

This seems to be mostly localized to rendering of multipage documents and in particular djvu files.

The impact of this bug is that work on wikisource is heavily disrupted:

https://en.wikisource.org/wiki/Wikisource:Scriptorium#ocrtoy-no-text
https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Changes_in_the_past_week,_affect_the_Page_namespace.

I have verified that the 429s are emitted by thumbor itself and that thumbor reports rate-limiting by poolcounter. I don't have enough confidence with thumbor on k8s to pinpoint what the problem is, though.

Details

ProjectBranchLines +/-Subject
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+2 -2
mediawiki/corewmf/1.41.0-wmf.13+37 -12
mediawiki/coremaster+37 -12
operations/software/thumbor-pluginsmaster+1 -1
operations/software/thumbor-pluginsmaster+3 -3
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+1 -1
operations/software/thumbor-pluginsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/software/thumbor-pluginsmaster+1 -0
operations/software/thumbor-pluginsmaster+2 -3
operations/software/thumbor-pluginsmaster+1 -0
operations/deployment-chartsmaster+9 -2
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+9 -5
operations/deployment-chartsmaster+9 -5
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I was asked by @jijiki to take a look.

PRP + OSD seems to be causing 2 thumb loads per Page:, both of different sizes. Can that be turned into just one? […]

My understanding is that we are actually requesting 3 images/thumbnails from the server on every page load as a result of this line in the php source code (Linker::processResponsiveImages() should ask for a 1.0 image, a 1.5 size image and a 2.0 image). […]

I don't know what ProofreadPage (PRP) and Openseadragon (OSD) are doing. But, in terms of how web browsers and HiDPi computer screens interact with <img srcset> HTML, I can say: No, browsers do not make three different requests.

The purpose of <img srcset> is to give the browser a choice of several image URLs. The browser will then select one of these. It selects the smallest variant that it knows will satisfies the pixel density of the user's computer screen. It would be "simpler" if we only offered one URL, the largest 2x one, and specified it using <img src>, no srcset complexity. The reason that srcset was invented in the HTML standard, is as performance optimisation, so that browsers can save download cost and time by offering the smallest appropiate version.

[operations/deployment-charts@master] thumbor: make POOLCOUNTER_CONFIG_EXPENSIVE configurable. https://gerrit.wikimedia.org/r/927712

@charts/thumbor/values.yaml
  poolcounter:
      enabled: true
     server: "localhost" # poolcounter1004.eqiad.wmnet / poolcounter2003.codfw.wmnet in prod
+    config:
+      expensive:
+         workers: 16
+         maxqueue: 120
+         timeout: 10

[operations/deployment-charts@master] thumbor: add more expensive workers. https://gerrit.wikimedia.org/r/927730

@helmfile.d/services/thumbor/values.yaml
+  poolcounter:
+    config:
+      expensive:
+         workers:  24
+         maxqueue: 100

It's not obvious to me how these interact, but I'm going to guess that the latter overrides the former in a way that recursively merges the arrays, preserving keys on the inner arays that aren't set (i.e. timeout: 10). Is that correct? If so, is there a way I could have found this out through documentation? And/or is there a way to publicly or from e.g. the deploy1002 command-line see what the effective "real" configuration is of a service? (And is that documented?)

Personally while I can see and appreciate the intention behind the POOLCOUNTER_CONFIG_EXPENSIVE value, having global limits for a system that distributes work across so many workers seems somewhat flawed to me. […]

I tend to agree. To my knowledge, the previous incarnation of this system (MediaWiki imagescalers) had no such global restriction. What we did have, and what I believe Thumbor also still has in addition to this, is a per-IP rate limit on thumbnail cache misses (renderfile ping limit for standard sizes, and renderfile-nonstandard for custom widths).

Based on Wikitech edits it looks like it was introduced in 2017, although the docs don't specify or explain why this exists or what problem it solves/solved.

At the risk of sounding naive and uninformed, and coming from an old-school CGI/imagescaler background, my intuition is to expect that, generally speaking, we should "just" respond to requests directly and must have sufficient workers to do so. I.e. either no queueing or queueing such that it is virtually invisible to end-users (queuing at cluster-level load balancer or local kernel accept queue for <1ms, to deal with concurrency, is fine of course). It feel like today we often:

  • cause users to literally sit in line waiting for multiple unrelated thumbnails to be resized synchronously, under "normal" circumstances whilst under "normal" load.
  • serve HTTP 429 errors to real users interacting with our own first-party applications.

If that feeling is correct, I think that means Thumbor is perfoming too much throttllng and queuing, and may justify raising, disabling, or removing some of these thresholds. Alternatively, I could also take comfort in a learning more about our Thumbor plugins and the reasons behind these custom measures.

All of expensive thumbnails have only one queue. It can be that some reqs failed to finish in time and thus never bumped the semaphore (=release the lock). […] Given that poolcounter's uptime is 92 days, I think zombie locks piled up to the point of throttling the whole system. Thumbor's poolcounter logic doesn't have a finally set to actually release anything in case of error/timeout somewhere else which is a bug […]. In fact, I'm struggling to find any release in the code at all.

I haven't independently verified this observation in code or in poolcounter network packets. But as far as PoolCounter general understanding is concerned, it is indeed required that clients release their lock. I don't actually whether PoolCounter has an automatic lock release mechanism. I'm fairly certain it doesn't do it automatically indeed (i.e. not like mysql locks that tie to a connection/session). I believe this is an important design decision in the case of PoolCounter, not unlike recent LoadMonitor discussion (T314020) in which we are working with a similar load effect where we want to avoid over loading servers and the simplest and most effective way to do that seems to be to assume by default that work is either still happening or that the problem is a side-effect of trying to do the work, either way we want the lock to stay. The problem with eager lock release is that it errs toward overload, which is fine for server orchestration logic (as we often do in MW for WAN/FileBackend/MySQL/Memc), but for external demand (like ParserCache and mysql) you probably want to assume that resources are taken until you explicitly hear otherwise. Timeouts tolerances tend to be higher than the time you to delay automatic response to outages.

PoolCounter docs say that ACQ4ANY takes <timeout> parameter. Lacking explanation to the contrary, my guess would be that this is how long the lock is held for and thus that upon connection loss it will be implicitly released after that amount of time. In any event, it would indeed significantly reduce apparent capacity if we always wait for the timeout instead of explicitly releasing it in Thumbor. That seems like a good thing to fix indeed, and should result in fewer Thumbor errors.

I was asked by @jijiki to take a look.

PRP + OSD seems to be causing 2 thumb loads per Page:, both of different sizes. Can that be turned into just one? […]

My understanding is that we are actually requesting 3 images/thumbnails from the server on every page load as a result of this line in the php source code (Linker::processResponsiveImages() should ask for a 1.0 image, a 1.5 size image and a 2.0 image). […]

I don't know what ProofreadPage (PRP) and Openseadragon (OSD) are doing. But, in terms of how web browsers and HiDPi computer screens interact with <img srcset> HTML, I can say: No, browsers do not make three different requests.

The purpose of <img srcset> is to give the browser a choice of several image URLs. The browser will then select one of these. It selects the smallest variant that it knows will satisfies the pixel density of the user's computer screen. It would be "simpler" if we only offered one URL, the largest 2x one, and specified it using <img src>, no srcset complexity. The reason that srcset was invented in the HTML standard, is as performance optimisation, so that browsers can save download cost and time by offering the smallest appropiate version.

I meant asking the Thumbor backend to generate three different images on every page load. Over on the browser side, we should be making exactly one request originating from Openseadragon (If caching is disabled and the user is on a extremely slow (Slow 3G) connection, the <img srcset> might race Openseadragon, which generates two requests but even then it should not approach 20sec, since the second image load is expected to be cached by Thumbor).

Throttling haven't improved (in the past six hours):

image.png (319×855 px, 21 KB)

Mostly guessing/thinking out loud.
All of expensive thumbnails have only one queue. It can be that some reqs failed to finish in time and thus never bumped the semaphore (=release the lock). (The TIMEOUT command in poolcounter doesn't set the timeout releasing the lock, it's about throttling). Given that poolcounter's uptime is 92 days, I think zombie locks piled up to the point of throttling the whole system. Thumbor's poolcounter logic doesn't have a finally set to actually release anything in case of error/timeout somewhere else which is a bug but thumbor is not known for being bug-free. In fact, I'm struggling to find any release in the code at all.

One way to try this hypothesis is to restart poolcounter and see if things improve but I don't know how risky it is to do such thing specially given that it's shared with all of mw.

First of all - throttling you see there has little to nothing to do with PDF/DJVU thumbnailing - the traffic for such images is low enough that the general numbers won't be affected.

We tried this hypothesis the safe way, which is restarting thumbor across a full DC does fix the issue for a day or two. The poolcounter code in thumbor has several logical flaws, one of which is the lack of a finally stanza for timeouts/kills/ooms; another one is (IIRC) that it has some logical bugs in treating nested poolcounter calls - which is what I think happens for expensive files.

I wanted to find the smoking gun yesterday, which would be a hanging connection between a thumbor instance and PC, but then we kept de-deploying thumbor to try to stop the bleeding so I could not.

I think the next thing we can try is to either raise the limit for expensive processing to be ~ 80% of the workers pool, or - better - we just try to disable it at least temporarily.

As a side note - I would urge all community members commenting here to try to stick to their experiences using the sites, rather than trying to figure out what's wrong from graphs - it's not obvious how the bug you're seeing is related to a specific graph, if at all.

I should add - we do have a correctly written python poolcounter client, but its code uses blocking i/o so it's not properly suited for tornado.

We could:

  1. Port that code to tornado
  2. Just swap in that library for the current thumbor poolcounter library, given we're explicitly avoiding sending more than one request at a time to the tornado worker (something I, to this day, don't undestand as a design decision given we're mostly shelling out to other programs so we could easily serve more than one scaling request per worker).

Change 927981 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] thumbor: allow changing poolcounter's release timeout

https://gerrit.wikimedia.org/r/927981

Change 927978 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/software/thumbor-plugins@master] Poolcounter.release: don't reconnect if the stream is lost

https://gerrit.wikimedia.org/r/927978

Change 927979 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/software/thumbor-plugins@master] Also add Poolcounter.release() to on_finish

https://gerrit.wikimedia.org/r/927979

Change 927981 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: allow changing poolcounter's release timeout

https://gerrit.wikimedia.org/r/927981

For the time being we have set release_timeout: 59, which maximum time a lock on poolcounter will be held. From the graphs, it appears that we are still serving more 429s than we want

image.png (510×1 px, 201 KB)

All of expensive thumbnails have only one queue. It can be that some reqs failed to finish in time and thus never bumped the semaphore (=release the lock). […] Given that poolcounter's uptime is 92 days, I think zombie locks piled up to the point of throttling the whole system. Thumbor's poolcounter logic doesn't have a finally set to actually release anything in case of error/timeout somewhere else which is a bug […]. In fact, I'm struggling to find any release in the code at all.

I haven't independently verified this observation in code or in poolcounter network packets. But as far as PoolCounter general understanding is concerned, it is indeed required that clients release their lock. I don't actually whether PoolCounter has an automatic lock release mechanism. I'm fairly certain it doesn't do it automatically indeed (i.e. not like mysql locks that tie to a connection/session).

@jijiki sniffed the traffic of a thumbor pod, it never sent any release command.

PoolCounter docs say that ACQ4ANY takes <timeout> parameter. Lacking explanation to the contrary, my guess would be that this is how long the lock is held for and thus that upon connection loss it will be implicitly released after that amount of time. In any event, it would indeed significantly reduce apparent capacity if we always wait for the timeout instead of explicitly releasing it in Thumbor. That seems like a good thing to fix indeed, and should result in fewer Thumbor errors.

My understanding from the doc (which can be wrong, I also should double check it with code) is that timeout in ACQ4ME (it calls ACQ4ME) is for timeout waiting for a lock to be freed to grab it, it's not timeout on the lock being released. My guess is that it just keeps the lock until it gets restarted.

[…] PoolCounter docs say that ACQ4ANY takes <timeout> parameter. Lacking explanation to the contrary, my guess would be that this is how long the lock is held for and thus that upon connection loss it will be implicitly released after that amount of time. […]

My understanding from the doc […] is that timeout in ACQ4ME […] is for timeout waiting for a lock to be freed to grab it, it's not timeout on the lock being released. My guess is that it just keeps the lock until it gets restarted.

You're right. It's indeed how long to block-wait whilst attempting to acquire a lock.

Looking at the code, there does appear to be code relating to the concept of a "Disconnected client", which calls free_client_data, which among other things releases any locks the connection held. Having said that, my C knowledge is very limited so I don't know if this covers all or most scenarios in which a connection can be closed or lost.

Hmm, maybe because of envoy or some other networking magic in k8s, the connections are getting reused inside the thumbor pod making poolcounter think it needs to hold the lock (given that thumbor doesn't release them explicitly), that would explain a lot. Regardless, we need to make the releasing explicit and get it deployed to test this.

Change 927767 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] poolcounter: Make it release before closing connection

https://gerrit.wikimedia.org/r/927767

Change 927978 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] Poolcounter.release: don't reconnect if the stream is lost

https://gerrit.wikimedia.org/r/927978

Change 927979 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] Also add Poolcounter.release() to on_finish

https://gerrit.wikimedia.org/r/927979

Hmm, maybe because of envoy or some other networking magic in k8s, the connections are getting reused inside the thumbor pod making poolcounter think it needs to hold the lock (given that thumbor doesn't release them explicitly), that would explain a lot. Regardless, we need to make the releasing explicit and get it deployed to test this.

Just for the record, there is no envoy in the thumbor pods nor I am aware of any k8s networking property that would cause what you describe.

Also as far as I can tell no "zombie" tcp connection was left between thumbor and poolcounter.

While the patches @Ladsgroup and I created would kind-of work to a level, the real issue here is structural:

  • Poolcounter locks are acquired in sequence, and then we do fire a release after N seconds. Well to be more precise, we try to close the connection. If we complete serving a request, we close the connection immediately instead.
  • This means that if the process mangling the image gets OOM killed or times our or throws an exception, we'll wait for the full time of this timeout (currently set to 59 seconds in production, since yesterday) to close the connection, which releases the lock.
  • As a consequence, if we're trying to render some file of the "expensive" kind that consistently fails because of the tighter memory limits in kubernetes, we are in fact saying "we can't have more than N failures to render an expensive file within M seconds", which causes the visible problem.

What we've done in the last few days is increase N from 16 to 24, and reduce M from 120 to 59 seconds. So this gives us some more wiggle room with this limit, but doesn't really solve the problem that our poolcounter integration for thumbor is fundamentally broken.

My personal take at this point is that we should just disable the limitation for large files for now, and consider reintroducing it if it becomes a recurring problem.

As a normal user who know nothing about the subliminal coding, I'd like to report that the problem seems to be alleviated for the past few days. At least I don't need to push the button for some 10 or 15 times, and in general it only takes 2-3 times for Google OCR to identify the text. Did you changed something for the OCR gadget?

As a normal user who know nothing about the subliminal coding, I'd like to report that the problem seems to be alleviated for the past few days. At least I don't need to push the button for some 10 or 15 times, and in general it only takes 2-3 times for Google OCR to identify the text. Did you changed something for the OCR gadget?

The button is not going to work until the thumnail is in place. The gadget works by having an image of the page in question and performing OCR (https://en.wikipedia.org/wiki/Optical_character_recognition) on it. There is also an OCR-layer (text-layer) in many Wikimedia Commons PDFs and DjVus. Those files have allready been OCRed, and the result is stored in the file itself. In both cases, the file needs to be opened first.

Change 929392 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] Thumbor: deploy various poolcounter fixes

https://gerrit.wikimedia.org/r/929392

Change 929394 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] poolcounter: use per-format throttling key

https://gerrit.wikimedia.org/r/929394

Change 929392 merged by jenkins-bot:

[operations/deployment-charts@master] Thumbor: deploy various poolcounter fixes

https://gerrit.wikimedia.org/r/929392

Change 929394 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] poolcounter: use per-format throttling key

https://gerrit.wikimedia.org/r/929394

Change 929676 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: split expensive format poolcounter buckets

https://gerrit.wikimedia.org/r/929676

Change 930001 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] handler.images: await poolcounter release

https://gerrit.wikimedia.org/r/930001

Change 930158 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: include poolcounter.failure metric

https://gerrit.wikimedia.org/r/930158

Change 929676 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: split expensive format poolcounter buckets

https://gerrit.wikimedia.org/r/929676

Change 930158 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: include poolcounter.failure metric

https://gerrit.wikimedia.org/r/930158

Change 930001 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] handler.images: remove async from poolcounter release

https://gerrit.wikimedia.org/r/930001

Change 930664 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] images: log key limited by poolcounter

https://gerrit.wikimedia.org/r/930664

Change 930664 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] images: log key limited by poolcounter

https://gerrit.wikimedia.org/r/930664

A few of the fixes recently have fixed some behaviours but there are still some pretty chronic behaviours present (and also some worrying side-effects from poolcounter rate limit behaviour that indicates something much more broken than just maintaining locks for a long time).

However, something notable happened at around 6AM UTC today (or perhaps a bit earlier with a knockon effect that had a long tail in thumbor). For the first time in a week, poolcounter times dropped to what would be a more normal range, rate limit responses dropped quite a bit into what would be a more normal range and times spent throttled also dropped precipitously. Does anyone tagged on this ticket know of any kind of batches or large jobs (OCR, translation etc) that might have triggered this behaviour?

Poolcounter drops:

image.png (1×1 px, 220 KB)

ghostscript QPS and 429 rate dropping:

image.png (1×1 px, 227 KB)

Further update: there was a sustained and unprecedentedly long-running period of activity for the ThumbnailRender job:

image.png (620×1 px, 76 KB)

Not sure what triggered this, I'd be very curious to see what kicked it off. I'm guessing this was mostly PDFs

oh I actually might know what's happening. When someone uploads a multi-page file, like a pdf or djvu, then to produce the pregen sizes, mw creates a job for each size and each page (e.g. if we have five pregen sizes and a 300 page pdf, you get 1500 jobs) and pushes all of them at the same time to the job queue which I assume they start around the same time... Thankfully it has the limit of up to first 50 pages but that's still 4*50 jobs at the same time. I think I can put out a solution to this. Basically make it similar to refreshlinks. Give me a minute.

Change 930884 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/930884

Reviews of this would be appreciated ^

When someone uploads a multi-page file, like a pdf or djvu, then to produce the pregen sizes, mw creates a job for each size and each page (e.g. if we have five pregen sizes and a 300 page pdf, you get 1500 jobs) and pushes all of them at the same time to the job queue which I assume they start around the same time... Thankfully it has the limit of up to first 50 pages but that's still 4*50 jobs at the same time. I think I can put out a solution to this. Basically make it similar to refreshlinks. Give me a minute.

Bigger but relevant issue: does it make sense in the first place to pregen the "thumb"-sized thumb sizes for PDF/DjVu files? Fæ alone uploaded north of a million of these two years ago as a backup of IA (worried they'd get sued into oblivion), most of which are never going to get used for anything. Of the fraction of extant PDF/DjVu files, the usage outside of Wikisourcen is limited to a few (mainly Wikipedias) showing a single page (the title page, or rarely a plate or illustration). I don't have hard data but I imagine the fraction of those pregen thumbs that ever gets used is infinitesimal and hard to justify expending the resources on, especially if we run into issues such as the current one.

The place where page images actually are needed is on Wikisource, because they access all the pages sequentially (and it's in the hot path of the workflow), but that doesn't use the standard pregen sizes. The access pattern, coarsly, is also going to be that the page images are accessed when proofreading (transcribing) a book, but once that's done it's going to be years before someone needs it. The brute-force strategy is pre-gen'ing all the page images at PRP-usable sizes (not the thumb sizes); the smart strategy would be to not kick off that job until someone first accesses a page from the DjVu/PDF.

I put this out there because there's a need for highly tuning the specific subset of relevance to Wikisource, but right now the strategy is optimized for a use case that is uncommon and that it is very hard to actually optimize for (for example, plates and illustrations are no more likely to be in the first 50 pages than in the subsequent several hundred).

(PS. a more ambitious, but better long term, approach would probably be to extend the stack such that you could upload raw .jp2 images into a magic "Category:"—or other virtual container—that tied them together without having to physically collect them inside a PDF/DjVu container. Even .zip would be a better container for this. Combined with a derived slot where we could store (and access) the OCR text layer this would be a much more elegant solution. Thumb generation would then no longer have to shell out to ghostscript/ddjvu, and apart from needing PRP-usable "thumb" sizes, you could treat this use case the same way as every other single-image use case. It would also have several other technical and functional benefits. Please hit me up if anybody has even the slightest interest in exploring that idea!)

I would suggest keeping ladsgroups patch, but limiting the max to 10, instead of 50.

In books, there are often empty pages at the start of an book, just after the title. When a user is uploading a PDF/DjVu for wikisource, it is common for them to create those pages, in the "Page" namespace, just after creation. Wikisource users do review OCR and move to the next page, going through a empty page naturally takes shorter amount of time. Wikisource users do work in sequence, so the old system of trying to get 50 thumbs at once is overkill. Admittedly, edit-in-sequence at T308098 does make this particular pre-gen a lot less important than it is, but then again, that feature has not rolled out.

I agree with Xovers part of explaining how PDFs/DjVus are used outside of Wikisource.

This issue was introduced in MW 1.37 as part of change 698367 (ref T284416) which added support for pre-rendering the thumbnail for multi-page files, which was previously not functional (not even the first page, presumably) .

Completely anecdotally, subjectively, and dependent on my hazy memory: that time frame can coincide with generally worse performance of loading page images on Wikisource. If it is accurate that prior to change 698367 no thumbs were pregenerated for multipage files, it is entirely possible that with the current implementation this pregen'ing is actually making performance worse and triggering (making visible) all these other problems. If that's the case then MAX_PAGE_RENDER_JOBS is just a bandaid that hides the underlying problem.

So dumb question time: do we have the logs to, and is it feasible in practice to analyse them to, get some hard data on how many of those pregen'ed standard thumbs are actually accessed and used for anything? Can we tell with any kind of objective metric whether the current strategy for it is actually doing anything beneficial?

Oh, also, the linked fork of PdfHandler claims performance gains of 10x by using pdftocairo instead of ghostscript+imagemagick. At the scale we're talking here, 10x improved performance for each thumb generated—iff the pregen is found beneficial in the first place—may be a worthwhile investment. It'd dramatically reduce the pressure for the pregen jobs, and improve interactive performance whenever a non-cached thumb is requested (which, as mentioned, seems to be most of them for the Wikisource use case right now). That it also has much better font rasterization is pure bonus.

Change 930884 merged by jenkins-bot:

[mediawiki/core@master] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/930884

Looked at my browsers networking tab, loaded the same wikipage in "Page" namespace on wikisource and switched my resolution around, with these results:

ResolutionImage fetchedImage for next+prev page
1920x10801535px1024px
1600x9001024px1024px
1280x7201024px1024px

1024px is not pre-generated as of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/912837 but it probaby does not need to be, since next and prev page thumbnails are fetched anyway.

Tried turning prefetch off in my browser, did not affect the next+prev page loading, seems to be an WMF thing.

It would be helpful though to pre-gen the first page in 1024px, and then visiting the first page generates page 2, and so on and so forth. It would be an PDF/DjVu only thing, 1024px is seldom used outside of that.

Oh, also, the linked fork of PdfHandler claims performance gains of 10x by using pdftocairo instead of ghostscript+imagemagick. At the scale we're talking here, 10x improved performance for each thumb generated—iff the pregen is found beneficial in the first place—may be a worthwhile investment. It'd dramatically reduce the pressure for the pregen jobs, and improve interactive performance whenever a non-cached thumb is requested (which, as mentioned, seems to be most of them for the Wikisource use case right now). That it also has much better font rasterization is pure bonus.

If this is true then this could clearly be a huge benefit for us - I've created T339845 to investigate this. We might need to do a bit of our own work to compare engines as lots of the comparisons and claims I can see about various engines are fairly old, but this is definitely work worth considering. Thanks for pointing it out!

Change 931073 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.41.0-wmf.13] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/931073

Change 931073 merged by jenkins-bot:

[mediawiki/core@wmf/1.41.0-wmf.13] file: Make pre-gen rendering of multi-page files (pdf, ...) serial

https://gerrit.wikimedia.org/r/931073

Mentioned in SAL (#wikimedia-operations) [2023-06-19T14:19:24Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-19T14:20:48Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-19T14:39:32Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:931073|file: Make pre-gen rendering of multi-page files (pdf, ...) serial (T337649)]] (duration: 20m 07s)

Thanks to Amir's change and a change in jobqueue's concurrency for Thumbnail generation I think much of the rate limiting people have been seeing should have subsided. We've surfaced multiple other issues that need to be resolved or have already been resolved while doing this though. If anyone is still seeing the specific original issue regularly please mention in this ticket

This comment was removed by Snaevar.

oh I actually might know what's happening. When someone uploads a multi-page file, like a pdf or djvu, then to produce the pregen sizes, mw creates a job for each size and each page (e.g. if we have five pregen sizes and a 300 page pdf, you get 1500 jobs) and pushes all of them at the same time to the job queue which I assume they start around the same time... Thankfully it has the limit of up to first 50 pages but that's still 4*50 jobs at the same time. I think I can put out a solution to this. Basically make it similar to refreshlinks. Give me a minute.

I had requested that we would pre-generate images just for the first N pages in the past, and IIRC it was implemented at least for PDFs.

Yeah yeah, For the first 50 pages. The problem was that it was queuing all pages in all pre-gen sizes all at once.

Yeah yeah, For the first 50 pages. The problem was that it was queuing all pages in all pre-gen sizes all at once.

Since this task is more narrowly focussed on keeping the stack from melting, I've opened T341918 as a separate task. But I will note here that the essence of it is that the static 50-pages limit is an extremely blunt instrument that is not serving the Wikisource community well, but does point out an opportunity for a significant optimization in their core workflow.

Over the last few days (early Sunday UTC was when I consciously noted it, but that's unlikely to be very accurate timing) I'm seeing variable but worsening image load times and increasing number of 429 responses. In a quick test just now, two out of three thumbs loaded from the same DjVu returned 429 (example in case you want to look for it in logs).

And the current status is that it takes 5–10 reloads, with a timeout in the 10–15s range, to load each page. That is, an average total effective thumbnail render time of 1–2 minutes. From my perspective "Commons is down".

429 response codes in codfw seems to have started jumping some time during July 31; briefly fell back during August 2nd, only to go through the roof again and stay there later that day.

Screenshot 2023-08-03 at 10.25.30.png (514×1 px, 275 KB)

NOTE: These graphs have a logarithmic vertical axis: 429 response codes have jumped from an average of around 0.5 per second to around 20 per second, an increase by a factor of 40. We are now on average seeing equal or higher number of 429 responses as normal 200 responses. If we factor in the normal noise floor of 404s that graphs shows twice as many error response codes as normal 200 response codes.

This comports with poolcounter times in codfw that hit the roof, briefly fell back to normal, and then was pegged to the 5s timeout in the same pattern.

Screenshot 2023-08-03 at 10.26.09.png (594×1 px, 459 KB)

The HTTP response code rate does not follow the same pattern, but as of yesterday the error responses have increased by a factor of 3–4.

Screenshot 2023-08-03 at 10.27.05.png (594×1 px, 142 KB)

This is probably also related to why HAProxy current sessions has more than doubled.

Screenshot 2023-08-03 at 10.27.24.png (590×1 px, 418 KB)

Looking back as far as the Thumbor dashboard has data (April-ish) this does not obviously appear to be a normal seasonal pattern, and both HTTP response code rate of 4xx codes and HAProxy current sessions are currently the highest they have been for as long as we have measurements.

Screenshot 2023-08-03 at 10.45.29.png (594×1 px, 155 KB)

Screenshot 2023-08-03 at 10.45.47.png (590×1 px, 306 KB)

Change 945553 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: include workers for djvu files, limit thumbnailrender concurrency

https://gerrit.wikimedia.org/r/945553

429 response codes in codfw seems to have started jumping some time during July 31; briefly fell back during August 2nd, only to go through the roof again and stay there later that day.

This comports with poolcounter times in codfw that hit the roof, briefly fell back to normal, and then was pegged to the 5s timeout in the same pattern.

Screenshot 2023-08-03 at 10.26.09.png (594×1 px, 459 KB)

The Thumbor logs indicate that at least some throttling in codfw is legitimate since 02:00 on August 2nd. However, the issues reported before that are unrelated and I'm attempting to look into that now - it appears your rate limiting is being caused by eqiad. There is some overlap from a thumbnailing job that is using up workers for expensive formats, I'm going to reduce concurrency there and also increase the threshold for workers on expensive formats to help remedy the issues you're seeing.

Change 945553 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: include workers for djvu files, limit thumbnailrender concurrency

https://gerrit.wikimedia.org/r/945553

I'm going to reduce concurrency there and also increase the threshold for workers on expensive formats to help remedy the issues you're seeing.

Current user experience is "pretty slow" rather than "broken". Page images take somewhere north of 5s to load, but in my relatively small sample size the images do all load on the first try.

“So, Xover, how do you contribute to the Wikimedia projects?”

“Well, mostly I wait around while Thumbor stares at me blankly and mumbles ‘429? 429?’.”

“That doesn't sound like a very productive way to contribute?”

“Well, it's not, but you have to understand… Look, it's like the Washington press corps, alright?”

“I don't follow?”

“It's like journalists in Washington sitting around watching an aging politician zone way out on the podium. They're not getting much done beside waiting, and eventually the aides will manage to recall him to the present so they can ask some questions and do some writing. It's horribly inefficient and a waste of everyone's time, but we still need a free and independent press.”

“So you're comparing the importance of your contributions to a fundamental tenet of a free society?”

“No, no. I'm just saying it's about as much use waiting for Thumbor to fix itself as hoping dementia will magically go away...”

Oh, this is Phabricator? I thought it was the submission form for my creative writing class. My bad… 😎

Change 956370 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] jobqueue: limit thumbnailrender job concurrency further

https://gerrit.wikimedia.org/r/956370

Hi @Xover,

I can sense the frustration pretty clearly and I appreciate the effort to illustrate it via this story, avoiding lashing out. As a data point, users aren't the only ones frustrated with the situation. Engineers (developers, software engineers, SREs) are frustrated (and have been for a long time) too.

Looking at the larger picture, I think it's safe for me to say that we are admittedly and unfortunately, as a movement, paying the price of years of neglect and under investment in this critical part of the infrastructure. This neglect wasn't limited to just Thumbor, but as a component, Thumbor was affected more. There are ongoing efforts to address the general problem. Perhaps moving slower that we 'd like but at least moving. Thumbor has finally been upgraded to a newer version, it's been moved to a new platform and in the process a variety of small bugs and problems here and there have been fixed. Engineers are participating in this exact task, trying to somehow fix the reported problems. There's obviously still a long road ahead, but there is some momentum.

Why do I say all that? Because I fear that venting frustration, in a forum of technical people (Phabricator), that are already all well aware of the frustration, never mind sharing it, will do more harm than good by causing people to become disengaged.

In your shoes, I would now obviously ask, "if not here, where?". I can only offer suggestions unfortunately. Reaching out to management via email would be my go to.

I can sense the frustration pretty clearly […]

Oh, if that came across as venting frustration I have to apologize. It was an attempt at a funny way to communicate 1) that Thumbor needs kicking again (as it has needed periodically since the opening of this task), and 2) that tweaking the parameters for parallelism, queue size, and timeouts is probably not going to be sufficient since it keeps recurring (iow, we'll be back here in a couple of weeks, wasting Hnowlan's time with yet another temporary workaround).

Change 956370 merged by jenkins-bot:

[operations/deployment-charts@master] jobqueue, thumbor: attempt to limit impact of thumbnailrender job

https://gerrit.wikimedia.org/r/956370

I can sense the frustration pretty clearly […]

Oh, if that came across as venting frustration I have to apologize. It was an attempt at a funny way to communicate 1) that Thumbor needs kicking again (as it has needed periodically since the opening of this task), and 2) that tweaking the parameters for parallelism, queue size, and timeouts is probably not going to be sufficient since it keeps recurring (iow, we'll be back here in a couple of weeks, wasting Hnowlan's time with yet another temporary workaround).

Apologies then for misinterpreting your intent. Thank you for clarifying it.