Page MenuHomePhabricator

Add job queue system for generating ebooks
Open, Needs TriagePublicFeature

Description

Currently, we have no way to throttle the rate of ebook generation, leading to reliability problems with WS Export.

We should implement a queuing system, such as with the Symfony Messenger component.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Samwilson if you think you won't manage to do it in the next few months, I think I can take some time to work on it.

A proposal:

  • restrict the number of parallel tasks
  • prioritise cheap tasks (e.g. epub) and trottle more expensive tasks (e.g. pdf).

what do you think about it?

That sounds great. I do have a bit of time to look at this, so I'd be happy to make a start and maybe you can help me figure it out?

I was also wondering if we shouldn't be thinking again about caching the generated books? We get quite a lot of requests for the same output, e.g.:

$ zcat -f access.log* | awk '{ print $5 }' | sort | uniq -c | sort -nr | head -30
   8064 /robots.txt
   3896 /opds/en/Ready_for_export.xml
   3884 /favicon.ico
   3492 "GET
   2950 /styles.css
   2904 /img/Wikisource-logo.svg
   2641 /
   1374 /img/previous-ltr.svg
    755 /?lang=en&format=rtf&page=Little_Elephant%27s_Christmas
    721 /?format=pdf&lang=zh&page=%E5%89%AA%E7%87%88%E9%A4%98%E8%A9%B1
    648 /?format=pdf&lang=zh&page=%E5%A4%B7%E5%A0%85%E5%BF%97
    519 /?format=pdf&lang=zh&page=%E8%B1%94%E7%95%B0%E7%B7%A8
    435 /tool/toolinfo.json
    429 /?format=pdf&lang=zh&page=%E6%B8%85%E5%B9%B3%E5%B1%B1%E5%A0%82%E8%A9%B1%E6%9C%AC
    345 /?format=pdf&lang=zh&page=%E6%AD%B7%E4%BB%A3%E8%88%88%E8%A1%B0%E6%BC%94%E7%BE%A9
    324 /apple-touch-icon.png
    318 /tool/book.php?lang=es&page=Bail%C3%A9n+%28Versi%C3%B3n+para+imprimir%29&format=pdf-a5
    307 /?format=pdf&lang=zh&page=%E5%AE%A3%E5%AE%A4%E5%BF%97
    302 /apple-touch-icon-precomposed.png
    291 /?format=pdf&lang=zh&page=%E8%A5%BF%E6%BC%A2%E6%BC%94%E7%BE%A9
    290 /?format=pdf&lang=zh&page=%E6%B0%91%E5%9C%8B%E6%BC%94%E7%BE%A9
    283 /?format=pdf&lang=zh&page=%E7%9B%A4%E5%8F%A4%E8%87%B3%E5%94%90%E8%99%9E%E5%82%B3
    274 /?format=pdf&lang=zh&page=%E7%9B%8A%E6%99%BA%E9%8C%84
    271 /?format=pdf&lang=zh&page=%E7%8E%8B%E9%99%BD%E6%98%8E%E9%9D%96%E4%BA%82%E9%8C%84
    269 /?format=pdf&lang=zh&page=%E6%B0%B4%E6%BB%B8%E5%BE%8C%E5%82%B3
    268 /?format=pdf&lang=zh&page=%E8%8B%B1%E7%83%88%E5%82%B3
    264 /?format=pdf&lang=zh&page=%E5%A4%A7%E6%98%8E%E5%A5%87%E4%BF%A0%E5%82%B3
    248 /?format=pdf&lang=it&page=Canti_(Leopardi_-_Donati)%2FXXXIV._La_ginestra
    241 /?format=pdf&lang=zh&page=%E6%8B%AC%E7%95%B0%E8%AA%8C

Anyway, my idea for the job queue is something like this:

  1. When the user submits an export request, a CreateBookMessage is created in the ExportController and dispatched to the message bus (i.e. some info is saved to the DB or Redis).
  2. The worker (running ./bin/console messenger:consume async) will consume the message and create the ebook, writing it out to a file.
  3. The controller (or an API request if we want to get fancy and have some sort of progress indicator) waits for that file to exist, and reads it and sends it to the user.

Ideally we'd write the files to the new object store and so they'd be accessible by any servers, which would mean we could run multiple worker servers that could each consume as many messages as they can, and the web server would always get the final files from the same place.

(I'm sure there's more to this, but does that make sense roughly?)

Draft PR here: https://github.com/wikimedia/ws-export/pull/489

The parts I'm not really sure about are:

  • The controller loop that waits for the book to be generated. Maybe it should do something better.
  • If there are multiple requests for the same book the first one gets the result. It looks like the fix for this is to add a request ID into the filename hash — but the other way is to not delete the file after download. For the latter, we'd want to touch the file on download and set up a stale-file deletion job (which we already do for Calibre tmp file, so it wouldn't be hard).

The queue processes one book at a time, so if we want to have parallel jobs (which we do) then we'll set up multiple processing workers with systemd.

The controller loop that waits for the book to be generated. Maybe it should do something better.

Is the a way to get "acks" when a message is processed? It seems it might be possible with "stamps" but I am not sure if it properly blocks or just returns the current "stamps": https://symfony.com/doc/current/messenger.html#getting-results-from-your-handlers

If there are multiple requests for the same book the first one gets the result. It looks like the fix for this is to add a request ID into the filename hash — but the other way is to not delete the file after download. For the latter, we'd want to touch the file on download and set up a stale-file deletion job (which we already do for Calibre tmp file, so it wouldn't be hard).

Yes it looks like a great idea! With that we might also avoid job submission if a file is already ready from the same book. But then we enter into the cache invalidation problem.

But then we enter into the cache invalidation problem.

Indeed. The only way to test a text for export is by exporting it. Your first instinct when that export is broken is to (tweak the wikitext and) try the download button again. If there's last-accessed caching there needs to be some fairly obvious or discoverable way to force regenerating.

Cached corrupted data was what downed phetools OCR back ca. 2019 (or whenever it was). Making sure such problems will resolve themselves over time is a very good idea.

I totally agree that it should be possible to re-generate a book absolutely whenever wanted, but we also want to not make that the default because most readers (I assume!) don't mind if the book they get was generated just now or a day ago. The existing nocache=1 parameter will be passed through as part of the job and obeyed accordingly, for people who do need the latest.

(Talking of exporting for testing purposes, I still want to work on T258909: Add direct display of epubcheck (validation) output, which would bypass the cache by default.)

I'm starting to wonder if it would be better here to not use Symfony's Messenger system at all, and just create our own database table for storing job info? That way, the job-runner could update that table with the status (or error info), and the record about the job could persist after the book has been generated and serve as our means to determine whether it should be re-generated etc.

I don't think passing the generation to a background process and check within a loop every second if it was generated would be the proper way.

I would continue creating it by the main process, but use a PoolCounter limiting the number of exports that may be running simultaneously, similar to what we do with wikitext.

This would also solve waiting and simultaneous requests of the same book.