Currently, we have no way to throttle the rate of ebook generation, leading to reliability problems with WS Export.
We should implement a queuing system, such as with the Symfony Messenger component.
Currently, we have no way to throttle the rate of ebook generation, leading to reliability problems with WS Export.
We should implement a queuing system, such as with the Symfony Messenger component.
@Samwilson if you think you won't manage to do it in the next few months, I think I can take some time to work on it.
A proposal:
what do you think about it?
That sounds great. I do have a bit of time to look at this, so I'd be happy to make a start and maybe you can help me figure it out?
I was also wondering if we shouldn't be thinking again about caching the generated books? We get quite a lot of requests for the same output, e.g.:
$ zcat -f access.log* | awk '{ print $5 }' | sort | uniq -c | sort -nr | head -30 8064 /robots.txt 3896 /opds/en/Ready_for_export.xml 3884 /favicon.ico 3492 "GET 2950 /styles.css 2904 /img/Wikisource-logo.svg 2641 / 1374 /img/previous-ltr.svg 755 /?lang=en&format=rtf&page=Little_Elephant%27s_Christmas 721 /?format=pdf&lang=zh&page=%E5%89%AA%E7%87%88%E9%A4%98%E8%A9%B1 648 /?format=pdf&lang=zh&page=%E5%A4%B7%E5%A0%85%E5%BF%97 519 /?format=pdf&lang=zh&page=%E8%B1%94%E7%95%B0%E7%B7%A8 435 /tool/toolinfo.json 429 /?format=pdf&lang=zh&page=%E6%B8%85%E5%B9%B3%E5%B1%B1%E5%A0%82%E8%A9%B1%E6%9C%AC 345 /?format=pdf&lang=zh&page=%E6%AD%B7%E4%BB%A3%E8%88%88%E8%A1%B0%E6%BC%94%E7%BE%A9 324 /apple-touch-icon.png 318 /tool/book.php?lang=es&page=Bail%C3%A9n+%28Versi%C3%B3n+para+imprimir%29&format=pdf-a5 307 /?format=pdf&lang=zh&page=%E5%AE%A3%E5%AE%A4%E5%BF%97 302 /apple-touch-icon-precomposed.png 291 /?format=pdf&lang=zh&page=%E8%A5%BF%E6%BC%A2%E6%BC%94%E7%BE%A9 290 /?format=pdf&lang=zh&page=%E6%B0%91%E5%9C%8B%E6%BC%94%E7%BE%A9 283 /?format=pdf&lang=zh&page=%E7%9B%A4%E5%8F%A4%E8%87%B3%E5%94%90%E8%99%9E%E5%82%B3 274 /?format=pdf&lang=zh&page=%E7%9B%8A%E6%99%BA%E9%8C%84 271 /?format=pdf&lang=zh&page=%E7%8E%8B%E9%99%BD%E6%98%8E%E9%9D%96%E4%BA%82%E9%8C%84 269 /?format=pdf&lang=zh&page=%E6%B0%B4%E6%BB%B8%E5%BE%8C%E5%82%B3 268 /?format=pdf&lang=zh&page=%E8%8B%B1%E7%83%88%E5%82%B3 264 /?format=pdf&lang=zh&page=%E5%A4%A7%E6%98%8E%E5%A5%87%E4%BF%A0%E5%82%B3 248 /?format=pdf&lang=it&page=Canti_(Leopardi_-_Donati)%2FXXXIV._La_ginestra 241 /?format=pdf&lang=zh&page=%E6%8B%AC%E7%95%B0%E8%AA%8C
Anyway, my idea for the job queue is something like this:
Ideally we'd write the files to the new object store and so they'd be accessible by any servers, which would mean we could run multiple worker servers that could each consume as many messages as they can, and the web server would always get the final files from the same place.
(I'm sure there's more to this, but does that make sense roughly?)
Draft PR here: https://github.com/wikimedia/ws-export/pull/489
The parts I'm not really sure about are:
The queue processes one book at a time, so if we want to have parallel jobs (which we do) then we'll set up multiple processing workers with systemd.
The controller loop that waits for the book to be generated. Maybe it should do something better.
Is the a way to get "acks" when a message is processed? It seems it might be possible with "stamps" but I am not sure if it properly blocks or just returns the current "stamps": https://symfony.com/doc/current/messenger.html#getting-results-from-your-handlers
If there are multiple requests for the same book the first one gets the result. It looks like the fix for this is to add a request ID into the filename hash — but the other way is to not delete the file after download. For the latter, we'd want to touch the file on download and set up a stale-file deletion job (which we already do for Calibre tmp file, so it wouldn't be hard).
Yes it looks like a great idea! With that we might also avoid job submission if a file is already ready from the same book. But then we enter into the cache invalidation problem.
But then we enter into the cache invalidation problem.
Indeed. The only way to test a text for export is by exporting it. Your first instinct when that export is broken is to (tweak the wikitext and) try the download button again. If there's last-accessed caching there needs to be some fairly obvious or discoverable way to force regenerating.
Cached corrupted data was what downed phetools OCR back ca. 2019 (or whenever it was). Making sure such problems will resolve themselves over time is a very good idea.
I totally agree that it should be possible to re-generate a book absolutely whenever wanted, but we also want to not make that the default because most readers (I assume!) don't mind if the book they get was generated just now or a day ago. The existing nocache=1 parameter will be passed through as part of the job and obeyed accordingly, for people who do need the latest.
(Talking of exporting for testing purposes, I still want to work on T258909: Add direct display of epubcheck (validation) output, which would bypass the cache by default.)
I'm starting to wonder if it would be better here to not use Symfony's Messenger system at all, and just create our own database table for storing job info? That way, the job-runner could update that table with the status (or error info), and the record about the job could persist after the book has been generated and serve as our means to determine whether it should be re-generated etc.
I don't think passing the generation to a background process and check within a loop every second if it was generated would be the proper way.
I would continue creating it by the main process, but use a PoolCounter limiting the number of exports that may be running simultaneously, similar to what we do with wikitext.
This would also solve waiting and simultaneous requests of the same book.