Page MenuHomePhabricator

Store generated ebooks
Open, Needs TriagePublic

Description

A few times (T253283, T265660, etc.) we've discussed caching the fully generated ebooks. So far, we've settled on only caching some components of the books, e.g. API request results. But there are still many issues with the tool's stability and the speed of rendering. To improve these, we should investigate storing the generated ebooks forever (and refreshing them on demand).

We may still want to move to a job queue (T345406), and so having books stored in a place that can be written to by multiple independent instances will be useful.

  • Use Cloud VPS object storage
  • Use a library such as Flysystem (via league/flysystem-bundle) that can use the local filesystem in development and test, and Swift (which is S3 compatible) for the public tool
  • Store in a single container, with paths matching the book's wiki page and prefixed with the subdomain of the Wikisource (using mul for wikisourge.org)
  • Add a new database table, e.g. books_stored { lang, title, format, images, credits, font, generated_time, start_time, last_accessed } with a unique constraint on lang, title, format, images, credits, font.
  • When a book is requested, if it isn't already in the table, add it. If it is already there, update last_accessed and redirect the user to download it.
  • A continuous Toolforge job (or more than one) will run, and when there's a row there with no generated_time pick up that job by adding a start_time.
  • The book will be generated and stored in the object storage, and the generated_time updated.
  • The original request will wait for a short amount of time (~10 sec) in case the book is generated quickly (which in most cases it will be).
  • Subsequent requests for the same book are idempotent (other than updating the last_accessed), and will return the book once it's been generated.
  • If no book is generated in time, a 202 response will be returned.

Cleaning up:

  • Dead jobs can be identified and purged, based on having a too-old start_time.
  • Unused stored files can be deleted based on last_accessed being older than some threshold.

Outstanding questions:

  • Store epubs only? In order to reduce how much gets stored. No, we can store everything for now, and revisit if it gets too big.
  • How much will be stored?
  • Reduce the number of things that can split the cache — currently the options that will change the output (other than format) are: images, fonts, and the credits page (we could do away with the last of these with T276672). The tool also the option of including categories, but that's only used in the AtomGenerator so doesn't apply here.
  • Is there a Swift system of evicting the oldest (if we e.g. update the modification time when accessing?) If we store the last_accessed in the database, we can evict based on that.
  • For access, do we stream the file through the web server, or redirect the client to the (non-stable) public bucket URL (i.e. https://object.eqiad1.wikimediacloud.org/PROJECT:BUCKET/…)? Redirect to the latter, and make it clear that people shouldn't treat that URL as stable. This also means that we need to make the filename as stored there be a useful one for people to end up downloading, because we can't set a Content-Disposition header to define a filename.
  • How to provide bulk download of all books (by language)? Looks likely to be possible based on the above design. Details left for a separate task, if anyone actually asks for it (which they haven't yet).

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Draft: Add job queue database and processingtoolforge-repos/wsexport!2samwilsonstored-booksmain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
  1. Consider flushing outdated books from the cache periodically. Probably basing on a querry to RC and tables with dependencies. (outdated = pages that they depend on were modified by users after the book was generated)
  1. Kindles refuse to download an ebook if its base url does not end with a supported extension; it would be nice to support it if technically possible.

That's an interesting point about Kindle wanting a foo.epub URL. Can you raise that as a separate task? It shouldn't be too hard.

For flushing based on RC, that's a good idea, but it could be tricky: for example, we don't have a solid way to map from Page NS edits to main NS, so an edit to Page:Lorem.pdf/23 would need to somehow purge the stored Lorem_Ipsum/Vol_1 book. The usual problem with this sort of thing is that my the time you've determined something should be purged from the cache you've already rebuilt the new value so you might as well just wait till it's requested!

I've started a PR for this: https://github.com/wikimedia/ws-export/pull/546 (not ready for review yet, but ideas very welcome). One oddity (described as "t.b.d." in this task's description) is that when you request a book from e.g. /?page=Lorem&lang=la it currently will send you the EPUB, but with this patch it'll a) submit a job; b) wait for up to 10 seconds; and if the job queue is processed fast enough it'll send the EPUB within that time, but if not then c) any subsequent request will try the same and once the book is available it'll send it. This works fine when the queue is being processed fast enough, and we can add multiple queue processing servers to speed it up, and drop things that have been queued for too long. But yeah, open to ideas!