A few times (T253283, T265660, etc.) we've discussed caching the fully generated ebooks. So far, we've settled on only caching some components of the books, e.g. API request results. But there are still many issues with the tool's stability and the speed of rendering. To improve these, we should investigate storing the generated ebooks forever (and refreshing them on demand).
We may still want to move to a job queue (T345406), and so having books stored in a place that can be written to by multiple independent instances will be useful.
- Use Cloud VPS object storage
- Use a library such as Flysystem (via league/flysystem-bundle) that can use the local filesystem in development and test, and Swift (which is S3 compatible) for the public tool
- Store in a single container, with paths matching the book's wiki page and prefixed with the subdomain of the Wikisource (using mul for wikisourge.org)
- Add a new database table, e.g. books_stored { lang, title, format, images, credits, font, generated_time, start_time, last_accessed } with a unique constraint on lang, title, format, images, credits, font.
- When a book is requested, if it isn't already in the table, add it. If it is already there, update last_accessed and redirect the user to download it.
- A continuous Toolforge job (or more than one) will run, and when there's a row there with no generated_time pick up that job by adding a start_time.
- The book will be generated and stored in the object storage, and the generated_time updated.
- The original request will wait for a short amount of time (~10 sec) in case the book is generated quickly (which in most cases it will be).
- Subsequent requests for the same book are idempotent (other than updating the last_accessed), and will return the book once it's been generated.
- If no book is generated in time, a 202 response will be returned.
Cleaning up:
- Dead jobs can be identified and purged, based on having a too-old start_time.
- Unused stored files can be deleted based on last_accessed being older than some threshold.
Outstanding questions:
Store epubs only? In order to reduce how much gets stored.No, we can store everything for now, and revisit if it gets too big.- How much will be stored?
- Reduce the number of things that can split the cache — currently the options that will change the output (other than format) are: images, fonts, and the credits page (we could do away with the last of these with T276672). The tool also the option of including categories, but that's only used in the AtomGenerator so doesn't apply here.
Is there a Swift system of evicting the oldest (if we e.g. update the modification time when accessing?)If we store the last_accessed in the database, we can evict based on that.For access, do we stream the file through the web server, or redirect the client to the (non-stable) public bucket URL (i.e. https://object.eqiad1.wikimediacloud.org/PROJECT:BUCKET/…)?Redirect to the latter, and make it clear that people shouldn't treat that URL as stable. This also means that we need to make the filename as stored there be a useful one for people to end up downloading, because we can't set a Content-Disposition header to define a filename.How to provide bulk download of all books (by language)?Looks likely to be possible based on the above design. Details left for a separate task, if anyone actually asks for it (which they haven't yet).