A few times (T253283, T265660, etc.) we've discussed caching the fully generated ebooks. So far, we've settled on only caching some components of the books, e.g. API request results. But there are still many issues with the tool's stability and the speed of rendering. To improve these, we should investigate storing the generated ebooks forever (and refreshing them on demand).
We may still want to move to a job queue (T345406), and so having books stored in a place that can be written to by multiple independent instances will be useful.
- Use Cloud VPS object storage
- Use a library such as Flysystem (via league/flysystem-bundle) that can use the local filesystem in development and test, and Swift (which is S3 compatible) for the public tool
- Store in a single container, with paths matching the book's wiki page and prefixed with the subdomain of the Wikisource (using mul for wikisourge.org)
- Add a new database table, e.g. books {subdomain, pagename, format, font, images, credits, generated_time, start_time, last_accessed} (actual names t.b.d. of course).
- When a book is requested, if it isn't already in the table, add it. If it is already there, update last_accessed and redirect the user to download it.
- A continuous Toolforge job (or more than one) will run, and when there's a row there with no generated_time pick up that job by adding a start_time.
- The book will be generated and stored in the object storage, and the generated_time updated.
- t.b.d. ''(something here about how the web interface gets to know when the book is ready)''
- Dead jobs can be identified and purged, based on having a too-old start_time.
Outstanding questions:
Store epubs only? In order to reduce how much gets stored.No, we can store everything for now, and revisit if it gets too big.- How much will be stored?
- Reduce the number of things that can split the cache — currently the options that will change the output (other than format) are: images, fonts, and the credits page (we could do away with the last of these with T276672). There is also the option of including categories, but that's only used in the AtomGenerator so doesn't apply here.
Is there a Swift system of evicting the oldest (if we e.g. update the modification time when accessing?)If we store the last_accessed in the database, we can evict based on that.For access, do we stream the file through the web server, or redirect the client to the (non-stable) public bucket URL (i.e. https://object.eqiad1.wikimediacloud.org/PROJECT:BUCKET/…)?Redirect to the latter, and make it clear that people shouldn't treat that URL as stable.- How to provide bulk download of all books (by language)?