Page MenuHomePhabricator

Wikisource Ebooks: Investigate cache generated ebooks [8H]
Closed, ResolvedPublicOct 21 2020

Description

As a Wikisource user, I would like the potential benefits and options related to cached generated ebooks (in the server) to be investigated, so it can be determined a) how this can be done, b) how much of an improvement a user could see, and c) the level of time/complexity that such work requires.

Background: With this work, we can make it so that, if someone downloads Book A and then someone else wants to download Book A, the generated ebook can already be cached, as an example. This task is a Phabricator placeholder for https://github.com/wsexport/tool/issues/38, which states: "It would be nice if you cached the files you produce for at least some days, as [[mw:OCG]] / Collection does. Not only that could save some processing time, but most importantly we could serve books faster to our users." This might as simple as caching the full generated epub/pdf/etc. files, or might involve caching various parts of the ebook construction process (e.g. contributor fetching). Please refer to the Github link for more details. This could be helpful for users in many cases, such as when: many people are downloading featured books of the month, when many people are downloading the list of common/popular books on wikis (which are always present, such as on Bengali Wikisource, and other cases. You can see data on recently popular ebook downloads on the Wikisource Stats page.

Acceptance Criteria:

  • Read the relevant discussion on Github to receive the full context (see sam's comment below for discussion)
  • Investigate what we can cache to improve ebook export reliability
  • Investigate for how long we can have this cache in the server, generally speaking
  • Investigate the primary work that would need to be done in order to cache generated ebooks
  • Investigate the main challenges, risks, and dependencies associated with such work
  • Investigate if/how we could give users the option of skipping the cache
  • Provide a general estimate/idea, if possible, of the potential impact it may have on ebook export reliability
  • Provide a general estimation/rough sense of the level of difficulty of effort required in doing such work
  • Think about storage options and disk space

Details

Due Date
Oct 21 2020, 4:00 AM

Event Timeline

Restricted Application added a project: Community-Tech. · View Herald TranscriptMay 10 2019, 4:28 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Samwilson updated the task description. (Show Details)May 10 2019, 5:31 AM
ifried renamed this task from Cache generated ebooks to Wikisource Ebooks: Investigate cache generated ebooks.Jun 11 2020, 10:48 PM
ifried added a project: Wikisource.
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)Jun 11 2020, 10:57 PM
ifried updated the task description. (Show Details)Jun 16 2020, 9:21 PM
ARamirez_WMF renamed this task from Wikisource Ebooks: Investigate cache generated ebooks to Wikisource Ebooks: Investigate cache generated ebooks [8H].Jun 19 2020, 12:10 AM
ifried updated the task description. (Show Details)Aug 28 2020, 9:36 PM
dmaza updated the task description. (Show Details)Sep 22 2020, 6:50 PM
ifried updated the task description. (Show Details)Sep 24 2020, 3:37 PM
ifried updated the task description. (Show Details)
ifried added a subscriber: ifried.Sep 24 2020, 6:17 PM

@Samwilson Hey there! Do you know where the conversation in https://github.com/wsexport/tool/issues/38 turned up in Phabricator during the Github to Phab migration?

ifried updated the task description. (Show Details)Sep 24 2020, 6:18 PM

I'm not sure, but here's the full comment thread.


Cache produced files #38

nemobis opened this issue on 13 Nov 2014 · 12 comments

nemobis commented on 13 Nov 2014

It would be nice if you cached the files you produce for at least some days, as [[mw:OCG]] / Collection does. Not only that could save some processing time, but most importantly we could serve books faster to our users.
Currently, the waiting times are often rather extreme...

samwilson commented on 13 Nov 2014

Sounds like a good idea.

It'd be nice to be able to override the cache as well though, for when one is trying to debug some issue with a book's HTML and how it converts to epub.

Aubreymcfato commented on 29 May 2015

This is a great idea: the ideal system, to me, would be the tool to "know" when a book has been modified (eg. a page proofread or corrected) so that the cache can be override.

jberkel commented on 17 Feb 2016

Right now the slowest part of the book generation is fetching the credit information for all the pages because it needs to go through all of the individual source pages. If this could be cached it would already speed things up considerably.

A more general caching solution would be as @Aubreymcfato suggested to keep generated books with the timestamp of the most recent page, and then query the API for more recent modifications of pages. If none exist, serve the cached version.

samwilson commented on 10 May 2019

At the moment we query https://tools.wmflabs.org/phetools/credits.py to get the names of contributors. That script queries the database directly. We send batches of 50 page titles at once; this could probably be increased, to reduce the number of requests.

I think this script is used rather than the API, in order to save an API call — because you'd first have to request the contributors for a given set of pages, and then find the groups of those users (so we can sort bots to the bottom).

As for determining a 'last modified' date for a work... does anyone have any ideas about that? It seems that we'd need to traverse the whole book (all subpages and Page NS pages) to see if any were newer than when we saw them last, and we would still have to parse each page as we go in order to find the links to the other pages in the book. By the end of that, we've done the whole book and so knowing that it's the same as last time doesn't really help. :)

The simplest thing might just be to cache the entire final generated file for an hour (and add a new nocache parameter that can be used while debugging a book).

jberkel commented on 10 May 2019

oh, completely forgot about this ticket…

@Samwilson agreed, we can give this a try, in practice the books probably don't change much after proofreading. and it's very expensive to get the last modified date, and it's not just content pages but in theory also templates and modules that influence the book rendering. it's really a whole graph of dependencies.

samwilson commented on 11 May 2019

Gosh, yeah I forgot about the chain of dependent pages! Too hard. :-)

Okay then, let's:

  • Increase the number of pages in each batch of credits-getting; and
  • Store every generated file for a minumum of 1 hour, with a nocache parameter permitted (which would also be an additional form field, I guess).

jberkel commented on 11 May 2019

To make things really fast we could also directly connect to the database to get the credit information. I'm not sure what the query actually does, and can't find the sources of phetools at the moment, it's not listed on toolforge.

samwilson commented on 11 May 2019

I was thinking that too. It'd go well with this idea of storing the logs in the tools' database (I mean, both would require wsexport to have a MariaDB database connection).

The source for the phetools script is here and it queries both the Wikisource of the book and Commons for the contributors of images. From a cursory look, I think the queries could be made more efficient, and anyway doing them directly in wsexport is going to be faster than the extra HTTP requests to get their results.

jberkel commented on 11 May 2019

Just had a quick look at the credits.py: I don't understand why we need to pass in all the pages when the service could just take a book parameter and compute everything in one single query?

It does accept a book name and there is some logic to get the pages for a book title (get_credit.py#L81) but it doesn't seem to work properly:

curl 'https://tools.wmflabs.org/phetools/credits.py?lang=en&cmd=history&book=Una_and_the_Lion&format=json'

returns nothing.

samwilson commented on 11 May 2019

Yeah, I guess there's a bug with the book parameter. Passing the pages in works, e.g.

curl 'https://tools.wmflabs.org/phetools/credits.py?lang=en&cmd=histy&page=Una_and_the_Lion&format=json'

I think the reason for passing pages (include Page NS ones) is that then only contributors to the actual pages used in the book are listed. We don't currently make the assumption that every subpage is part of the exported book, nor that every Page NS page of an Index is transcluded. I do wonder if actually we could make these assumptions, and err on the side of including too many contributors' names, because doing that might be a much more efficient query.

I might look at adding credits-querying to the wikisource/api PHP library, then using that in wsexport.

jberkel commented on 11 May 2019

Moving the credit logic into wsexport is an option but I think it would be preferable to have an external service handle this efficiently.

I don't understand how the PHP library would make things faster, as you'd still have to make many separate queries?

samwilson commented on 5 Aug

The topics here are now being tracked on Phabricator:

ARamirez_WMF set Due Date to Oct 7 2020, 4:00 AM.Sep 25 2020, 9:27 PM
ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".
ifried updated the task description. (Show Details)Oct 1 2020, 3:13 PM
ifried updated the task description. (Show Details)Oct 1 2020, 3:21 PM
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)Oct 1 2020, 4:36 PM

For some quick numbers, I looked at the month of September 2020:

Number of exports:

SELECT COUNT(*) FROM books_generated WHERE MONTH(time) = 9 AND YEAR(time) = 2020;

Result: 85536

Unique exports (title + lang + format):

SELECT COUNT(*) FROM (SELECT DISTINCT lang, title, format FROM books_generated WHERE MONTH(time) = 9 AND YEAR(time) = 2020) a;

Result: 37979

Unique exports (title + lang):

SELECT COUNT(*) FROM (SELECT DISTINCT lang, title FROM books_generated WHERE MONTH(time) = 9 AND YEAR(time) = 2020) a;

Result: 30380

This suggests re-exporting of the same work is very common. The difference in formats is comparatively not that much, but I take it the conversion part isn't the slow bit so it's okay to focus on just caching epubs.

I examined random samples of the access logs, and indeed it seems there times when a single user is exporting the same book over and over. Whether or not this is bot traffic I think is outside the scope of this investigation. Let's just assume all traffic is human for now (and even if it is a bot, perhaps we don't care because we're serving from the cache which is cheap). That said, my inclination is then to cache for very brief amount of time, say just 10 minutes.


Inline answers to the acceptance criteria:

Investigate what we can cache to improve ebook export reliability

It was said in T222936#6493260 that fetching credit information is the slowest part. That is easier to cache (using Symfony's Cache Component, preferably via the APCu adapter), then we can just do that. Then we don't need to worry about storage concerns.

Caching entire epubs on disk I think is totally doable too, though. We could use a cron to regularly delete stale files.

Investigate for how long we can have this cache in the server, generally speaking

Based on what I'm seeing, I think caching for a brief time will see benefits and won't cost us much. We can always experiment and change the cache duration and see what works best, but I would think starting with a very brief time is best (say 10 minutes). This also means there is less concern about getting out-of-date information.

Investigate the primary work that would need to be done in order to cache generated ebooks

For caching SQL queries (the credits information), it's fairly simple using Symfony's Cache Component. XTools and Event Metrics both have extensive examples of how to do this (and in fact we don't even need the Symfony framework to use their caching component).

For caching epubs on disk -- my first thought is to store with a unique filename (say [title]-[lang]-[font].[format], similar to the cache key we'd use for SQL queries). A cron will run every 10 minutes and delete files > 10 minutes old. So when a request comes in, we simply check if the file exists. If so, use it, otherwise generate a new one.

Investigate the main challenges, risks, and dependencies associated with such work

Caching SQL queries -- very low risk, medium-high potential for gain.

Caching epubs -- probably not very risky, unless we increase the cache time in which case we need to make sure we don't fill up the disk too fast. Even with say, an hour-long cache, we'll probably be fine with our current VPS instance. Going much longer might mean we need to make a larger instance, consider a separate instance just for storage, etc. etc. I think we should avoid it unless we're really sure it's worth the while.

Investigate if/how we could give users the option of skipping the cache

Our endpoint for exporting could take a new query string parameter, say by passing in ?purge=1. How we expose this option to the user on the wiki is outside the scope of this investigation, I think. It may be that we'll want to keep it a "secret", sort of like MediaWiki does with action=purge. Ideally, I think, it should be very rare that you'd want/need to do this unless you're testing something. But regardless, if are caching time is very brief (10 minutes) then out-of-date information shouldn't be much of a concern anyway.

Provide a general estimate/idea, if possible, of the potential impact it may have on ebook export reliability
Provide a general estimation/rough sense of the level of difficulty of effort required in doing such work

Brief, 10 minute caching by itself could reduce a lot of strain, and cost us very little, especially if we just cache the credit information.


Now, if we did want long-term caching, I'd like to respond to this point:

more general caching solution would be as @Aubreymcfato suggested to keep generated books with the timestamp of the most recent page, and then query the API for more recent modifications of pages. If none exist, serve the cached version.

Indeed if we cache for a long period, we'd need to ensure we bypass the cache automatically when changes are made. For this, we could go by the page.page_touched field. This timestamp represents when MediaWiki's page cache was last purged, which is even better than looking for the last edits because it takes into account changes to templates that the pages might use. I ran a test query to get the maximum touched timestamp for an entire work (https://en.wikisource.org/wiki/Philochristus , ~34 pages):

SELECT MAX(page_touched) FROM enwikisource_p.page WHERE page_title LIKE 'Philochristus/%' AND page_namespace = 0;

Result in 0.24 seconds, and should be only a tiny bit longer for a book with many hundreds of pages. It's certainly much, much faster than using the API. The only problem is the replicas are sometimes lagged, but we apparently are using the replicas for credit information anyway, so I'm guessing this would be acceptible.

I think it may also be good to see renewed interest before considering any long-term caching. Many of the comments at T222936#6493260 predate the other reliability work we've done on Wikisource Export. In my experience, most works export pretty quickly.


Summary

Based on everything I've read, caching the credit information for a brief amount of time (10 minutes) is a good starting point. It's easy to implement, should work seamlessly on our current VPS (no need for more disk space), and the short cache period means we will be serving a version of the work that's no more than 10 minutes old. Implementing this caching can be done with or without T257886.

We can instead cache the epubs themselves, provided it is also for a brief time (10 mins). This will require a little more work, though.

There are definitely some works that are especially popular (A Simplified Grammar of the Swedish Language for example), but again according the access logs, there are usually a series of rapid requests to export it by the same user, further suggesting long-term caching won't actually help that much.

Bypassing the cache can be done via a query string parameter (e.g. purge=1).

ARamirez_WMF changed Due Date from Oct 7 2020, 4:00 AM to Oct 21 2020, 4:00 AM.Oct 8 2020, 10:09 PM

Task for caching credits info: T265660

ifried closed this task as Resolved.Dec 15 2020, 11:31 PM

We have proceeded with the cache work in T265660, so I'm now marking this investigation as Done.