Page MenuHomePhabricator

Wikisource Ebooks: Investigate job queue for more efficient ebook generation [16H]
Closed, ResolvedPublicNov 4 2020

Description

As a Wikisource user, I want the team to investigate a queue-running system, so it can be determined if a) such a system would improve reliability to a meaningful degree, and b) if the work would be manageable and within scope for the team.

Background: We should investigate adding a queue-running system, similar to what we built for #eventmetrics, for WSExport. This would mean that users would submit a request for an ebook, it'd be added to the queue, and they'd get a status page indicating the progress. The queue would first generate the epub, which would then be available for download, and then it'd generate the derivative forms (PDF etc.) and make those available when done. This would help with errors such as T250614. This system would also effectively give us a cache (e.g. if two people request the same ebook, only one queue process would need to run). Task for that: T222936.

Acceptance Criteria:

  • Investigate the primary work that would need to be done in order to implement a queue-running system, similar to what we built for #eventmetrics, for WSExport.
  • Investigate the main challenges, risks, and possible dependencies associated with implementing such a system
  • Provide a general estimate/idea, if possible, of the potential impact it may have on ebook export reliability.
    • In other words, do we have a strong hunch that this could, indeed, improve reliability (and in a considerable way)? Why or why not?
  • Provide a general estimation/rough sense of the level of difficulty of effort required in doing such work

NOTES:

  • We will also want to see how often many people are downloading at once (i.e., how often there are issues that implementing a job queue could address). This could be data we work with Jennifer to get, potentially.
  • We will investigate UX questions related to informing users of the status of a potential download status (i.e., if it is almost done downloading, if there are errors and they should try again, etc) in a separate investigation. Refer to T256707 parent task to see relevant design & UX for the ebook export process.

Details

Due Date
Nov 4 2020, 5:00 AM

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript
ifried renamed this task from Add job queue for more efficient ebook generation to Wikisource Ebooks: Add job queue for more efficient ebook generation.Jun 11 2020, 10:39 PM
ifried renamed this task from Wikisource Ebooks: Add job queue for more efficient ebook generation to Wikisource Ebooks: Investigate job queue for more efficient ebook generation.
ifried updated the task description. (Show Details)
ifried added a project: Wikisource.
ARamirez_WMF renamed this task from Wikisource Ebooks: Investigate job queue for more efficient ebook generation to Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H}.Jun 19 2020, 12:03 AM
ARamirez_WMF moved this task from To Be Estimated/Discussed to Estimated on the Community-Tech board.
ARamirez_WMF renamed this task from Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H} to Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H].Aug 20 2020, 11:29 PM
ifried updated the task description. (Show Details)
ARamirez_WMF renamed this task from Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H] to Wikisource Ebooks: Investigate job queue for more efficient ebook generation [16H].Sep 24 2020, 5:57 PM
ARamirez_WMF moved this task from To Be Estimated/Discussed to Estimated on the Community-Tech board.
ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".

An important distinction to be made with Event Metrics โ€“ ultimately the "report" data in Event Metrics gets pre-stored in a database, indefinitely. This isn't a problem in terms of storage because they are just numbers. In addition, Event Metrics had a timestamp of when the report was generated, and you as the user will always get that version of the report until you ask for an updated version. In our case, we end up with a epub (or other format), which is not as cheap to store, and also we want to automatically ensure the user is served the latest possible version. So the two systems won't work exactly the same.

In its simplest form, the purpose of a job queue would be to ensure there aren't but so many resource-intensive things going on at the same time. Beyond that, we need to also cache the exported file for some period of time. So for now I'm just going to go off of the investigation at T222936 and recommend a brief period, say 10 minutes.

Combining the system we use for Event Metrics, I envision the system working something like this:

First we need a table in the database to keep track of the jobs. The schema could look something like:

  • id โ€“ unique ID of the job
  • filename โ€“ unique filename for the exported book, something like [title]-[lang]-[font].[format]
  • submitted โ€“ when the job was submitted
  • status โ€“ status of the job, stored as a smallint but in English it's one of:
    • queued โ€“ waiting to be spawned by cron
    • started โ€“ currently processing
    • failed_timeout โ€“ timed out (we'll come up with a maximum period of time that any job should run, Sam and others probably know what a good value would be)
    • failed_unknown โ€“ failed for some other known reason

There's no "completed" status because the system will refer to the file system to determine whether a job is needed.

So the whole pipeline might look something like:

  1. Request comes in
  2. Check if a file already exists for the desired work/format
    1. If it exists, serve it. Nothing else needs to be done.
  3. (no file exists) Check the job table to see if a job is pending for the desired work/format.
    1. If a job exists, the client goes back to step 1 (continually pinging the server until a file becomes available, with some reasonable timeout before failing gracefully)
  4. (no job exists) A row is added to the job table and set to "queued"
  5. A cron is ran every minute that spawns queued jobs (setting the status to "started"), never allowing more than N jobs to be running at the same time (probably can be a fairly high threshold)

Under this system, we will never be exporting the same thing twice at the same time. This doesn't cover the caching part (more on that below), it just ensures any unique combination of title/format/font is exported synchronously, and ensures that at any given time we have enough resources (RAM and such) to do properly export a work. I don't think we've confirmed that RAM and/or CPU overhead is actually a major problem, but the job queue nonetheless can ensure it stays that way.

Now, more about this cron job. It's workflow could be something like:

  1. Start processing
  2. Once complete, the file now exists which is the first thing that gets checked, so the corresponding row in the job table can be deleted. The next request will serve the actual file.
  3. If the job fails, we set the status to indicate this (whether it was due to timeout or something unknown). This needs to be persisted so that on the next ping from the client, we can tell them the export failed.
    1. We don't have emailing of errors set up yet for WSExport, but when we do, it could email us when this happens.
  4. Delete files that over 10 minutes old

#4 is effectively the caching bit. It could be done by a separate cron, but I worry about them competing with each other (one is trying to serve a book and just as it's about to the other deletes it).

A few things I'm unsure about:

  • Should we cache just the epub, the final format, or both?
  • For small books that export quickly, the job queue system could slow down the experience significantly because you have to wait up to a minute for the next cron run. Perhaps the queue can be bypassed if we can programmatically determine the work isn't very expensive to export (say by number of pages). This is basically what we did for Event Metrics. We also have the option of simulating a cron rate of less than a minute by having multiple cron jobs and putting a sleep in there. Definitely a hack but from quick research this isn't unheard of.

I'm going to stop here for now and let the other engineers read this and we will discuss tomorrow in our meeting. I'm not certain if the above is the best approach but it will allow us to steal a lot of already-written code from Event Metrics.

ARamirez_WMF changed Due Date from Oct 21 2020, 4:00 AM to Nov 4 2020, 5:00 AM.Oct 22 2020, 7:39 PM