Page MenuHomePhabricator

[16 hour spike] Investigate issues related to failures of Wikisource e-book export
Closed, ResolvedPublicSpike

Description

Wikisource users regularly report that e-book export fails. This problem, which appears to be intermittent, afflicts the various export options: Epub, Mobi, and Pdf (these are the main three we're interested in).

Your assignment is to investigate this issue and come up with feasible ideas for ways to strengthen the WSexport tool so that it works reliably.

(Below are a list of tickets, in Github and Phabricator, that relate to this issue. They're offered only as clues/starting points. Many of them are bound to be duplicates, stemming from a common set of core problems.)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

List of (probably duplicate) export issues to start off your investigation

Don't feel you have to solve or address each of these. They are offered as clues and examples of the issue. I've looked into many of them and found that I could often download the books that the user couldn't. This merely confirms that the problem appears to be intermittent.

jmatazzoni renamed this task from Investigate issues related to failures of e-book export out of Wikisource to Investigate issues related to failures of Wikisource e-book export .Mar 26 2019, 9:49 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni updated the task description. (Show Details)
jmatazzoni moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.
jmatazzoni renamed this task from Investigate issues related to failures of Wikisource e-book export to [16 hour spike] Investigate issues related to failures of Wikisource e-book export .Mar 26 2019, 11:31 PM

A few notes from our meeting with @Tpt:

  • The number one concern is likely service workers which are compiling the ePub being killed by memory limits within the Labs environment.
  • All of the information used to create the ePub is pulled into memory including images. Tpt says that it might make sense to insert images into the epub directly instead of holding them in memory.
  • The version of Calibre is whatever comes with Ubuntu Stretch on the Labs platform.
  • We could potentially upgrade Calibre to see some improvement but there’s nothing directly obvious that would be an improvement.
  • Wsexport extracts metadata from the wiki text using a microformat of sorts. It relies on HTML classes to be present so it can find the correct content.
  • The input is mediwiki parser HTML and we may want to switch to parsoid HTML that might be easier to work with.

I probably have a few more details about these things if there are questions. I think this info from Tpt is very helpful for our investigation into this work.


The above analysis uncovers the first layer of problems. Once we fix them, we may see more problems, but for now this is the direction of work.

I took a look at the new tasks that Max created and it looks like a good start to dig through the first layer of issues with the service.

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 23 2019, 5:39 AM