Maniphest T219330

[16 hour spike] Investigate issues related to failures of Wikisource e-book export
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	• jmatazzoni
	Mar 26 2019, 9:32 PM

Description

Wikisource users regularly report that e-book export fails. This problem, which appears to be intermittent, afflicts the various export options: Epub, Mobi, and Pdf (these are the main three we're interested in).

Your assignment is to investigate this issue and come up with feasible ideas for ways to strengthen the WSexport tool so that it works reliably.

(Below are a list of tickets, in Github and Phabricator, that relate to this issue. They're offered only as clues/starting points. Many of them are bound to be duplicates, stemming from a common set of core problems.)

Related Objects

Mentioned In: T244307: Request creation of wikisource VPS project
Mentioned Here: T221332: [4 hour spike] Investigate wsexport webservice crashes/restarts
T221337: Bump PHP memory_limit

Event Timeline

• jmatazzoni created this task.Mar 26 2019, 9:32 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptMar 26 2019, 9:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

List of (probably duplicate) export issues to start off your investigation

Don't feel you have to solve or address each of these. They are offered as clues and examples of the issue. I've looked into many of them and found that I could often download the books that the user couldn't. This merely confirms that the problem appears to be intermittent.

Unable to use wsexport from telugu wikisource due to 502 bad gateway error
[[ https://phabricator.wikimedia.org/T178803 | T178803 qsub sync -y jobs failing on Grid Engine with "range_list containes no elements" error ]]
- This is specifically mentioned in the wish
Conversion to pdf failed
Downloading using EPUB, MOBI and "Choose format" options causes '502 Bad Gateway' error
Book don't generate (math problem?)
T166337 wsexport tool leaking files in /tmp
Conversion to pdf/mobi/txt failed
Creation of ePub with a lot of images fails
Server error: 503 on some book
Failure when there are too many too big images

• jmatazzoni updated the task description. (Show Details)Mar 26 2019, 9:37 PM

• jmatazzoni renamed this task from Investigate issues related to failures of e-book export out of Wikisource to Investigate issues related to failures of Wikisource e-book export .Mar 26 2019, 9:49 PM

• jmatazzoni updated the task description. (Show Details)

• jmatazzoni moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.

• jmatazzoni renamed this task from Investigate issues related to failures of Wikisource e-book export to [16 hour spike] Investigate issues related to failures of Wikisource e-book export .Mar 26 2019, 11:31 PM

MBinder_WMF added a project: Spike.Mar 26 2019, 11:32 PM

• jmatazzoni moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.Mar 26 2019, 11:35 PM

A few notes from our meeting with @Tpt:

The number one concern is likely service workers which are compiling the ePub being killed by memory limits within the Labs environment.
All of the information used to create the ePub is pulled into memory including images. Tpt says that it might make sense to insert images into the epub directly instead of holding them in memory.
The version of Calibre is whatever comes with Ubuntu Stretch on the Labs platform.
- The default repo is older than the backport repository.
- Current version on labs: https://packages.debian.org/stretch/calibre-bin
- New version on backport: https://packages.debian.org/stretch-backports/calibre-bin
- Related task: https://phabricator.wikimedia.org/T219307
We could potentially upgrade Calibre to see some improvement but there’s nothing directly obvious that would be an improvement.
Wsexport extracts metadata from the wiki text using a microformat of sorts. It relies on HTML classes to be present so it can find the correct content.
- Microformat details: https://wikisource.org/wiki/Wikisource:Microformat
The input is mediwiki parser HTML and we may want to switch to parsoid HTML that might be easier to work with.

I probably have a few more details about these things if there are questions. I think this info from Tpt is very helpful for our investigation into this work.

• jmatazzoni added a project: Community-Tech-Sprint.Apr 15 2019, 6:46 PM

Niharika moved this task from Up Next (June 3-21) to In Sprint 🏃‍♀️🏃‍♂️ on the Community-Tech board.Apr 16 2019, 7:16 PM

MaxSem claimed this task.Apr 16 2019, 11:15 PM

MaxSem moved this task from Ready to In Development on the Community-Tech-Sprint board.

• jmatazzoni moved this task from Backlog to In sprint on the WS Export board.Apr 17 2019, 6:38 PM

Unable to use wsexport from telugu wikisource due to 502 bad gateway error
- Judging by service.log, webservice crashes/has to be restarted often, investigate. T221332
[[ https://phabricator.wikimedia.org/T178803 | T178803 qsub sync -y jobs failing on Grid Engine with "range_list containes no elements" error ]]
- Has been closed as invalid since then, but really looks resolved
Conversion to pdf failed
- Dupe of the previous one?
Downloading using EPUB, MOBI and "Choose format" options causes '502 Bad Gateway' error
- Also a dupe?
Conversion to pdf/mobi/txt failed
- Also a dupe?
Creation of ePub with a lot of images fails
- Also a dupe?
Server error: 503 on some book
- Guess what, also a dupe.
Book don't generate (math problem?)
- PHP timeout on the MediaWiki side parsing https://fr.wikisource.org/wiki/Recherches_arithm%C3%A9tiques/Section_cinqui%C3%A8me - indeed inside of Math while processing the metadata of this file.
T166337 wsexport tool leaking files in /tmp
- Cronjob to clean it up has already been installed, verify that it works and close.
Failure when there are too many too big images
- OOM due to memory limit being only 128M, double or quadruple it as a quick fix, then investigate what can be done about not storing all images in memory. T221337

The above analysis uncovers the first layer of problems. Once we fix them, we may see more problems, but for now this is the direction of work.

I took a look at the new tasks that Max created and it looks like a good start to dig through the first layer of issues with the service.

• aezell closed this task as Resolved.Apr 19 2019, 3:43 PM

• aezell moved this task from Needs Review/Feedback to Done on the Community-Tech-Sprint board.

Siddhant subscribed.Jul 23 2019, 5:39 AM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 23 2019, 5:39 AM

MusikAnimal mentioned this in T244307: Request creation of wikisource VPS project.Feb 5 2020, 12:00 AM

[16 hour spike] Investigate issues related to failures of Wikisource e-book export Closed, ResolvedPublicSpikeActions

Description

Related Objects

Event Timeline

List of (probably duplicate) export issues to start off your investigation

[16 hour spike] Investigate issues related to failures of Wikisource e-book export
Closed, ResolvedPublicSpike
Actions