As a Wikisource user, I want the team to investigate "improve export of electronic books" wish, so they can consider the various options and risks of this top Wikisource wish.
Background: In the 2020 Community Wishlist Survey, the #1 wish was to "Improve export of electronic books." This was also requested in the 2019 Community Wishlist Survey as the #4 wish. This has been a repeated pain point for the Wikisource community. While we have done work to improve the process, they are still experiencing issues. For this reason, we want to take the time to deeply investigate the potential options available to improve reliability and formatting for users.
Relevant Resources:
- 2020 Wish from Community Wishlist Survey
- 2019 Wish from Community Wishlist Survey
- T178803, T242760
- WSExport Issues
Acceptance Criteria:
- Review the wishes from 2020 and 2019, as well as relevant Phabricator tasks (see links above)
- Provide an analysis of potential risks associated with this project from a technical perspective
- Provide an analysis of potential dependencies associated with this project from a technical perspective
- Provide a recommendation for implementation of this change
- Provide a rough estimate/sense of difficulty or effort required by this project
- Investigate various options outlined in the All Hands brainstorms doc, which includes:
Investigation
This wish focuses on two key aspects of the export tool: uptime/reliability and ebook formatting.
1. Reliability
Uptime we dealt with a fair bit last year, and we're in the process of moving the tool to its own VPS so that it can have more resources and not be as affected by Toolforge maintenance.
The other big thing we can do to improve reliability is to move to a job queue system, so that the book generation processes are handled separately both from the web frontend and each other. This is a large refactor, but I think one we understand reasonably well (it's similar to what we built for #EventMetrics).
Possible actions:
- T242760: Move WSExport to VPS
- Set up a job queue for generating epubs, with the web interface only handling requesting and delivering them.
- T222936: Wikisource Ebooks: Investigate cache generated ebooks [8H] The tricky part of this of course is cache invalidation, because idealy we'd want to be able to regenerate whenever any page in the book changes. A fixed duration might be easier, with a nocache URL parameter for overriding it.
2. Formatting
A wide range of formatting errors have been reported, such as:
- Missing text at end of page or beginning of page (in plain text or in table) T244825
- Duplication of text at end of page or beginning of page
- Table titles don't appear
- Table alignment in a page (centered) not respected
- Text alignment in table cell not respected
- Style in table not respected in MOBI format
There are four main places where we're getting formatting errors in ebooks:
- The original HTML of the wiki, from things such as misnested tags or incorrect CSS in templates etc.
- How we process the wiki HTML into epub XHTML.
- The secondary output formats such as PDF, introduced by Calibre's internal conversion.
- Ereader rendering of epubs.
The first two are the only ones we can do much about.
Possible actions:
- T244837: Upgrade Calibre on wsexport VPSs This has been done as part of the move to VPS. Toolforge has Calibre 2.75.1; VPS has 3.39.1. The latest Calibre is 4.10.1, so we should still upgrade more.
- Come up with simple example pages that demonstrate each of the formatting issues.
- Fix easy errors in templates that are widely used.
- Add direct display of epubcheck output GH #190 and fix prevalent issues (such as T244694, T244448)
Misc.
- Switching to Pandoc for epub generation: this looks likely to not give us enough control over the epub contents (crucially the ToC, but also other metadata).
- Switching to Pandoc for converting epubs to other formats. For example, an epub PDF from Calibre F31607711 and from Pandoc F31607712 (Calibre is better in almost all ways, but books with lots of tables for example might fare better with Pandoc, e.g. F31607741). This would give us lots of other output formats that aren't supported by Calibre. Perhaps we could just add a parameter so that people could choose which conversion system they want? But this is not something that we should worry about too much.