Page MenuHomePhabricator

Wikisource: Investigate using Parsoid HTML for Wsexport [8h]
Closed, ResolvedPublicOct 7 2020

Description

As a Wikisource user, I want the team to investigate sing Parsoid HTML for Wsexport, so it can be determined if a) such a change would improve reliability to a meaningful degree, and b) if the work would be manageable and within scope for the team.

Background: WSexport is currently using the MediaWiki parser HTML output using ?action=render to generate its ePubs. This was the only HTML provided when the tool has been created. However, Parsoid HTML is now available and provides much richer data. It might be relevant to migrate to Parsoid HTML to make the tool more "future proof" and hopefully simplify some HTML transformations and cleanups (footnotes, mathematical formulas...).

Acceptance Criteria:

  • Investigate the primary work that would need to be done in order to use Parsoid HTML for Wsexport
  • Provide a general rundown of pros/cons of using Parsoid HTML for Wsexport
  • Investigate the main challenges, risks, and possible dependencies associated with implementing such a change
  • Provide a general estimate/idea, if possible, of the potential impact it may have on ebook export reliability.
    • In other words, do we have a strong hunch that this could, indeed, improve reliability (and in a considerable way)? Why or why not?
  • Provide a general estimation/rough sense of the level of difficulty of effort required in doing such work

Details

Due Date
Oct 7 2020, 4:00 AM

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ifried renamed this task from Investigate using Parsoid HTML for Wsexport to Wikisource: Investigate using Parsoid HTML for Wsexport.Jul 16 2020, 3:54 PM
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ARamirez_WMF set the point value for this task to 8.Sep 24 2020, 6:29 PM
ARamirez_WMF moved this task from Needs Discussion to Up Next on the Community-Tech board.
ARamirez_WMF renamed this task from Wikisource: Investigate using Parsoid HTML for Wsexport to Wikisource: Investigate using Parsoid HTML for Wsexport [8h].Sep 24 2020, 6:40 PM
ARamirez_WMF removed the point value for this task.
ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".

Since the Parsoid's API (https://www.mediawiki.org/wiki/Parsoid/API) is meant to be integrated into core MediaWiki, with the goal of replacing MediaWiki's current native parser. Therefore, it is a good idea to make the transition sooner rather than later to improve reliability.

There are some parameters (like rvparse) that are being used in WSexport and have already been deprecated according to https://www.mediawiki.org/wiki/API:Revisions#API_documentation.

WSexport has the flexibility to accommodate a new API endpoint so the transition would simply require updating the domain and options based on the system's requirements.

Migrating to Parsoid will require to update the HTML cleaning code (here and here)

@Tpt Thank you for pointing that out! We will proceed with making the migration since we've have concluded that it needs to be done if we want to improve reliability and make the tool more "future proof" We'll be creating another ticket T264788 to track the transition work. Please let out us if you have any other feedback.

ifried subscribed.

We have discussed this as a team, and we have decided to implement the changes in T264788, due to the findings shared in this investigation. For this reason, I'm marking this investigation as Done.