Page MenuHomePhabricator

Generalise the Parsoid structure and internal representations.
Closed, DeclinedPublic

Description

So Molly and I are at LibrePlanet talking about bug 37933.

We decided that while Wikitext --> HTML --> LaTeX is possible, it's not terribly useful and doesn't really take advantage of the structure of Parsoid - we see it as adding an extra stage to the parse, which will potentially add to the time the parse takes, as opposed to having a generalized token stream and only starting to convert to a format after the token stream is actually ready.

Obviously this means a few of big things, potentially:

  1. The DOM post processor needs to either run before the HTML5 tree builder, on tokens or some other structure, or it needs to be emulated for each format. I'm leaning towards the former, because if we're going to export to multiple formats it would make more sense to have one file for each format that builds the export from a token structure, rather than two files each, which build the export and do the postprocessing.
  1. Because we aren't actually dealing with HTML, necessarily, in the end, we shouldn't be talking about tokens with HTML-specific tag names. Probably we could just use canonical Parsoid-specific names - something like http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary - or maybe something similar to the *_NODE attributes in DOM nodes, with a mapper to some canonical integer values that are defined in the base Token class.

Footnote: As I was thinking about this and trying to come up with how I wanted it to look, I realized that the problem was that I was looking at it as wanting to convert between WT and either LaTeX or HTML, but if we wound up following our long term plan, "LaTeX export" would also require HTML-to-LaTeX, because HTML would be our storage mechanism. So I think it might be better to rewrite each bit of our system to convert each format to and from a canonical internal representation, rather than to and from any one other format.

I'm posting here because I want thoughts and feedback. It should be noted that bug 37934 would also benefit from any of the work we did on the generalisation problem - and we could probably open a tracking bug to figure out all of these things more generally.


Version: unspecified
Severity: normal

Details

Reference
bz46516

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:15 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz46516.

I just ran a quick serialize to see what the timing was like in comparison to a parse - in case we decide to convert HTML into an intermediary format that would translate into whatever other format we want to use. I think the two seconds that it takes to serialize wouldn't be greatly affected by an extra step in the serializer process, and I think the benefits of having a general solution for converting between formats would be greatly useful.

Reg. 1. the DOM post processor cannot run before the DOM is built, it requires information that is available only after the DOM is fully built.

We're chatting on IRC currently, and I think the rough consensus is something like..."HTML should be the common format, and the rest can deal with it." This makes some amount of sense, though it does mean that, temporarily at least, we'll have to deal with parsing WT to HTML before going to other formats. Since anything beyond that is relatively fast, in terms of performance, we can live with that issue for now.

So Molly, we'll throw out the insane charts and maps we've drawn up (which really just consists of one chalk scribble at Harvard) and see what HTML can offer as an intermediary format - to that end, I guess gwicke's suggestion of pandoc is a good place to start.

molly.white5 wrote:

Alright, that sounds fine as long as you're not concerned with the speed. I'm not familiar with pandoc, but I'll read up on it. I'm curious to see how they perform wikitext <-> LaTeX and HTML <-> LaTeX.

Have you discussed whether or not we'll want to be able to roundtrip wikitext and LaTeX?

I think we could call that "of dubious usefulness", but you're the one who's going to have to implement it, so I probably shouldn't dictate anything. In any case I think LaTeX and HTML have a much closer 1:1 ratio than WT and HTML, so it should be much easier to add in, even if it winds up being an afterthought.

And as ever, we're in IRC if you'd like any help or guidance :)

molly.white5 wrote:

I agree that LaTeX to wikitext would not be terribly useful. I wonder if the community would benefit from having some other formats be two-way, though. I'm sure there are people out there who would rather write their articles using Markdown... Then again, if we're trying to make the switch to the VisualEditor anyway, the ability to write articles in Markdown would be of only temporary value.

I agree. If we wind up supporting Markdown, it will be because we've found a way to do HTML <--> Markdown, or at least one direction or t'other. Converting to HTML is simple enough, at least.

molly.white5 wrote:

From my brief look at pandoc, it appears that it can do it. As always, it's roundtripping that's the issue.

[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]