So Molly and I are at LibrePlanet talking about bug 37933.
We decided that while Wikitext --> HTML --> LaTeX is possible, it's not terribly useful and doesn't really take advantage of the structure of Parsoid - we see it as adding an extra stage to the parse, which will potentially add to the time the parse takes, as opposed to having a generalized token stream and only starting to convert to a format after the token stream is actually ready.
Obviously this means a few of big things, potentially:
- The DOM post processor needs to either run before the HTML5 tree builder, on tokens or some other structure, or it needs to be emulated for each format. I'm leaning towards the former, because if we're going to export to multiple formats it would make more sense to have one file for each format that builds the export from a token structure, rather than two files each, which build the export and do the postprocessing.
- Because we aren't actually dealing with HTML, necessarily, in the end, we shouldn't be talking about tokens with HTML-specific tag names. Probably we could just use canonical Parsoid-specific names - something like http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary - or maybe something similar to the *_NODE attributes in DOM nodes, with a mapper to some canonical integer values that are defined in the base Token class.
Footnote: As I was thinking about this and trying to come up with how I wanted it to look, I realized that the problem was that I was looking at it as wanting to convert between WT and either LaTeX or HTML, but if we wound up following our long term plan, "LaTeX export" would also require HTML-to-LaTeX, because HTML would be our storage mechanism. So I think it might be better to rewrite each bit of our system to convert each format to and from a canonical internal representation, rather than to and from any one other format.
I'm posting here because I want thoughts and feedback. It should be noted that bug 37934 would also benefit from any of the work we did on the generalisation problem - and we could probably open a tracking bug to figure out all of these things more generally.