Page MenuHomePhabricator

Parsoid: Output MathJax rendering for Math tags instead of images
Closed, InvalidPublic

Description

When MathJax is enabled we should render a <span> containing TeX, instead of an image.


Version: unspecified
Severity: normal

Details

Reference
bz51698

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:51 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz51698.

With VE's JS requirements we can just force everyone to use the MathJax rendering when editing, makes things a lot simpler and both ends, and means we don't need to round-trip to re-render.

From IRC discussions:

It appears that supporting this on the Parsoid end would definitely require an update to the mw api so we can pass in options to the extension (something like action=callextensiontag&attributes=...&options=...)

In addition, if the math extension doesn't accept output rendering options (except via config settings), the extension itself will need an update to support this.

Once both are present, Parsoid can emit mathml.

Have we filed a bug against the mw api to support this?

From IRC discussion with pkrautzberger it seems that we could actually use MathJax directly to convert tex to MathML without calling the API at all. Some work is still needed though:

[17:03] <gwicke> do you have a node package?
[17:03] <pkrautzberger> ah. no. The problem is that MathJax requires a dom right now. But we plan to liberate parts.
[17:03] <gwicke> we have a dom
[17:03] <pkrautzberger> (a lot functionality doesn't make sense outside the DOM).
[17:04] <gwicke> domino currently
[17:04] <pkrautzberger> but TeX2MathML conversion could be isolated easily (MathML to SVG not as easily). BTW our internal format is MathML.
[17:04] <gwicke> it only implements DOM4 in case that makes a difference
[17:04] <pkrautzberger> I'm not sure if dom4 would be a problem.
[17:04] <pkrautzberger> but we've seen people use dom runners.
[17:04] <gwicke> we could also use JSDom for math if that helps
[17:05] <gwicke> supports the other levels
[17:05] <gwicke> and script tag execution
[17:05] <pkrautzberger> well, ideally we could just isolate that from the dom.
[17:05] <pkrautzberger> but yes, I know someone who got jsdom to work with MathJax.
[17:06] <gwicke> ah, that sounds promising

Thanks for pointing me to this thread, Gabriel.

Let me add https://bugzilla.wikimedia.org/show_bug.cgi?id=48036.

If there's PNG+TeX in the page, then MathJax can replace the image on the fly. That will produce a nice user experience as the math will always be visible and only improve once MathJax is done (cf this Chrome extension https://chrome.google.com/webstore/detail/wikipedia-with-mathjax/fhomhkjcommffnlajeemenejemmegcmi).

For the OP, I should point out that MathJax preprocessing will remove the spans and insert its script tags, so you might want to insere those directly see http://docs.mathjax.org/en/latest/model.html#how-mathematics-is-stored-in-the-page.

We currently render as PNG with the tex available in an attribute. It would be nicer both for semantic information / indexing purposes and copy / pasting to render as MathML instead. MathJax could be used server-side to generate the MathML from tex, and client-side as a polyfill to provide HTML+CSS (or image?) rendering for browsers that don't support MathML well.

Copy / pasting native MathML seemed to work quite well the last time I tried it in Firefox and vanilla contenteditable. MathJax-inserted HTML+CSS will likely not do so well unless we preserve the MathML and especially the tex in the data-mw attribute. At least entire formulas could be copy/pastable that way though.

Very relevant: http://arxiv.org/abs/1304.5475

How does the MathML generated by MathJax compare to the 'content MathML' generated by LaTeXML? Is it purely presentational MathML?

Just had a meeting with Gerardo Capiel in which we also talked about math. He was involved in a node / MathJax prototype that renders to SVG plus a textual description of the formula suitable for screen readers. It might be possible to use that to avoid the need for client-side rendering altogether:

  • Content MathML as primary rendering for indexing and good rendering in Firefox
  • Server-generated SVG fallback for fast rendering on other browsers
  • Textual description for screen readers
  • tex in data-mw for editing

The fall-back selection needs to be worked out, and might depend on JS. Compared to client-side rendering from TeX with MathJax this should be pretty fast. Ideally the DOM will not be modified, so that copy&pasting from a read-only page into an editor preserves all information.

MathJax only works with Presentation MathML:
http://www.mathjax.org/resources/faqs/#problem-content

It also accepts only Presentation MathML as input.

We might actually be better off using LaTeXML (https://www.mediawiki.org/wiki/Extension:Math#LaTeXML) to generate Content MathML. LaTeXML provides a web service that we could probably use directly. SVG can be generated with dvisvg from the TeX source. Afaik this is already used in the math extension.

Arg -- I had responded twice, twice it was lost... Trying again.

On Sept 4 (after the IRC chat log) I tried to post:

Obviously, I'd love to see MathML + MathJax on Wikipedia. That would be a huge step forward for math on the web, accessibility, and education.

But on the wikitech-I thread I started a while ago, there was a bit of uneasiness when it comes to MathJax performance, especially on mobile. I got the feeling that fallback images will be required for a while. Perhaps SVG might be better though and MathJax could help there, too.

MathJax is modular on input, internal and output which is sometimes confusing in discussions; so yes we have an HTML/CSS output and an SVG output. At the same time, the texlive+texvc backend is a bit horrible. Personally, I think LateXML is a great tool for converting full LaTeX documents but I worry that you'll need another texvc to limit it -- it's too powerful. MathJax might just fit better because of its restricted syntax (and is extensible through javascript). Obviously, I'm terribly biased. For the record, LaTeXML is miles better than texvc. I'm meeting Martin Schubotz (the author of the arXiv link) over the next few days, so I hope to learn more about his projects (and he's coming up to WMF after that I hear so that's awesome).

Copy&paste is tricky. Yes, it works in FF, but often OS clipboards do not know how to handle it, apps sanitize it away etc. MathJax offers a context menu to access TeX & MathML source (and in our upcoming release any annotation-xml); cumbersome but it works everywhere. We are considering web components / shadow dom, but given the state of support that's for the future (current implementations have some funky copy&paste behavior).

I did get the strong impression on wikitech-I that wiktext should keep TeX as its internal format, so I'm wondering what you have in mind for pasting MathML. MathML isn't semantically rich enough to produce human readable TeX.

Regarding Content MathML, that's a topic of debate. I'm not an expert on Content MathML but I've heard relatively negative things about it from a semantic point of view. A case in point is that accessibility tools don't do better on Content MathML than on Presentation MathML -- they build their own semantic structures on top of it anyway. In any case, you don't see a lot ContentMathML in the wild since no one can render it.

From a search point of view there doesn't seem to be much difference (but of course a specific search technology might prefer Content, Presentation, or TeX).

It's more important to produce high quality Presentation MathML instead of low quality Content MathML. MathML today is a bit like HTML 1 -- we have the language, some basic rendering, that's it. MathML has missed out on 20 years of web development (although it's the de-facto standard in publishing and technical writing workflows). MathML on Wikipedia would be important to push things forward but small steps in what's currently possible would be better.

Yesterday I tried to post:

cc'ing Fred and Moritz who have been actively working on the math extension recently.

@GABRIEL I'm a bit confused by your last two messages. [[well, less so after seeing that mine didn't get through]]

Are you just collecting thoughts on this? Are you thinking about long term or short term? Is the topic now the back end or is it still the front end (as the issue title suggests)?

Anyway, here a few more thoughts, trying to provide some outside input.

  • Content MathML won't help on the front end -- you need Presentation MathML on the front end and use polyfills where necessary. MathJax works on all current browsers and while older machines and older Android devices may still see performance issues, those will continue to improve. Replacing images (PNG or SVG) on the fly is a progressive enhancement on all systems.
  • the prototype that Gerardo mentioned combines MathJax and ChromeVox, so you'll run into the same problem for MathML support.
  • generating static speech strings is the lowest form of a11y, especially when you could use MathJax which math accessibility tools support.
  • generating static images of any kind will remove all the advantages of reflowable and accessible content.
  • the math extension does not yet use LaTeXML but Fred and Martin are working on that.

I can't help but point out that there are also a number of serious issues with WIkipedia's math that are more important than ContentMathML. For example, there's no display math mode which is an incredible shortcoming. There's also poor unicode support and poor RTL support. None of this will improve by switching to Content MathML -- garbage in, garbage out would be the result.

(In reply to comment #11)

Peter, thanks for your input!

Are you just collecting thoughts on this? Are you thinking about long term or
short term? Is the topic now the back end or is it still the front end (as
the
issue title suggests)?

In the Parsoid project we are developing a long-term HTML storage format for Wikipedia content. The VisualEditor uses our HTML, but also has the freedom to cut some corners for display in the shorter term. However, our long-term goal is to use Parsoid HTML also for regular page views. This is why we are now thinking about issues like copy&paste from read-only pages into a VisualEditor instance while refining the DOM spec at http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec.

Because of its rich metadata, Parsoid HTML is also useful for researchers and search engines. Ideally we'd also like to expose math in a way that works well for both content analysis / indexing *and* display.

I'm meeting Martin Schubotz (the author of the arXiv link)
over the next few days, so I hope to learn more about his projects (and he's
coming up to WMF after that I hear so that's awesome).

I'm also looking forward to learning more about LaTeXML vs. MathJax options. No browser supports Content MathML currently. The question then is if it is still useful to produce it for search while always rendering via Presentation MathML and/or server-generated SVG.

From a search point of view there doesn't seem to be much difference (but of
course a specific search technology might prefer Content, Presentation, or
TeX).

That might be, although Martin suggests that Content is better for search in his paper.

Copy&paste is tricky. Yes, it works in FF, but often OS clipboards do not
know
how to handle it, apps sanitize it away etc. MathJax offers a context menu to
access TeX & MathML source (and in our upcoming release any annotation-xml);
cumbersome but it works everywhere. We are considering web components /
shadow
dom, but given the state of support that's for the future (current
implementations have some funky copy&paste behavior).

Copy&pasting entire formulas should be possible as long as our data-mw attribute on the outer wrapper node is preserved. That has the TeX source, which can be used to re-render the contents from scratch. This will enable copy&pasting of entire sections including formulas.

I also get the impression that copy&pasting parts of a formula directly might not be feasible. It works with Presentation MathML in FF, but that would not be useful as we don't have a way to convert that back to TeX. It might make more sense to let users copy the TeX and insert that in a new formula using a widget.

  • generating static speech strings is the lowest form of a11y, especially

when
you could use MathJax which math accessibility tools support.

Are there popular screen readers that handle math? I agree that MathML is probably better in the longer term, but for current screen readers a speech string could still be a useful fall-back. At least if that would not prevent a plugin like math player from using the MathML instead.

  • generating static images of any kind will remove all the advantages of

reflowable and accessible content.

I think nobody is suggesting to generate *only* SVG. I'm not sure about the need for reflow in Wikipedia, as the limits of texvc seem to have motivated authors to handle this manually in TeX.

(In reply to comment #12)

Thanks, Gabriel! That's very helpful.

I'm also looking forward to learning more about LaTeXML vs. MathJax options.
No browser supports Content MathML currently. The question then is if it is
still useful to produce it for search while always rendering via Presentation
MathML and/or server-generated SVG.

That's a good question. Especially, if Content MathML from TeX can be good enough (with "random" authoring instead of firm semantic guidelines).

Copy&pasting entire formulas should be possible as long as our data-mw
attribute on the outer wrapper node is preserved. That has the TeX source,
which can be used to re-render the contents from scratch. This will enable
copy&pasting of entire sections including formulas.

That would be awesome. Subexpressions seems impossible right now -- but one day, with shadowdom and a lot of great heuristics it might just work...

Are there popular screen readers that handle math? I agree that MathML is
probably better in the longer term, but for current screen readers a speech
string could still be a useful fall-back. At least if that would not prevent
a plugin like math player from using the MathML instead.

A static speech string is never a bad idea for legacy screen readers. There are only two math accessibility solutions, MathPlayer and ChromeVox. AFAIK a number of screen readers ship MathPlayer but I'm not an expert on screen readers.

Since ChromeVox is mostly JavaScript (and open source), an obvious idea is to create a MathJax extension based on its technology.

The thing is that accessibility is about more than aural rendering; in particular synchronized highlighting is extremely important for learning and other non-vision disabilities. That can't work with static strings.

I think nobody is suggesting to generate *only* SVG.

I'm glad I misunderstood you :)

I'm not sure about the need for reflow in Wikipedia, as the limits of texvc
seem to have motivated authors to handle this manually in TeX.

I seem to have a very different experience :( Anyway, right now PNGs are the only option on mobile (bug 45816).

Mathjax is interesting for ZIM files and Kiwix because this would allow to reduce bandwidth/storage usage. Hope this is still on the Parsoid roadmap.

Arlolra set Security to None.
Esanders claimed this task.

MathJax support is discountinued T99369

Esanders changed the task status from Resolved to Invalid.Sep 17 2015, 1:51 PM