Page MenuHomePhabricator

Fix formulas in HTML extracts
Closed, DeclinedPublic5 Story Points

Description

Right now, TextExtracts:

  • Strips the fallback images from the output of the Math extension; and
  • Flattens the <span style="display: none;"> container element that wraps the math element and its children.

The result is that HTML and plain text extracts contain the alt text of the MathML, e.g.

https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=extracts&titles=Planck%20constant

and

"Ко́мпле́ксные чи́сла — числа вида {\displaystyle x+iy} x+iy, где {\displaystyle x} x и {\displaystyle y} y — вещественные числа, {\displaystyle i} i"

AC

  • math elements (and their children) are stripped.
  • img.mwe-math-fallback-image and img.mwe-math-fallback-image-inline elements aren't stripped from HTML extracts.
    • Whitelist should be configurable
  • Plain text extracts behave as before, i.e. the alt-text of the MathML markup is still renderered in plain text extracts.
  • Audio and video tags (other media tags) are stripped as before.
  • Announce the change on the mediawiki-api-announce mailing list.

Testing criteria

Expected behavior: lie algebra preview should display all mathematical expressions across browsers

Related Objects

Event Timeline

MaxSem created this task.Feb 26 2017, 5:27 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 26 2017, 5:27 AM
ovasileva triaged this task as Normal priority.Mar 2 2017, 3:48 PM
ovasileva moved this task from To Triage to Needs Analysis on the Readers-Web-Backlog board.

This could be of interest: http://asciimath.org/

Make the formulas text readable by converting them to ascii for the extract

Jdlrobson added subscribers: Niki4, Jdlrobson.

I've merged the description with the much more thorough one from @Niki4

Jdlrobson changed the task status from Open to Stalled.May 2 2017, 4:59 PM

Until we have a solid proposal in T113094

Jdlrobson changed the task status from Stalled to Open.

So basically the problem is TextExtracts is currently wired to remove all img tags from the HTML output.

We can stop doing this by changing ExtractFormatter to not call setRemoveMedia and instead as part of filterContent manually pull out img tags that are not math images e.g. maintain a whitelist which contains mwe-math-fallback-image-inline

However, for browsers which support mathml we may need to think of a way to make sure the img tag is hidden to those users (otherwise they will see it twice)

@Jdlrobson - can we convert to the fallback images when we get the extract (prior to mathml conversion)?

ovasileva raised the priority of this task from Normal to High.Jun 26 2017, 3:45 PM
ovasileva renamed this task from Decide what to do with formulas in HTML extracts to Fix formulas in HTML extracts.Jun 26 2017, 5:17 PM
phuedx added a subscriber: phuedx.EditedJun 26 2017, 5:56 PM

However, for browsers which support mathml we may need to think of a way to make sure the img tag is hidden to those users (otherwise they will see it twice).

Interestingly, when you navigate to https://en.wikipedia.org/wiki/Lie_group#Definitions_and_examples in Firefox, you'll see the fallback images are rendered as the <math> tag is wrapped in a <span style="display: none;"> container element. Note well that this container element is flattened by the {HTML,Extract}Formatter.

We'll need to strip math tags while whitelisting img elements whose class list contains 'mwe-math-fallback-image{-inline}'.

phuedx updated the task description. (Show Details)Jun 26 2017, 6:28 PM
ovasileva updated the task description. (Show Details)Jun 26 2017, 6:32 PM
phuedx updated the task description. (Show Details)Jun 27 2017, 8:33 AM
ovasileva updated the task description. (Show Details)Jun 27 2017, 4:13 PM
ovasileva set the point value for this task to 5.Jun 27 2017, 4:17 PM
phuedx updated the task description. (Show Details)Jun 27 2017, 4:18 PM
phuedx removed the point value for this task.
phuedx set the point value for this task to 5.Jun 27 2017, 4:22 PM
phuedx updated the task description. (Show Details)Jun 28 2017, 5:14 PM

Change 362297 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[HtmlFormatter@master] Allow certain media elements not to be removed

https://gerrit.wikimedia.org/r/362297

Change 362298 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/TextExtracts@master] Keep math images while removing formulas from HTML output

https://gerrit.wikimedia.org/r/362298

Jdlrobson moved this task from 2016-17 Q4 to Needs Analysis on the Readers-Web-Backlog board.
Jdlrobson changed the task status from Open to Stalled.

We talked about this task and T168329 during prioritisation/standup/goals time and decided we were working on this a little prematurely and thus feeling pain. We plan to wait on decisions inside T113094 that will tell us how we continue maintaining TextExtracts/the new services endpoint and how we want to sustain this going forward.

bmansurov removed bmansurov as the assignee of this task.Jul 5 2017, 7:55 PM
bmansurov added a subscriber: bmansurov.

Change 362297 abandoned by Bmansurov:
Allow certain elements not to be removed

Reason:
Talk is going on about doing this in MCS.

https://gerrit.wikimedia.org/r/362297

Change 362298 abandoned by Bmansurov:
Keep math images while removing formulas from HTML output

Reason:
Talk is going on about doing this in MCS.

https://gerrit.wikimedia.org/r/362298

Jdlrobson closed this task as Declined.Jul 13 2017, 6:48 PM

We're going to not fix this in TextExtracts. TextExtract omits all img tags and we don't plan to add them in this mode given the complexity of the problem. Thoughts on what we should do with the math elements outputted in the api result are appreciated in T170617.