Page MenuHomePhabricator

MathML tags are missing xmlns attribute
Closed, ResolvedPublic

Description

Ex: enwiki:Light, enwiki:MathML

----- JS:[34267, 34347] -----
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\displaystyle x+5}">

+++++ PHP:[34181, 34218] +++++
<math alttext="{\displaystyle x+5}">

Event Timeline

ssastry triaged this task as Medium priority.Oct 11 2019, 6:35 PM
ssastry created this task.

RemexHTML seems to be dropping the xmlns attribute (see output when I added a log output to ExtensionHandler.php)

[subbu@earth:~/work/wmf/parsoid] echo '<math>1</math>' | php bin/parse.php --body_only --dump extoutput
HTML STRING: <span class="mwe-math-element"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML"  alttext="{\displaystyle 1}">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mn>1</mn>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle 1}</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/92d98b82a3778f043108d4e20960a9193df57cbf" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.338ex; width:1.162ex; height:2.176ex;" alt="1"/></span>
[dump/extoutput] ================================================================================
[dump/extoutput] EXTENSION INPUT: <math>1</math>
[dump/extoutput] ================================================================================
[dump/extoutput] EXTENSION OUTPUT (outerHTML of body of parsed html doc): 
[dump/extoutput] <body><span class="mwe-math-element"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math alttext="{\displaystyle 1}">   <semantics>     <mrow class="MJX-TeXAtom-ORD">       <mstyle displaystyle="true" scriptlevel="0">         <mn>1</mn>       </mstyle>     </mrow>     <annotation encoding="application/x-tex">{\displaystyle 1}</annotation>   </semantics> </math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/92d98b82a3778f043108d4e20960a9193df57cbf" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.338ex; width:1.162ex; height:2.176ex;" alt="1"/></span></body>
[dump/extoutput] --------------------------------------------------------------------------------

I bet the culprit is the workaround for T217708 (Remex commit 33de7ba9746fce0aaaeb9314a7a78460f2a28122), although that was *supposed* to affect only HTML elements, not xmlns="http://www.w3.org/1998/Math/MathML".

Hm. Apparently the PHP DOM just ignores setAttribute and setAttributeNS when the name is xmlns. This is regardless of whether the element is created with createElement or createElementNS. I haven't figured out a workaround yet.

Grumble grumble PHP DOM grumble grumble.

Change 548913 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/libs/RemexHtml@master] WIP: Hack Remex to work around PHP DOM "special case" of xmlns attributes

https://gerrit.wikimedia.org/r/548913

Change 548937 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Workaround for missing xmlns attributes on DOMElement

https://gerrit.wikimedia.org/r/548937

Change 548913 abandoned by C. Scott Ananian:
WIP: Hack Remex to work around PHP DOM "special case" of xmlns attributes

Reason:
Abandoned in favor of I7ef44ad9c1996749e9cf4306ac4d89cc5b8cc6e0

https://gerrit.wikimedia.org/r/548913

Change 548937 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Workaround for missing xmlns attributes on DOMElement

https://gerrit.wikimedia.org/r/548937

Change 549132 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Remove T235295 normalization from html diffing script

https://gerrit.wikimedia.org/r/549132

Change 549132 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove T235295 normalization from html diffing script

https://gerrit.wikimedia.org/r/549132

Change 549619 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Put the 'fake' xmlns attribute first, in an attempt to better match JS

https://gerrit.wikimedia.org/r/549619

Change 549655 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Reorder attributes in Parsoid/JS to put xmlns first

https://gerrit.wikimedia.org/r/549655

Change 549619 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Put the 'fake' xmlns attribute first, in an attempt to better match JS

https://gerrit.wikimedia.org/r/549619

Change 549655 abandoned by C. Scott Ananian:
Reorder attributes in Parsoid/JS to put xmlns first

Reason:
Not worth doing.

https://gerrit.wikimedia.org/r/549655

I think PHP's behaviour is defensible. According to the XHTML serialization part of the HTML spec, if there is "an Attr node with no namespace whose local name is the lowercase string "xmlns"", the serializer is required to throw.

The W3C DOM Parsing and Serialization draft specifies that when serializing an element that is in a different namespace to the default, an xmlns attribute should be emitted. If an xmlns attribute actually existed in the DOM, you would end up with a duplicate xmlns attribute and an invalid resulting document. So you can see why it would make sense to throw an exception in that case.

The HTML spec states that when coercing an HTML document to an XML infoset,

If the XML API doesn't support attributes in no namespace that are named "xmlns", attributes whose names start with "xmlns:", or attributes in the XMLNS namespace, then the tool may drop such attributes.

So PHP's behaviour is explicitly permitted.

The correct way to serialize an element to XML is by emitting an xmlns attribute at serialization time if the namespace of the element requires it. Parsoid's XMLSerializer was incorrect in attempting to simply iterate over $element->attributes.

So I see DOMCompat::attributes() is a hack to work around Parsoid's incorrect XML namespace serialization, not as a workaround for a PHP bug.

Change 734058 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/services/parsoid@master] Make DOMCompat::attributes() return string[]

https://gerrit.wikimedia.org/r/734058

Change 734058 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Replace DOMCompat::attributes() with DOMUtils::attributes()

https://gerrit.wikimedia.org/r/734058