Page MenuHomePhabricator

cscott (C. Scott Ananian)
Parser whisperer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 21 2014, 6:47 PM (264 w, 7 h)
Availability
Available
IRC Nick
cscott
LDAP User
Unknown
MediaWiki User
Cscott [ Global Accounts ]

Editor since 2005; WMF developer since 2013. I work on Parsoid and OCG, and dabble with VE, real-time collaboration, and OOjs.

On github: https://github.com/cscott

See https://en.wikipedia.org/wiki/User:cscott for more.

Recent Activity

Yesterday

cscott added a comment to T197879: Fix mw:DisplaySpace to match PHP "armorFrenchSpaces".

See also T14752: Space before/after »guillemets« (»/«) converted to non-breaking space ( ) (French spaces), T60529: Non-breaking thin spaces before double punctuation marks (French spaces).

Tue, Nov 12, 3:58 PM · Parsoid-Read-Views, Parsoid-Rendering
cscott added a comment to T14752: Space before/after »guillemets« (»/«) converted to non-breaking space ( ) (French spaces).

I suspect I fixed this in T197902: Be more selective in applying French Space armoring; »quote« shouldn't add   anymore.

Tue, Nov 12, 3:57 PM · MediaWiki-extension-requests, Wikisource, MediaWiki-Parser
cscott added a subtask for T197879: Fix mw:DisplaySpace to match PHP "armorFrenchSpaces": T60529: Non-breaking thin spaces before double punctuation marks (French spaces).
Tue, Nov 12, 3:49 PM · Parsoid-Read-Views, Parsoid-Rendering
cscott added a parent task for T60529: Non-breaking thin spaces before double punctuation marks (French spaces): T197879: Fix mw:DisplaySpace to match PHP "armorFrenchSpaces".
Tue, Nov 12, 3:49 PM · MediaWiki-Parser

Fri, Nov 8

cscott added a subtask for T235217: Parsoid should use protocol-relative URLs for media: T237754: Adjust TimedMedia url handling (getAPIData) to match legacy parser.
Fri, Nov 8, 6:53 PM · Patch-For-Review, Parsoid-PHP
cscott added a parent task for T237754: Adjust TimedMedia url handling (getAPIData) to match legacy parser: T235217: Parsoid should use protocol-relative URLs for media.
Fri, Nov 8, 6:53 PM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Patch-For-Review, TimedMediaHandler, Parsoid-PHP
cscott renamed T237754: Adjust TimedMedia url handling (getAPIData) to match legacy parser from Adjust media handling to match legacy parser to Adjust TimedMedia url handling (getAPIData) to match legacy parser.
Fri, Nov 8, 6:52 PM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Patch-For-Review, TimedMediaHandler, Parsoid-PHP
cscott created T237754: Adjust TimedMedia url handling (getAPIData) to match legacy parser.
Fri, Nov 8, 6:47 PM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Patch-For-Review, TimedMediaHandler, Parsoid-PHP

Thu, Nov 7

cscott added a comment to T234280: Edits made via VE on a translatable page removes untouched content in <translate> tags.

You're right, I didn't notice it in the example, and i missed it in @matmarex's screenshot, because data-mw was abbreviated in both cases. But it's there, on the <h2> tag in your example, and on the top <h2> in F31045488.

Thu, Nov 7, 6:44 PM · VisualEditor (Current work), VisualEditor-MediaWiki, Parsoid
cscott added a comment to T234280: Edits made via VE on a translatable page removes untouched content in <translate> tags.

@subbu -- the data-mw attribute necessary to round-trip the template is only occuring on the <section> tag. That's true for both the template example in https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping#Examples and the extension example in @matmarex's F31045488.

Thu, Nov 7, 6:05 PM · VisualEditor (Current work), VisualEditor-MediaWiki, Parsoid
cscott added a comment to T234280: Edits made via VE on a translatable page removes untouched content in <translate> tags.

Seems like the bug is in VE's section-stripping code? It should probably not strip the section tag iff it has typeof="mw:Transclusion"; that will probably be handled properly by the existing VE "template-affected tag" mechanisms (which just look at typeof and ignore the tag name AFAIK).

Thu, Nov 7, 4:06 PM · VisualEditor (Current work), VisualEditor-MediaWiki, Parsoid

Wed, Nov 6

cscott claimed T235217: Parsoid should use protocol-relative URLs for media.
Wed, Nov 6, 5:13 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T237326: Make Parsoid/PHP cluster read-write to ensure lints discovered by Parsoid/PHP are stored in the DB.

If it were possible to direct writes at some *other* database before pointing them at the live database, that might be a good "step #0".

Wed, Nov 6, 5:09 PM · Core Platform Team, User-WDoran, Parsoid-PHP
cscott created T237538: Merge Disambiguation in core or add hook.
Wed, Nov 6, 2:56 PM · MediaWiki-extensions-Disambiguator, Parsoid-PHP
cscott created T237535: Fix inconsistency between mediawiki-title and Title in core.
Wed, Nov 6, 2:48 PM · Parsoid-PHP

Tue, Nov 5

cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

It gets worse:

>>> $doc = DOMDocument::loadHTML('<p>foo');  null;
=> null
>>> $node = $doc->createElement('math'); null;
=> null
>>> $node->setAttribute('xmlns', 'foo');
=> true
>>> $node->getAttribute('xmlns')
=> "foo"
>>> $node->getAttributeNode('xmlns')
=> DOMNameSpaceNode {#2310
     +nodeName: "xmlns",
     +nodeValue: "",
     +nodeType: XML_NAMESPACE_DECL_NODE,
     +prefix: "",
     +localName: "xmlns",
     +namespaceURI: "foo",
     +ownerDocument: DOMDocument {#2305 …},
     +parentNode: null,
   }
>>> $node->setAttribute('x','y');
=> DOMAttr {#2316
...
   }
>>> $node->attributes
=> DOMNamedNodeMap {#2318
     +length: 1,
   }
>>> $node->attributes->item(0)
=> DOMAttr {#2316
...
   }
>>> $node->getAttributeNode('x')
=> DOMAttr {#2316
     +nodeName: "x",
     +nodeValue: "",
     +nodeType: XML_ATTRIBUTE_NODE,
     +parentNode: DOMElement {#2321 …},
[...]
     +namespaceURI: null,
     +prefix: "",
     +localName: "x",
     +baseURI: null,
     +textContent: "",
     +name: "x",
     +specified: true,
     +value: "y",
     +ownerElement: DOMElement {#2321 …},
     +schemaTypeInfo: null,
   }
>>> $node->getAttributeNode('xmlns') instanceof \DOMAttr
=> false
>>>

Note that the attribute node for the xmlns attribute is a DOMNameSpaceName, and the value of the attribute is stored in namespaceURI, not value. All the other attributes are instances of DOMAttr with value in value.

Tue, Nov 5, 11:44 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

(The reason PHP's magic behavior is awful is that the attribute "exists" in some sense, but it's completely invisible to DOMNode::setAttribute/DOMNode::hasAttribute/etc. So even if we hacked our serializer to behave the same way that $doc->saveXML does and magically 'revive' the attribute, I'd have to go through and audit every setAttribute/hasAttribute etc in Parsoid to make sure that a hidden xmlns attribute isn't going to break things by being invisible yet present...)

Tue, Nov 5, 10:07 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

Yeah, but that's not how the DOM is supposed to work.

Tue, Nov 5, 10:04 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T235295: MathML tags are missing xmlns attribute.

https://github.com/php/php-src/blob/php-7.2.24/ext/dom/element.c#L411

Tue, Nov 5, 9:50 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

Another one: T235295: MathML tags are missing xmlns attribute -- PHP's DOM refuses to accept an attribute named xmlns for some reason:

$ psysh 
Psy Shell v0.9.9 (PHP 7.3.4-2 — cli) by Justin Hileman
>>> $doc = new \DOMDocument(); null
=> null
>>> $node = $doc->createElement('math'); null;
=> null
>>> $node->setAttributeNS('http://www.w3.org/2000/xmlns/', 'xmlns', 'http://www.w3.org/1998/Math/MathML');
=> null
>>> $node->attributes
=> DOMNamedNodeMap {#2335
     +length: 0,
   }
>>> $node->setAttribute('xmlns', 'xyz');
=> false
>>> $node->attributes
=> DOMNamedNodeMap {#2333
     +length: 0,
   }
>>> $node->setAttribute('x', 'xyz'); null;
=> null
>>> $node->attributes
=> DOMNamedNodeMap {#2322
     +length: 1,
   }
>>> $node2 = $doc->createElementNS('http://www.w3.org/1999/xhtml', 'math'); null;
Tue, Nov 5, 9:45 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T235295: MathML tags are missing xmlns attribute.

Hm. Apparently the PHP DOM just ignores setAttribute and setAttributeNS when the name is xmlns. This is regardless of whether the element is created with createElement or createElementNS. I haven't figured out a workaround yet.

Tue, Nov 5, 7:54 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T235295: MathML tags are missing xmlns attribute.

I bet the culprit is the workaround for T217708 (Remex commit 33de7ba9746fce0aaaeb9314a7a78460f2a28122), although that was *supposed* to affect only HTML elements, not xmlns="http://www.w3.org/1998/Math/MathML".

Tue, Nov 5, 7:26 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T228346: PHP 7.2 garbage collector segfault.

No one has audited core or extensions for this particular usage pattern. So we assume there is at least *some* code in core/extensions which would be affected.

Tue, Nov 5, 7:23 PM · MW-1.35-release, Upstream, MediaWiki-General, PHP 7.2 support
cscott added a comment to T236810: Make private methods of Parser.php actually private.

@cscott I was wondering if there is a preferred migration path away from Parser::replaceLinkHolders for code that relied on it. I see that the mLinkHolders field in Parser is public, but I imagine that won't stay that way either, right? Thanks in advance!

Tue, Nov 5, 6:38 PM · Parsoid, MW-1.34-notes, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Parser
cscott committed rEMLI5d5c849a554a: Replace use of Parser::disableCache(), deprecated in MW 1.28 (authored by cscott).
Replace use of Parser::disableCache(), deprecated in MW 1.28
Tue, Nov 5, 12:53 AM
cscott committed rERQT1e4a301cd2ad: Replace use of Parser::disableCache(), deprecated in MW 1.28 (authored by cscott).
Replace use of Parser::disableCache(), deprecated in MW 1.28
Tue, Nov 5, 12:40 AM

Mon, Nov 4

cscott committed rERPG885325715006: Replace use of Parser::disableCache(), deprecated in MW 1.28 (authored by cscott).
Replace use of Parser::disableCache(), deprecated in MW 1.28
Mon, Nov 4, 11:22 PM
cscott committed rEMYV4c34a2e64acd: Replace use of Parser::disableCache(), deprecated in MW 1.28 (authored by cscott).
Replace use of Parser::disableCache(), deprecated in MW 1.28
Mon, Nov 4, 10:37 PM
cscott committed rEBTXdbd964503d5f: Replace use of Parser::disableCache(), deprecated in MW 1.28 (authored by cscott).
Replace use of Parser::disableCache(), deprecated in MW 1.28
Mon, Nov 4, 8:28 PM
cscott committed rELIW278d1ecf9931: Replace use of Parser::disableCache(), deprecated in MW 1.28 (authored by cscott).
Replace use of Parser::disableCache(), deprecated in MW 1.28
Mon, Nov 4, 8:28 PM
cscott committed rECNSbc2d5f086132: Replace use of Parser::disableCache(), deprecated in MW 1.28 (authored by cscott).
Replace use of Parser::disableCache(), deprecated in MW 1.28
Mon, Nov 4, 8:27 PM

Tue, Oct 29

cscott added a parent task for T236813: Magic word implementations should be moved out of Parser.php: T236809: Refactor Parser.php to allow alternate parser (Parsoid).
Tue, Oct 29, 4:22 PM · Parsoid, MediaWiki-Parser
cscott added a subtask for T236809: Refactor Parser.php to allow alternate parser (Parsoid): T236813: Magic word implementations should be moved out of Parser.php.
Tue, Oct 29, 4:22 PM · Parsoid, MediaWiki-Parser
cscott created T236813: Magic word implementations should be moved out of Parser.php.
Tue, Oct 29, 4:22 PM · Parsoid, MediaWiki-Parser
cscott added a subtask for T236811: Parser creation should always use factory: T236812: Parser.php should be split into a base class and a parser implementation.
Tue, Oct 29, 4:18 PM · Parsoid, MediaWiki-Parser
cscott added a parent task for T236812: Parser.php should be split into a base class and a parser implementation: T236811: Parser creation should always use factory.
Tue, Oct 29, 4:18 PM · Parsoid, MediaWiki-Parser
cscott created T236812: Parser.php should be split into a base class and a parser implementation.
Tue, Oct 29, 4:18 PM · Parsoid, MediaWiki-Parser
cscott added a subtask for T236809: Refactor Parser.php to allow alternate parser (Parsoid): T236811: Parser creation should always use factory.
Tue, Oct 29, 4:16 PM · Parsoid, MediaWiki-Parser
cscott added a parent task for T236811: Parser creation should always use factory: T236809: Refactor Parser.php to allow alternate parser (Parsoid).
Tue, Oct 29, 4:16 PM · Parsoid, MediaWiki-Parser
cscott created T236811: Parser creation should always use factory.
Tue, Oct 29, 4:16 PM · Parsoid, MediaWiki-Parser
cscott added a project to T236809: Refactor Parser.php to allow alternate parser (Parsoid): MediaWiki-Parser.
Tue, Oct 29, 4:12 PM · Parsoid, MediaWiki-Parser
cscott added a subtask for T236809: Refactor Parser.php to allow alternate parser (Parsoid): T236810: Make private methods of Parser.php actually private.
Tue, Oct 29, 4:12 PM · Parsoid, MediaWiki-Parser
cscott added a parent task for T236810: Make private methods of Parser.php actually private: T236809: Refactor Parser.php to allow alternate parser (Parsoid).
Tue, Oct 29, 4:12 PM · Parsoid, MW-1.34-notes, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Parser
cscott created T236810: Make private methods of Parser.php actually private.
Tue, Oct 29, 4:12 PM · Parsoid, MW-1.34-notes, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Parser
cscott created T236809: Refactor Parser.php to allow alternate parser (Parsoid).
Tue, Oct 29, 4:10 PM · Parsoid, MediaWiki-Parser
cscott added a comment to T197902: Be more selective in applying French Space armoring.

@Od1n could you do some quick benchmarks to satisfy the reviewer on the patch above?

Tue, Oct 29, 3:43 PM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), MediaWiki-Parser

Fri, Oct 25

cscott added a comment to T235656: Ref fragments remain unexpanded in Image:Frameless, mw:ExpandedAttrs, mw:LanguageVariant nodes.

Hm. Where should the ref go -- does PHP actually generate an entry in <referernces/> for this (gosh, I hope so...)?

Fri, Oct 25, 9:19 PM · Parsoid

Thu, Oct 24

cscott added a comment to T236213: Utils::bcp47 code needs to be ported.

I think this was a function which is in core and was ported to JS?

Thu, Oct 24, 2:28 PM · Parsoid-PHP

Wed, Oct 23

cscott added a comment to T235661: section id differences.

We have <section id="0">....</section><section id="981">.... and they appear to count up from there, but they must jump because we end with <section id="1330">....</section><section id="-1">....</section>.

Wed, Oct 23, 4:35 PM · Parsoid-PHP
cscott added a comment to T235661: section id differences.

Hm, that's strange: the section ids in the legacy parser only go up to 31:

Wed, Oct 23, 4:19 PM · Parsoid-PHP

Tue, Oct 22

cscott added a comment to T235691: Big DSR diff.

So the main body of the gallery has sensible DSRs:

<body data-parsoid='{"dsr":[0,84,0,0]}'
<ul class="gallery mw-gallery-traditional" typeof="mw:Extension/gallery" data-parsoid='{"dsr":[0,83,9,10]}'
<div class="gallerytext" data-parsoid='{"dsr":[23,72,0,0]}'
<sup class="mw-ref" id="cite_ref-1" typeof="mw:Extension/ref" data-parsoid='{"dsr":[34,72,5,6]}'

And other outer wrapper of the auto-inserted <references/> is zero-width, but it contains the link with the bogus DSR (which fwiw is outside the boundary of its parent):

<div class="mw-references-wrap" typeof="mw:Extension/references" data-parsoid='{"dsr":[84,84,0,0]}'
<a rel="mw:ExtLink" href="https://cscott.net" class="external text" data-parsoid='{"dsr":[46,73,20,1]}'>

in PHP that last is:

<a rel="mw:ExtLink" href="https://cscott.net" class="external text" data-parsoid='{"dsr":[46,73,18,1]}'

...but both are bogus. The string at [46,73] is //cscott.net foo💩 ]</ref>\n and neither the first 18 or 20 characters of that are a sensible "open tag" width.

Tue, Oct 22, 8:42 PM · Parsoid-PHP
cscott added a comment to T236183: Link trail differences between Parsoid/JS & Parsoid/PHP.

This turns out to be an instance of the dreaded "PHP regex D modifier" bug.

Tue, Oct 22, 7:39 PM · Parsoid-PHP
cscott created T236205: Clean up Parsoid metrics.
Tue, Oct 22, 7:20 PM · Parsoid
cscott added a comment to T236183: Link trail differences between Parsoid/JS & Parsoid/PHP.

Ok, I can reproduce it (parsertest in gerrit above), and confirmed that this is a Parsoid/PHP bug. Now to figure out how to fix it....

Tue, Oct 22, 4:46 PM · Parsoid-PHP
cscott added a comment to T236183: Link trail differences between Parsoid/JS & Parsoid/PHP.

for lnwiki, looks like the link*trail* is getting stuck on as a prefix? Is that what that looks like to you?

Tue, Oct 22, 4:11 PM · Parsoid-PHP
cscott added a comment to T236112: Missing contentmodel handlers for everything but wikitext.

Probably missing implementations of the JSON and ProofRead page extensions...

Tue, Oct 22, 1:32 AM · Parsoid-PHP

Mon, Oct 21

cscott added a comment to T235691: Big DSR diff.

I can reproduce (at least part of) the DSR differences with the following wikitext:

<gallery>
File:Foo.jpg|💩 caption <ref>[https://cscott.net foo💩 ]</ref>
</gallery>
Mon, Oct 21, 9:50 PM · Parsoid-PHP
cscott added a comment to T235563: Link prefix differences between Parsoid/JS & Parsoid/PHP.

from kawiki is \u2013; from arwiki is \u2014. Both of these are in the link prefix range of \x80-\x10fffff, but even if we're not propagating the u modifier across, it would be parsed as [\x80-\xDBFF\xDFFF] and \u2013/\u2014 ought to be in that range.

Mon, Oct 21, 2:42 PM · Parsoid-PHP

Sat, Oct 19

cscott added a comment to T235563: Link prefix differences between Parsoid/JS & Parsoid/PHP.

Link prefix/trail on arwiki is:

"linkprefixcharset": "a-zA-Z\\x{80}-\\x{10ffff}",
"linkprefix": "/^((?>.*[^a-zA-Z\\x{80}-\\x{10ffff}]|))(.+)$/sDu",
"linktrail": "/^([a-z\u0621-\u064a]+)(.*)$/sDu",

Link prefix/trail on kawiki is:

"linkprefixcharset": "a-zA-Z\\x{80}-\\x{10ffff}",
"linkprefix": "/^((?>.*[^a-zA-Z\\x{80}-\\x{10ffff}]|))(.+)$/sDu",
"linktrail": "/^([a-z\u10d0\u10d1\u10d2\u10d3\u10d4\u10d5\u10d6\u10d7\u10d8\u10d9\u10da\u10db\u10dc\u10dd\u10de\u10df\u10e0\u10e1\u10e2\u10e3\u10e4\u10e5\u10e6\u10e7\u10e8\u10e9\u10ea\u10eb\u10ec\u10ed\u10ee\u10ef\u10f0\u201c\u00bb]+)(.*)$/sDu"
Sat, Oct 19, 2:59 AM · Parsoid-PHP
cscott added a comment to T235392: I09a178e5c6938954edb2949f13660227d6a01fbc breaks extension Semantic MediaWiki.

Well, 1.34.0 is due in November, so it seems we don't actually have to wait that long to fix T228881. In fact, since 1.34 has already forked and @Reedy has done the backports, it could be landed on master now. I'd prefer waiting a few weeks just to let the deprecation code ride the train and get testing and make sure @Fomafix and I haven't inadvertently regressed something with our revert + deprecation patches, but early November seems reasonable to land https://gerrit.wikimedia.org/r/544249.

Sat, Oct 19, 2:09 AM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), MW-1.34-notes, Parsing-Team, MW-1.34-release, MediaWiki-Parser
cscott added a comment to T235392: I09a178e5c6938954edb2949f13660227d6a01fbc breaks extension Semantic MediaWiki.

This is all backported

Sat, Oct 19, 2:03 AM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), MW-1.34-notes, Parsing-Team, MW-1.34-release, MediaWiki-Parser

Fri, Oct 18

cscott added a comment to T229074: Preparing VisualEditor for Parsoid-PHP switch.

Here's my draft list of scenarios to test, from our hangout the other day:

Fri, Oct 18, 7:43 PM · Editing QA, Patch-For-Review, MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), VisualEditor (Current work), Core Platform Team, Parsoid-PHP

Thu, Oct 17

cscott added a comment to T235392: I09a178e5c6938954edb2949f13660227d6a01fbc breaks extension Semantic MediaWiki.

So to be concrete, I'm suggesting to merge @Fomafix's revert https://gerrit.wikimedia.org/r/543001 and then my deprecation patch on top of that https://gerrit.wikimedia.org/r/543903 but *not* the hasTitle and return type hint patches https://gerrit.wikimedia.org/r/543002 and https://gerrit.wikimedia.org/r/543003.

Thu, Oct 17, 5:03 PM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), MW-1.34-notes, Parsing-Team, MW-1.34-release, MediaWiki-Parser
cscott added a comment to T235392: I09a178e5c6938954edb2949f13660227d6a01fbc breaks extension Semantic MediaWiki.

What about checking for $mTitle === null and issuing a deprecation warning?

Thu, Oct 17, 4:51 PM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), MW-1.34-notes, Parsing-Team, MW-1.34-release, MediaWiki-Parser

Wed, Oct 16

cscott added a comment to T235684: id and fallback id differences.

It's easy enough in this instance, and if it helps reduce noise in the HTML diffs it helps increase our confidence in deploying Parsoid/PHP so I think it's still worthwhile.

Wed, Oct 16, 10:17 PM · Chinese-Sites, Parsoid-PHP
cscott added a comment to T235684: id and fallback id differences.

Ok, tracked it down to:

Sanitizer.normalizeSectionIdWhiteSpace = function(id) {
	return id.replace(/[ _]+/g, ' ').trim();
};

vs

public static function normalizeSectionIdWhiteSpace( string $id ): string {
		return trim( preg_replace( '/[ _]+/', ' ', $id ) );
}
Wed, Oct 16, 9:51 PM · Chinese-Sites, Parsoid-PHP
cscott added a comment to T235684: id and fallback id differences.

Minimum repro:

$ echo '==={{CHNML}}}===' | php bin/parse.php --domain zh.wikipedia.org --body_only
<h3 id="_中国大陆}" data-parsoid='{"dsr":[0,16,3,3]}'><span id="_.E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86.7D" typeof="mw:FallbackId" data-parsoid='{"dsr":[3,3,null,null]}'></span><span class="flagicon" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","dsr":[3,12,null,null],"pi":[[]]}' data-mw='{"parts":[{"template":{"target":{"wt":"CHNML","href":"./Template:CHNML"},"params":{},"i":0}}]}'><figure-inline class="mw-image-border" typeof="mw:Image"><span><img alt="" resource="./File:Flag_of_the_People's_Republic_of_China.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" data-file-width="900" data-file-height="600" data-file-type="drawing" height="15" width="22" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/44px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/33px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x"/></span></figure-inline><span typeof="mw:Entity"> </span></span><a rel="mw:WikiLink" href="./中国大陆" title="中国大陆" about="#mwt1" data-parsoid='{"stx":"piped","a":{"href":"./中国大陆"},"sa":{"href":"中国大陆"}}'>中国大陆</a>}</h3>

-vs-

$ echo '==={{CHNML}}}===' | bin/parse.js --domain zh.wikipedia.org --body_only
<h3 id="中国大陆}" data-parsoid='{"dsr":[0,16,3,3]}'><span id=".E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86.7D" typeof="mw:FallbackId" data-parsoid='{"dsr":[3,3]}'></span><span class="flagicon" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","dsr":[3,12,null,null],"pi":[[]]}' data-mw='{"parts":[{"template":{"target":{"wt":"CHNML","href":"./Template:CHNML"},"params":{},"i":0}}]}'><figure-inline class="mw-image-border" typeof="mw:Image"><span><img alt="" resource="./File:Flag_of_the_People's_Republic_of_China.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" data-file-width="900" data-file-height="600" data-file-type="drawing" height="15" width="22"/></span></figure-inline><span typeof="mw:Entity"> </span></span><a rel="mw:WikiLink" href="./中国大陆" title="中国大陆" about="#mwt1" data-parsoid='{"stx":"piped","a":{"href":"./中国大陆"},"sa":{"href":"中国大陆"}}'>中国大陆</a>}</h3>
Wed, Oct 16, 9:11 PM · Chinese-Sites, Parsoid-PHP
cscott added a comment to T235653: Template data-parsoid "spc" property differences.

Ah. Is the value always '' in cases where the diff appears? It seemed like you were saying that in some cases the unicode space wasn't being properly (?) stripped from the value. (That seems to be the case in T235684, for example.)

Wed, Oct 16, 8:03 PM · Parsoid-PHP
cscott added a comment to T235653: Template data-parsoid "spc" property differences.

Which of these (from your second example) is actually correct?

{"k":"div 1","named":true,"spc":["","","","   \n"]},
{"k":"div 1","named":true,"spc":["","","  ","\n"]},

Either the whitespace appears both before *and* after div 1 in the wikitext (PHP) or it only appears after div 1 (JS). One of those has to be wrong, and we should fix it.

Wed, Oct 16, 7:52 PM · Parsoid-PHP
cscott renamed T235684: id and fallback id differences from Fallback id differences to id and fallback id differences.
Wed, Oct 16, 5:54 PM · Chinese-Sites, Parsoid-PHP
cscott added a comment to T235684: id and fallback id differences.

the regular id attribute seems to be different as well -- looks like we're not doing whitespace stripping on the left hand side appropriately.

Wed, Oct 16, 5:50 PM · Chinese-Sites, Parsoid-PHP
cscott committed rMLLCd33b49fad560: Update documentation in preparation for 0.1.0 release. (authored by cscott).
Update documentation in preparation for 0.1.0 release.
Wed, Oct 16, 3:17 PM

Tue, Oct 15

cscott added a comment to T235552: Nested DSR offsets aren't converted from byte to ucs2 offsets by the convertOffsets code.

hm, it *should* be converted. I wrote the recursive code. Maybe there's a case missing somewhere. Will look into it...

Tue, Oct 15, 7:53 PM · Parsoid-PHP
cscott added a comment to T234966: Decide on HTML format for machine-readable signatures.

So...

<span vocab="http://schema.org" typeof="Comment" class="mw-signature">
    <span rel="creator" resource="/wiki/User:cscott"> <!-- machine readable link to username in span attributes -->
       <!-- But note that everything in this <span> is customizable text, don't try to parse inside the tag -->
       <a href="/wiki/User:cscott" title="User:cscott">C. Scott Ananian</a> (<a href="/wiki/User_talk:cscott" title="User talk:cscott">talk</a>)
     </span>
    <time property="dateCreated" datetime="2007-03-29T18:07Z">18:07, 29 March 2007 (UTC)</time>
</span>

?
(You could also add the userid as an attribute to the <span> if that was thought useful.)

Tue, Oct 15, 4:19 PM · OWC2020

Oct 11 2019

cscott added a comment to T234979: Tracking task for addressing HTML string diffs between Parsoid/JS & Parsoid/PHP.

These diffs should be normalized away

ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js diff.yaml fr.wikipedia.org Paris
DIFFS FOR fr.wikipedia.org:Paris
...
----- JS:[94991, 95143] -----
<li class="gallerybox" style="width: 477.3333333333333px;" data-parsoid="{}">
<div class="thumb" style="width: 475.3333333333333px;" data-parsoid="{}">
+++++ PHP:[94895, 95043] +++++
<li class="gallerybox" style="width: 477.33333333333px;" data-parsoid="{}">
<div class="thumb" style="width: 475.33333333333px;" data-parsoid="{}">
Oct 11 2019, 4:00 PM · Parsoid-PHP
cscott moved T235273: Remove PHPUtils::jsSort routine (or make it no-op) once Parsoid/JS is retired from Backlog to Porting Tech Debt Redressal on the Parsoid-PHP board.
Oct 11 2019, 3:14 PM · Parsoid-PHP
cscott edited projects for T235273: Remove PHPUtils::jsSort routine (or make it no-op) once Parsoid/JS is retired, added: Parsoid-PHP; removed Parsoid.
Oct 11 2019, 3:12 PM · Parsoid-PHP
cscott created T235273: Remove PHPUtils::jsSort routine (or make it no-op) once Parsoid/JS is retired.
Oct 11 2019, 3:11 PM · Parsoid-PHP

Oct 10 2019

cscott created T235217: Parsoid should use protocol-relative URLs for media.
Oct 10 2019, 8:38 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T235179: Implement workarounds in RESTBase and Flow to hit Parsoid/PHP REST API endpoints without an oldid for titles containing ".".

subbu: would percent-encoding the . as %2E help, or is this IE6/7 hack done after percent-decoding?

Oct 10 2019, 7:26 PM · Core Platform Team Workboards (Clinic Duty Team), StructuredDiscussions, RESTBase, Growth-Team, Parsoid-PHP
cscott added a comment to T235179: Implement workarounds in RESTBase and Flow to hit Parsoid/PHP REST API endpoints without an oldid for titles containing ".".

To be clear (and I'll edit this if I turn out to be wrong), this is a *future* not *current* problem. It will break when we switch to Parsoid/PHP from Parsoid/JS.

Oct 10 2019, 6:43 PM · Core Platform Team Workboards (Clinic Duty Team), StructuredDiscussions, RESTBase, Growth-Team, Parsoid-PHP

Oct 8 2019

cscott updated the task description for T234966: Decide on HTML format for machine-readable signatures.
Oct 8 2019, 5:07 PM · OWC2020
cscott updated the task description for T234966: Decide on HTML format for machine-readable signatures.
Oct 8 2019, 5:06 PM · OWC2020
cscott updated the task description for T234966: Decide on HTML format for machine-readable signatures.
Oct 8 2019, 5:03 PM · OWC2020
cscott updated the task description for T234966: Decide on HTML format for machine-readable signatures.
Oct 8 2019, 5:03 PM · OWC2020
cscott added a subtask for T230653: Use a parser function to encapsulate signatures: T234966: Decide on HTML format for machine-readable signatures.
Oct 8 2019, 4:55 PM · Patch-For-Review, OWC2020, MediaWiki-Parser
cscott added a parent task for T234966: Decide on HTML format for machine-readable signatures: T230653: Use a parser function to encapsulate signatures.
Oct 8 2019, 4:55 PM · OWC2020
cscott added a project to T234966: Decide on HTML format for machine-readable signatures: OWC2020.
Oct 8 2019, 4:55 PM · OWC2020
cscott created T234966: Decide on HTML format for machine-readable signatures.
Oct 8 2019, 4:52 PM · OWC2020
cscott created T234932: Parsoid srcset is inconsistent with core.
Oct 8 2019, 2:59 PM · Parsoid

Oct 7 2019

cscott added a comment to T234863: Devise a suitable parser or phpunit test for commit 4b9344af71c69a49735eddae3278e4d6460532e4.

Quoting the commit down here in comments so it autolinks properly and I can click through to see what it is: 4b9344af71c69a49735eddae3278e4d6460532e4

Oct 7 2019, 9:45 PM · Parsoid-PHP

Oct 4 2019

cscott added a comment to T197879: Fix mw:DisplaySpace to match PHP "armorFrenchSpaces".

To restate, I'm proposing that we take the DisplaySpace hack *out* of the tokenizer, and instead run it as a DOMPostProcessor pass, with a corresponding preprocessor in the html2wt side to reverse that transformation.

Oct 4 2019, 4:53 PM · Parsoid-Read-Views, Parsoid-Rendering

Oct 3 2019

cscott added a comment to T120085: RFC: Serve Main Page of Wikimedia wikis from a consistent URL.

From Parsoid's perspective:

Oct 3 2019, 9:14 PM · CommRel-Specialists-Support, Readers-Web-Backlog (Tracking), MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Fundraising-Backlog, Editing-team, Parsing-Team, User-notice, MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), Core Platform Team, Patch-For-Review, Performance-Team, Operations, Traffic, TechCom-RFC, SEO, Wikimedia-Site-requests
cscott added a comment to T232182: Parsoid/PHP performance benchmarking on scandium / eqiad cluster.

What about splitting the hosts, and directing JS traffic to one half of them and PHP traffic to the other half? Then by changing exactly what machines are in each pool we can reallocate resources as necessary, since every machine would be capable of responding to either PHP or JS requests.

Oct 3 2019, 5:57 PM · Patch-For-Review, Performance-Team (Radar), Performance Issue, Parsoid-PHP
cscott added a comment to T234549: "Properly" address missing srcText issues in PageConfigFrame.

See also T233818: Call to a member function getContent() on null.

Oct 3 2019, 4:50 PM · Parsoid
cscott added a project to T234548: Serializer should use Frame, not SelserData: Parsoid.
Oct 3 2019, 4:46 PM · Parsoid, Patch-For-Review
cscott created T234549: "Properly" address missing srcText issues in PageConfigFrame.
Oct 3 2019, 4:46 PM · Parsoid
cscott created T234548: Serializer should use Frame, not SelserData.
Oct 3 2019, 4:41 PM · Parsoid, Patch-For-Review

Sep 30 2019

cscott added a comment to T233818: Call to a member function getContent() on null.

I think the "proper" solution would be to carry around some sort of parameter -- maybe even all the way from the REST API in terms of a special path or query parameter -- to allow us to distinguish the case where a revision not being found is "expected" (ie, new page creation), from where revision not being found should be "tolerated" (ie, it's always possible there's a race between saving a new revision and someone deleting the page -- but ideally in this case we'll have the old revision ourself and pass it in to the API, so that we can generate proper wikitext w/o having to fetch the original revision), from where revision not being found/supplied is definitely an bug that should generate a 500 error.

Sep 30 2019, 9:40 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T233818: Call to a member function getContent() on null.

Well, we have to be able to create a page from scratch (with no previous revision). So a zero-length string seems reasonable to me as a fallback, in both the Api config and the integrated config.

Sep 30 2019, 6:10 PM · Patch-For-Review, Parsoid-PHP