Page MenuHomePhabricator

cscott (C. Scott Ananian)
Parser whisperer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 21 2014, 6:47 PM (230 w, 3 d)
Availability
Available
IRC Nick
cscott
LDAP User
Unknown
MediaWiki User
Cscott [ Global Accounts ]

Editor since 2005; WMF developer since 2013. I work on Parsoid and OCG, and dabble with VE, real-time collaboration, and OOjs.

On github: https://github.com/cscott

See https://en.wikipedia.org/wiki/User:cscott for more.

Recent Activity

Thu, Mar 21

cscott updated subscribers of T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Thu, Mar 21, 8:59 PM · PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott renamed T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature) from MWException when viewing or comparing certain pages with PHP7 beta feature to MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Thu, Mar 21, 8:58 PM · PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott updated subscribers of T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Thu, Mar 21, 8:58 PM · PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Interesting. Probably not actually related to PHP 7 but to Preprocessor_DOM -- I believe HHVM still runs Preprocessor_Hash. I've got an outstanding request to reduce the code duplication: T204945: Deprecate one of the Preprocessor implementations for 1.33.

Thu, Mar 21, 8:58 PM · PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

Yep, most likely related. @Arlolra will probably write you a patch once he wakes up/gets online this am, but if you're impatient just removing the newlines from the indicated places in parserTests.txt will get you going again.

Thu, Mar 21, 2:27 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing
Mill <mill@mail.com> committed rEPFM5b55c1c2a2aa: %26dbaaaaaaaaaaa (authored by cscott).
%26dbaaaaaaaaaaa
Thu, Mar 21, 12:24 AM
cscott committed rMLZE8fc57991b857: Update README.md (authored by cscott).
Update README.md
Thu, Mar 21, 12:22 AM
cscott committed rMLZE023d6c4ac2a9: Update README.md (authored by cscott).
Update README.md
Thu, Mar 21, 12:22 AM
cscott committed rMLZE322afbbd7ee8: Update README.md (authored by cscott).
Update README.md
Thu, Mar 21, 12:22 AM
cscott added a comment to T218817: PHP Warning: count(): Parameter must be an array or an object that implements Countable.

It would be a good idea to add some parser tests which exercise both modes of StringUtils::explode. If this hadn't thrown an exception for being not Countable, we probably wouldn't have noticed that the array keys shift on large articles until it made production.

Thu, Mar 21, 12:08 AM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), User-zeljkofilipin, Patch-For-Review, Core Platform Team, Parsing-Team, MediaWiki-Parser, Wikimedia-production-error

Wed, Mar 20

cscott added a comment to T218816: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants null even if type is nullable.

Because the type hint just below it will say ?string not string|null? We should be using the one markup language which is used by PHP, not making up our own?

Wed, Mar 20, 10:16 PM · MediaWiki-Codesniffer
cscott added a comment to T218817: PHP Warning: count(): Parameter must be an array or an object that implements Countable.

ExplodeIterator::key() returns $this->curPos, *not* the line number. So with the 497863 patch it fixes the exceptions (and so should be fine to backport to wmf.22) but for wikitexts over a thousand lines it will effectively "never be on the last line" and so you'll get an extra trailing newline.

Wed, Mar 20, 9:02 PM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), User-zeljkofilipin, Patch-For-Review, Core Platform Team, Parsing-Team, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T218324: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants redundant "mixed|null".

See also T218816: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants null even if type is nullable.

Wed, Mar 20, 7:42 PM · Patch-For-Review, MediaWiki-Codesniffer
cscott created T218816: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants null even if type is nullable.
Wed, Mar 20, 7:41 PM · MediaWiki-Codesniffer

Tue, Mar 19

cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

Seems to be fixed. Watching builds of https://gerrit.wikimedia.org/r/464096 https://gerrit.wikimedia.org/r/497471 and https://gerrit.wikimedia.org/r/497320 to confirm.

Tue, Mar 19, 6:43 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing
cscott added a comment to T218358: Add data-title attribute to anchors.

Yes, the intention is certainly that title (in normalized form, ie spaces converted) can always be derived by removing the relative path. Furthermore, in modern Parsoid (not historically, but we can ignore that) that relative path is *always* ./. So just strip the first two characters and you've got your title.

Tue, Mar 19, 6:38 PM · Readers-Web-Backlog (Tracking), Parsing-Team, MediaWiki-Parser, Internet-Archive, Parsoid, Technical-Debt
cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

See https://gerrit.wikimedia.org/r/#/q/topic:trail+(status:open+OR+status:merged) for the set of patches merged; scribunto probably needs a patch somewhat like https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ParserFunctions/+/494939/

Tue, Mar 19, 5:16 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing
cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

Yeah, we should just fix the scributo tests, not revert patches. This was a set of 9 or so dependent patches to merge, unrolling would be quite a chore, and the scribunto fix should just be to remove some trailing newlines from the parser tests....

Tue, Mar 19, 5:14 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing

Fri, Mar 15

cscott added a comment to T218378: Flaky test Wikibase\Repo SetAliasesTest::testUserCannotSetAliasesWhenTheyLackPermission.

There are a few patches which have actually (apparently) passes this test and gotten merged, eg https://gerrit.wikimedia.org/r/496080

Fri, Mar 15, 9:10 PM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), Patch-For-Review, Wikidata-Campsite, Wikidata, MediaWiki-extensions-WikibaseRepository, Wikimedia-production-error (Shared Build Failure)
cscott committed rMLZE35dffc7d807c: Allow passing options to Remex in the loadHtml/parseHtml test helpers (authored by cscott).
Allow passing options to Remex in the loadHtml/parseHtml test helpers
Fri, Mar 15, 5:22 AM

Thu, Mar 14

cscott added a comment to T218183: Audit uses of PHP DOM in Wikimedia software.

Yeah, there's a more-or-less standard-but-ugly workaround that involves using mb_encode to replace everything above U+007F with an HTML entity: https://github.com/wikimedia/html-formatter/blob/5e33e3bbb327b3e0d685cc595837ccb024b72f57/src/HtmlFormatter.php#L71

Thu, Mar 14, 11:28 PM · TechCom, MediaWiki-General-or-Unknown, Parsoid-PHP
cscott added a comment to T217850: Remex could use some helper/utility classes.

I'd say there's one other use case, and it's what tidy does (AIUI): mutate a string representation of a HTML document in a "safe" way, without every building the complete DOM tree in memory. That is, "safe" string-to-string transformations. There are probably lots of weird things you could do here, but I would love to see a basic "insert X into Y" (like innerHTML) or "append X to Y" utility, done in a safe way that respected tag boundaries etc. The API of https://github.com/wikimedia/html-formatter/blob/master/src/HtmlFormatter.php could be a guide, just imagine doing it string-to-string without creating an intermediate DOM.

Thu, Mar 14, 7:01 PM · RemexHtml
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

We don't use this (I don't think) but be aware that if you create an attribute named 'xmlns' in PHP's DOM everything breaks: https://marc.info/?l=php-internals&m=155249142123136&w=2

Thu, Mar 14, 6:48 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

A performance update is at T204595#5024206

Thu, Mar 14, 6:42 PM · Patch-For-Review, Performance, RemexHtml
cscott added a comment to T218183: Audit uses of PHP DOM in Wikimedia software.

In general I would be happy to see an interface that allowed components to pass around DOM subtrees instead of strings, stringifying the tree only where needed for a legacy API. Then "composition" is (eventually) just subtree assembly, and we don't have to worry about poorly-constructed components leaking open tags into the rest of the content...

Thu, Mar 14, 3:42 PM · TechCom, MediaWiki-General-or-Unknown, Parsoid-PHP
cscott added a comment to T204595: Evaluate and document performance of RemexHtml vs Domino.

Testing with the following script (from the zest library home dir) with xdebug off:

$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> require('./tests/ZestTest.php');
=> 1
>>> $html100 = file_get_contents('./obama.html'); strlen($html100);                       => 2592386
>>> timeit -n10 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html100, [ 'suppressHtmlNamespace' => true, 'ignoreErrors' => true ]) && true;
=> true
Command took 0.389295 seconds on average (0.376101 median; 3.892954 total) to complete.
Thu, Mar 14, 3:00 PM · RemexHtml, Parsoid-PHP
cscott added a comment to T204595: Evaluate and document performance of RemexHtml vs Domino.

Could you post your test scripts somewhere? To be fair we should probably factor out process startup and file read times out of the measurements (the 350ms overhead you're measuring). It seems like we should dig into the slow Remex performance on large documents more, though, to figure out if there are some O(N^2) tree-mutation algorithms we need to kill, and if so figure out how hard they will be to fix (ie, are the bugs in Remex, in the PHP DOM extension).

Thu, Mar 14, 4:18 AM · RemexHtml, Parsoid-PHP
cscott closed T124762: parsoid is trying to use relative addressing of images, even though wikitext doesn't. as Resolved.
Thu, Mar 14, 1:43 AM · Patch-For-Review, OCG-General, Parsoid

Wed, Mar 13

cscott committed rMLZE3c64054177a7: Rename package to `wikimedia/zest-css` in composer.json (authored by cscott).
Rename package to `wikimedia/zest-css` in composer.json
Wed, Mar 13, 6:30 PM
cscott closed T217708: Remex should offer an option to not set namespaceURI as Resolved.
Wed, Mar 13, 3:41 PM · Patch-For-Review, RemexHtml
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

I like this section from Symfony's page, which describes why it's nicer for long-term maintenance to use CSS selectors in the code:

Wed, Mar 13, 3:19 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T218183: Audit uses of PHP DOM in Wikimedia software.

I'm finding performance issues with the PHP DOM implementation in T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size as well; we could watch for those in the audit as well. (In particular, setting the namespace on a node in the PHP DOM seems to trigger a nonlinear slowdown.)

Wed, Mar 13, 3:05 PM · TechCom, MediaWiki-General-or-Unknown, Parsoid-PHP
cscott updated subscribers of T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

Some notes copied from chat:

Wed, Mar 13, 3:03 PM · Patch-For-Review, Performance, RemexHtml

Sun, Mar 10

cscott added a comment to T217850: Remex could use some helper/utility classes.

T217849: Remex needs documentation of how to use its API as well. The spec requires \r stripping but IIRC MW also does newline stripping so we're guaranteed that any article we fetch from the DB already has newlines stripped, which is why there's an optimization in remex to avoid unnecessary work. I wonder if the time savings is actually significant enough to merit the developer cost of maintaining a separate option. In any case, we need to document this stuff better.

Sun, Mar 10, 9:16 PM · RemexHtml

Sat, Mar 9

cscott added a comment to T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

The relative # of calls appear to scale linearly between the 10% and 100% benchmarks:


xdebug says that Tokenizer::handleAttribsAndClose accounts for 15.9 of the 20.6 total time units parsing [[en:Barack Obama]] now (after tweaking Remex to immediately initialize the LazyAttributes so the cost is accounted properly). That 15.9 is split about evenly between Tokenizer::consumeAttribs (6.6 units, most of this in Tokenizer::interpretAttribMatches via LazyAttributes::init()) and TreeBuilder\Dispatcher::startTag (6.5 units, most of this in DOMBuilder::insertElement).

Sat, Mar 9, 7:03 PM · Patch-For-Review, Performance, RemexHtml
cscott added a comment to T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

Found another slowdown:

<cscott> huh, we're also spending a lot of time in CachingStack->dump() -- that's not right, that's pure debug code, we should never be in there (!)
<cscott> oh no, $this->trace( "AFE\n" . $afe->dump() . "STACK\n" . $stack->dump() );
<cscott>         private function trace( $msg ) {
<cscott>                 // print "[AAA] $msg\n";
<cscott>         }
<cscott> sigh
<cscott> removing the debugging code from TreeBuilder::adoptionAgency brings us down to 6.16s.  getting there! one more second shaved, five more to go.
Sat, Mar 9, 5:21 PM · Patch-For-Review, Performance, RemexHtml
cscott added a comment to T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

Part of the slowness appears to be the call to dom_reconcile_ns in node.c -- callgrind seems to confirm that the xmlSearchNsByHref call is slow. Hacking around things by changing createElementNS to createElement in DOMBuilder::createNode (which is, incidentally, something requested as a short-term workaround by T217708: Remex should offer an option to not set namespaceURI) makes DOMElement->insertBefore() disappear from the hot functions list, and seems to reduce time spent parsing [[en:Barack Obama]] from 6.8s to 4.8s on my machine.

Sat, Mar 9, 4:20 PM · Patch-For-Review, Performance, RemexHtml
cscott added a comment to T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

Profiling [[en:Barack Obama]] with xdebug seems to confirm that DOMElement->insertBefore() is the slow call; I just need to figure out how to get a C-level profile to get a little more insight into which path through https://github.com/php/php-src/blob/master/ext/dom/node.c#L925 is the slow one.

Sat, Mar 9, 2:04 AM · Patch-For-Review, Performance, RemexHtml

Fri, Mar 8

cscott added a comment to T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

DOMBuilder.php uses $parent->insertBefore( $node, $refNode ) which appears to call xmlAddPrevSibling in libxml which ought to be constant time (nodes are stored as a linked list). The quadratic performance makes me think that something in here is operating on the child node *array* instead of a linked list, but I can't find it (yet).

Fri, Mar 8, 9:34 PM · Patch-For-Review, Performance, RemexHtml
cscott updated the task description for T217708: Remex should offer an option to not set namespaceURI.
Fri, Mar 8, 9:19 PM · Patch-For-Review, RemexHtml
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

@Catrope: https://github.com/wikimedia/parsoid/blob/34964e38e431238ecada0f052b4b81a5a19db84d/src/Wt2Html/XMLSerializer.php but @ssastry says it's not terribly fast right now. We should look into that, maybe there are performance tweaks that are possible.

Fri, Mar 8, 8:49 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T217867: Port domino (or another spec-compliant DOM library) to PHP.

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package.

AIUI most of the bugs come from libxml, not PHP directly, so that wouldn't improve the situation much.

Fri, Mar 8, 8:42 PM · Core Platform Team Backlog (Attic), Parsoid-PHP

Thu, Mar 7

cscott added a comment to T217867: Port domino (or another spec-compliant DOM library) to PHP.

Another useful note, while I'm brain-dumping. Part of the task would be to define an appropriate PHP binding to WebIDL. There's a good start in packagist -- https://packagist.org/packages/esperecyan/webidl -- but it's implementation based. Someone should write a brief document describing how WebIDL maps to PHP. Unfortunately, the only non-JavaScript language that appears to have a format WebIDL binding description is Java, and they have "stopped work" on it and published it as a W3C note.

Thu, Mar 7, 10:48 PM · Core Platform Team Backlog (Attic), Parsoid-PHP
cscott added a comment to T217766: Flow\Exception\WikitextException: ParseEntityRef: no name.

@Catrope: yes, what @Tgr said -- your test case should be "<pre>\n\nfoo\n\nbar\n\n</pre>". (The middle and trailing ones are just for completeness, it's really the leading newlines that would be mangled.)

Thu, Mar 7, 10:41 PM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), Growth-Team (Current Sprint), StructuredDiscussions, Parsoid, Wikimedia-production-error
cscott added a comment to T217867: Port domino (or another spec-compliant DOM library) to PHP.

One note about porting domino in particular -- it uses meta-programming to generate classes corresponding to all the different HTML element types (HTMLAnchorElement, etc) from a compact specification (in htmlelts.js). Since neither PHP nor C support that kind of metaprogramming (eval doesn't count), that part of domino would have to be rewritten as a code-generator which runs during the build-phase instead. Probably not a huge deal, just something to keep in mind.

Thu, Mar 7, 10:36 PM · Core Platform Team Backlog (Attic), Parsoid-PHP
cscott added a comment to T217867: Port domino (or another spec-compliant DOM library) to PHP.

In theory (with infinite resources, etc) the best of all possible worlds would be a pure PHP implementation coupled with a "native" extension with more speed. Since (again in theory) both are implementing the exact same DOM API anyway, this would allow us to avoid adding an extension to mediawiki's required dependencies.

Thu, Mar 7, 10:33 PM · Core Platform Team Backlog (Attic), Parsoid-PHP
cscott updated the task description for T217867: Port domino (or another spec-compliant DOM library) to PHP.
Thu, Mar 7, 10:31 PM · Core Platform Team Backlog (Attic), Parsoid-PHP
cscott created T217867: Port domino (or another spec-compliant DOM library) to PHP.
Thu, Mar 7, 10:11 PM · Core Platform Team Backlog (Attic), Parsoid-PHP
cscott added a comment to T217766: Flow\Exception\WikitextException: ParseEntityRef: no name.

We've got our own XMLSerializer for Parsoid, that is probably part of the long-term solution.

Thu, Mar 7, 9:41 PM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), Growth-Team (Current Sprint), StructuredDiscussions, Parsoid, Wikimedia-production-error
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

Weird. Luckily (?) for Parsoid, this seems to be a bug in saveHTML(), which I don't think we're actually using? setAttribute/getAttribute handle the "true" (unencoded) value of the attribute fine. PHP:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> $doc = new DOMDocument();
=> DOMDocument {#2308}
>>> $doc->loadHTML('<p>Hello</p>');
=> true
>>> $el = $doc->getElementsByTagName('p')->item(0);
=> DOMElement {#2325}
>>> $el->setAttribute('data-foo', '"<!-- foo&bar -->');
=> DOMAttr {#2311}
>>> $el->getAttribute('data-foo');
=> ""<!-- foo&bar -->"
>>> $doc->saveHTML($el);
=> "<p data-foo='"<!-- foo&bar -->'>Hello</p>"

vs domino (which should match browsers and the spec):

$ node
> var domino = require('./');
undefined
> var doc = domino.createDocument('<p>Hello</p>');
undefined
> el = doc.querySelector('p');
HTMLParagraphElement {}
> el.setAttribute('data-foo', '"<!-- foo&bar -->');
undefined
> el.getAttribute('data-foo');
'"<!-- foo&bar -->'
> el.outerHTML;
'<p data-foo="&quot;<!-- foo&amp;bar -->">Hello</p>'
Thu, Mar 7, 9:35 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T217766: Flow\Exception\WikitextException: ParseEntityRef: no name.

The HTMLFormatter library itself is pretty general purpose and might be the basis of a more general lib, especially if it were updated (eventually) to use Remex and Zest (T217360: Replace libxml/xpath in HtmlFormatter with Remex/zest). I think Remex still has a performance problem to fix (T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size) which is probably going to prevent it from replacing DOMDocument::loadHTML in the immediate short term (sigh).

Thu, Mar 7, 9:28 PM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), Growth-Team (Current Sprint), StructuredDiscussions, Parsoid, Wikimedia-production-error
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

Seems to come from libxml: https://gitlab.gnome.org/GNOME/libxml2/blob/master/HTMLparser.c#L2323

Thu, Mar 7, 8:26 PM · Patch-For-Review, Parsoid-PHP
cscott updated subscribers of T199332: PHP Warning: count(): Parameter must be an array or an object that implements Countable in Serializer.php.

Maybe relevant, maybe not:
@Tgr said "Turns out DOMNamedNodeMap did not implement the Countable interface before PHP 7.2; also pre 7.2 count() does not warn if you give it a non-countable value, it just shrugs and returns 1. That made the attribute walking logic throw up.)" on a different bug (https://gerrit.wikimedia.org/r/#/c/mediawiki/services/parsoid/+/486835/).
and @Anomie added on IRC, "I do see that https://secure.php.net/manual/en/class.domnodelist.php says Countable wasn't implemented until 7.2.0."

Thu, Mar 7, 5:01 PM · Core Platform Team Kanban (Blocked Externally), Core Platform Team (Security, stability, performance and scalability (TEC1)), RemexHtml
cscott updated the task description for T217849: Remex needs documentation of how to use its API.
Thu, Mar 7, 4:49 PM · RemexHtml
cscott updated the task description for T217850: Remex could use some helper/utility classes.
Thu, Mar 7, 4:45 PM · RemexHtml
cscott created T217850: Remex could use some helper/utility classes.
Thu, Mar 7, 4:45 PM · RemexHtml
cscott created T217849: Remex needs documentation of how to use its API.
Thu, Mar 7, 4:41 PM · RemexHtml
cscott added a member for RemexHtml: cscott.
Thu, Mar 7, 4:38 PM
cscott added a watcher for RemexHtml: cscott.
Thu, Mar 7, 4:38 PM
cscott added a comment to T217708: Remex should offer an option to not set namespaceURI.

Yeah, that sounds right. Option to DOMBuilder.

Thu, Mar 7, 4:35 PM · Patch-For-Review, RemexHtml
cscott committed rMLZE061863332d2e: Add test cases for DOM loading; workaround wrong root document nodeType (authored by cscott).
Add test cases for DOM loading; workaround wrong root document nodeType
Thu, Mar 7, 4:17 PM
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

Another fun bug -- if you use DOMDocument::loadHTML(), the top-level node (DOMDocument) has nodeType of 13 (which is not defined in any spec ever, not even DOM level 1) instead of 9 (which is what the spec says it should be).

Thu, Mar 7, 3:58 PM · Patch-For-Review, Parsoid-PHP
cscott committed rMLZEa302fcfacfd3: Factor out `getElementsById` helper (authored by cscott).
Factor out `getElementsById` helper
Thu, Mar 7, 3:40 PM
cscott committed rMLZEa78eb78085a6: Fix comma combinator (authored by cscott).
Fix comma combinator
Thu, Mar 7, 3:40 PM
cscott committed rMLZE9f6ed92a8011: Fix corner cases with fast-path matching of unusual tag/class names (authored by cscott).
Fix corner cases with fast-path matching of unusual tag/class names
Thu, Mar 7, 3:40 PM
cscott committed rMLZEc973f6275dd9: Optimize class attribute search using XPath (authored by cscott).
Optimize class attribute search using XPath
Thu, Mar 7, 3:40 PM
cscott committed rMLZE0bb6839c23e3: Optimize ID attribute search, if the DOM implementation has indexed it (authored by cscott).
Optimize ID attribute search, if the DOM implementation has indexed it
Thu, Mar 7, 3:40 PM
cscott committed rMLZE459622063d97: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName() (authored by cscott).
Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
Thu, Mar 7, 3:40 PM
cscott committed rMLZE57fb2aa6db8e: Namespace the test suite (authored by cscott).
Namespace the test suite
Thu, Mar 7, 3:40 PM
cscott committed rMLZEc7f8ac5077ec: Fix comma combinator (authored by cscott).
Fix comma combinator
Thu, Mar 7, 3:40 PM
cscott committed rMLZEa055da57b0fe: Factor out `getElementsById` helper (authored by cscott).
Factor out `getElementsById` helper
Thu, Mar 7, 3:40 PM
cscott committed rMLZE2a08fe0819ec: Fix corner cases with fast-path matching of unusual tag/class names (authored by cscott).
Fix corner cases with fast-path matching of unusual tag/class names
Thu, Mar 7, 3:40 PM
cscott committed rMLZE77781621046a: Optimize class attribute search using XPath (authored by cscott).
Optimize class attribute search using XPath
Thu, Mar 7, 3:40 PM
cscott committed rMLZE5552f635c0e8: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName() (authored by cscott).
Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
Thu, Mar 7, 3:40 PM
cscott committed rMLZE8d22c0ce9397: Optimize ID attribute search, if the DOM implementation has indexed it (authored by cscott).
Optimize ID attribute search, if the DOM implementation has indexed it
Thu, Mar 7, 3:40 PM
cscott committed rMLZEd8f49dba3f2a: Namespace the test suite (authored by cscott).
Namespace the test suite
Thu, Mar 7, 3:40 PM
cscott committed rMLZEd4941251dc72: Enable phan (authored by cscott).
Enable phan
Thu, Mar 7, 3:40 PM
cscott added a comment to T199849: VisualEditor manipulation based on TemplateData source code formatting does not handle newlines before and after correctly.

Looks like the implementation is missing the "or the template is at the start of the output" clause from the spec.

Thu, Mar 7, 3:18 PM · VisualEditor, Patch-For-Review, Parsoid, TemplateData
cscott added a comment to T217766: Flow\Exception\WikitextException: ParseEntityRef: no name.

@Catrope also look at the HTMLFormatter library; the mobile team already have figured out a bunch of weird workarounds for PHP's DOM bugs.

Thu, Mar 7, 3:16 PM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), Growth-Team (Current Sprint), StructuredDiscussions, Parsoid, Wikimedia-production-error

Wed, Mar 6

cscott added a comment to T217708: Remex should offer an option to not set namespaceURI.

Sigh.

Wed, Mar 6, 9:07 PM · Patch-For-Review, RemexHtml
cscott committed rMLZEb82597a3bba7: Fix comma combinator (authored by cscott).
Fix comma combinator
Wed, Mar 6, 8:42 PM
cscott committed rMLZEddeb0a0cf55b: Factor out `getElementsById` helper (authored by cscott).
Factor out `getElementsById` helper
Wed, Mar 6, 6:36 PM
cscott committed rMLZEdd0539106b60: Fix corner cases with fast-path matching of unusual tag/class names (authored by cscott).
Fix corner cases with fast-path matching of unusual tag/class names
Wed, Mar 6, 6:36 PM
cscott committed rMLZE235ca62237b6: Optimize class attribute search using XPath (authored by cscott).
Optimize class attribute search using XPath
Wed, Mar 6, 6:36 PM
cscott committed rMLZE63a19879b1f8: Optimize ID attribute search, if the DOM implementation has indexed it (authored by cscott).
Optimize ID attribute search, if the DOM implementation has indexed it
Wed, Mar 6, 6:36 PM
cscott committed rMLZE829003598c0a: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName() (authored by cscott).
Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
Wed, Mar 6, 6:36 PM
cscott committed rMLZE22f2b09781d6: Namespace the test suite (authored by cscott).
Namespace the test suite
Wed, Mar 6, 6:36 PM
cscott committed rMLZE83c846d8235f: Optionally use mb_string instead of intl extension (authored by cscott).
Optionally use mb_string instead of intl extension
Wed, Mar 6, 6:36 PM
cscott committed rMLZE50ccd68cac02: Enable phan (authored by cscott).
Enable phan
Wed, Mar 6, 6:36 PM
cscott added a comment to T216102: Determine which PHP version to target with Parsoid.

We should probably bump Parsoid to 7.2 since (a) the zest port is using 7.2 and (b) mediawiki is planning to skip 7.1 anyway.

Wed, Mar 6, 6:17 PM · Patch-For-Review, Parsoid-PHP
cscott committed rMLZEe905adcee40d: Factor out `getElementsById` helper (authored by cscott).
Factor out `getElementsById` helper
Wed, Mar 6, 6:06 PM
cscott committed rMLZEa17ae1b0bda5: Fix corner cases with fast-path matching of unusual tag/class names (authored by cscott).
Fix corner cases with fast-path matching of unusual tag/class names
Wed, Mar 6, 6:06 PM
cscott committed rMLZEf9bdd5d5735a: Optimize class attribute search using XPath (authored by cscott).
Optimize class attribute search using XPath
Wed, Mar 6, 6:06 PM
cscott committed rMLZEd31425336579: Optimize ID attribute search, if the DOM implementation has indexed it (authored by cscott).
Optimize ID attribute search, if the DOM implementation has indexed it
Wed, Mar 6, 6:06 PM
cscott committed rMLZE238521ed469c: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName() (authored by cscott).
Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
Wed, Mar 6, 6:06 PM
cscott committed rMLZE62846a369eed: Namespace the test suite (authored by cscott).
Namespace the test suite
Wed, Mar 6, 6:06 PM
cscott committed rMLZE9a596c3021ce: Optionally use mb_string instead of intl extension (authored by cscott).
Optionally use mb_string instead of intl extension
Wed, Mar 6, 6:06 PM
cscott committed rMLZEfe7585e3e0ce: Enable phan (authored by cscott).
Enable phan
Wed, Mar 6, 6:06 PM
cscott committed rMLZE44ac637142eb: Factor out `getElementsById` helper (authored by cscott).
Factor out `getElementsById` helper
Wed, Mar 6, 5:16 PM
cscott committed rMLZEedd918856f6c: Fix corner cases with fast-path matching of unusual tag/class names (authored by cscott).
Fix corner cases with fast-path matching of unusual tag/class names
Wed, Mar 6, 5:04 PM