Page MenuHomePhabricator

cscott (C. Scott Ananian)
Parser whisperer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 21 2014, 6:47 PM (238 w, 6 d)
Availability
Available
IRC Nick
cscott
LDAP User
Unknown
MediaWiki User
Cscott [ Global Accounts ]

Editor since 2005; WMF developer since 2013. I work on Parsoid and OCG, and dabble with VE, real-time collaboration, and OOjs.

On github: https://github.com/cscott

See https://en.wikipedia.org/wiki/User:cscott for more.

Recent Activity

Wed, May 15

cscott created T223411: PHP regression when switching from extTagWidths to extTagOffsets.
Wed, May 15, 8:11 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T223296: proposal: new &? !? and ~? predicates.

If you look at Antlr ( https://www.antlr.org/ ) for inspiration, they have instead a single operator which means "return this!".

Wed, May 15, 6:09 PM · Patch-For-Review, Parsoid

Sat, May 11

Pppery awarded T112987: Separating infoboxes and navboxes from article content a Dislike token.
Sat, May 11, 6:40 PM · Wikidata, Community-Wishlist-Survey-2015, Wikimedia-Developer-Summit-2016
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

expansionDepth is a static variable in PPFrame_Hash::expand() -- as such it could be incremented by 1 by some PHP7-specific parse earlier in the process lifetime. For example, in the header which says "this is an experimental service running PHP7" or some other bit of wikitext which appears only on the PHP7 machines. I think that's a red herring.

Sat, May 11, 4:06 AM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

Chasing it down...
MWTidy::tidy() -> RemexDriver::tidy() creates

		$tokenizer = new Tokenizer( $dispatcher, $text, [
			'ignoreErrors' => true,
			'ignoreCharRefs' => true,
			'ignoreNulls' => true,
			'skipPreprocess' => true,
		] );

and then calls $tokenizer->execute() -- and it appears that the remex patch in question isn't honoring ignoreCharRefs and is decoding the entities regardless (at least in certain situations).

Sat, May 11, 3:50 AM · Patch-For-Review, RemexHtml, Parsing-Team, MediaWiki-Parser
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

...though it seems impossible, I've confirmed that 3058052c756ac7b69ead21b4b237a1cd6714de8a in Remex causes these failures. (If only I knew why!)

I cannot find that commit in gerrit or git log. Is that the right hash?

Sat, May 11, 3:06 AM · Patch-For-Review, RemexHtml, Parsing-Team, MediaWiki-Parser
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

...though it seems impossible, I've confirmed that 3058052c756ac7b69ead21b4b237a1cd6714de8a in Remex causes these failures. (If only I knew why!)

Sat, May 11, 2:57 AM · Patch-For-Review, RemexHtml, Parsing-Team, MediaWiki-Parser
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

These failures all have to do with how entities are escaped in attributes, which in theory shouldn't be affected by any of the changes between remex 2.0.1 and 2.0.2, which are about *parsing* (specifically tokenization) not *serialization*.

Sat, May 11, 2:24 AM · Patch-For-Review, RemexHtml, Parsing-Team, MediaWiki-Parser

Fri, May 10

cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Yeah, it's interesting that the various ppvisitednodes etc counts confirm that HHVM and PHP seem to be doing exactly the same work; ie, there's no weird PHP-7 behavior which is causing it to generate a subtly different graph. Assuming that the generated objects are exactly the same (which again, the counts seem to confirm), slight representation differences for object *probably* wouldn't account for such a large difference. So here are my two theories:

Fri, May 10, 5:17 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error

Thu, May 9

cscott added a comment to T222856: WikiPEG & Parsoid cache rule optimization.

This computation is done at grammar compile time; it has no effect on runtime (unless I'm missing something).

Thu, May 9, 2:06 PM · Patch-For-Review, Parsoid

Wed, May 8

cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Does this mean HHVM is using less memory for the same task than PHP 7? Or maybe it's measuring/enforcing it differently?
Wed, May 8, 8:11 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error

Fri, May 3

cscott added a comment to T222419: Incorrect section numbering after unclosed subst.

I suspect what is happening is that we are incrementing section ID when we attempt to parse, and then failing to decrement when we backtrack out of the parse.

Fri, May 3, 1:35 PM · Parsoid-Read-Views, Patch-For-Review, Reading-Infrastructure-Team-Backlog, Page Content Service
cscott added a comment to T222419: Incorrect section numbering after unclosed subst.

Reproduced:

$ (echo '==Foo==' ; echo '{{subst: a surname' ; echo '==Bar==') | bin/parse.js --wrapSections --normalize=parsoid
<section data-mw-section-id="0"></section><section data-mw-section-id="1">
<h2 id="Foo">Foo</h2>
<p>{{subst: a surname</p>
</section><section data-mw-section-id="3">
<h2 id="Bar">Bar</h2>
</section>
Fri, May 3, 1:34 PM · Parsoid-Read-Views, Patch-For-Review, Reading-Infrastructure-Team-Backlog, Page Content Service

Thu, May 2

cscott added a comment to T222328: [extlink] parsing - link cannot contain language variant or extension tags.

This is a weird corner case of legacy parser behavior; it's not entirely clear Parsoid should follow it. But it's not clear that what Parsoid is doing is totally consistent either.

Thu, May 2, 1:50 PM · Chinese-Sites, Parsoid
cscott added a comment to T221920: [extlink] parsing - don't validate protocol twice on simple link.

It's not either-or -- wikitext like [http://example.com/{{1x|path/to/resource}} blah] is possible too, which your rewrite doesn't capture. We could strip the redundant checks and fall back on the final check which is done after template expansion, but in general we prefer doing the early checks, if they are simple and fast (as the protocol check is), in order to avoid wasting too much time on a potential parse which will ultimately be rejected. The sooner we can reject it the better, hence the three-fold check in the current code.

Thu, May 2, 2:43 AM · Parsoid

Wed, May 1

cscott closed T221920: [extlink] parsing - don't validate protocol twice on simple link as Invalid.

You've confused the precedence of the / operator a little bit. In fact, the addr can be the empty string "" which is why we need a secondary check. In fact, I believe there's a third-string check after templates are expanded, as well.

Wed, May 1, 4:31 PM · Parsoid
cscott added a comment to T222266: Edge case difference processing templated styles in table cells.

Current parsoid output for the test case wikitext above:

$ echo -n | bin/parse.js --pageName User:cscott/T222266 --normalize=parsoid
Wed, May 1, 3:25 PM · Parsoid-Read-Views
cscott added a comment to T222266: Edge case difference processing templated styles in table cells.

The <nowiki> in the style attribute was perhaps an attempt to workaround T5158: Parser inserts invalid &nbsp; in the middle of style attribute (French spaces)/T197902: Be more selective in applying French Space armoring, which was fixed in July 2018?

Wed, May 1, 3:19 PM · Parsoid-Read-Views
cscott updated subscribers of T222266: Edge case difference processing templated styles in table cells.
Wed, May 1, 3:15 PM · Parsoid-Read-Views

Fri, Apr 26

cscott added a comment to T221907: Security Concept Review For Parsoid-PHP.

Yes, indeed. I just finished digging through git log to find that patch, and came here and you'd beaten me to it. ;)

Fri, Apr 26, 3:22 PM · Security-Team-Review-Active, Parsoid-PHP
cscott added a comment to T221907: Security Concept Review For Parsoid-PHP.

I recall that we did have the issue with Parsoid being exposed to sensitive content, specifically usernames and deleted revisions. The details are foggy, but I vaguely recall that I had to remove some information from the <head> of the Parsoid document because it could expose information about deleted revisions.

Fri, Apr 26, 3:05 PM · Security-Team-Review-Active, Parsoid-PHP

Thu, Apr 25

cscott added a comment to T90902: Non-breaking space in header ID breaks anchor.

@Pols12 that seems like possibly an interaction with french space armoring (T197902).

Thu, Apr 25, 7:25 PM · Community-Tech, MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), Patch-For-Review, MediaWiki-Parser, Parsoid
cscott added a comment to T221876: Pick one html5 source (whatwg or w3c) and update all code documentation links to point to that source.

On the other hand, stating "we follow WHATWG HTML5" implies an intention to follow WHATWG as HTML5 grows/changes. We haven't stated that as an explicit goal in the past, but our community seems to assume it, insofar historically as soon as new elements were added to HTML (<section>, etc) there were folks who wanted to start using them in wikitext. Not every HTML feature intersects with wikitext, but tag and entity names are the most direct points of contact and historically they've tended to follow HTML's evolution closely.

Thu, Apr 25, 6:22 PM · Documentation, Parsoid-PHP
cscott added a comment to T221876: Pick one html5 source (whatwg or w3c) and update all code documentation links to point to that source.

...except for the Japanese ruby issue, where the W3C has philosophical differences which apparently conflict with the desires of jawiki. (I used to actually understand the differences, but I paged all that out long ago.) According to MDN, only firefox implements <rtc>, but all browsers except IE implement <rb>. Of course wikitext doesn't *have* to exactly follow anyone, but it's slightly less confusing to just say "wikitext follows the WHATWG HTML5 spec" than to say "wikitext follows the W3C HTML5 spec, except in the case of <ruby> elements".

Thu, Apr 25, 5:28 PM · Documentation, Parsoid-PHP
cscott added a comment to T221876: Pick one html5 source (whatwg or w3c) and update all code documentation links to point to that source.

This seems pretty low priority; W3C is following the WHATWG spec so there's no real difference. WHATWG tends to lead, and then their decisions are ratified by W3C sometime later.

Thu, Apr 25, 5:14 PM · Documentation, Parsoid-PHP
cscott created T221872: composer-package-php73-docker seems to fail often on Parsoid builds.
Thu, Apr 25, 4:01 PM · Parsoid, Continuous-Integration-Infrastructure, Jenkins
cscott closed T219943: Create a composer library for wikipeg as Resolved.
Thu, Apr 25, 2:46 PM · Patch-For-Review, Core Platform Team (Parsoid PHP (CDP2)), Parsoid-PHP
cscott created T221858: Release wikipeg 2.0.3 through npm.
Thu, Apr 25, 2:21 PM · Patch-For-Review, Parsoid

Wed, Apr 24

cscott created T221790: Parsoid extension API should use DOM fragments, not documents.
Wed, Apr 24, 4:16 PM · Parsoid-PHP

Tue, Apr 23

cscott added a comment to T126618: $wgAllowImageTag should also block/allow <video>/<audio> tags (and be renamed to $wgAllowMediaTags ?).

As a counter-argument: $wgAllowImageTag is not turned on for any WMF production wikis. You could argue that wgAllowImageTag should be deprecated and removed, rather than supported and expanded.

Tue, Apr 23, 6:42 PM · Multimedia, MediaWiki-Parser
cscott added a comment to T221684: Some third-party wikis allow hotlinked images, but these aren't shown in preview.

Dup of T127884: Respect $wgAllowImageTag wiki configuration flag in the Sanitizer.

Tue, Apr 23, 6:41 PM · Parsoid, VisualEditor, VisualEditor-MediaWiki-2017WikitextEditor
cscott created T221677: Sanitizer::validateAttributes is not as efficient as it could be.
Tue, Apr 23, 5:14 PM · Parsoid-PHP, MediaWiki-Parser

Mon, Apr 22

cscott added a comment to T118520: Use <figure-inline> instead of <span> for inline figures..

The spec has this explanation, https://www.mediawiki.org/wiki/Specs/HTML/1.5.0#Images

The outer <figure> element needs to become a <span> element when the figure is rendered inline, since otherwise the HTML5 parser will interrupt a surrounding block context. The inner <figcaption> element is rendered as a data-mw attribute in this case (since block content in an invisible caption would otherwise break parsing).

So, the difference is in how a browser will render it. Hope that helps.

Mon, Apr 22, 5:01 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), MW-1.31-release-notes (WMF-deploy-2017-09-26 (1.31.0-wmf.1)), Patch-For-Review, Parsoid-DOM
cscott added a comment to T118517: [RFC] Use <figure> for media.

Adding a few details to @ssastry's update: Parsoid was changed to use <figure> and <figure-inline> in c9f404761cd288e7b58b89623ac459bbb2901a7d (T118520). The remaining work to be done is to transition core to use this same markup. The original plan was to do this in two steps: first convert block markup to use <figure>, and then as a follow-up convert inline markup to use <figure-inline>. Arlo has core patches written (linked above), but actually deploying them will take a careful process of communicating w/ local communities, linting, etc, which we do not plan to tackle until after the Parsoid port to PHP is complete.

Mon, Apr 22, 4:59 PM · Patch-For-Review, Accessibility, Parsing-Team, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, MediaWiki-Parser, TechCom-RFC

Apr 20 2019

cscott updated the task description for T221491: Parsoid: change Cite to use \s+ as a separator instead of [ \t\n].
Apr 20 2019, 2:20 AM · Parsoid-Read-Views, Cite
cscott created T221491: Parsoid: change Cite to use \s+ as a separator instead of [ \t\n].
Apr 20 2019, 2:20 AM · Parsoid-Read-Views, Cite
cscott created T221490: Parsoid: Cite silently ignores all parameters in <ref> with more than two parameters.
Apr 20 2019, 2:11 AM · Parsoid-Read-Views
cscott created T221489: Parsoid: Extra spaces are meaningful in sfn (differs from article text behavior).
Apr 20 2019, 2:09 AM · Parsoid-Read-Views, Cite
cscott created T221488: Add "decoding=async" change to Parsoid image markup.
Apr 20 2019, 2:06 AM · Parsoid-Read-Views
cscott added a comment to T212124: Consider adding decoding=async to our img tags.

I thought about Parsoid, but first let's see if it's beneficial at all for merely reading articles rendered by MediaWiki on a web browser?

Apr 20 2019, 2:01 AM · MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), MediaWiki-General-or-Unknown, MediaWiki-extensions-General, Parsing-Team, Performance-Team

Apr 19 2019

cscott added a comment to T221041: Convert Parsoid to dependency injection.

In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever.

Having multiple wikitext parsers and having different parsers for different markups are two entirely different layers of abstraction. The latter is already mostly possible via the content handler mechanism, and the extent to which it is not possible has less to do with the parser and more with baked-in assumptions about certain things (like system messages) always use wikitext. I don't see much overlap between wider support for non-wikitext markup and the Parsoid-PHP project.

Apr 19 2019, 5:06 PM · User-Daniel, Core Platform Team (Decoupling (CDP2)), Parsoid-PHP, Technical-Debt

Apr 18 2019

cscott added a comment to T220018: html2wt should escape [ when it precedes an external link or autolink.

Note that we have a similar issue with {{ }}, as shown in our test case for T70421:

$ echo '{{1x|{{ }}}}' | bin/parse.js --wt2wt
{{1x|{{ }<nowiki>}</nowiki>}}

But this doesn't round trip:

$ echo '{{1x|{{ }<nowiki>}</nowiki>}}' | php maintenance/parse.php 
<p>{{1x|{{ }}}}
</p>
Apr 18 2019, 8:07 PM · Parsoid-Edit-Support, Parsoid-Nowiki
cscott added a comment to T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.

Oh, there is one buglet on JS, I wonder if that's how I managed to get this to reproduce:

$ node
> [null, undefined].join(':')
':'
> [null, undefined,32].join(':')
'::32'

That is, null and undefined are treated as interchangeable in the cache key. That's an issue in the PHP port as well:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.3-1 — cli) by Justin Hileman
>>> implode(':', [null, 32, '', false])
=> ":32::"

It would be safer to use json_encode() for PHP, probably, but JSON.stringify doesn't distinguish null and undefined on JS.

Apr 18 2019, 7:08 PM · Patch-For-Review, Parsoid
cscott renamed T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined from Wikipeg cache is unsafe when rule variables are in use to Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 7:00 PM · Patch-For-Review, Parsoid
cscott added a comment to T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.

No, I'm an idiot. The above is close, but the original value of the variable is in fact stored in the cache key.

Apr 18 2019, 6:37 PM · Patch-For-Review, Parsoid
cscott added a comment to T221028: Parsoid incorrectly parses links to pages starting with a slash '/' in namespaces that can have subpages.

As I said over in T110413, .//Foo is pretty dubious as an href. But then again, https://en.wikipedia.org/wiki//e/_(operating_system) exists on enwiki, so I guess that's legit.

Apr 18 2019, 5:07 PM · Parsoid-Read-Views
cscott added a comment to T110413: Relative links to subpages are treated as links to the mainspace that start with / until after saving.

.//Foo is pretty dubious as an href. I'd expect ./%2FFoo (if subpages aren't enabled) or ./CurrentPageTitle/Foo (if they are). The base href is always the root of the wiki.

Apr 18 2019, 4:50 PM · Patch-For-Review, VisualEditor (Current work), VisualEditor-MediaWiki-Links
cscott added a comment to T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.

OK, pretty sure I found the bug. In the cache code, we save and restore reference values like so:

			start: [
				`$key = ${key};`,
				'$bucket = $this->currPos;',
				`$cached = $this->cache[$bucket][$key] ?? null;`,
				'if ($cached) {',
				'  $this->currPos = $cached[\'nextPos\'];',
				opts.loadRefs,
				'  return $cached[\'result\'];',
				'}',
				opts.saveRefs,
			].join('\n'),
			store: [
				`$cached = ['nextPos' => $this->currPos, 'result' => ${opts.result}];`,
				opts.storeRefs,
				`$this->cache[$bucket][$key] = $cached;`

Where opts.storeRefs typically looks something like this:

if ($saved_preproc !== $param_preproc) $cached['refs']["preproc"] = $param_preproc;
if ($saved_th !== $param_th) $cached['refs']["th"] = $param_th;
Apr 18 2019, 4:43 PM · Patch-For-Review, Parsoid
cscott updated the task description for T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 4:29 PM · Patch-For-Review, Parsoid
cscott updated subscribers of T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 3:56 PM · Patch-For-Review, Parsoid
cscott created T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 3:56 PM · Patch-For-Review, Parsoid

Apr 17 2019

cscott created T221238: Re-evaluate Parsoid's error handling strategy once it is integrated into MediaWiki.
Apr 17 2019, 2:55 PM · Parsoid-PHP

Apr 16 2019

cscott added a comment to T221041: Convert Parsoid to dependency injection.

I know @ssastry and I disagree on this, but I'd like for us to move toward a place where Parsoid is an extension. In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever. I'd like to reduce the dependencies between mediawiki core and any one particular parser implementation.

Apr 16 2019, 6:11 PM · User-Daniel, Core Platform Team (Decoupling (CDP2)), Parsoid-PHP, Technical-Debt

Apr 11 2019

cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

I'd recommend merge of 502567 ^. @Krinkle, if you agree C+1 and we can get this scheduled for SWAT?

Apr 11 2019, 2:32 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error

Apr 9 2019

cscott added a comment to T219901: Default to Preprocessor_Hash for PHP 7.

To be conservative, let's configure it for PHP7 first -- since that's an experimental deployment anyway. Then we can talk about changing the defaults once we are certain that we don't come across anything unexpected in production use. (T216664 was unexpected, for instance, at least to me. I'd think the likelihood of being surprised is even lower with a shift to Preprocessor_Hash since we're already using it in production on HHVM... but that's the thing about unknowns: you never know them in advance...)

Apr 9 2019, 5:35 PM · MW-1.34-release, Performance-Team (Radar), MediaWiki-Parser
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

I was sort of waiting to see if @tstarling had anything to say about this, as (a) he wrote both Preprocessor_DOM and Preprocessor_Hash (AFAIK) and (b) he was the one objection to the previous plan (deprecating _Hash in favor of _DOM).

Apr 9 2019, 5:30 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott committed rMLZEffc83014b1d7: Bug fix in :first-child selector (unique to PHP port) (authored by cscott).
Bug fix in :first-child selector (unique to PHP port)
Apr 9 2019, 4:13 PM
cscott committed rMLZE33335142718c: Bug fix in :first-child selector (unique to PHP port) (authored by cscott).
Bug fix in :first-child selector (unique to PHP port)
Apr 9 2019, 3:38 PM

Apr 8 2019

cscott added a comment to T217867: Port domino (or another spec-compliant DOM library) to PHP.
Apr 8 2019, 9:40 PM · Core Platform Team Backlog (Attic), Parsoid-PHP

Apr 6 2019

Mill <mill@mail.com> committed R1907:2da18fafd017: d#baaaaaaaaaaa (authored by cscott).
d#baaaaaaaaaaa
Apr 6 2019, 2:35 AM

Apr 5 2019

Mill <mill@mail.com> committed rELINT3edfb7275584: wsbaaaaaaaaaaa (authored by cscott).
wsbaaaaaaaaaaa
Apr 5 2019, 10:42 PM
Mill <mill@mail.com> committed rELINT9ef59fcdfceb: yzbaaaaaaaaaaa (authored by cscott).
yzbaaaaaaaaaaa
Apr 5 2019, 10:42 PM
Mill <mill@mail.com> committed rELINT41c861adb738: xzbaaaaaaaaaaa (authored by cscott).
xzbaaaaaaaaaaa
Apr 5 2019, 10:42 PM

Apr 4 2019

cscott added a comment to T220055: Internal links surrounded by square brackets should be parsed correctly.

This is consistent between Parsoid and PHP:

$ echo '[[[Main Page]]]' | bin/parse.js --normalize
<p>[[[Main Page]]]</p>

-vs-

$ echo '[[[Main Page]]]' | php maintenance/parse.php 
<p>[[[Main Page]]]
</p>
Apr 4 2019, 1:08 PM · Parsoid
Krinkle awarded T219901: Default to Preprocessor_Hash for PHP 7 a Orange Medal token.
Apr 4 2019, 12:37 AM · MW-1.34-release, Performance-Team (Radar), MediaWiki-Parser

Apr 3 2019

cscott created T220018: html2wt should escape [ when it precedes an external link or autolink.
Apr 3 2019, 5:14 PM · Parsoid-Edit-Support, Parsoid-Nowiki

Apr 2 2019

cscott created T219901: Default to Preprocessor_Hash for PHP 7.
Apr 2 2019, 5:18 PM · MW-1.34-release, Performance-Team (Radar), MediaWiki-Parser
cscott added a comment to T204945: Deprecate one of the Preprocessor implementations for 1.33.

I was wrong, Preprocessor_DOM is a lie. At least the "DOM" part of it is -- it just runs DOMDocument::loadXML to wrap a thin veneer of DOMness around the result at the end. Furthermore, it has issues (like T216664) and the DOM extension has other further issues (T215000). Let's just kill it and keep Preprocessor_Hash instead.

Apr 2 2019, 5:15 PM · MW-1.33-release, Technical-Debt (Deprecation), MediaWiki-Parser, Patch-For-Review
cscott updated the task description for T204945: Deprecate one of the Preprocessor implementations for 1.33.
Apr 2 2019, 5:13 PM · MW-1.33-release, Technical-Debt (Deprecation), MediaWiki-Parser, Patch-For-Review
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Alternatively we should use Preprocessor_Hash on both HHVM and PHP 7 and deprecate/remove Preprocessor_DOM (see T204945: Deprecate one of the Preprocessor implementations for 1.33).

Apr 2 2019, 5:10 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Node counting was added to Preprocessor_DOM in 2caa7829fcc6a0ab45f91c4346c0d5a9100ef4dc by @tstarling. It's not strictly needed in Preprocessor_Hash (as I understand it) because libdom allocates memory in a different way than used by Preprocessor_Hash, but perhaps it should be implemented for consistency -- so that we don't get pages which are only editable if you use Preprocessor_Hash not Preprocessor_DOM.

Apr 2 2019, 4:48 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error

Mar 29 2019

cscott added a comment to T219069: Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of ported code.

We definitely want strategy 1, at least long-term. The mb_* functions all take O(length of string) time, since you need to scan the string from the beginning in order to count codepoints.

Mar 29 2019, 9:11 PM · Patch-For-Review, Parsoid-PHP

Mar 28 2019

cscott added a comment to T208139: Georgian words are automatically (incorrectly) capitalized when entered.

Seems like we should probably make a plan to proactively manage Unicode transitions on our three platforms (browsers, PHP, server-side JS).

Mar 28 2019, 1:01 AM · I18n, Wikidata, MediaWiki-extensions-WikibaseClient

Mar 21 2019

cscott updated subscribers of T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Mar 21 2019, 8:59 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott renamed T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature) from MWException when viewing or comparing certain pages with PHP7 beta feature to MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Mar 21 2019, 8:58 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott updated subscribers of T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Mar 21 2019, 8:58 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Interesting. Probably not actually related to PHP 7 but to Preprocessor_DOM -- I believe HHVM still runs Preprocessor_Hash. I've got an outstanding request to reduce the code duplication: T204945: Deprecate one of the Preprocessor implementations for 1.33.

Mar 21 2019, 8:58 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

Yep, most likely related. @Arlolra will probably write you a patch once he wakes up/gets online this am, but if you're impatient just removing the newlines from the indicated places in parserTests.txt will get you going again.

Mar 21 2019, 2:27 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing
Mill <mill@mail.com> committed rEPFM5b55c1c2a2aa: %26dbaaaaaaaaaaa (authored by cscott).
%26dbaaaaaaaaaaa
Mar 21 2019, 12:24 AM
cscott committed rMLZE8fc57991b857: Update README.md (authored by cscott).
Update README.md
Mar 21 2019, 12:22 AM
cscott committed rMLZE023d6c4ac2a9: Update README.md (authored by cscott).
Update README.md
Mar 21 2019, 12:22 AM
cscott committed rMLZE322afbbd7ee8: Update README.md (authored by cscott).
Update README.md
Mar 21 2019, 12:22 AM
cscott added a comment to T218817: PHP Warning: count(): Parameter must be an array or an object that implements Countable.

It would be a good idea to add some parser tests which exercise both modes of StringUtils::explode. If this hadn't thrown an exception for being not Countable, we probably wouldn't have noticed that the array keys shift on large articles until it made production.

Mar 21 2019, 12:08 AM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), User-zeljkofilipin, Patch-For-Review, Core Platform Team, Parsing-Team, MediaWiki-Parser, Wikimedia-production-error

Mar 20 2019

cscott added a comment to T218816: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants null even if type is nullable.

Because the type hint just below it will say ?string not string|null? We should be using the one markup language which is used by PHP, not making up our own?

Mar 20 2019, 10:16 PM · MediaWiki-Codesniffer
cscott added a comment to T218817: PHP Warning: count(): Parameter must be an array or an object that implements Countable.

ExplodeIterator::key() returns $this->curPos, *not* the line number. So with the 497863 patch it fixes the exceptions (and so should be fine to backport to wmf.22) but for wikitexts over a thousand lines it will effectively "never always be on the last line" and so you'll get an extra trailing newline.

Mar 20 2019, 9:02 PM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), User-zeljkofilipin, Patch-For-Review, Core Platform Team, Parsing-Team, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T218324: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants redundant "mixed|null".

See also T218816: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants null even if type is nullable.

Mar 20 2019, 7:42 PM · Patch-For-Review, MediaWiki-Codesniffer
cscott created T218816: MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam wants null even if type is nullable.
Mar 20 2019, 7:41 PM · MediaWiki-Codesniffer

Mar 19 2019

cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

Seems to be fixed. Watching builds of https://gerrit.wikimedia.org/r/464096 https://gerrit.wikimedia.org/r/497471 and https://gerrit.wikimedia.org/r/497320 to confirm.

Mar 19 2019, 6:43 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing
cscott added a comment to T218358: Add data-title attribute to anchors.

Yes, the intention is certainly that title (in normalized form, ie spaces converted) can always be derived by removing the relative path. Furthermore, in modern Parsoid (not historically, but we can ignore that) that relative path is *always* ./. So just strip the first two characters and you've got your title.

Mar 19 2019, 6:38 PM · Readers-Web-Backlog (Tracking), Parsing-Team, MediaWiki-Parser, Internet-Archive, Parsoid, Technical-Debt
cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

See https://gerrit.wikimedia.org/r/#/q/topic:trail+(status:open+OR+status:merged) for the set of patches merged; scribunto probably needs a patch somewhat like https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ParserFunctions/+/494939/

Mar 19 2019, 5:16 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing
cscott added a comment to T218702: ParserIntegrationTest for Scribunto failing in Wikibase CI.

Yeah, we should just fix the scributo tests, not revert patches. This was a set of 9 or so dependent patches to merge, unrolling would be quite a chore, and the scribunto fix should just be to remove some trailing newlines from the parser tests....

Mar 19 2019, 5:14 PM · Patch-For-Review, MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), SyntaxHighlight, User-Addshore, Wikimedia-production-error (Shared Build Failure), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-Scribunto, MediaWiki-Core-Testing

Mar 15 2019

cscott added a comment to T218378: Flaky test Wikibase\Repo SetAliasesTest::testUserCannotSetAliasesWhenTheyLackPermission [4h].

There are a few patches which have actually (apparently) passed this test and gotten merged, eg https://gerrit.wikimedia.org/r/496080. But only about two in the past hour AFAICT. So that's UBN territory for me...

Mar 15 2019, 9:10 PM · MW-1.34-notes (1.34.0-wmf.1; 2019-04-16), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), Patch-For-Review, Wikidata, MediaWiki-extensions-WikibaseRepository, Wikimedia-production-error (Shared Build Failure)
cscott committed rMLZE35dffc7d807c: Allow passing options to Remex in the loadHtml/parseHtml test helpers (authored by cscott).
Allow passing options to Remex in the loadHtml/parseHtml test helpers
Mar 15 2019, 5:22 AM

Mar 14 2019

cscott added a comment to T218183: Audit uses of PHP DOM in Wikimedia software.

Yeah, there's a more-or-less standard-but-ugly workaround that involves using mb_encode to replace everything above U+007F with an HTML entity: https://github.com/wikimedia/html-formatter/blob/5e33e3bbb327b3e0d685cc595837ccb024b72f57/src/HtmlFormatter.php#L71

Mar 14 2019, 11:28 PM · TechCom, MediaWiki-General-or-Unknown, Parsoid-PHP
cscott added a comment to T217850: Remex could use some helper/utility classes.

I'd say there's one other use case, and it's what tidy does (AIUI): mutate a string representation of a HTML document in a "safe" way, without every building the complete DOM tree in memory. That is, "safe" string-to-string transformations. There are probably lots of weird things you could do here, but I would love to see a basic "insert X into Y" (like innerHTML) or "append X to Y" utility, done in a safe way that respected tag boundaries etc. The API of https://github.com/wikimedia/html-formatter/blob/master/src/HtmlFormatter.php could be a guide, just imagine doing it string-to-string without creating an intermediate DOM.

Mar 14 2019, 7:01 PM · RemexHtml
cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

We don't use this (I don't think) but be aware that if you create an attribute named 'xmlns' in PHP's DOM everything breaks: https://marc.info/?l=php-internals&m=155249142123136&w=2

Mar 14 2019, 6:48 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size.

A performance update is at T204595#5024206

Mar 14 2019, 6:42 PM · Patch-For-Review, Performance, RemexHtml
cscott added a comment to T218183: Audit uses of PHP DOM in Wikimedia software.

In general I would be happy to see an interface that allowed components to pass around DOM subtrees instead of strings, stringifying the tree only where needed for a legacy API. Then "composition" is (eventually) just subtree assembly, and we don't have to worry about poorly-constructed components leaking open tags into the rest of the content...

Mar 14 2019, 3:42 PM · TechCom, MediaWiki-General-or-Unknown, Parsoid-PHP
cscott added a comment to T204595: Evaluate and document performance of RemexHtml vs Domino.

Testing with the following script (from the zest library home dir) with xdebug off:

$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> require('./tests/ZestTest.php');
=> 1
>>> $html100 = file_get_contents('./obama.html'); strlen($html100);                       => 2592386
>>> timeit -n10 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html100, [ 'suppressHtmlNamespace' => true, 'ignoreErrors' => true ]) && true;
=> true
Command took 0.389295 seconds on average (0.376101 median; 3.892954 total) to complete.
Mar 14 2019, 3:00 PM · RemexHtml, Parsoid-PHP
cscott added a comment to T204595: Evaluate and document performance of RemexHtml vs Domino.

Could you post your test scripts somewhere? To be fair we should probably factor out process startup and file read times out of the measurements (the 350ms overhead you're measuring). It seems like we should dig into the slow Remex performance on large documents more, though, to figure out if there are some O(N^2) tree-mutation algorithms we need to kill, and if so figure out how hard they will be to fix (ie, are the bugs in Remex, in the PHP DOM extension).

Mar 14 2019, 4:18 AM · RemexHtml, Parsoid-PHP