Page MenuHomePhabricator

cscott (C. Scott Ananian)
Parser whisperer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 21 2014, 6:47 PM (246 w, 6 d)
Availability
Available
IRC Nick
cscott
LDAP User
Unknown
MediaWiki User
Cscott [ Global Accounts ]

Editor since 2005; WMF developer since 2013. I work on Parsoid and OCG, and dabble with VE, real-time collaboration, and OOjs.

On github: https://github.com/cscott

See https://en.wikipedia.org/wiki/User:cscott for more.

Recent Activity

Yesterday

cscott committed rMLLCf2ae1eab696c: Update jsdoc-wmf-theme to 0.0.3 (authored by cscott).
Update jsdoc-wmf-theme to 0.0.3
Mon, Jul 15, 10:11 PM
cscott committed rJWTHee6b131bca4a: Bump version after release. (authored by cscott).
Bump version after release.
Mon, Jul 15, 8:19 PM
cscott committed rJWTH7d1b73775773: Release 0.0.3 (authored by cscott).
Release 0.0.3
Mon, Jul 15, 8:19 PM

Thu, Jul 11

cscott committed rMLAL0d90bc9a7707: Add new wiki homepage for the library; tabify composer.json (authored by cscott).
Add new wiki homepage for the library; tabify composer.json
Thu, Jul 11, 3:56 PM

Wed, Jul 10

cscott committed rMLAL440911602cc4: Linting and minor improvements (authored by cscott).
Linting and minor improvements
Wed, Jul 10, 10:30 PM
cscott committed rMLAL15b7de3a19c5: Fix indentation in a doc comment (authored by cscott).
Fix indentation in a doc comment
Wed, Jul 10, 10:30 PM
cscott committed rMLAL99e75df3b68b: Test cases (authored by cscott).
Test cases
Wed, Jul 10, 10:30 PM
cscott committed rMLAL02497e2613bf: Fix phan and phpunit configuration (authored by cscott).
Fix phan and phpunit configuration
Wed, Jul 10, 10:30 PM
cscott committed rMLALffa518eec3a5: Skeleton package files and README (authored by cscott).
Skeleton package files and README
Wed, Jul 10, 10:29 PM
cscott committed rMLALf8ab73427242: Initial port (authored by cscott).
Initial port
Wed, Jul 10, 10:29 PM
cscott created T227693: Parsoid should strip the hash fragment for non-existent pages.
Wed, Jul 10, 5:47 PM · Parsoid

Tue, Jul 9

cscott added a comment to T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias.

I find it puzzling that "we know they cannot". We know for a fact that they are using a special pipeline for handling wiki content, and we have worked with them to build it. We can see their User-Agent in our logs ("MediaWikiCrawler-Google/2.0"). I've talked to Google engineers personally. They come to Wikimania. I can understand that Google is a big company, and sometimes it can be hard to find the right person at Google who actually knows how things work. But talking to the wrong person at Google isn't proof of anything.

Tue, Jul 9, 5:01 PM · Performance-Team (Radar), Readers-Web-Backlog, Wikimedia-Site-requests, Hindi-Sites, Chinese-Sites, SEO

Mon, Jul 8

cscott added a comment to T227352: Set up extension tests for Parsoid repo.

We should update our developer docs as well to reflect best practices to installing Parsoid & running PHP tests. Should Parsoid be installed inside core/extensions/Parsoid and/or is setting $MW_INSTALL_PATH sufficient? Ideally composer extension-test or something would Do The Right Thing assuming proper set up (whether that's installing Parsoid inside the extension folder or setting MW_INSTALL_PATH or whatever else the Right Thing needs to be).

Mon, Jul 8, 3:33 PM · Continuous-Integration-Config, Parsoid-PHP

Fri, Jul 5

dbarratt awarded T207168: Provide JSON-LD support for Wikidata a Party Time token.
Fri, Jul 5, 1:46 PM · Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-extensions-WikibaseRepository

Wed, Jul 3

cscott created T227240: VE isn't sending the original page's revision ID when a template is inserted/edited.
Wed, Jul 3, 11:52 PM · Parsoid-DOM, VisualEditor
cscott added a comment to T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags.

Looks like the DSR offsets are getting scrambled and/or the base text is changing, and selser is substituting regions from a different wikitext and/or from an offset region. Seems to require that a parameter of the ref is changed, so that selser isn't skipping over the entire ref but is instead trying to do a finer-grainer selective serialization of each parameter.

Wed, Jul 3, 9:43 PM · User-Ryasmeen, Parsoid, VisualEditor
cscott added a comment to T227205: Provide a self-close param for #tag: to enable generation of pseudo tags that self-close.

Right, I'm proposing to make it do that (if it doesn't already). As far as I know, extension tag parsing in wikitext (as opposed to HTML) should allow self-closed tags anywhere and treat it completely equivalently to an open/close tag pair with a zero-length content, so there shouldn't be any problem with always emitting self-closed tags when the content is empty.

Wed, Jul 3, 8:26 PM · ParserFunctions
cscott added a comment to T227205: Provide a self-close param for #tag: to enable generation of pseudo tags that self-close.

The general practice seems to be that if the content is zero-length, then a self-closed tag would be generated. Would that be sufficient? I don't think we need to add an extra parameter. (What if you wanted to define an attribute named 'selfclose'?)

Wed, Jul 3, 4:40 PM · ParserFunctions

Tue, Jul 2

cscott added a comment to T223969: "PHP fatal error: entire web request took longer than 60 seconds and timed out" on zh.wikisource page.

With gerrit 520293 the time taken drops from:

83.21user 267.41system 5:52.38elapsed

to

6.60user 0.08system 0:06.72elapsed

Adding gerrit 520294 on top further reduces runtime to

6.02user 0.06system 0:06.11elapsed

on this zh.wikisource page.

Tue, Jul 2, 6:39 PM · MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Parsing-Team, Wikimedia-production-error, MediaWiki-Language-converter, Performance, Chinese-Sites
cscott added a comment to T223969: "PHP fatal error: entire web request took longer than 60 seconds and timed out" on zh.wikisource page.

Wikitext source, for reference:

(I think this has an extra trailing newline appended, if it matters)

Tue, Jul 2, 5:23 PM · MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Parsing-Team, Wikimedia-production-error, MediaWiki-Language-converter, Performance, Chinese-Sites

Thu, Jun 27

cscott added a comment to T215000: Fill gaps in PHP DOM's functionality.

Another difference we found: PHP's Node::normalize() doesn't actually remove zero-length text nodes, like the spec says it should (DOM level 2: https://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-normalize ; latest DOM spec: https://dom.spec.whatwg.org/#dom-node-normalize ).

Thu, Jun 27, 8:54 PM · Patch-For-Review, Parsoid-PHP

Jun 14 2019

cscott closed T219901: Default to Preprocessor_Hash for PHP 7 as Resolved.

I'm closing this bug since https://gerrit.wikimedia.org/r/502567 was merged. Technically, we ought to change the shipping defaults to make our production configuration, but we can handle that over in T204945.

Jun 14 2019, 4:27 PM · MW-1.34-release, Performance-Team (Radar), MediaWiki-Parser
cscott added a comment to T204945: Deprecate one of the Preprocessor implementations for 1.34.

I updated the patch ( https://gerrit.wikimedia.org/r/502578 ) to resolve merge conflicts since it was originally written for 1.33. Would be a good idea to deploy this (or something like it) since WMF is running Preprocessor_Hash on PHP7 now but our official release still defaults to Preprocessor_DOM on that platform.

Jun 14 2019, 4:25 PM · MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Technical-Debt (Deprecation), MediaWiki-Parser
cscott updated the task description for T204945: Deprecate one of the Preprocessor implementations for 1.34.
Jun 14 2019, 4:02 PM · MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Technical-Debt (Deprecation), MediaWiki-Parser
cscott added a comment to T204945: Deprecate one of the Preprocessor implementations for 1.34.

Yes, now that https://gerrit.wikimedia.org/r/502567 has been merged, this task is no longer stalled.

Jun 14 2019, 4:02 PM · MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Technical-Debt (Deprecation), MediaWiki-Parser
cscott updated the task description for T204945: Deprecate one of the Preprocessor implementations for 1.34.
Jun 14 2019, 3:56 PM · MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Technical-Debt (Deprecation), MediaWiki-Parser
cscott added a comment to T185093: Template transclusion of a few list items extends to the entire list in VE, making it impossible to edit visually.

This *might* be related to Parsoid's slightly-incorrect handling of initial newlines in wikitext (T175421: Parsoid doesn't insert same spacers when article text starts with two or more newlines.) -- I think the "insert leading newline if the contents begin with *" part is actually done by the preprocessor, and so the result Parsoid gets back from the legacy preprocessor should have this hack (T14974, T2529) already applied.

Jun 14 2019, 3:55 PM · VisualEditor-MediaWiki, Parsoid, VisualEditor

Jun 10 2019

cscott added a comment to T180611: Figure captions can serialize to valid options.

The bug here is related to the semantics of image options where "anything that doesn't look like a valid option is treated as a caption". Ie:

[[File:Foo.jpg|right|Caption|200px]]

Has the caption Caption because 200px looks like an image option. But:

[[File:Foo.jpg|right|Caption|x200px]]

...has the caption x200px.

Jun 10 2019, 5:52 PM · Parsoid-Edit-Support, Parsoid-Serializer

Jun 7 2019

cscott added a comment to T225217: VE is removing spaces (dirty diffs) on some wikis (wikitech, officewiki).

Hm! Turns out that we've been worrying about the complex caching semantics required of RESTbase during edits, while all along merrily ignoring caching altogether....

Jun 7 2019, 3:20 PM · User-Ryasmeen, Patch-For-Review, Parsoid-Edit-Support, VisualEditor

Jun 5 2019

cscott added a comment to T213980: For every ported file, audit all regular expressions for subtle mismatches (especially around \s usage and those ending with $ and might require a D regexp modifier).

js2php's regexps are correct, it is just a bit conservative about escaping $ in double-quoted strings. It doesn't technically need the backslash escape so long as the next character isn't a valid PHP identifier start or {. I can tweak js2php, but there's no bug that needs to be fixed per se.

Jun 5 2019, 6:01 PM · Parsoid-PHP

May 30 2019

cscott added a comment to T221872: composer-package-php73-docker seems to fail often on Parsoid builds.

This is now *mostly* fixed. Getting spurious failures on other jobs, eg ENOMEM on https://integration.wikimedia.org/ci/job/parsoidsvc-npm-run-roundtrip-node-6-docker/5677/console

May 30 2019, 9:18 PM · phan-taint-check-plugin, Continuous-Integration-Infrastructure, Jenkins, Release-Engineering-Team (Kanban), phan, Parsoid

May 29 2019

cscott added a comment to T221872: composer-package-php73-docker seems to fail often on Parsoid builds.

We could probably also tweak the .phan/config.php to exclude more stuff in vendor/, including phan itself.

May 29 2019, 7:34 PM · phan-taint-check-plugin, Continuous-Integration-Infrastructure, Jenkins, Release-Engineering-Team (Kanban), phan, Parsoid

May 28 2019

cscott added a comment to T224377: Properly implement trace, dump, debug log support via possibly a LoggingUtils.

See https://www.mediawiki.org/wiki/Manual:How_to_debug#Creating_custom_log_groups -- mediawiki core already has a lot of logging support; hopefully we could reuse this rather than reinvent it.

May 28 2019, 3:45 PM · Parsoid-PHP
cscott added a comment to T222419: Incorrect section numbering after unclosed subst.

The PHP parser does it in the preprocessor, and there are edge cases that mean if we want compatibility it's best to continue to do so. I moved it to the tokenizer to fix these edge cases, if we move it back we're going to regress on those.

May 28 2019, 2:47 PM · Parsoid-Read-Views, Patch-For-Review, Reading-Infrastructure-Team-Backlog, Page Content Service

May 26 2019

cscott added a comment to T224357: Tag a new release of RemexHTML.

You're right, never mind. Tag freely!

May 26 2019, 3:57 AM · Parsing-Team, RemexHtml
cscott added a comment to T224357: Tag a new release of RemexHTML.

I have one more patch outstanding I was going to merge ( https://gerrit.wikimedia.org/r/508037 ) but I can do that in a subsequent point release. Note that wikipeg is a dual JS/PHP package now, you should probably release to both npm and composer. That involves bumping the version in package.json before tagging a commit. I can do this Tuesday if that's easier.

May 26 2019, 3:22 AM · Parsing-Team, RemexHtml

May 23 2019

cscott added a comment to T224227: php-1.34.0-wmf.6/includes/TemplateParser.php(149) : eval()'d code: syntax error, unexpected '=>' (T_DOUBLE_ARROW), expecting ')'.

I don't think that anyone currently on @Parsing-Team knows anything about the mustache template code. I think all those folk were moved back into Platform.

May 23 2019, 2:55 PM · Core Platform Team Backlog (Watching / External), MediaWiki-HTML-Templating, Readers-Web-Backlog, Vector, Parsing-Team, PHP 7.2 support, Wikimedia-production-error

May 21 2019

cscott added a comment to T222964: WikiPEG character class optimization.

It would be nice to support proposed changes like this with benchmarks. My strong suspicious is that character class optimizations are already done by the javascript regex runtime, so these changes are unlikely to produce any actual performance improvement.

May 21 2019, 2:57 PM · WikiPEG, Patch-For-Review

May 15 2019

cscott created T223411: PHP regression when switching from extTagWidths to extTagOffsets.
May 15 2019, 8:11 PM · Patch-For-Review, Parsoid-PHP
cscott added a comment to T223296: proposal: new &? !? and ~? predicates.

If you look at Antlr ( https://www.antlr.org/ ) for inspiration, they have instead a single operator which means "return this!".

May 15 2019, 6:09 PM · Patch-For-Review, Parsoid

May 11 2019

Pppery awarded T112987: Separating infoboxes and navboxes from article content a Dislike token.
May 11 2019, 6:40 PM · Wikidata, Community-Wishlist-Survey-2015, Wikimedia-Developer-Summit-2016
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

expansionDepth is a static variable in PPFrame_Hash::expand() -- as such it could be incremented by 1 by some PHP7-specific parse earlier in the process lifetime. For example, in the header which says "this is an experimental service running PHP7" or some other bit of wikitext which appears only on the PHP7 machines. I think that's a red herring.

May 11 2019, 4:06 AM · Core Platform Team Workboards (Done with CPT), serviceops, Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

Chasing it down...
MWTidy::tidy() -> RemexDriver::tidy() creates

		$tokenizer = new Tokenizer( $dispatcher, $text, [
			'ignoreErrors' => true,
			'ignoreCharRefs' => true,
			'ignoreNulls' => true,
			'skipPreprocess' => true,
		] );

and then calls $tokenizer->execute() -- and it appears that the remex patch in question isn't honoring ignoreCharRefs and is decoding the entities regardless (at least in certain situations).

May 11 2019, 3:50 AM · RemexHtml, Parsing-Team, MediaWiki-Parser
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

...though it seems impossible, I've confirmed that 3058052c756ac7b69ead21b4b237a1cd6714de8a in Remex causes these failures. (If only I knew why!)

I cannot find that commit in gerrit or git log. Is that the right hash?

May 11 2019, 3:06 AM · RemexHtml, Parsing-Team, MediaWiki-Parser
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

...though it seems impossible, I've confirmed that 7b208b709161456b5a5a255e0bf249e57e2d5c97 in Remex causes these failures. (If only I knew why!)

May 11 2019, 2:57 AM · RemexHtml, Parsing-Team, MediaWiki-Parser
cscott added a comment to T222992: Upgrading remex-html to 2.0.2 causes 18 test failures.

These failures all have to do with how entities are escaped in attributes, which in theory shouldn't be affected by any of the changes between remex 2.0.1 and 2.0.2, which are about *parsing* (specifically tokenization) not *serialization*.

May 11 2019, 2:24 AM · RemexHtml, Parsing-Team, MediaWiki-Parser

May 10 2019

cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Yeah, it's interesting that the various ppvisitednodes etc counts confirm that HHVM and PHP seem to be doing exactly the same work; ie, there's no weird PHP-7 behavior which is causing it to generate a subtly different graph. Assuming that the generated objects are exactly the same (which again, the counts seem to confirm), slight representation differences for object *probably* wouldn't account for such a large difference. So here are my two theories:

May 10 2019, 5:17 PM · Core Platform Team Workboards (Done with CPT), serviceops, Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error

May 9 2019

cscott added a comment to T222856: WikiPEG & Parsoid cache rule optimization.

This computation is done at grammar compile time; it has no effect on runtime (unless I'm missing something).

May 9 2019, 2:06 PM · WikiPEG, Patch-For-Review

May 8 2019

cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Does this mean HHVM is using less memory for the same task than PHP 7? Or maybe it's measuring/enforcing it differently?

May 8 2019, 8:11 PM · Core Platform Team Workboards (Done with CPT), serviceops, Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error

May 3 2019

cscott added a comment to T222419: Incorrect section numbering after unclosed subst.

I suspect what is happening is that we are incrementing section ID when we attempt to parse, and then failing to decrement when we backtrack out of the parse.

May 3 2019, 1:35 PM · Parsoid-Read-Views, Patch-For-Review, Reading-Infrastructure-Team-Backlog, Page Content Service
cscott added a comment to T222419: Incorrect section numbering after unclosed subst.

Reproduced:

$ (echo '==Foo==' ; echo '{{subst: a surname' ; echo '==Bar==') | bin/parse.js --wrapSections --normalize=parsoid
<section data-mw-section-id="0"></section><section data-mw-section-id="1">
<h2 id="Foo">Foo</h2>
<p>{{subst: a surname</p>
</section><section data-mw-section-id="3">
<h2 id="Bar">Bar</h2>
</section>
May 3 2019, 1:34 PM · Parsoid-Read-Views, Patch-For-Review, Reading-Infrastructure-Team-Backlog, Page Content Service

May 2 2019

cscott added a comment to T222328: [extlink] parsing - link cannot contain language variant or extension tags.

This is a weird corner case of legacy parser behavior; it's not entirely clear Parsoid should follow it. But it's not clear that what Parsoid is doing is totally consistent either.

May 2 2019, 1:50 PM · Chinese-Sites, Parsoid
cscott added a comment to T221920: [extlink] parsing - don't validate protocol twice on simple link.

It's not either-or -- wikitext like [http://example.com/{{1x|path/to/resource}} blah] is possible too, which your rewrite doesn't capture. We could strip the redundant checks and fall back on the final check which is done after template expansion, but in general we prefer doing the early checks, if they are simple and fast (as the protocol check is), in order to avoid wasting too much time on a potential parse which will ultimately be rejected. The sooner we can reject it the better, hence the three-fold check in the current code.

May 2 2019, 2:43 AM · Parsoid

May 1 2019

cscott closed T221920: [extlink] parsing - don't validate protocol twice on simple link as Invalid.

You've confused the precedence of the / operator a little bit. In fact, the addr can be the empty string "" which is why we need a secondary check. In fact, I believe there's a third-string check after templates are expanded, as well.

May 1 2019, 4:31 PM · Parsoid
cscott added a comment to T222266: Edge case difference processing templated styles in table cells.

Current parsoid output for the test case wikitext above:

$ echo -n | bin/parse.js --pageName User:cscott/T222266
<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://en.wikipedia.org/wiki/Special:Redirect/revision/895030248"><head prefix="mwr: http://en.wikipedia.org/wiki/Special:Redirect/"><meta charset="utf-8"/><meta property="mw:pageNamespace" content="2"/><meta property="mw:pageId" content="60633826"/><link rel="dc:replaces" resource="mwr:revision/0"/><meta property="dc:modified" content="2019-05-01T15:20:40.000Z"/><meta property="mw:revisionSHA1" content="68d05a01c3a5c6743d96a854d9754428a3b4fcc1"/><meta property="mw:html:version" content="2.1.0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/User%3Acscott/T222266"/><title>User:Cscott/T222266</title><base href="//en.wikipedia.org/wiki/"/><link rel="stylesheet" href="//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Csite.styles%7Cext.cite.style%7Cext.cite.styles%7Cmediawiki.page.gallery.styles&amp;only=styles&amp;skin=vector"/><!--[if lt IE 9]><script src="//en.wikipedia.org/w/load.php?modules=html5shiv&amp;only=scripts&amp;skin=vector&amp;sync=1"></script><script>html5.addElements('figure-inline');</script><![endif]--><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body data-parsoid='{"dsr":[0,240,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><table data-parsoid='{"dsr":[0,107,2,2]}'>
<tbody data-parsoid='{"dsr":[3,105,0,0]}'><tr data-parsoid='{"autoInsertedStart":true,"dsr":[3,104,0,0]}'><td about="#mwt18" typeof="mw:ExpandedAttrs" data-parsoid='{"a":{"{{Grade II colour}}":null},"sa":{"{{Grade II colour}}":""},"dsr":[3,25,21,0]}' data-mw='{"attribs":[[{"txt":"style=\"background-color: \"","html":"&lt;span about=\"#mwt1\" typeof=\"mw:Transclusion\" data-parsoid=&apos;{\"pi\":[[]],\"dsr\":[4,23,null,null]}&apos; data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"Grade II colour\",\"href\":\"./Template:Grade_II_colour\"},\"params\":{},\"i\":0}}]}&apos;>style=\"background-color: &lt;/span>&lt;span typeof=\"mw:Nowiki\" about=\"#mwt1\" data-parsoid=\"{}\">\n#ACE1AF&lt;/span>&lt;span about=\"#mwt1\" data-parsoid=\"{}\">\"&lt;/span>"},{"html":""}]]}'>X</td>
<td style="color:red" about="#mwt4" typeof="mw:ExpandedAttrs" data-parsoid='{"a":{"style":"color:red"},"sa":{"style":"color:&lt;nowiki>red&lt;/nowiki>"},"dsr":[26,63,36,0]}' data-mw='{"attribs":[[{"txt":"style"},{"html":"color:&lt;span typeof=\"mw:Nowiki\" data-parsoid=&apos;{\"dsr\":[40,60,8,9]}&apos;>red&lt;/span>"}]]}'>Y</td>
<td style="color:
green" about="#mwt7" typeof="mw:ExpandedAttrs" data-parsoid='{"a":{"style":"color:\ngreen"},"sa":{"style":"color:&lt;nowiki>\ngreen&lt;/nowiki>"},"dsr":[64,104,39,0]}' data-mw='{"attribs":[[{"txt":"style"},{"html":"color:&lt;span typeof=\"mw:Nowiki\" data-parsoid=&apos;{\"dsr\":[78,101,8,9]}&apos;>\ngreen&lt;/span>"}]]}'>Z</td></tr>
</tbody></table>
May 1 2019, 3:25 PM · Parsoid-Read-Views
cscott added a comment to T222266: Edge case difference processing templated styles in table cells.

The <nowiki> in the style attribute was perhaps an attempt to workaround T5158: Parser inserts invalid &nbsp; in the middle of style attribute (French spaces)/T197902: Be more selective in applying French Space armoring, which was fixed in July 2018?

May 1 2019, 3:19 PM · Parsoid-Read-Views
cscott updated subscribers of T222266: Edge case difference processing templated styles in table cells.
May 1 2019, 3:15 PM · Parsoid-Read-Views

Apr 26 2019

cscott added a comment to T221907: Security Concept Review For Parsoid-PHP.

Yes, indeed. I just finished digging through git log to find that patch, and came here and you'd beaten me to it. ;)

Apr 26 2019, 3:22 PM · Security-Team-Reviews, Parsoid-PHP
cscott added a comment to T221907: Security Concept Review For Parsoid-PHP.

I recall that we did have the issue with Parsoid being exposed to sensitive content, specifically usernames and deleted revisions. The details are foggy, but I vaguely recall that I had to remove some information from the <head> of the Parsoid document because it could expose information about deleted revisions.

Apr 26 2019, 3:05 PM · Security-Team-Reviews, Parsoid-PHP

Apr 25 2019

cscott added a comment to T90902: Non-breaking space in header ID breaks anchor.

@Pols12 that seems like possibly an interaction with french space armoring (T197902).

Apr 25 2019, 7:25 PM · Community-Tech, MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), Patch-For-Review, MediaWiki-Parser, Parsoid
cscott added a comment to T221876: Pick one html5 source (whatwg or w3c) and update all code documentation links to point to that source.

On the other hand, stating "we follow WHATWG HTML5" implies an intention to follow WHATWG as HTML5 grows/changes. We haven't stated that as an explicit goal in the past, but our community seems to assume it, insofar historically as soon as new elements were added to HTML (<section>, etc) there were folks who wanted to start using them in wikitext. Not every HTML feature intersects with wikitext, but tag and entity names are the most direct points of contact and historically they've tended to follow HTML's evolution closely.

Apr 25 2019, 6:22 PM · Documentation, Parsoid-PHP
cscott added a comment to T221876: Pick one html5 source (whatwg or w3c) and update all code documentation links to point to that source.

...except for the Japanese ruby issue, where the W3C has philosophical differences which apparently conflict with the desires of jawiki. (I used to actually understand the differences, but I paged all that out long ago.) According to MDN, only firefox implements <rtc>, but all browsers except IE implement <rb>. Of course wikitext doesn't *have* to exactly follow anyone, but it's slightly less confusing to just say "wikitext follows the WHATWG HTML5 spec" than to say "wikitext follows the W3C HTML5 spec, except in the case of <ruby> elements".

Apr 25 2019, 5:28 PM · Documentation, Parsoid-PHP
cscott added a comment to T221876: Pick one html5 source (whatwg or w3c) and update all code documentation links to point to that source.

This seems pretty low priority; W3C is following the WHATWG spec so there's no real difference. WHATWG tends to lead, and then their decisions are ratified by W3C sometime later.

Apr 25 2019, 5:14 PM · Documentation, Parsoid-PHP
cscott created T221872: composer-package-php73-docker seems to fail often on Parsoid builds.
Apr 25 2019, 4:01 PM · phan-taint-check-plugin, Continuous-Integration-Infrastructure, Jenkins, Release-Engineering-Team (Kanban), phan, Parsoid
cscott closed T219943: Create a composer library for wikipeg as Resolved.
Apr 25 2019, 2:46 PM · Patch-For-Review, Core Platform Team (Parsoid PHP (CDP2)), Parsoid-PHP
cscott created T221858: Release wikipeg 2.0.3 through npm.
Apr 25 2019, 2:21 PM · Patch-For-Review, Parsoid

Apr 24 2019

cscott created T221790: Parsoid extension API should use DOM fragments, not documents.
Apr 24 2019, 4:16 PM · Parsoid-PHP

Apr 23 2019

cscott added a comment to T126618: $wgAllowImageTag should also block/allow <video>/<audio> tags (and be renamed to $wgAllowMediaTags ?).

As a counter-argument: $wgAllowImageTag is not turned on for any WMF production wikis. You could argue that wgAllowImageTag should be deprecated and removed, rather than supported and expanded.

Apr 23 2019, 6:42 PM · Multimedia, MediaWiki-Parser
cscott added a comment to T221684: Some third-party wikis allow hotlinked images, but these aren't shown in preview.

Dup of T127884: Respect $wgAllowImageTag wiki configuration flag in the Sanitizer.

Apr 23 2019, 6:41 PM · Parsoid, VisualEditor, VisualEditor-MediaWiki-2017WikitextEditor
cscott created T221677: Sanitizer::validateAttributes is not as efficient as it could be.
Apr 23 2019, 5:14 PM · MW-1.34-notes (1.34.0-wmf.11; 2019-06-26), Parsoid-PHP, MediaWiki-Parser

Apr 22 2019

cscott added a comment to T118520: Use <figure-inline> instead of <span> for inline figures..

The spec has this explanation, https://www.mediawiki.org/wiki/Specs/HTML/1.5.0#Images

The outer <figure> element needs to become a <span> element when the figure is rendered inline, since otherwise the HTML5 parser will interrupt a surrounding block context. The inner <figcaption> element is rendered as a data-mw attribute in this case (since block content in an invisible caption would otherwise break parsing).

So, the difference is in how a browser will render it. Hope that helps.

Apr 22 2019, 5:01 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), MW-1.31-release-notes (WMF-deploy-2017-09-26 (1.31.0-wmf.1)), Patch-For-Review, Parsoid-DOM
cscott added a comment to T118517: [RFC] Use <figure> for media.

Adding a few details to @ssastry's update: Parsoid was changed to use <figure> and <figure-inline> in c9f404761cd288e7b58b89623ac459bbb2901a7d (T118520). The remaining work to be done is to transition core to use this same markup. The original plan was to do this in two steps: first convert block markup to use <figure>, and then as a follow-up convert inline markup to use <figure-inline>. Arlo has core patches written (linked above), but actually deploying them will take a careful process of communicating w/ local communities, linting, etc, which we do not plan to tackle until after the Parsoid port to PHP is complete.

Apr 22 2019, 4:59 PM · Patch-For-Review, Accessibility, Parsing-Team, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, MediaWiki-Parser, TechCom-RFC

Apr 20 2019

cscott updated the task description for T221491: Parsoid: change Cite to use \s+ as a separator instead of [ \t\n].
Apr 20 2019, 2:20 AM · Parsoid-Read-Views, Cite
cscott created T221491: Parsoid: change Cite to use \s+ as a separator instead of [ \t\n].
Apr 20 2019, 2:20 AM · Parsoid-Read-Views, Cite
cscott created T221490: Parsoid: Cite silently ignores all parameters in <ref> with more than two parameters.
Apr 20 2019, 2:11 AM · Parsoid-Read-Views
cscott created T221489: Parsoid: Extra spaces are meaningful in sfn (differs from article text behavior).
Apr 20 2019, 2:09 AM · Parsoid-Read-Views, Cite
cscott created T221488: Add "decoding=async" change to Parsoid image markup.
Apr 20 2019, 2:06 AM · Parsoid-Read-Views
cscott added a comment to T212124: Consider adding decoding=async to our img tags.

I thought about Parsoid, but first let's see if it's beneficial at all for merely reading articles rendered by MediaWiki on a web browser?

Apr 20 2019, 2:01 AM · MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), MediaWiki-General-or-Unknown, MediaWiki-extensions-General, Parsing-Team, Performance-Team

Apr 19 2019

cscott added a comment to T221041: Convert Parsoid to dependency injection.

In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever.

Having multiple wikitext parsers and having different parsers for different markups are two entirely different layers of abstraction. The latter is already mostly possible via the content handler mechanism, and the extent to which it is not possible has less to do with the parser and more with baked-in assumptions about certain things (like system messages) always use wikitext. I don't see much overlap between wider support for non-wikitext markup and the Parsoid-PHP project.

Apr 19 2019, 5:06 PM · User-Daniel, Core Platform Team (Decoupling (CDP2)), Parsoid-PHP, Technical-Debt

Apr 18 2019

cscott added a comment to T220018: html2wt should escape [ when it precedes an external link or autolink.

Note that we have a similar issue with {{ }}, as shown in our test case for T70421:

$ echo '{{1x|{{ }}}}' | bin/parse.js --wt2wt
{{1x|{{ }<nowiki>}</nowiki>}}

But this doesn't round trip:

$ echo '{{1x|{{ }<nowiki>}</nowiki>}}' | php maintenance/parse.php 
<p>{{1x|{{ }}}}
</p>
Apr 18 2019, 8:07 PM · Parsoid-Edit-Support, Parsoid-Nowiki
cscott added a comment to T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.

Oh, there is one buglet on JS, I wonder if that's how I managed to get this to reproduce:

$ node
> [null, undefined].join(':')
':'
> [null, undefined,32].join(':')
'::32'

That is, null and undefined are treated as interchangeable in the cache key. That's an issue in the PHP port as well:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.3-1 — cli) by Justin Hileman
>>> implode(':', [null, 32, '', false])
=> ":32::"

It would be safer to use json_encode() for PHP, probably, but JSON.stringify doesn't distinguish null and undefined on JS.

Apr 18 2019, 7:08 PM · Patch-For-Review, Parsoid
cscott renamed T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined from Wikipeg cache is unsafe when rule variables are in use to Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 7:00 PM · Patch-For-Review, Parsoid
cscott added a comment to T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.

No, I'm an idiot. The above is close, but the original value of the variable is in fact stored in the cache key.

Apr 18 2019, 6:37 PM · Patch-For-Review, Parsoid
cscott added a comment to T221028: Parsoid incorrectly parses links to pages starting with a slash '/' in namespaces that can have subpages.

As I said over in T110413, .//Foo is pretty dubious as an href. But then again, https://en.wikipedia.org/wiki//e/_(operating_system) exists on enwiki, so I guess that's legit.

Apr 18 2019, 5:07 PM · Parsoid-Read-Views
cscott added a comment to T110413: Relative links to subpages are treated as links to the mainspace that start with / until after saving.

.//Foo is pretty dubious as an href. I'd expect ./%2FFoo (if subpages aren't enabled) or ./CurrentPageTitle/Foo (if they are). The base href is always the root of the wiki.

Apr 18 2019, 4:50 PM · Patch-For-Review, VisualEditor (Current work), VisualEditor-MediaWiki-Links
cscott added a comment to T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.

OK, pretty sure I found the bug. In the cache code, we save and restore reference values like so:

			start: [
				`$key = ${key};`,
				'$bucket = $this->currPos;',
				`$cached = $this->cache[$bucket][$key] ?? null;`,
				'if ($cached) {',
				'  $this->currPos = $cached[\'nextPos\'];',
				opts.loadRefs,
				'  return $cached[\'result\'];',
				'}',
				opts.saveRefs,
			].join('\n'),
			store: [
				`$cached = ['nextPos' => $this->currPos, 'result' => ${opts.result}];`,
				opts.storeRefs,
				`$this->cache[$bucket][$key] = $cached;`

Where opts.storeRefs typically looks something like this:

if ($saved_preproc !== $param_preproc) $cached['refs']["preproc"] = $param_preproc;
if ($saved_th !== $param_th) $cached['refs']["th"] = $param_th;
Apr 18 2019, 4:43 PM · Patch-For-Review, Parsoid
cscott updated the task description for T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 4:29 PM · Patch-For-Review, Parsoid
cscott updated subscribers of T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 3:56 PM · Patch-For-Review, Parsoid
cscott created T221384: Wikipeg cache is unsafe when rule variables are set to null/undefined.
Apr 18 2019, 3:56 PM · Patch-For-Review, Parsoid

Apr 17 2019

cscott created T221238: Re-evaluate Parsoid's error handling strategy once it is integrated into MediaWiki.
Apr 17 2019, 2:55 PM · Parsoid-PHP

Apr 16 2019

cscott added a comment to T221041: Convert Parsoid to dependency injection.

I know @ssastry and I disagree on this, but I'd like for us to move toward a place where Parsoid is an extension. In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever. I'd like to reduce the dependencies between mediawiki core and any one particular parser implementation.

Apr 16 2019, 6:11 PM · User-Daniel, Core Platform Team (Decoupling (CDP2)), Parsoid-PHP, Technical-Debt

Apr 11 2019

cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

I'd recommend merge of 502567 ^. @Krinkle, if you agree C+1 and we can get this scheduled for SWAT?

Apr 11 2019, 2:32 PM · Core Platform Team Workboards (Done with CPT), serviceops, Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error

Apr 9 2019

cscott added a comment to T219901: Default to Preprocessor_Hash for PHP 7.

To be conservative, let's configure it for PHP7 first -- since that's an experimental deployment anyway. Then we can talk about changing the defaults once we are certain that we don't come across anything unexpected in production use. (T216664 was unexpected, for instance, at least to me. I'd think the likelihood of being surprised is even lower with a shift to Preprocessor_Hash since we're already using it in production on HHVM... but that's the thing about unknowns: you never know them in advance...)

Apr 9 2019, 5:35 PM · MW-1.34-release, Performance-Team (Radar), MediaWiki-Parser
cscott added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

I was sort of waiting to see if @tstarling had anything to say about this, as (a) he wrote both Preprocessor_DOM and Preprocessor_Hash (AFAIK) and (b) he was the one objection to the previous plan (deprecating _Hash in favor of _DOM).

Apr 9 2019, 5:30 PM · Core Platform Team Workboards (Done with CPT), serviceops, Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
cscott committed rMLZEffc83014b1d7: Bug fix in :first-child selector (unique to PHP port) (authored by cscott).
Bug fix in :first-child selector (unique to PHP port)
Apr 9 2019, 4:13 PM
cscott committed rMLZE33335142718c: Bug fix in :first-child selector (unique to PHP port) (authored by cscott).
Bug fix in :first-child selector (unique to PHP port)
Apr 9 2019, 3:38 PM

Apr 8 2019

cscott added a comment to T217867: Port domino (or another spec-compliant DOM library) to PHP.
Apr 8 2019, 9:40 PM · Core Platform Team (Parsoid PHP (CDP2)), Parsoid-PHP

Apr 6 2019

Mill <mill@mail.com> committed R1907:2da18fafd017: d#baaaaaaaaaaa (authored by cscott).
d#baaaaaaaaaaa
Apr 6 2019, 2:35 AM

Apr 5 2019

Mill <mill@mail.com> committed rELINT3edfb7275584: wsbaaaaaaaaaaa (authored by cscott).
wsbaaaaaaaaaaa
Apr 5 2019, 10:42 PM
Mill <mill@mail.com> committed rELINT9ef59fcdfceb: yzbaaaaaaaaaaa (authored by cscott).
yzbaaaaaaaaaaa
Apr 5 2019, 10:42 PM