Page MenuHomePhabricator

Bugs in PHP port of LanguageConverter
Closed, ResolvedPublic

Description

We deployed Parsoid/PHP on srwiki and zhwiki and found some bugs in LanguageConverter which hadn't been found with our test suite.
Here are the notices from logstash.

  • PHP Notice: Undefined property: stdClass::$disabled
  • PHP Notice: Undefined property: stdClass::$oneway
  • PHP Notice: Undefined property: stdClass::$twoway
  • PHP Notice: Undefined property: stdClass::$name
  • PHP Notice: Trying to get property 'epsEdge' of non-object
  • PHP Notice: Trying to get property 'blen' of non-object
  • PHP Notice: Trying to get property 'outpos' of non-object
  • PHP Notice: Trying to get property 'idx' of non-object

Event Timeline

Change 554113 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Bug fixes for PHP port of ConversionTraverser

https://gerrit.wikimedia.org/r/554113

ssastry triaged this task as High priority.Dec 2 2019, 6:18 PM
ssastry updated the task description. (Show Details)

These should be handled by https://gerrit.wikimedia.org/r/554113:

PHP Notice: Undefined property: stdClass::$disabled
PHP Notice: Undefined property: stdClass::$oneway
PHP Notice: Undefined property: stdClass::$twoway
PHP Notice: Undefined property: stdClass::$name

I believe these are the result of BacktrackState being omitted from the repo (https://gerrit.wikimedia.org/r/554085):

PHP Notice: Trying to get property 'epsEdge' of non-object
PHP Notice: Trying to get property 'blen' of non-object
PHP Notice: Trying to get property 'outpos' of non-object
PHP Notice: Trying to get property 'idx' of non-object

Mentioned in SAL (#wikimedia-operations) [2019-12-03T11:58:02Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP - T229015 T239643

Mentioned in SAL (#wikimedia-operations) [2019-12-03T11:58:10Z] <mobrovac@deploy1001> deploy aborted: Mirror html2html traffic to Parsoid/PHP - T229015 T239643 (duration: 00m 00s)

Mentioned in SAL (#wikimedia-operations) [2019-12-03T11:58:33Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP - T229015 T239643

Mentioned in SAL (#wikimedia-operations) [2019-12-03T12:12:02Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP - T229015 T239643 (duration: 13m 29s)

Change 554276 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] TEMP: Log all html2html errors coming from Parsoid/PHP

https://gerrit.wikimedia.org/r/554276

Change 554276 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] TEMP: Log all html2html errors coming from Parsoid/PHP

https://gerrit.wikimedia.org/r/554276

Mentioned in SAL (#wikimedia-operations) [2019-12-03T12:28:08Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@41bb230]: Log all html2html errors coming from Parsoid/PHP - T239643

Mentioned in SAL (#wikimedia-operations) [2019-12-03T12:42:49Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@41bb230]: Log all html2html errors coming from Parsoid/PHP - T239643 (duration: 14m 41s)

RESTBase is now mirroring all html2html requests to Parsoid/PHP.

Mentioned in SAL (#wikimedia-operations) [2019-12-03T13:11:40Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@92acf1e]: Revert mirroring html2html traffic to PHP - T239643

I believe these are the result of BacktrackState being omitted from the repo (https://gerrit.wikimedia.org/r/554085):

PHP Notice: Trying to get property 'epsEdge' of non-object
PHP Notice: Trying to get property 'blen' of non-object
PHP Notice: Trying to get property 'outpos' of non-object
PHP Notice: Trying to get property 'idx' of non-object

No, these errors still exist. See logstash from this morning after Marko mirrored traffic. These are all still present in large volumes.
Ex: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.12.03/parsoid-php?id=AW7L6Ynkh3Uj6x1zlmba&_g=h@6fb875b

Change 554113 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Bug fixes for PHP port of ConversionTraverser

https://gerrit.wikimedia.org/r/554113

Mentioned in SAL (#wikimedia-operations) [2019-12-03T13:22:23Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@92acf1e]: Revert mirroring html2html traffic to PHP - T239643 (duration: 10m 43s)

Mirroring had to be disabled due to too many errors being emitted by Parsoid/PHP. Once the fixes for it have been deployed, RESTBase can de re-set up to mirror the traffic. I have prepared everything for it, all that needs to be done is to deploy restbase with scap deploy -f from deploy1001.

I believe these are the result of BacktrackState being omitted from the repo (https://gerrit.wikimedia.org/r/554085):

No, these errors still exist. See logstash from this morning after Marko mirrored traffic. These are all still present in large volumes.

Hm. I'll have to think about that more. The only thing which is added to the stack are BacktrackResult objects. Maybe this is a stack underflow of some kind, although I'd expect to have a crash when null is dereferenced....

It's a stack underflow; I can confirm that PHP emits warnings instead of crashing when you dereference null here (weird!).

I wasn't able to reproduce locally, but I fixed a number of other issues and added a debugging message and an assertion to this path to make it easier to reproduce (hopefully) from the log messages. See https://gerrit.wikimedia.org/r/554385

Change 554385 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/libs/LangConv@master] Add pFST filename to assertion messages for easier debugging

https://gerrit.wikimedia.org/r/554385

Well, one reason I'm having trouble reproducing this is that the logged errors don't (apparently) contain the actual page title being processed; see T239768: All RestBase mirrored html2html language conversion pages have page title set to Main Page.

I could probably work around this in Parsoid by catching thrown errors and looking at the <title> element of the html that we're processing, but let's see if this is easy to fix on the RESTBase side first.

Change 554385 merged by jenkins-bot:
[mediawiki/libs/LangConv@master] Add pFST filename to assertion messages for easier debugging

https://gerrit.wikimedia.org/r/554385

Mentioned in SAL (#wikimedia-operations) [2019-12-04T15:12:41Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@f4b752e]: Parsoid: Set title when sending html2html reqs; Mirror 6% of html2html reqs to Parsoid/PHP - T239768 T239643

Mentioned in SAL (#wikimedia-operations) [2019-12-04T15:28:42Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@f4b752e]: Parsoid: Set title when sending html2html reqs; Mirror 6% of html2html reqs to Parsoid/PHP - T239768 T239643 (duration: 16m 02s)

Ah! Can now reproduce:

$php bin/parse.php --restURL '/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle/%E8%89%BE%E5%A1%9E%E5%88%97%E5%85%AC%E7%88%B5' --htmlVariantLanguage zh-hans
Wikimedia\Assert\InvariantException from line 217 of /home/cananian/Projects/Wikimedia/Parsoid/vendor/wikimedia/assert/src/Assert.php: Invariant failed: /home/cananian/Projects/Wikimedia/langconv/src/../fst/brack-zh-hant-noop.pfst
#0 /home/cananian/Projects/Wikimedia/langconv/src/FST.php(149): Wikimedia\Assert\Assert::invariant(false, '/home/cananian/...')
#1 /home/cananian/Projects/Wikimedia/langconv/src/FST.php(208): Wikimedia\LangConv\FST->Wikimedia\LangConv\{closure}()
#2 /home/cananian/Projects/Wikimedia/langconv/src/ReplacementMachine.php(92): Wikimedia\LangConv\FST->run('\xE7\xBE\x85\xE6\xAF\x94; zh-hk:\xE7...', 0, 46, true)
#3 /home/cananian/Projects/Wikimedia/langconv/src/ReplacementMachine.php(187): Wikimedia\LangConv\ReplacementMachine->countBrackets('\xE7\xBE\x85\xE6\xAF\x94; zh-hk:\xE7...', 'zh-hant', 'zh-hant')
[...]

It's an invalid UTF-8 string, which is then causing the FST to fail (because it doesn't have edges corresponding to invalid UTF-8 sequences). The string in stack trace #2 there looks like a bogus fragment -- that's some text from the middle of -{ }- markup but the boundaries don't look right to me -- I'm guessing that's the underlying problem.

cscott claimed this task.

Fixed in langconv 0.3.2 (commit 5ea2fe04fe4ace60c7b6f0fc96c77dc4e38c640f).