Page MenuHomePhabricator

Parsoid/PHP errors
Closed, DuplicatePublic

Description

As of midnight UTC (2019-12-04T00:00Z), RESTBase has been observing parsing errors coming from Parsoid/PHP at a steady rate of 10-20 per minute. Given that this hasn't been happening before, it should be investigated ASAP. The errors as perceived by RESTBase can be seen on this dashboard.

Event Timeline

mobrovac created this task.

Can't immediately reproduce this on the command line in API mode, eg this produces no errors:

$ php bin/parse.php --restURL 'https://ar.wikipedia.org/api/rest_v1/page/html/%D8%A8%D9%88%D8%A7%D8%A8%D8%A9%3A%D8%B9%D9%84%D9%85_%D8%A7%D9%84%D9%88%D8%B1%D8%A7%D8%AB%D8%A9%2F%D8%A5%D8%B6%D8%A7%D8%A1%D8%A9_%D8%B9%D9%84%D9%89?redirect=false'

Maybe something related to the integrated mode, or a resource limit of some kind?

I should note that most of the errors appearing there originate from ChangePropagation, so they seem like genuine ones (given that Parsoid/JS don't spit out an error for 99% of these).

There are no new errors reported in logstash. So, it is a increase in one of the filtered errors. I unfiltered 'WMFTimeoutException' and I see a spike in that count .. and possibly a spike in T236912 ... haven't yet verified .. there is a lot of logstash noise from the langvariant crashers. So, once we deploy that and filter out all those errors, easier to look at this.

No, this is not T236912 spike (although there was a spike in this as well in the last 24 hours). The shape of the spike seen in restbase is very highly correlated with the shape of the spike in WMFTimeoutException events. So, yes, this is all parse timeouts in Parsoid/PHP.

Note that Parsoid/JS has a 2-min limit whereas Parsoid/PHP has a 60-sec limit.

There is T236289: Pathological test case for Parsoid wt->html which may potentially be one source of additional parse timeouts in Parsoid/PHP.

brennen claimed this task.
brennen raised the priority of this task from High to Unbreak Now!.

Adding this as a blocker for 1.35.0-wmf.8 as there's work underway to mitigate Icinga noise before train rolls forward.

brennen lowered the priority of this task from Unbreak Now! to High.Dec 4 2019, 9:51 PM

For comparison with parsoid/php, look at the parsoid/js fatals, timeouts in the last 24 hours .. https://logstash.wikimedia.org/app/kibana#/dashboard/parsoid?_g=h@1251ff0&_a=h@616633c .. so, this isn't parsoid/php specific by any means. After someone else independently validates this, I am inclined to close this as declined.

For comparison with parsoid/php, look at the parsoid/js fatals, timeouts in the last 24 hours .. https://logstash.wikimedia.org/app/kibana#/dashboard/parsoid?_g=h@1251ff0&_a=h@616633c .. so, this isn't parsoid/php specific by any means. After someone else independently validates this, I am inclined to close this as declined.

Scratch this. This was a red herring caused by the train rollout which severely impacted parsoid/js. Anyway, will leave this to @Arlolra to look at on scandium (to directly hit Parsoid there) as discussed on Wednesday.

I wonder if this was potentially triggered by edits to highly popular templates.

Parsoid/JS does better here,

λ (master) time node bin/parse.js --domain fi.wikipedia.org --oldid 18604768 --pageName "Wikiprojekti:Wiki_Loves_Monuments/Puuttuvat_paikkatiedot" < /dev/null > /dev/null

real	0m19.082s
user	0m15.103s
sys	0m0.958s
λ (master) time php bin/parse.php --domain fi.wikipedia.org --oldid 18604768 --pageName "Wikiprojekti:Wiki_Loves_Monuments/Puuttuvat_paikkatiedot" < /dev/null > /dev/null

real	4m22.276s
user	0m48.295s
sys	0m5.579s

Both do poorly, but the DOM is large,

https://uk.wikipedia.org/w/index.php?title=%D0%92%D1%96%D0%BA%D1%96%D0%BF%D0%B5%D0%B4%D1%96%D1%8F:%D0%A8%D0%B0%D1%84%D0%B0&oldid=25192337

Again, Parsoid/JS does better,

λ (master) time node bin/parse.js --domain zh.wikipedia.org --oldid 57074657 --pageName "哈利波特角色列表" < /dev/null > /dev/null
[warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1}
[warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1}
[warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1}
[warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1}
[warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1}

real	0m23.371s
user	0m20.422s
sys	0m0.687s
λ (master) time php bin/parse.php --domain zh.wikipedia.org --oldid 57074657 --pageName "哈利波特角色列表" < /dev/null > /dev/null

real	1m51.259s
user	0m19.672s
sys	0m1.619s

Parsoid/PHP crasher,

/rest.php/ca.wikipedia.org/v3/page/html/Marc_Corneli_Cetege_(c%C3%B2nsol_204_aC)/21730350 Exception from line 59 of /src/Utils/PHPUtils.php: JSON encoding failed.

Backtrace:

#0 /src/Utils/DOMDataUtils.php(200): Parsoid\Utils\PHPUtils::jsonEncode(stdClass)
#1 /src/Utils/DOMDataUtils.php(641): Parsoid\Utils\DOMDataUtils::setJSONAttribute(DOMElement, string, stdClass)
#2 /src/Wt2Html/PP/Handlers/CleanUp.php(275): Parsoid\Utils\DOMDataUtils::storeDataAttribs(DOMElement, array)
...

This one is more interesting,

λ (master) time node bin/parse.js --domain ca.wikipedia.org --oldid 22114421 --pageName "Gai_Sulpici_Camerí" < /dev/null > /dev/null

real	0m11.812s
user	0m10.654s
sys	0m0.329s
λ (master) time php bin/parse.php --domain ca.wikipedia.org --oldid 22114421 --pageName "Gai_Sulpici_Camerí" < /dev/null > /dev/null

real	0m17.944s
user	0m6.331s
sys	0m0.646s

but, from scandium,

arlolra@scandium:~$ curl -x scandium.eqiad.wmnet:80 http://ca.wikipedia.org/w/rest.php/ca.wikipedia.org/v3/page/html/Gai_Sulpici_Camerí/22114421 -I
HTTP/1.1 500 Internal Server Error
Date: Tue, 17 Dec 2019 22:54:22 GMT
Server: scandium.eqiad.wmnet
X-Powered-By: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
X-Content-Type-Options: nosniff
Cache-control: no-cache
P3P: CP="See https://ca.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
Vary: Accept-Encoding
Backend-Timing: D=3289850 t=1576623262838380
Connection: close
Content-Type: text/html; charset=UTF-8

This was from, https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-restbase-2019.12.17/restbase?id=AW8WBmararkxubcmtM5b&_g=h@44136fa

Similarly, what was noted above in T239806#5712800,

λ (master) time node bin/parse.js --domain ar.wikipedia.org --oldid 21242337 --pageName "بوابة:علم_الوراثة/مجال_مختار" < /dev/null > /dev/null

real	0m10.381s
user	0m9.034s
sys	0m0.312s
λ (master) time php bin/parse.php --domain ar.wikipedia.org --oldid 21242337 --pageName "بوابة:علم_الوراثة/مجال_مختار" < /dev/null > /dev/null

real	0m30.652s
user	0m7.316s
sys	0m0.662s

arlolra@scandium:~$ curl -x scandium.eqiad.wmnet:80 -I http://ar.wikipedia.org/w/rest.php/ar.wikipedia.org/v3/page/html/%D8%A8%D9%88%D8%A7%D8%A8%D8%A9%3A%D8%B9%D9%84%D9%85_%D8%A7%D9%84%D9%88%D8%B1%D8%A7%D8%AB%D8%A9%2F%D9%85%D8%AC%D8%A7%D9%84_%D9%85%D8%AE%D8%AA%D8%A7%D8%B1/21242337
HTTP/1.1 500 Internal Server Error
Date: Tue, 17 Dec 2019 23:06:29 GMT
Server: scandium.eqiad.wmnet
X-Powered-By: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1
X-Content-Type-Options: nosniff
Cache-control: no-cache
P3P: CP="See https://ar.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
Vary: Accept-Encoding
Backend-Timing: D=2791112 t=1576623989040500
Connection: close
Content-Type: text/html; charset=UTF-8

From, https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-restbase-2019.12.17/restbase?id=AW8WGEc6KWrIH1QRPhzM&_g=h@44136fa

Looking at the parser-tests dashboard, I see,

PHP Fatal Error: Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes)

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.12.17/parsoid-tests?id=AW8WKegbKWrIH1QRQl-n&_g=h@44136fa

and

Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes)

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.12.17/parsoid-tests?id=AW8WKevQh3Uj6x1zoEXe&_g=h@44136fa

From that last attempt at parsing the arwiki page.

Locally, my memory_limit = 8G because of the langconv requirements when running composer test

On scandium, php -i says it's getting its conf from /etc/php/7.2/cli/php.ini which has memory_limit = 500M

arlolra@scandium:~$ php -i | grep memory_limit
memory_limit => 500M => 500M

If I set that locally, I now get,

λ (master) time php bin/parse.php --domain ar.wikipedia.org --oldid 21242337 --pageName "بوابة:علم_الوراثة/مجال_مختار" < /dev/null > /dev/null
PHP Fatal error:  Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /src/Wt2Html/Grammar.php on line 3560
Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /src/Wt2Html/Grammar.php on line 3560

real	0m3.040s
user	0m1.543s
sys	0m0.264s

Although, there does seem to be a $wmgMemoryLimitParsoid from T236833#5716665

That seems to be configured with,

'wmgMemoryLimitParsoid' => [
	'default' => 760 * 1024 * 1024, // 760MB
],

From https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L18675-L18677

Removing -"tried to allocate" filtering from the Parsoid/PHP dashboard seems to match up the fatals with the above mentioned RESTBase dashboard.

Change 558737 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit

https://gerrit.wikimedia.org/r/558737

Change 558737 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit

https://gerrit.wikimedia.org/r/558737

Mentioned in SAL (#wikimedia-operations) [2019-12-18T19:27:59Z] <niharika29@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit - T239806, T236833 (duration: 01m 01s)

Bumping to 1 GiB seems to have reduced the number of OOMs but still need to try to compare it with the rate of Parsoid/JS. The default on 64-bit platforms for node is 1400 MB, but note that we've disabled GC for Parsoid/PHP.

Bumping to 1 GiB seems to have reduced the number of OOMs

Not significantly.

Change 564805 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit again

https://gerrit.wikimedia.org/r/564805

Mentioned in SAL (#wikimedia-operations) [2020-01-24T00:32:15Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit again (T239806, T236833) (duration: 01m 05s)