As of midnight UTC (2019-12-04T00:00Z), RESTBase has been observing parsing errors coming from Parsoid/PHP at a steady rate of 10-20 per minute. Given that this hasn't been happening before, it should be investigated ASAP. The errors as perceived by RESTBase can be seen on this dashboard.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | ssastry | T229015 Tracking: Direct live production traffic at Parsoid/PHP | |||
Duplicate | Arlolra | T239806 Parsoid/PHP errors |
Event Timeline
Can't immediately reproduce this on the command line in API mode, eg this produces no errors:
$ php bin/parse.php --restURL 'https://ar.wikipedia.org/api/rest_v1/page/html/%D8%A8%D9%88%D8%A7%D8%A8%D8%A9%3A%D8%B9%D9%84%D9%85_%D8%A7%D9%84%D9%88%D8%B1%D8%A7%D8%AB%D8%A9%2F%D8%A5%D8%B6%D8%A7%D8%A1%D8%A9_%D8%B9%D9%84%D9%89?redirect=false'
Maybe something related to the integrated mode, or a resource limit of some kind?
Hitting
https://ar.wikipedia.org/api/rest_v1/page/html/%D8%A8%D9%88%D8%A7%D8%A8%D8%A9%3A%D8%B9%D9%84%D9%85_%D8%A7%D9%84%D9%88%D8%B1%D8%A7%D8%AB%D8%A9%2F%D9%85%D8%AC%D8%A7%D9%84_%D9%85%D8%AE%D8%AA%D8%A7%D8%B1?redirect=false
directly gives a 500 error, though, and it doesn't take too long (so probably not a timeout). So something definitely wrong somewhere...
I should note that most of the errors appearing there originate from ChangePropagation, so they seem like genuine ones (given that Parsoid/JS don't spit out an error for 99% of these).
There are no new errors reported in logstash. So, it is a increase in one of the filtered errors. I unfiltered 'WMFTimeoutException' and I see a spike in that count .. and possibly a spike in T236912 ... haven't yet verified .. there is a lot of logstash noise from the langvariant crashers. So, once we deploy that and filter out all those errors, easier to look at this.
No, this is not T236912 spike (although there was a spike in this as well in the last 24 hours). The shape of the spike seen in restbase is very highly correlated with the shape of the spike in WMFTimeoutException events. So, yes, this is all parse timeouts in Parsoid/PHP.
Note that Parsoid/JS has a 2-min limit whereas Parsoid/PHP has a 60-sec limit.
There is T236289: Pathological test case for Parsoid wt->html which may potentially be one source of additional parse timeouts in Parsoid/PHP.
Adding this as a blocker for 1.35.0-wmf.8 as there's work underway to mitigate Icinga noise before train rolls forward.
For comparison with parsoid/php, look at the parsoid/js fatals, timeouts in the last 24 hours .. https://logstash.wikimedia.org/app/kibana#/dashboard/parsoid?_g=h@1251ff0&_a=h@616633c .. so, this isn't parsoid/php specific by any means. After someone else independently validates this, I am inclined to close this as declined.
Scratch this. This was a red herring caused by the train rollout which severely impacted parsoid/js. Anyway, will leave this to @Arlolra to look at on scandium (to directly hit Parsoid there) as discussed on Wednesday.
Parsoid/JS does better here,
λ (master) time node bin/parse.js --domain fi.wikipedia.org --oldid 18604768 --pageName "Wikiprojekti:Wiki_Loves_Monuments/Puuttuvat_paikkatiedot" < /dev/null > /dev/null real 0m19.082s user 0m15.103s sys 0m0.958s λ (master) time php bin/parse.php --domain fi.wikipedia.org --oldid 18604768 --pageName "Wikiprojekti:Wiki_Loves_Monuments/Puuttuvat_paikkatiedot" < /dev/null > /dev/null real 4m22.276s user 0m48.295s sys 0m5.579s
Both do poorly, but the DOM is large,
Again, Parsoid/JS does better,
λ (master) time node bin/parse.js --domain zh.wikipedia.org --oldid 57074657 --pageName "哈利波特角色列表" < /dev/null > /dev/null [warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1} [warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1} [warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1} [warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1} [warn/api/etimedout][zhwiki/哈利波特角色列表?oldid=57074657] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1} real 0m23.371s user 0m20.422s sys 0m0.687s λ (master) time php bin/parse.php --domain zh.wikipedia.org --oldid 57074657 --pageName "哈利波特角色列表" < /dev/null > /dev/null real 1m51.259s user 0m19.672s sys 0m1.619s
Parsoid/PHP crasher,
/rest.php/ca.wikipedia.org/v3/page/html/Marc_Corneli_Cetege_(c%C3%B2nsol_204_aC)/21730350 Exception from line 59 of /src/Utils/PHPUtils.php: JSON encoding failed. Backtrace: #0 /src/Utils/DOMDataUtils.php(200): Parsoid\Utils\PHPUtils::jsonEncode(stdClass) #1 /src/Utils/DOMDataUtils.php(641): Parsoid\Utils\DOMDataUtils::setJSONAttribute(DOMElement, string, stdClass) #2 /src/Wt2Html/PP/Handlers/CleanUp.php(275): Parsoid\Utils\DOMDataUtils::storeDataAttribs(DOMElement, array) ...
This one is more interesting,
λ (master) time node bin/parse.js --domain ca.wikipedia.org --oldid 22114421 --pageName "Gai_Sulpici_Camerí" < /dev/null > /dev/null real 0m11.812s user 0m10.654s sys 0m0.329s λ (master) time php bin/parse.php --domain ca.wikipedia.org --oldid 22114421 --pageName "Gai_Sulpici_Camerí" < /dev/null > /dev/null real 0m17.944s user 0m6.331s sys 0m0.646s
but, from scandium,
arlolra@scandium:~$ curl -x scandium.eqiad.wmnet:80 http://ca.wikipedia.org/w/rest.php/ca.wikipedia.org/v3/page/html/Gai_Sulpici_Camerí/22114421 -I HTTP/1.1 500 Internal Server Error Date: Tue, 17 Dec 2019 22:54:22 GMT Server: scandium.eqiad.wmnet X-Powered-By: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1 X-Content-Type-Options: nosniff Cache-control: no-cache P3P: CP="See https://ca.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info." Vary: Accept-Encoding Backend-Timing: D=3289850 t=1576623262838380 Connection: close Content-Type: text/html; charset=UTF-8
This was from, https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-restbase-2019.12.17/restbase?id=AW8WBmararkxubcmtM5b&_g=h@44136fa
Similarly, what was noted above in T239806#5712800,
λ (master) time node bin/parse.js --domain ar.wikipedia.org --oldid 21242337 --pageName "بوابة:علم_الوراثة/مجال_مختار" < /dev/null > /dev/null real 0m10.381s user 0m9.034s sys 0m0.312s λ (master) time php bin/parse.php --domain ar.wikipedia.org --oldid 21242337 --pageName "بوابة:علم_الوراثة/مجال_مختار" < /dev/null > /dev/null real 0m30.652s user 0m7.316s sys 0m0.662s arlolra@scandium:~$ curl -x scandium.eqiad.wmnet:80 -I http://ar.wikipedia.org/w/rest.php/ar.wikipedia.org/v3/page/html/%D8%A8%D9%88%D8%A7%D8%A8%D8%A9%3A%D8%B9%D9%84%D9%85_%D8%A7%D9%84%D9%88%D8%B1%D8%A7%D8%AB%D8%A9%2F%D9%85%D8%AC%D8%A7%D9%84_%D9%85%D8%AE%D8%AA%D8%A7%D8%B1/21242337 HTTP/1.1 500 Internal Server Error Date: Tue, 17 Dec 2019 23:06:29 GMT Server: scandium.eqiad.wmnet X-Powered-By: PHP/7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1 X-Content-Type-Options: nosniff Cache-control: no-cache P3P: CP="See https://ar.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info." Vary: Accept-Encoding Backend-Timing: D=2791112 t=1576623989040500 Connection: close Content-Type: text/html; charset=UTF-8
Looking at the parser-tests dashboard, I see,
PHP Fatal Error: Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes)
and
Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes)
From that last attempt at parsing the arwiki page.
Locally, my memory_limit = 8G because of the langconv requirements when running composer test
On scandium, php -i says it's getting its conf from /etc/php/7.2/cli/php.ini which has memory_limit = 500M
arlolra@scandium:~$ php -i | grep memory_limit memory_limit => 500M => 500M
If I set that locally, I now get,
λ (master) time php bin/parse.php --domain ar.wikipedia.org --oldid 21242337 --pageName "بوابة:علم_الوراثة/مجال_مختار" < /dev/null > /dev/null PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /src/Wt2Html/Grammar.php on line 3560 Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /src/Wt2Html/Grammar.php on line 3560 real 0m3.040s user 0m1.543s sys 0m0.264s
That seems to be configured with,
'wmgMemoryLimitParsoid' => [ 'default' => 760 * 1024 * 1024, // 760MB ],
Removing -"tried to allocate" filtering from the Parsoid/PHP dashboard seems to match up the fatals with the above mentioned RESTBase dashboard.
Change 558737 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit
Change 558737 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit
Mentioned in SAL (#wikimedia-operations) [2019-12-18T19:27:59Z] <niharika29@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit - T239806, T236833 (duration: 01m 01s)
Bumping to 1 GiB seems to have reduced the number of OOMs but still need to try to compare it with the rate of Parsoid/JS. The default on 64-bit platforms for node is 1400 MB, but note that we've disabled GC for Parsoid/PHP.
Change 564805 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit again
Mentioned in SAL (#wikimedia-operations) [2020-01-24T00:32:15Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit again (T239806, T236833) (duration: 01m 05s)