Page MenuHomePhabricator

Parsoid API not able to deliver HTML (HTTP 504) for certain (big) articles
Open, MediumPublicBUG REPORT

Description

Here and example:

$ curl --connect-timeout 120 -I "https://ru.wikipedia.org/api/rest_v1/page/mobile-sections/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D1%83%D0%B3%D1%80%D0%BE%D0%B6%D0%B0%D0%B5%D0%BC%D1%8B%D1%85_%D0%B2%D0%B8%D0%B4%D0%BE%D0%B2_%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D1%85_%D1%80%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D0%B9"
HTTP/2 504 
content-length: 24
content-type: text/plain
date: Sat, 14 Aug 2021 14:49:24 GMT
server: envoy
age: 67
x-cache: cp3054 miss, cp3054 miss
x-cache-status: miss
server-timing: cache;desc="miss", host;desc="cp3054"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
permissions-policy: interest-cohort=()
set-cookie: WMF-Last-Access=14-Aug-2021;Path=/;HttpOnly;secure;Expires=Wed, 15 Sep 2021 12:00:00 GMT
set-cookie: WMF-Last-Access-Global=14-Aug-2021;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 15 Sep 2021 12:00:00 GMT
x-client-ip: 2a02:168:6008:0:1592:e08c:305a:2fd8
set-cookie: GeoIP=CH:ZH:Zurich:47.37:8.57:v4; Path=/; secure; Domain=.wikipedia.org

But it exists many of them of different Wikipedia.

First reported at https://github.com/openzim/mwoffliner/issues/1523

Event Timeline

Kelson renamed this task from Parsoid API not able to deliver for certain (big) article to Parsoid API not able to deliver HTML (HTTP 504) for certain (big) articles.Aug 14 2021, 2:53 PM
Kelson moved this task from TRIAGE to TOP on the affects-Kiwix-and-openZIM board.
Kelson added a subscriber: Arlolra.

@Arlolra HTTP error code is not the same but otherwise looks for me to be a similar problem like https://phabricator.wikimedia.org/T280381

It was noted in T280381#7185326 that this ruwiki page times out. That particular page might benefit from T251624 or T214662

The frwiki page can probably be filed as a dupe of T206040

Parsoid/JS used to have a timeout of 110s but Parsoid/PHP is down to 1min,
https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/templates/config.yaml.j2#L63-L67

@Arlolra HTTP error code is not the same but otherwise looks for me to be a similar problem like https://phabricator.wikimedia.org/T280381

T280381 was about removing arbitrary complexity limits that Parsoid had in place that differed from the legacy parser.

These pages are failing for performance reasons. In order for Parsoid to be the main parser in use, we're going to have to address them. It's helpful to collect examples here for when we get to that work, which I believe may get started in the next few quarters.

Arlolra triaged this task as Medium priority.Aug 18 2021, 4:43 PM
Arlolra moved this task from Needs Triage to Performance on the Parsoid board.

@Arlolra Sounds good. Overall situation with Wiki timeouts/errors has been improved over the past months, but we still have problems to fully scrape many of them, mostly the big ones. Will keep trying to attach our MWoffliner tickets to Phabricator tickets so info keep percolating up to you :)

Other case:

$ time curl -sI "https://zh.wikisource.org/api/rest_v1/page/html/%E6%98%8E%E6%9C%AC%E6%8E%92%E5%AD%97%E4%B9%9D%E7%B6%93%E7%9B%B4%E9%9F%B3_(%E5%9B%9B%E5%BA%AB%E5%85%A8%E6%9B%B8%E6%9C%AC)%2F%E5%8D%B7%E4%B8%8B" | grep 504
HTTP/2 504 

real	2m33.215s
user	0m0.035s
sys	0m0.006s

Other case:

$ time curl -sI "https://de.wikisource.org/api/rest_v1/page/html/Schwere%2C_Elektricit%C3%A4t_und_Magnetismus%2FErster_Theil" | grep 504
HTTP/2 504 

real	1m5.362s
user	0m0.036s
sys	0m0.012s
$ time curl -sI "https://zh.wikisource.org/api/rest_v1/page/html/%E6%98%8E%E6%9C%AC%E6%8E%92%E5%AD%97%E4%B9%9D%E7%B6%93%E7%9B%B4%E9%9F%B3_(%E5%9B%9B%E5%BA%AB%E5%85%A8%E6%9B%B8%E6%9C%AC)%2F%E5%8D%B7%E4%B8%8B" | grep 504

This one looks like T275505

Other case:

$ time curl -sI "https://id.wikipedia.org/api/rest_v1/page/mobile-sections/Daftar_tokoh_Wales" | grep 504
HTTP/2 504 

real	2m29.723s
user	0m0.019s
sys	0m0.018s

Another case with WPFR:

 time curl -sI "https://fr.wikipedia.org/api/rest_v1/page/mobile-sections/Liste_des_cantons_fran%C3%A7ais_depuis_2015" | grep 504
HTTP/2 504 

real	1m9.307s
user	0m0.020s
sys	0m0.005s

I have rechecked these 4 URLs posted in my lastest comment 6 months again. I can confirmed that all the four of them still failed with a HTTP 504 error.