Page MenuHomePhabricator

413 error while trying to fetch using desktop api
Closed, ResolvedPublic

Description

While trying to fetch an article from https://hi.wikibooks.org using desktop rest API I received a 413 error here is the link I used -> https://hi.wikibooks.org/api/rest_v1/page/html/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AA%E0%A5%81%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A4%95%3A%E0%A4%9A%E0%A5%8C%E0%A4%AA%E0%A4%BE%E0%A4%B2%2F%E0%A4%B8%E0%A4%AE%E0%A4%BE%E0%A4%9A%E0%A4%BE%E0%A4%B0

article name -> विकिपुस्तक:चौपाल/समाचार

Downstream MWoffliner bug https://github.com/openzim/mwoffliner/issues/1424

Event Timeline

(This is Wikimedia's RESTBase-API; the MediaWiki REST API would be under https://hi.wikibooks.org/w/rest.php/v1/ )

@Aklapper do you mean I am using the wrong URL? and if so why does this https://hi.wikibooks.org/api/rest_v1/ mentions the usage of page/html in it to get content

No, I only corrected project tags.

@Aklapper This bug is a serious one for the Kiwix team as it impacts many (proeminent) Wikimedia wikis and make our whole scraping dying because the backend does not deliver. Any chance someone could have a look why such given URLs simply fail in the backend?

@Arlolra I allow myself to ping you on this as the impact is super high for us and I don't know who else to ping. That said, not sure if this is an error in Parsoid or in the API service itself.

Looks like these are pages running up against size limits which differ between Parsoid and the legacy parser,

> php bin/parse.php --domain de.wikipedia.org --pageName "Chronik_der_COVID-19-Pandemie_in_den_Vereinigten_Staaten_2020" < /dev/null
... wt2html: wikitextSize limit exceeded: 1766176 ...

> php bin/parse.php --domain hi.wikibooks.org --pageName "विकिपुस्तक:चौपाल/समाचार" < /de < /dev/null 
... wt2html: wikitextSize limit exceeded: 1132702 ...

We have a few open issues in this area, T211946 / T254522 / T239841. I'll bring it up with the team on Monday.

@Arlorla Thank you so much for your effort. I'm not sure if this (kind of) problem is old or if this is a regression because at the same time we tend to make MWoffliner more strict. What is sure is that it impacts maybe 40% of all wikis and that we barelly can scrape fully a big Wikipedia anymore.

Arlolra triaged this task as Medium priority.
Arlolra moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.

Change 690029 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Enforce wikitext limits like in the legacy parser

https://gerrit.wikimedia.org/r/690029

@Arlolra Any problem with your patch? It seems stuck since 2 weeks.

@Arlolra Any problem with your patch? It seems stuck since 2 weeks.

Sorry, last week the Product department at the WMF was having an event. I'll try and get it reviewed now so it can ship next week.

@Arlolra Thx. Sounds great. Just wanted to secure it does not go forgotten.

I'll try and get it reviewed now so it can ship next week.

Sorry, there's no train this week

In an attempt to help you diagnose the problem (or verify the patch works fine), here is a list of other URLS suffering for the same symptom:

This is helpful. While the patch fixes the original report of hiwikibooks, most of these pages seem to be failing for other reasons.

curl -v -x scandium.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/List_of_modern_Hangul_characters_in_ISO%2FIEC_2022–compliant_national_character_set_standards/983235746

{"message":"wt2html: tableCell limit exceeded: 30001","httpCode":413,"httpReason":"Request Entity Too Large"}

curl -v -x scandium.eqiad.wmnet:80 http://fr.wikipedia.org/w/rest.php/fr.wikipedia.org/v3/page/html/Liste_des_membres_du_magistrat_de_Bruxelles/182643422

{"message":"wt2html: tableCell limit exceeded: 30001","httpCode":413,"httpReason":"Request Entity Too Large"}

curl -v -x scandium.eqiad.wmnet:80 http://de.wikipedia.org/w/rest.php/de.wikipedia.org/v3/page/html/Ewige_Tabelle_der_UEFA_Europa_League_und_des_UEFA-Cups/212415959

{"message":"wt2html: transclusion limit exceeded: 10001","httpCode":413,"httpReason":"Request Entity Too Large"}

curl -v -x scandium.eqiad.wmnet:80 http://ru.wikipedia.org/w/rest.php/ru.wikipedia.org/v3/page/html/Список_угрожаемых_видов_цветковых_растений/108390695

{"message":"wt2html: transclusion limit exceeded: 10001","httpCode":413,"httpReason":"Request Entity Too Large"}

Good and bad news ;)

@Arlolra It seems we have two scenarios:

Do I get it right?

Both those scenarios will be alleviated by addressing T211946: Bump parsoid resource limits in production.

No, this isn't T206040.

I'm not 100% clear what is the next step here, should we just wait the next Parsoid PROD rollout and check again?

Change 698841 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Stop throwing on resource limits

https://gerrit.wikimedia.org/r/698841

I'm not 100% clear what is the next step here, should we just wait the next Parsoid PROD rollout and check again?

Unfortunately, that patch didn't get reviewed on time to make it into this week's train. And next week there is again no train because of Wikimedia's All-Hands event. Maybe we can try to SWAT something in later in the week?

The above two patches should stop making Parsoid hard fail when reaching limits and instead revert to adding inline error text. That's more in line with how the legacy parser works and will at least produce something. However, there's a bit of a larger issue here that the two parsers are inevitably going to have different complexity limits and might render differently on some pages. Part of the performance work that's necessary before Parsoid becomes the default parser will help to identify such pages, determine appropriate limits to set in the parser, and I guess communicate the changes in content it necessitates.

@Arlolra Thx for the update. I slowly realise that we will have to live a quite long time with backend generating sporadicaly HTTP 413 errors... and that I should better start soon to implement workaround around this on MWoffliner side. Do I'm right?

Well, the above conclusion was that the legacy parser doesn't return an http error when hitting complexity limits, it degrades the response by leaving some wikitext syntax unparsed.

Parsoid has been taking a different approach and I'm sort of proposing we stop doing that with those patches. Hopefully others agree and we can do away with those 413s.

The difference though is that Parsoid and legacy parser will inevitably hit different limits and so pages that previously parsed fine might no longer, and vice versa.

But if you mean quite a long time in that we're slow getting out these patches then, yes, it could be a few weeks.

@Arlolra If this kind of HTTP code could disappear within a month, then this would be a good timing for me. I was under the impression that many different things would need to be fixed and that it could take many months to fully fix the problem. That said, the idea of having pages half parsed, is not really enthusiastic. I guess this is a complex problem to solve and a compromise we will have to live with.

Change 690029 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Enforce wikitext limits like in the legacy parser

https://gerrit.wikimedia.org/r/690029

Surprisingly, this is still failing.

$ curl -s -I "https://hi.wikibooks.org/api/rest_v1/page/html/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AA%E0%A5%81%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A4%95%3A%E0%A4%9A%E0%A5%8C%E0%A4%AA%E0%A4%BE%E0%A4%B2%2F%E0%A4%B8%E0%A4%AE%E0%A4%BE%E0%A4%9A%E0%A4%BE%E0%A4%B0" | grep HTTP
HTTP/2 413 

Surprisingly, this is still failing.

It hasn't been deployed yet

Change 701949 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump parsoid to 0.14.0-a6

https://gerrit.wikimedia.org/r/701949

Change 701949 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.14.0-a6

https://gerrit.wikimedia.org/r/701949

Change 698841 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Stop throwing on arbitrary resource limits

https://gerrit.wikimedia.org/r/698841

With ee1b6d7f554d4548b5fd27a836a24ef19f9f6094 from T280381#7184361 running, I tried reparsing some of the pages from T280381#7135508 on scandium,

$ time curl -x scandium.eqiad.wmnet:80 http://hi.wikibooks.org/w/rest.php/hi.wikibooks.org/v3/page/html/विकिपुस्तक%3Aचौपाल%2Fसमाचार/73946
real	0m16.218s
user	0m0.021s
sys	0m0.108s

$ time curl -x scandium.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/List_of_modern_Hangul_characters_in_ISO%2FIEC_2022–compliant_national_character_set_standards/983235746
real	0m18.624s
user	0m0.026s
sys	0m0.074s

$ time curl -x scandium.eqiad.wmnet:80 http://fr.wikipedia.org/w/rest.php/fr.wikipedia.org/v3/page/html/Liste_des_membres_du_magistrat_de_Bruxelles/182643422
real	0m25.152s
user	0m0.015s
sys	0m0.116s

$ time curl -x scandium.eqiad.wmnet:80 http://de.wikipedia.org/w/rest.php/de.wikipedia.org/v3/page/html/Ewige_Tabelle_der_UEFA_Europa_League_und_des_UEFA-Cups/212415959
real	0m44.816s
user	0m0.009s
sys	0m0.156s

This last one timed out (>60s) though,

$ time curl -x scandium.eqiad.wmnet:80 http://ru.wikipedia.org/w/rest.php/ru.wikipedia.org/v3/page/html/Список_угрожаемых_видов_цветковых_растений/108390695
real	1m0.455s
user	0m0.010s
sys	0m0.014s

@Arlolra Many scrapes pass now. This is great. Thank you very much. Should I open a new ticket if WPRU (and others maybe) still fail because of this?

Only the first patch was deployed yesterday. The second patch won't go out until two weeks from now (next week is a WMF holiday, so I think the train is cancelled).

Filing new tickets is good but hold off until everything here has been deployed, thanks.

Change 704142 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump parsoid to 0.14.0-a7

https://gerrit.wikimedia.org/r/704142

Filing new tickets is good but hold off until everything here has been deployed, thanks.

Ok, that's done

https://www.mediawiki.org/wiki/Parsoid/Deployments#July_13-15:_V0.14.0-a7_as_part_of_1.37.0-wmf14