Page MenuHomePhabricator

wt2html: Out of memory crashers
Open, MediumPublic

Description

/w/rest.php/en.wikipedia.org/v3/page/pagebundle/Template%3ASyrian_Civil_War_detailed_map/854623668
/w/rest.php/fr.wikipedia.org/v3/page/pagebundle/Portail%3AHerp%C3%A9tologie%2FPage_au_hasard%2FListe/163976305
...
See the Parsoid-PHP dashboard @ https://logstash.wikimedia.org/app/kibana#/dashboard/AW4Y6bumP44edBvO7lRc

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Joe subscribed.

We can raise the memory limit for parsoid-php a bit.

We first need to find out if such requests use a reasonable amount of memory or not.

The Syrian civil war template is a known issue on wiki even with legacy parser. We've had to remove details at least once from the template before.

Just fyi.

The memory_limit set in php-fpm config is 500M.

When trying to curl from restbase1017 to wtp1025 the error shows a limit of 660M though.

[restbase1017:~] $ curl -H "Host: en.wikipedia.org" https://parsoid-php.discovery.wmnet/w/rest.php/en.wikipedia.org/v3/page/pagebundle/Template%3ASyrian_Civil_War_detailed_map/854623668 wtp1025.eqiad.wmnet

PHP fatal error: <br/> Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes)

The limit of 660MB does not show up in php-fpm config but it does appear in MediaWiki's InitialiseSettings.php:

18562 'wmgMemoryLimit' => [
18563     'default' => 660 * 1024 * 1024, // 660MB
18564 ],

I live hacked /srv/mediawiki/wmf-config/InitialiseSettings.php on wtp1025 to raise it a bit and restarted php-fpm and apache2 but .. the memory limit in the error message does not change.

Ok, wrong curl command to actually talk to wtp1025 and not the cluster while avoiding cert issue. This works:

curl --header "Host: en.wikipedia.org" --resolve 'parsoid-php.discovery.wmnet:443:10.64.0.239' https://parsoid-php.discovery.wmnet/w/rest.php/en.wikipedia.org/v3/page/pagebundle/Template%3ASyrian_Civil_War_detailed_map/854623668 wtp1025.eqiad.wmnet

I raised the limit step by step but always got the "exhausted" error even when doubling it.

PHP fatal error: <br/> Allowed memory size of 1321205760 bytes exhausted

Over 1321205760 there seems to be a new limit kicking in.

When i tried to double it again (~ 2560M) i am getting a different error. I get an error page <title>Internal error - Wikipedia</title> which includes {mw.config.set({"wgBackendResponseTime":60032,"wgHostname":"wtp1025"

So this looks like it can't be fixed:

Syrian war template:

memory_limiterror
660exhausted
1460exhausted
1960exhausted
2360exhausted
2460internal error

While this would be fixed at about 980M.

Portail:Herpétologie

memory_limiterror
660exhausted
970exhausted
980works
1060works
1160works
1260works
1460works
1960works
2360works
2460works

Let us not bump the memory limit quite yet. I am curious to see how many instances of these we run into and we can then test these pages on scandium and determine what might be a reasonable limit.

Let us not bump the memory limit quite yet. I am curious to see how many instances of these we run into and we can then test these pages on scandium and determine what might be a reasonable limit.

I take that back. Looks like we have ~300 instances of this error in the last 10 hours. So, maybe we need to bump this up. Parsoid/JS probably uses 1GB or more, I imagine. In any case I'll need to programmatically collect these urls from logstash somehow and we can then run them on scandium after bumping memory limit there first.

Okay, I have the full set of about 110 urls (that were failing with Parsoid/PHP with OOMs in the last 24 hours) and am ready to test these urls on scandium. @Mutante, can you bump the limit to 850 MB on scandium? I can then test these urls on scandium to see which of those urls pass.

Some of those urls were being repeatedly retried (many 10s of times over the last 24 hours probably because of T236838) .. so, that is why only ~100 unique urls even though there were over 500 OOMs.

Mentioned in SAL (#wikimedia-operations) [2019-10-30T22:29:47Z] <mutante> scandium - live hack /srv/mediawiki/wmf-config/InitialiseSettings.php - set wmgMemoryLimit to 850 (*1024 *1024), restart php7.2-fpm (T236833)

@Mutante, can you bump the limit to 850 MB on scandium?

@ssastry I changed it to 850MB on scandium in MediawikiConfig as logged above.

(Please use the @Dzahn user here on Phabricator. I was already on the ticket and need to get rid of the other account.

With that change, 40% of the urls don't OOM anymore.

Later next month, we are also going to relax some resource limit constraints which we imported from Parsoid/JS (and which we have wanted to relax for a while now but never got around to). At that time, other larger pages which were so far failing with HTTP 413 will now start being processed. I imagine some fraction of those new requests will also need more memory.

40% of failing requests passing as I reported above is a substantial improvement. So, how about we bump memory in 100mb increments as long as we know we have sufficient memory on the cluster for all the appserver processes (and parsoid/js live traffic server processes) and it lets more requests be processed?

@Dzahn: @Joe and I chatted and we decided to bump the memory limit by 100MB for the parsoid cluster. Can you prepare the patches for this? Thanks!

Change 548923 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/mediawiki-config@master] raise wmgMemoryLimit from 660MB to 760MB

https://gerrit.wikimedia.org/r/548923

To keep things easy to reason about, I think it would be preferred to apply this to all app servers (more in comments at https://gerrit.wikimedia.org/r/548923 for why).

@Dzahn @Joe Is that okay as-is, or do we want to do some calculations about what an acceptable limit would be given current server count and traffic projections - compared to when the limit was set in 2015.

To keep things easy to reason about, I think it would be preferred to apply this to all app servers (more in comments at https://gerrit.wikimedia.org/r/548923 for why).

Given that Parsoid operates on the DOM, it is likely Parsoid will always need more memory than the core parser for the same page. How should we handle this asymmetry so that when we eventually switch everything to Parsoid, we don't have previously rendered pages stop rendering? Or, should we rely on other resource limits (wikitext size, expand size, etc.) to be proxies for memory usage? We don't have to resolve this issue now for this task, but, we plan to keep bumping memory limits for Parsoid/PHP upto 900mb if necessary.

Change 548944 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/mediawiki-config@master] allow different memory limit settings for parsoid-php servers

https://gerrit.wikimedia.org/r/548944

I uploaded 2 changes. One to simply change it on all and the second one is an attempt to allow different settings for parsoid-php servers vs. the other appservers.

How should we handle this asymmetry so that when we eventually switch everything to Parsoid, we don't have previously rendered pages stop rendering?

I don't think there is a good way to solve that. As such, I think we should keep the limit identical on both types of servers.

In the short term, these servers are special in that they are the only place where the experimental Parsoid/PHP code runs. But that won't be the case for long. I consider identical memory settings to be a blocker to wider deployment and parser unification. So unless there is a short-term need for more memory that we expect to change, I would recommend we increase it more generally instead.

Summary from various comments in code review:

From https://gerrit.wikimedia.org/r/548944 ("Bump wgMemoryLimit from 660MB to 760MB")
Krinkle wrote:

l I'd recommend we do keep this applied to all app servers if possible. We'll want to double check that this is acceptable across the fleet, … The roles are mainly for optimising use of limited APCu capacity and for traffic/availability. They should all be capable of doing the same, and may all in edge cases be tasked with the same [work].


Giuseppe Lavagetto wrote:

While I agree with Krinkle's sentiment, I think we should be able to vary the memory limit depending on the called endpoint.


Mobrovac wrote:

While I agree with Giuseppe's sentiment, I think this is outside the scope of this PS, as varying per end point is slightly more complex and involved to achieve in our current config set-up.

Indeed. We might have a future where more parts of MediaWiki are mutually exclusive by endpoint (e.g. where some code has a dedicated endpoint and where that code isn't invoked directly by other endpoints). But something like "parsing wikitext" is used throughout the code base and triggered from quite a few endpoints. I'm not sure we can (or should) reasonably isolate that. This is also intimately related with Parsoid being unified with the core Parser, but that's a separate topic.

If we do want to give "parsing" separate resource constraints from the rest of core, I suppose we could introduce a memory_limit setting separate from wgMemoryLimit that is only enacted by core (and, for now, by Parsoid) when a "signifcant amount of wikitext" is being parsed.

From https://gerrit.wikimedia.org/r/548944 ("Bump memory_limit on parsoid servers")
Krinkle wrote:

If this is meant to be a proof of concept or other temporary measure then LGTM, but from what I understand it we already tried it on [the parsoid servers].
If we want to roll this out more gradually across the fleet, I recommend iterating on https://gerrit.wikimedia.org/r/548923 …, e.g. a few servers at once slowly across the fleet.


Subramanya Sastry wrote:

For any reasonable memory limit, Parsoid will handle a subset of pages that will render with Parser.php. With different limits on the two wikitext engines, there is a possibility of reducing that gap.
Once the parsers are unified and run on the same cluster, the Parsoid limits would become the new limits anyway, so that is not a problem for parser unification. And, till that time, while Parsoid and Parser.php run on different clusters, I don't yet understand the objection to having different memory limits for the two.


Krinkle wrote:

OK. Sounds good to me. It is imho a blocker for exposing Parsoid on other servers, to sync the memory_limit (or making it enabled at run-time by Parsoid itself through a separate config var).

I think it's practical, in the current moment and in general, to be able to vary the memory limits by payload and /or cluster.

Not all clusters have the same setup: for instance the jobrunners have quite different setups in terms of apache configuration with respect to the other servers for example.

So I would suggest we do temporarily raise the memory limit for parsoid now, and consider raising it globally later if needed. We do have a few OOMs from other parts of the stack too.

Change 548923 abandoned by Dzahn:
Raise PHP memory_limit from 660MB to 760MB

Reason:
per comments on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/ /548944 the consensus seems to only change it on parsoid servers and not on all

https://gerrit.wikimedia.org/r/548923

@Joe @Dzahn can that memory bump patch ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/548944 ) be deployed next week? Or, are we waiting on any further clarifications?

jcrespo subscribed.

This is ongoing, so adding production error tag:

PHP Fatal error:  Allowed memory size of 692060160 bytes exhausted (tried to allocate 1052672 bytes) in /srv/deployment/parsoid/deploy-cache/revs/e7faa19d48c07f7322de767e8848f81470091658/src/src/Wt2Html/TT/TokenHandler.php on line 258

<strike>Aside from the error itself, this is interesting, because mw parser OOMs seem to create an exception, while wtp seem to output an error.</strike>

I was wrong: It also creates exceptions: https://logstash.wikimedia.org/goto/72fa55079810779f699aba27d484ef2c

Change 548944 merged by jenkins-bot:
[operations/mediawiki-config@master] allow different memory limit settings for parsoid-php servers

https://gerrit.wikimedia.org/r/548944

Patches are merged and the memory limit is now at 760 MB, as confirmed by the current OOMs.

This is ongoing, so adding production error tag:

This will be more or less ongoing forever. At best, we can reduce the # of pages that OOM by adjusting memory limits and other tweaks. But, there will always be pages that will OOM, and till we address the Parsoid and Parser.php resource differences, Parsoid will OOM on a bigger subset of pages.

So, to clarify expectations, if the expectation is that the production error tag will give this higher priority compared other Parsoid bugs, that is not going to be the case right now because of the reality of Parsoid vs Parser.php differences. But, if the tag is just an indicator of that this is an exception raised on the production cluster, then, that is fine.

Patches are merged and the memory limit is now at 760 MB, as confirmed by the current OOMs.

Not sure how much of a difference this made in practice (judging by logstash charts). We can may be bump by another 100MB later this week, once the dust from other deployments has settled down.

if the expectation is that the production error tag will give this higher priority compared other Parsoid bugs, that is not going to be the case right now because of the reality of Parsoid vs Parser.php differences. But, if the tag is just an indicator of that this is an exception raised on the production cluster, then, that is fine.

Mostly the second- it allows to search for the error once detected and avoid duplicate reports/easy filtering out of unrelated-issues. The first is only if it was not known and the impact uncertain. If it is know, as it seems the case, but "difficult/impractical to fix", that is more than ok, as long as it is open (and maybe low priority?) although I would suggest as actionable to clarify that on the header of this task so people understand the context quickly and you don't get unhelpful reports like mine :-D.

This was assigned to me to (make it possible to) raise the memory limit. That has happened now.

Of the 2 patches linked one has been abandoned because i affected all servers and not parsoid-php. But it's now easy to change the value again.

Since that has been requested for next week and i will be on vacation I'm unassigning it from me.

Dzahn removed Dzahn as the assignee of this task.Dec 5 2019, 7:59 PM

If you want to change the memory_limit for just Parsoid servers again that is now:

18606 'wmgMemoryLimitParsoid' => [
18607         'default' => 760 * 1024 * 1024, // 760MB

in InitialiseSettings.php in the mediawiki-config repo. So not a puppet change that necessarily needs SRE.

Change 558737 had a related patch set uploaded (by C. Scott Ananian; owner: Arlolra):
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit

https://gerrit.wikimedia.org/r/558737

Change 558737 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit

https://gerrit.wikimedia.org/r/558737

Mentioned in SAL (#wikimedia-operations) [2019-12-18T19:27:59Z] <niharika29@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit - T239806, T236833 (duration: 01m 01s)

Change 564805 merged by jenkins-bot:
[operations/mediawiki-config@master] Bump Parsoid/PHP cluster memory_limit again

https://gerrit.wikimedia.org/r/564805

Mentioned in SAL (#wikimedia-operations) [2020-01-24T00:32:15Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit again (T239806, T236833) (duration: 01m 05s)

@ssastry please let us know if there is anything more to be done in this task, if not, we can resolve it

Actually, we want to keep some task around to do another sprint on tackling more of our memory usage related issues at some point. Do you prefer I create a new phab task OR do you want to untag serviceops? Doesn't matter to me either way.

I am removing the tag for now, please add it back if you need help!