Page MenuHomePhabricator

wt2html: Out of memory crashers
Open, NormalPublic

Description

/w/rest.php/en.wikipedia.org/v3/page/pagebundle/Template%3ASyrian_Civil_War_detailed_map/854623668
/w/rest.php/fr.wikipedia.org/v3/page/pagebundle/Portail%3AHerp%C3%A9tologie%2FPage_au_hasard%2FListe/163976305
...
See the Parsoid-PHP dashboard @ https://logstash.wikimedia.org/app/kibana#/dashboard/AW4Y6bumP44edBvO7lRc

Details

Related Gerrit Patches:
operations/mediawiki-config : masterRaise PHP memory_limit from 660MB to 760MB
operations/mediawiki-config : masterallow different memory limit settings for parsoid-php servers

Event Timeline

ssastry triaged this task as Normal priority.Tue, Oct 29, 7:11 PM
ssastry created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Oct 29, 7:11 PM
Joe added a subscriber: Joe.

We can raise the memory limit for parsoid-php a bit.

We first need to find out if such requests use a reasonable amount of memory or not.

ssastry moved this task from Backlog to Performance on the Parsoid-PHP board.Tue, Oct 29, 8:59 PM
Izno added a subscriber: Izno.Tue, Oct 29, 9:09 PM

The Syrian civil war template is a known issue on wiki even with legacy parser. We've had to remove details at least once from the template before.

Just fyi.

Dzahn added a subscriber: Dzahn.EditedWed, Oct 30, 12:05 AM

The memory_limit set in php-fpm config is 500M.

When trying to curl from restbase1017 to wtp1025 the error shows a limit of 660M though.

[restbase1017:~] $ curl -H "Host: en.wikipedia.org" https://parsoid-php.discovery.wmnet/w/rest.php/en.wikipedia.org/v3/page/pagebundle/Template%3ASyrian_Civil_War_detailed_map/854623668 wtp1025.eqiad.wmnet

PHP fatal error: <br/> Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes)

The limit of 660MB does not show up in php-fpm config but it does appear in MediaWiki's InitialiseSettings.php:

18562 'wmgMemoryLimit' => [
18563     'default' => 660 * 1024 * 1024, // 660MB
18564 ],

I live hacked /srv/mediawiki/wmf-config/InitialiseSettings.php on wtp1025 to raise it a bit and restarted php-fpm and apache2 but .. the memory limit in the error message does not change.

Ok, wrong curl command to actually talk to wtp1025 and not the cluster while avoiding cert issue. This works:

curl --header "Host: en.wikipedia.org" --resolve 'parsoid-php.discovery.wmnet:443:10.64.0.239' https://parsoid-php.discovery.wmnet/w/rest.php/en.wikipedia.org/v3/page/pagebundle/Template%3ASyrian_Civil_War_detailed_map/854623668 wtp1025.eqiad.wmnet

I raised the limit step by step but always got the "exhausted" error even when doubling it.

PHP fatal error: <br/> Allowed memory size of 1321205760 bytes exhausted

Over 1321205760 there seems to be a new limit kicking in.

When i tried to double it again (~ 2560M) i am getting a different error. I get an error page <title>Internal error - Wikipedia</title> which includes {mw.config.set({"wgBackendResponseTime":60032,"wgHostname":"wtp1025"

So this looks like it can't be fixed:

Syrian war template:

memory_limiterror
660exhausted
1460exhausted
1960exhausted
2360exhausted
2460internal error

While this would be fixed at about 980M.

Portail:Herpétologie

memory_limiterror
660exhausted
970exhausted
980works
1060works
1160works
1260works
1460works
1960works
2360works
2460works

Let us not bump the memory limit quite yet. I am curious to see how many instances of these we run into and we can then test these pages on scandium and determine what might be a reasonable limit.

Let us not bump the memory limit quite yet. I am curious to see how many instances of these we run into and we can then test these pages on scandium and determine what might be a reasonable limit.

I take that back. Looks like we have ~300 instances of this error in the last 10 hours. So, maybe we need to bump this up. Parsoid/JS probably uses 1GB or more, I imagine. In any case I'll need to programmatically collect these urls from logstash somehow and we can then run them on scandium after bumping memory limit there first.

Okay, I have the full set of about 110 urls (that were failing with Parsoid/PHP with OOMs in the last 24 hours) and am ready to test these urls on scandium. @Mutante, can you bump the limit to 850 MB on scandium? I can then test these urls on scandium to see which of those urls pass.

Some of those urls were being repeatedly retried (many 10s of times over the last 24 hours probably because of T236838) .. so, that is why only ~100 unique urls even though there were over 500 OOMs.

Dzahn removed a subscriber: Mutante.Wed, Oct 30, 10:26 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-30T22:29:47Z] <mutante> scandium - live hack /srv/mediawiki/wmf-config/InitialiseSettings.php - set wmgMemoryLimit to 850 (*1024 *1024), restart php7.2-fpm (T236833)

Dzahn added a subscriber: Mutante.Wed, Oct 30, 10:34 PM

@Mutante, can you bump the limit to 850 MB on scandium?

@ssastry I changed it to 850MB on scandium in MediawikiConfig as logged above.

(Please use the @Dzahn user here on Phabricator. I was already on the ticket and need to get rid of the other account.

Dzahn removed a subscriber: Mutante.Wed, Oct 30, 10:37 PM

With that change, 40% of the urls don't OOM anymore.

Later next month, we are also going to relax some resource limit constraints which we imported from Parsoid/JS (and which we have wanted to relax for a while now but never got around to). At that time, other larger pages which were so far failing with HTTP 413 will now start being processed. I imagine some fraction of those new requests will also need more memory.

40% of failing requests passing as I reported above is a substantial improvement. So, how about we bump memory in 100mb increments as long as we know we have sufficient memory on the cluster for all the appserver processes (and parsoid/js live traffic server processes) and it lets more requests be processed?

@Dzahn: @Joe and I chatted and we decided to bump the memory limit by 100MB for the parsoid cluster. Can you prepare the patches for this? Thanks!

Change 548923 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/mediawiki-config@master] raise wmgMemoryLimit from 660MB to 760MB

https://gerrit.wikimedia.org/r/548923

To keep things easy to reason about, I think it would be preferred to apply this to all app servers (more in comments at https://gerrit.wikimedia.org/r/548923 for why).

@Dzahn @Joe Is that okay as-is, or do we want to do some calculations about what an acceptable limit would be given current server count and traffic projections - compared to when the limit was set in 2015.

Krinkle assigned this task to Dzahn.Tue, Nov 5, 11:50 PM

To keep things easy to reason about, I think it would be preferred to apply this to all app servers (more in comments at https://gerrit.wikimedia.org/r/548923 for why).

Given that Parsoid operates on the DOM, it is likely Parsoid will always need more memory than the core parser for the same page. How should we handle this asymmetry so that when we eventually switch everything to Parsoid, we don't have previously rendered pages stop rendering? Or, should we rely on other resource limits (wikitext size, expand size, etc.) to be proxies for memory usage? We don't have to resolve this issue now for this task, but, we plan to keep bumping memory limits for Parsoid/PHP upto 900mb if necessary.

Change 548944 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/mediawiki-config@master] allow different memory limit settings for parsoid-php servers

https://gerrit.wikimedia.org/r/548944

Dzahn added a comment.Wed, Nov 6, 1:33 AM

I uploaded 2 changes. One to simply change it on all and the second one is an attempt to allow different settings for parsoid-php servers vs. the other appservers.

How should we handle this asymmetry so that when we eventually switch everything to Parsoid, we don't have previously rendered pages stop rendering?

I don't think there is a good way to solve that. As such, I think we should keep the limit identical on both types of servers.

In the short term, these servers are special in that they are the only place where the experimental Parsoid/PHP code runs. But that won't be the case for long. I consider identical memory settings to be a blocker to wider deployment and parser unification. So unless there is a short-term need for more memory that we expect to change, I would recommend we increase it more generally instead.

Summary from various comments in code review:

From https://gerrit.wikimedia.org/r/548944 ("Bump wgMemoryLimit from 660MB to 760MB")
Krinkle wrote:

l I'd recommend we do keep this applied to all app servers if possible. We'll want to double check that this is acceptable across the fleet, … The roles are mainly for optimising use of limited APCu capacity and for traffic/availability. They should all be capable of doing the same, and may all in edge cases be tasked with the same [work].


Giuseppe Lavagetto wrote:

While I agree with Krinkle's sentiment, I think we should be able to vary the memory limit depending on the called endpoint.


Mobrovac wrote:

While I agree with Giuseppe's sentiment, I think this is outside the scope of this PS, as varying per end point is slightly more complex and involved to achieve in our current config set-up.

Indeed. We might have a future where more parts of MediaWiki are mutually exclusive by endpoint (e.g. where some code has a dedicated endpoint and where that code isn't invoked directly by other endpoints). But something like "parsing wikitext" is used throughout the code base and triggered from quite a few endpoints. I'm not sure we can (or should) reasonably isolate that. This is also intimately related with Parsoid being unified with the core Parser, but that's a separate topic.

If we do want to give "parsing" separate resource constraints from the rest of core, I suppose we could introduce a memory_limit setting separate from wgMemoryLimit that is only enacted by core (and, for now, by Parsoid) when a "signifcant amount of wikitext" is being parsed.

From https://gerrit.wikimedia.org/r/548944 ("Bump memory_limit on parsoid servers")
Krinkle wrote:

If this is meant to be a proof of concept or other temporary measure then LGTM, but from what I understand it we already tried it on [the parsoid servers].
If we want to roll this out more gradually across the fleet, I recommend iterating on https://gerrit.wikimedia.org/r/548923 …, e.g. a few servers at once slowly across the fleet.


Subramanya Sastry wrote:

For any reasonable memory limit, Parsoid will handle a subset of pages that will render with Parser.php. With different limits on the two wikitext engines, there is a possibility of reducing that gap.
Once the parsers are unified and run on the same cluster, the Parsoid limits would become the new limits anyway, so that is not a problem for parser unification. And, till that time, while Parsoid and Parser.php run on different clusters, I don't yet understand the objection to having different memory limits for the two.


Krinkle wrote:

OK. Sounds good to me. It is imho a blocker for exposing Parsoid on other servers, to sync the memory_limit (or making it enabled at run-time by Parsoid itself through a separate config var).

Joe added a comment.Fri, Nov 8, 8:07 AM

I think it's practical, in the current moment and in general, to be able to vary the memory limits by payload and /or cluster.

Not all clusters have the same setup: for instance the jobrunners have quite different setups in terms of apache configuration with respect to the other servers for example.

So I would suggest we do temporarily raise the memory limit for parsoid now, and consider raising it globally later if needed. We do have a few OOMs from other parts of the stack too.

Change 548923 abandoned by Dzahn:
Raise PHP memory_limit from 660MB to 760MB

Reason:
per comments on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/ /548944 the consensus seems to only change it on parsoid servers and not on all

https://gerrit.wikimedia.org/r/548923