Page MenuHomePhabricator

wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php
exception.trace
#0 [internal function]: unknown()
#1 /srv/mediawiki/php-1.38.0-wmf.12/includes/OutputHandler.php(163): gzencode()
#2 /srv/mediawiki/php-1.38.0-wmf.12/includes/OutputHandler.php(78): MediaWiki\OutputHandler::handleGzip()
#3 [internal function]: MediaWiki\OutputHandler::handle()
Notes

333 of errors like this one in the last hour.

https://logstash.wikimedia.org/goto/92bd66749f1813b1e88d89fc7b80fa42

Details

Request URL
https://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/pagebundle/South_Atlantic_states/1058540524

Event Timeline

dancy renamed this task from wtp1025: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php to wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php.Dec 10 2021, 8:10 PM
dancy updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-12-10T20:38:11Z] <rzl> rzl@wtp1025:~$ sudo restart-php7.2-fpm - T297517 - rolling restart to follow

Mentioned in SAL (#wikimedia-operations) [2021-12-10T20:43:20Z] <rzl> sudo cumin -b2 -s10 -p0 'A:parsoid and not P{wtp1025.eqiad.wmnet}' restart-php7.2-fpm - T297517

This task is a blocker for the wmf.13 train (next week). The rate at which memory usage increases on mediawiki servers began ramping up at the start of the the wmf.12 train.

See https://grafana.wikimedia.org/goto/TBsGlYhnk
Click on "Show deployments". See the usual sawtoothish pattern of growth and reset. The rate of growth starts increasing after 2021-12-06 12:04 UTC. It increases even further after 2021-12-09 16:55 UTC.

Mentioned in SAL (#wikimedia-operations) [2021-12-10T21:09:42Z] <rzl> rzl@mw1414:~$ sudo depool - preserving for investigation, T297517

21:10:42 <rzl> !log sudo cumin -b7 -s10 -p0 'A:mw-eqiad and not P{mw1414.eqiad.wmnet}' restart-php7.2-fpm

taavi triaged this task as Unbreak Now! priority.Dec 10 2021, 10:16 PM

Recap from today:

  • The OOMs originally flagged in this ticket were all on wtp1025, but that turned out to be the leading indicator of high memory usage due to a leak affecting all parsoid hosts and appservers (but not API appservers). wtp1025 had the highest memory usage and so started to fail requests first, but others would have followed.
  • A rolling restart of both clusters relieved the immediate pressure, but growth resumed, and we would have had to perform another rolling restart within about 30-40 hours (sometime Sunday). For investigation, canary appserver mw1414 was depooled and not restarted.
  • @dancy rolled back the train to wmf.9, which seemed to do the trick: it looks like there was already a much slower leak, but the urgent issue we saw today was resolved by the rollback. Given the current rate of memory growth, we now expect to get through the weekend with no further problems. A couple of us from serviceops will check in periodically and do another rolling restart if needed, but it's unlikely. Given the results of the train rollback (and the set of hosts affected) we also don't believe any of the security patches from T297322 are implicated.
  • @Joe and @Ladsgroup dug into mw1414 and made some progress but didn't find a conclusive cause yet; they'll take another look during the workday. From looking at graphs, the slower leak has been going on since at least September (when the eqiad graphs begin, because of the DC switchover from codfw) but opcache-driven restarts have kept the problem from being user-impacting. We should also be able to look at code changes in wmf.12 to try to find what changed.

The memory dump of APCu is in my home directory in mw1414, I'm planning to take a look in Monday, the problem is that it's 7GB.

Another note: Most of OOMs in appservers seems to be coming from rest.php (is rest.php being served from the main appservers? Shouldn't it be served from the api cluster?)

Note that for Parsoid itself, errors ilke this Allowed memory size of 1468006400 bytes exhausted (tried to allocate 20480 bytes) in Grammar.php or Allowed memory size of 1468006400 bytes exhausted (tried to allocate 65536 bytes) in OutputHandler.php for a particular request is a known issue since the very beginning (Dec 2019 when we first brought Parsoid-PHP online). On some subset of large pages, Parsoid uses too much memory. But, those are per-request OOMs, not a memory leak afaik (see https://logstash.wikimedia.org/goto/53533ab05e2c78f191a36680152e9508) but I thought I would point this out as an FYI.

Are rest.php requests on the app servers to the Parsoid REST API? I assume the answer is no. But, if yes, why are they hitting the app servers and not RESTBase (and through that to the Parsoid wtp* cluster)? Or, rather if yes, who is issuing those requests to the app servers?

Are rest.php requests on the app servers to the Parsoid REST API? I assume the answer is no. But, if yes, why are they hitting the app servers and not RESTBase (and through that to the Parsoid wtp* cluster)? Or, rather if yes, who is issuing those requests to the app servers?

rest.php has an endpoint that serves Parsoid HTML. I'm not sure if it's in active use - if it is, it's probably by the iOS app. I recall that we initially served rest.php from regular app servers, but I thought that had changed since.

By the way, @tstarling has recently been working on optimizing Parsoid. If it's Parsoid related, he may have an idea of what's going on here.

The message probably indicates that mmap() returned NULL when PHP tried to allocate memory. I don't know why that would happen. You would expect oom-killer to be invoked if the system is out of memory. The message is probably occurs randomly on a stressed system, rather than on a process that is using a lot of memory.

tstarling lowered the priority of this task from Unbreak Now! to High.Dec 13 2021, 1:04 AM

So the wtp* servers were indeed out of memory, as reported at T296098. There were other consequences, for example:

Dec 10 20:06:01 wtp1025 prometheus-mcrouter-exporter[28439]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Dec 10 20:06:01 wtp1025 prometheus-mcrouter-exporter[28439]: SIGABRT: abort
...
Dec 10 20:06:01 wtp1025 systemd[1]: prometheus-phpfpm-statustext-textfile.service: Failed to fork: Cannot allocate memory
...
Dec 10 20:18:04 wtp1025 systemd[25163]: gpgconf: running /usr/bin/gpg-agent failed (exitcode=2): General error
Dec 10 20:18:04 wtp1025 systemd[25163]: gpgconf: fatal error (exit status 1)

However, the oom-killer was not invoked. This is not an appropriate failure mode. I would suggest increasing vm.min_free_kbytes. On wtp1025 it is 90112 KB, or 0.14% of physical memory. We could quadruple it to 360448 KB, then test what happens when there is memory pressure.

In any case, it's not happening now, so it's not a UBN.

brennen raised the priority of this task from High to Unbreak Now!.Dec 13 2021, 4:42 PM

In any case, it's not happening now, so it's not a UBN.

We're currently on 1.38.0-wmf.9, and this remains a blocker to rolling forward both wmf.12 and this week's wmf.13. Re-raising to UBN!.

This can be similar to T297236: timeouts and memory limits on translatewiki.net I suggest we deploy the fix and roll with wmf.12 to see if it still happens. Do you have a good way to monitor issues caused by this memory problem?

This can be similar to T297236: timeouts and memory limits on translatewiki.net I suggest we deploy the fix and roll with wmf.12 to see if it still happens.

Sounds like a plan. Looks like you already prepared the cherry-picked. Feel free to deploy and notify me when you're ready to roll forward.

Do you have a good way to monitor issues caused by this memory problem?

Grafana links:
https://grafana.wikimedia.org/goto/Y1ifZC27k
https://grafana.wikimedia.org/goto/emofZC27k
These are 6-months views that show good and bad behavior.

The underlying memory leak has not been fixed but so far it looks good with the changes being backported. I will wait for a bit more and check more in depth.

We're currently on 1.38.0-wmf.9, and this remains a blocker to rolling forward both wmf.12 and this week's wmf.13. Re-raising to UBN!.

This bug is a duplicate of T296098 which is marked as resolved. Per my latest comment there, I think there is a memory leak in mysqli which was always present but was exacerbated by T296063 (also marked as resolved). I don't see how a bug can be a duplicate of two resolved bugs and yet be UBN. I don't know what I'm meant to do with it to fix it. I'm going to work on the memory leak, I'll file a bug about it once it's fully isolated, but it has probably been present for years so doesn't seem like it should block the train.

The only thing unique to this report as compared to T296098 and T296063 is the failure mode, i.e. mmap() failure, which as I've said should be fixed by tuning the kernel. Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for years.

Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for years.

We were using this ticket to track the acute memory issues seen on Friday. (See T297517#7564396 above.) Procedurally, anything with a train as a parent task is UBN. I get the feeling that tracking of this issue (these issues?) has been a bit scattered, and a lot of the folks involved in responding have been working too many hours of late, so comms have been a bit fuzzier than ideal.

It looks like wmf.12 was able to go to all wikis today after backporting of 746909: DeprecationHelper: avoid closures. I'm removing this from the wmf.12 blockers, since that's been deployed. I think this should mean that wmf.13 is unblocked as well, but I'm going to defer to someone with a slightly better handle on this situation than I have to verify.

Since the train was rolled forward from wmf.9 -> wmf.12 today, the slope of memory increase on the appservers has doubled again. So, I don't know if those 2 tasks you mentioned above were resolved prematurely, but that graph I linked above indicates that something in wmf.12 is probably causing that and this doubling matches what SRE saw on Thu & Fri (and had the train rolled back to wmf.9).

So, this not yet diagnosed wmf.12 issue (or not completely fixed issue from the other 2 tasks) coupled with the mysqli leak issue is probably exacerbated by the kernel issue since it is preventing the servers from recovering from the memory leaks.

It is possible that this time around, the wtp* servers recover (we won't know till we get there), but the appserver memory graph indicates that wtp* memory usage will also escalate.

I filed T297667 for the PHP bug which I'm working on.

It indeed looks like wmf.12 has increased db traffic: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=2&orgId=1&from=1639381034224&to=1639455640440

I sorta like this memleak, it forces us to reduce db load. I actually have a couple of patches I'm planning to roll out as part of reducing the load.

We're currently on 1.38.0-wmf.9, and this remains a blocker to rolling forward both wmf.12 and this week's wmf.13. Re-raising to UBN!.

This bug is a duplicate of T296098 which is marked as resolved. Per my latest comment there, I think there is a memory leak in mysqli which was always present but was exacerbated by T296063 (also marked as resolved). I don't see how a bug can be a duplicate of two resolved bugs and yet be UBN. I don't know what I'm meant to do with it to fix it. I'm going to work on the memory leak, I'll file a bug about it once it's fully isolated, but it has probably been present for years so doesn't seem like it should block the train.

The problem is that, with the current code in wmf.12, the memory leak (which btw shows up on appservers and parsoid and not elsewhere) is so severe that we get to global memory pressure in less than 48 hours from a php-fpm restart.

Thus, it causes production instabilities and need for constant monitoring, so yes, it should block the train and yes it is UBN!, unless you consider the stability of the website secondary.

The only thing unique to this report as compared to T296098 and T296063 is the failure mode, i.e. mmap() failure, which as I've said should be fixed by tuning the kernel. Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for years.

I think you might have misunderstood both the symptoms and the scope of the issue. Killing off individual php workers, or even php-fpm via OOM, not only is unacceptable as an experience for users but is buggy, unpredictable, and won't avoid us having performance degradations well before we get to the point where the kernel OOM killer comes into play.

The issue appeared with wmf.12 which is fully deployed now and it does not seem we will roll it back. May I deploy wmf.13 assuming it is not going to make the situation any worse than it currently is?

The issue appeared with wmf.12 which is fully deployed now and it does not seem we will roll it back. May I deploy wmf.13 assuming it is not going to make the situation any worse than it currently is?

This is not my call, but just an opinion based on monitoring the tasks. On T297669, there has been talk of reverting wmf.12. Maybe wait to see how @Ladsgroup 's fixes impact memory pressure. It is looking promising in the 10 mins since his fixes rolled out.

So I have been working on this on several fronts (with Daniel and Tim). The next patch that goes in should bring it back to wmf.9 levels, then we need to improve further. Including the loadslot info, mysqli memleak, etc. etc. I'm backporting everything to wmf.13 so on test wikis it should be fine.

I need to go to a meeting but after that, I'll run a rolling restart

Documenting my understanding of this problem after reading this task (along with T297669 and T297667)).

  1. There is a known memory leak in mysqli
  2. The cause of the memory leak is being investigated in T297667. So far there are some obvious flaws around allocations, but what exactly is being leaked and where is not known at this time.
  3. There is an observed increase in memory usage first with wmf.9 (tracked in T296063) the outcome was to revert a risky patch (T293950#7516808)—this patch was not reverted from master only from wmf.9
  4. wmf.12 contained the same risky patch and @Ladsgroup found and fixed the issue in T297669#7569560 (evidently) <3<3<3

Opinions:

  1. It's unreasonable to make T297667: mysqli/mysqlnd memory leak a train blocker since it has nothing to do with a deployment
  2. There is a known cause in code that is exacerbating the mysqli/mysqlnd memory leak, that is the blocker
  3. That cause has, we think, been fixed

Opinions welcome:

  • I think the work that remains is to verify the fix. Is that correct?
  • If the fix did not mitigate the memory leak, is it possible to revert the patch in wmf.12?
  • I would prefer we either (a) abandon wmf.12 and roll forward wmf.13 or (b) hold wmf.13 while we verify fixes for wmf.12. Having three versions in production will cause more confusion. Any strong opinions against that or for either (a) or (b)?

I don't have strong opinions but I think wmf.12 issues are "mitigated" (but not resolved) and wmf.13 is already blocked on other things so let's keep the status quo for now?

FWIW, I wholeheartedly agree with @thcipriani's opinions above.

As for the remaining work: we need to run .12 on group 2 for a day or so to tell if we mitigated the memleak which is *anyways there* since forever. If the slope of the memory increase is similar enough to the profile pre-wmf.12 we can say it's mitigated.

If T297667 is somehow solved, that would be great as well, but I think in a way this bug has helped us focus on the fact that increasing the read query rate significantly in production should always be considered a train blocker.

  • I would prefer we either (a) abandon wmf.12 and roll forward wmf.13 or (b) hold wmf.13 while we verify fixes for wmf.12. Having three versions in production will cause more confusion. Any strong opinions against that or for either (a) or (b)?

@ihurbain flagged this for me. What does it mean to abandon wmf.12 .. does that mean if wmf.13 had to be rolled back, it will rollback all the way to wmf.9, not wmf.12? If so, that is potentially problematic. With wmf.12, we rolled out forward compatibility code in Parsoid that handles content from wmf.13 (HTML content version 2.4.0) in events of rollback. But, wmf.9 won't have that capability. While the risk is mild (see the reasoning and analysis from the wmf.12 blocker task) and we have options (purge RESTBase cache selectively -- but that needs verification if selective purge is possible), or simply purge cache on pages with dirty diffs on those pages, we would ideally avoid this scenario altogether!

So, as far as we are concerned, (a) is not the best option.

[...] does that mean if wmf.13 had to be rolled back, it will rollback all the way to wmf.9, not wmf.12? If so, that is potentially problematic. [...]

I have to agree with the statement in general. If wmf.12 is skipped on such short notice, this definitely needs to be communicated, otherwise there may be problems. In CentralAuth, for example, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/720438 would have to be reverted in wmf.13.

(This is based on my understanding that that decision would mean that we rollback to wmf.9 first in order to then do a regular train to wmf.13)

We will promote testwikisgroup 0 to wmf.13 in a few minutes.

Tomorrow evening we would had wmf.12 running on group 2 for ~ 48 hours which would let us confirm that the memleak is worked around. If that is the case we can promote group 1 wikis to wmf.13 , else we hold and see what we can do. Then I don't think wmf.13 is going to make anything worse on that front.

After some deployment issue, 1.38.0-wmf.13 has reached group 0 wikis.

Krinkle assigned this task to Ladsgroup.
Krinkle subscribed.

The immediate issue appears resolved, as evidenced by 1.38.0-wmf.13 and 1.38.0-wmf.16 both having passed without further revert.

If I understand correctly, the OOM was caused by a pre-existing issue with php-mysqli (T297667 or T298573) being newly triggered in a big way by a recent change in MW. That recent change was also noticed through a major db load increase, and was reverted/fixedup in tickets T297669 and T296063.