Page MenuHomePhabricator

Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state
Closed, ResolvedPublic

Description

Page loads are inconsistently loaded or timing out.

apache-status output:

GGGGGGKGGGGGGGGGGGGGGGGGGGGGKGGGGGGGGGGGGGGGGGGGGGGGGKGGKGGGGGGG
KGGGGGGGGGGGKGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG_GGGGGGGGGGGGGGGG
KKGGGGGGGGGGGGGKGGGGGGGGGGGGGGG_GG_GGGGGGGGGGGKGGGGKGGGGGGGKGGGG
GGGGGKGGKKGKGGGGGGGGGGGGGG_GKGKKKGKGGGKKKGKKWGGG_GG__G__GGG__GGK
G__G_GGGGG_KG___...G.G...KGGGGGGG...GG.....G.GG._GG.G..G.G.G.G..
.....G......G.G.G..........G....K..G..........G.................
................................................................
..

  Scoreboard Key:
  "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
  "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
  "C" Closing connection, "L" Logging, "G" Gracefully finishing,
  "I" Idle cleanup of worker, "." Open slot with no current process

Outstanding issues:

  • Each httpd process stuck in G state shows the following stacktrace: https://phabricator.wikimedia.org/T182832#3940279. This seems to be due to how Phabricator handles very long json strings. Upgrade Phabricator to the next upstream release seems to be a possible solution due to some recent activity related to json string handling.

Related Objects

StatusSubtypeAssignedTask
Resolvedmmodell
DuplicateNone
Resolvedjcrespo
Resolvedmmodell
ResolvedPaladox
ResolvedDzahn
Resolvedmmodell
ResolvedRobH
ResolvedMoritzMuehlenhoff
ResolvedDzahn
InvalidNone
DeclinedDzahn
Resolvedmmodell
DeclinedNone
Resolvedmmodell
Resolvedmmodell
Declinedmmodell
ResolvedRequestCmjohnson
DeclinedDzahn
ResolvedMoritzMuehlenhoff

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@elukey what about backporting php7.1 from buster into wikimedia-stretch component/php71?

I am running buster's php7.1 packages on phab-stretch.wmflabs.org and it works.

@Paladox: how many packages from buster were required to get it working?

Change 410245 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Phabricator: Support php 7.1 under stretch

https://gerrit.wikimedia.org/r/410245

Looks like we still have workers in the "G" state:

WWG__._G_GGWGWK___.GGW___W.__G.._G._GGGG.GGG_..GG__..G_G____K_._
G__K______G._W.G_W_G_.._..G._._.____......__.G._._..K._.........
......................._........................................
................................................................
................................................................
................................................................
................................................................

@mmodell full list

1root@phabricator-stretch1:/home/paladox# apt list --installed | grep php
2
3WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
4
5libapache2-mod-php7.1/now 7.1.13-1 amd64 [installed,local]
6php-apcu/now 5.1.9+4.0.11-1+b1 amd64 [installed,local]
7php-cli/stable,now 1:7.0+49 all [installed]
8php-common/stable,now 1:49 all [installed,automatic]
9php-curl/stable,now 1:7.0+49 all [installed]
10php-dev/stable,now 1:7.0+49 all [installed]
11php-gd/stable,now 1:7.0+49 all [installed]
12php-json/stable,now 1:7.0+49 all [installed]
13php-ldap/stable,now 1:7.0+49 all [installed]
14php-mailparse/now 3.0.2+2.1.6-12-gae1ef14-1+b1 amd64 [installed,local]
15php-mysql/stable,now 1:7.0+49 all [installed]
16php7.0-cli/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
17php7.0-common/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
18php7.0-curl/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
19php7.0-dev/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
20php7.0-gd/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
21php7.0-json/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
22php7.0-ldap/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
23php7.0-mysql/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
24php7.0-opcache/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
25php7.0-phpdbg/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
26php7.0-readline/stable,now 7.0.27-0+deb9u1 amd64 [installed,automatic]
27php7.1/now 7.1.13-1 all [installed,local]
28php7.1-cli/now 7.1.13-1 amd64 [installed,local]
29php7.1-common/now 7.1.13-1 amd64 [installed,local]
30php7.1-curl/now 7.1.13-1 amd64 [installed,local]
31php7.1-dev/now 7.1.13-1 amd64 [installed,local]
32php7.1-gd/now 7.1.13-1 amd64 [installed,local]
33php7.1-json/now 7.1.13-1 amd64 [installed,local]
34php7.1-mbstring/now 7.1.13-1 amd64 [installed,local]
35php7.1-mysql/now 7.1.13-1 amd64 [installed,local]
36php7.1-opcache/now 7.1.13-1 amd64 [installed,local]
37php7.1-readline/now 7.1.13-1 amd64 [installed,local]

Got a backtrace of one process in G state, seems the same issue:

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007fc2f16735fb in _L_lock_11305 () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fc2f1671758 in __GI___libc_realloc (oldmem=0x7fc2f199a620 <main_arena>, bytes=bytes@entry=64) at malloc.c:3025
#3  0x00007fc2e44578e0 in printbuf_extend (min_size=<optimized out>, p=0x5628ef5467a0) at printbuf.c:72
#4  printbuf_memappend (p=0x5628ef5467a0,
    buf=buf@entry=0x5628f0a6c6bf "/srv/phab/libext/translations/src\"], \"maniphest.custom-field-definitions\": {\"bug.version\": {\"description\": \"Affected software version\", \"fulltext\": true, \"name\": \"Version\", \"search\": true, \"type\": \"te"..., size=size@entry=33) at printbuf.c:82
#5  0x00007fc2e44561a6 in json_tokener_parse_ex (tok=0x5628ef548780,
    str=0x5628f0a6c6e0 "\"], \"maniphest.custom-field-definitions\": {\"bug.version\": {\"description\": \"Affected software version\", \"fulltext\": true, \"name\": \"Version\", \"search\": true, \"type\": \"text\"}, \"deadline.due\": {\"copy\": tr"..., len=8739) at json_tokener.c:434
#6  0x00007fc2e4661e21 in php_json_decode_ex () from /usr/lib/php5/20131226/json.so
#7  0x00007fc2e4661fa7 in ?? () from /usr/lib/php5/20131226/json.so
#8  0x00007fc2edf31dca in dtrace_execute_internal (execute_data_ptr=<optimized out>, fci=<optimized out>, return_value_used=<optimized out>)
    at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:97
#9  0x00007fc2edff27f0 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238fae0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:560
#10 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238fae0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#11 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238fae0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#12 0x00007fc2edff2d33 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238f988) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:592
#13 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238f988) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#14 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238f988) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#15 0x00007fc2edff2d33 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238f868) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:592
#16 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238f868) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#17 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238f868) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#18 0x00007fc2edff2d33 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238f688) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:592
#19 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238f688) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#20 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238f688) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#21 0x00007fc2edff2d33 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238f4e0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:592
#22 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238f4e0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#23 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238f4e0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#24 0x00007fc2edff2d33 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238f3e8) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:592
#25 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238f3e8) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#26 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238f3e8) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#27 0x00007fc2edff2d33 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238f1b8) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:592
#28 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238f1b8) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#29 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238f1b8) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#30 0x00007fc2edff2d33 in zend_do_fcall_common_helper_SPEC (execute_data=0x7fc2f238f0a0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:592
#31 0x00007fc2edf80b40 in execute_ex (execute_data=0x7fc2f238f0a0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_vm_execute.h:363
#32 0x00007fc2edf31c68 in dtrace_execute_ex (execute_data=0x7fc2f238f0a0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend_dtrace.c:73
#33 0x00007fc2edf44850 in zend_execute_scripts (type=-241588704, type@entry=8, retval=0x80, retval@entry=0x0, file_count=87, file_count@entry=3)
    at /build/php5-m8fq6G/php5-5.6.33+dfsg/Zend/zend.c:1341
#34 0x00007fc2ededf940 in php_execute_script (primary_file=primary_file@entry=0x7ffed74915d0) at /build/php5-m8fq6G/php5-5.6.33+dfsg/main/main.c:2613
#35 0x00007fc2edff441a in php_handler (r=<optimized out>) at /build/php5-m8fq6G/php5-5.6.33+dfsg/sapi/apache2handler/sapi_apache2.c:667
#36 0x00005628ee1298a0 in ap_run_handler ()
#37 0x00005628ee129de9 in ap_invoke_handler ()
#38 0x00005628ee13fcc2 in ap_process_async_request ()
This comment was removed by Paladox.

Looks like we still have workers in the "G" state:

WWG__._G_GGWGWK___.GGW___W.__G.._G._GGGG.GGG_..GG__..G_G____K_._
G__K______G._W.G_W_G_.._..G._._.____......__.G._._..K._.........
......................._........................................
................................................................
................................................................
................................................................
................................................................

@mmodell worth to open an upstream task about the issue?

Yeah it's especially odd that it's happening on the exact same json string every time. I'll see if upstream has heard of anything like it

Interesting: I think I may have found a relevant (and long standing) bug report for the deadlock: https://bugs.php.net/bug.php?id=31749, it's even got comments from our own @tstarling who points to https://www.mediawiki.org/wiki/Special:Code/MediaWiki/103433#140

@mmodell "This is finally fixed in PHP-7.1 with "safe timeout handling."

I fixed the server status page. it's available again now.

What was the issue? I can see a big hole in https://grafana.wikimedia.org/dashboard/db/phabricator?orgId=1&panelId=7&fullscreen&from=now-30d&to=now :(

The result was that the server-status.conf was in conf-available where it was expected but because the apache module had put it there and there was no puppet code to purge it / it wasn't done manually. The symlink from conf-enabled to conf-available though was removed by puppet and it reloaded Apache, breaking the status page.

I can see this in syslog:

Feb 13 06:28:25 phab1001 diamond[828]: Collector failed!

That is a remnant from when it still used the Apache module but the httpd module doesn't have it on purpose because we want to remove the diamond collector.

I saw that you added the prometheus collector in https://gerrit.wikimedia.org/r/#/c/410740/ which is the real fix , yep :)

If it's really as simple as importing the packages from buster and deploying them, then that seems like a fairly ideal solution to me. We would need to do some testing beyond simply installing them on a vps instance and declaring it works!!, however, I don't think there will be any huge issues with compatibility, phabricator has supported 7.1 for a full year now.

Mentioned in SAL (#wikimedia-operations) [2018-02-17T17:33:38Z] <twentyafterfour> restarting apache on phab1001 to clear deadlocked workers. refs T182832

@elukey, @Dzahn: do you think that setting MaxConnectionsPerChild would be worth a try? I somehow doubt it will help but it seems like it might improve the situation.

@elukey, @Dzahn: do you think that setting MaxConnectionsPerChild would be worth a try? I somehow doubt it will help but it seems like it might improve the situation.

The main issue is that when the deadlock occurs and the httpd process is stuck in G state it will never reach MaxConnectionsPerChild, so no hope (from my point of view) to get some relief in this case. It seem that the (only) permanent solution is to migrate to PHP 7.1..

The main issue is that when the deadlock occurs and the httpd process is stuck in G state it will never reach MaxConnectionsPerChild, so no hope (from my point of view) to get some relief in this case. It seem that the (only) permanent solution is to migrate to PHP 7.1..

I'll look into creating a component/php71 this week.

Thanks @MoritzMuehlenhoff! Please let me know if there is anything I can do to help.

Change 410245 merged by Dzahn:
[operations/puppet@production] Phabricator: Support php 7.2 under stretch

https://gerrit.wikimedia.org/r/410245

Change 410767 abandoned by Elukey:
phabricator: disable opcache.fastshutdown

Reason:
Old code review

https://gerrit.wikimedia.org/r/410767

[phab1001:~] $ sudo apache-status
               Apache Server Status for 127.0.0.1 (via 127.0.0.1)

   Server Version: Apache/2.4.10 (Debian) PHP/5.6.33-0+deb8u1

   Server MPM: prefork

   Server Built: Sep 20 2017 12:05:38

     ----------------------------------------------------------------------

   Current Time: Tuesday, 03-Apr-2018 22:19:21 UTC

   Restart Time: Sunday, 01-Apr-2018 23:15:12 UTC

   Parent Server Config. Generation: 3

   Parent Server MPM Generation: 2

   Server uptime: 1 day 23 hours 4 minutes 8 seconds

   Server load: 11.64 12.49 11.45

   Total accesses: 1143441 - Total Traffic: 194.7 GB

   CPU Usage: u5746.4 s450.4 cu740.81 cs156.01 - 4.19% CPU load

   6.75 requests/sec - 1.2 MB/second - 178.5 kB/request

   96 requests currently being processed, 50 idle workers

 GWWGGGGGG.WWWWWGW.WGGWW._GGGG_.GGGGW_GWGGG.W_.WWGGGGGW_.__K_W.GG
 .GG__GGG_GGW_G..__G.GG__G_.G__.K.WGWG_._KK_.W_..W.GG__WK_.GGG.GG
 _W....._..W._._GW._W_._WK__.W_.._W___W__._K_...W...._._WK...._..
 __.......................G....._..................W.............
 ................................................................
 ................................................................
 ................................................................
 ..

   Scoreboard Key:
   "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
   "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
   "C" Closing connection, "L" Logging, "G" Gracefully finishing,
   "I" Idle cleanup of worker, "." Open slot with no current process

Mentioned in SAL (#wikimedia-operations) [2018-04-03T22:25:02Z] <mutante> restarting Apache on phab1001 - T182832

This happened again even with the cron that restarts it every sunday. Im not sure if we have time for codfw to get the databases setup by the end of 2018.

with the restart above, it fixed it, but it will only fill up like before.

The plan is to reimage phab1001 to Stretch in T190568, unblocking the PHP 7.2 deployment that should fix both issues and get us to a normal/more-stable state.

So apparently this is resolved... I'm not sure what changed on June 10th but check out this graph:
https://grafana.wikimedia.org/dashboard/db/phabricator?orgId=1&from=now-30d&to=now

apache-idle-workers.png (292×514 px, 36 KB)

Could it be traffic? But also it could happen again.

And now it's back again on July 1st:

Screenshot from 2018-07-18 11-49-22.png (235×758 px, 37 KB)

I guess that may have been the period where the rate limiter was enabled?

I guess that may have been the period where the rate limiter was enabled?

Not quite, the timing is close but not exact.

Unassigning but I'm still tracking this on my personal workboard (User-MModell)

Should we raise the frequency of the restarts?

Change 407958 abandoned by Paladox:
phabricator: Replace mod_php with php-fpm

https://gerrit.wikimedia.org/r/407958

Change 407958 restored by Paladox:
phabricator: Replace mod_php with php-fpm

Reason:
Re opening per _joe_ comment at https://phabricator.wikimedia.org/T208108#4701107

https://gerrit.wikimedia.org/r/407958

Change 407958 abandoned by Paladox:
phabricator: Replace mod_php with php-fpm

Reason:
I've did it in https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /476985/

https://gerrit.wikimedia.org/r/407958

While merging and confirming https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476985/ which adds php-fpm support on stretch (so should not affect jessie prod yet), i noticed that /etc/apache2/mods-available/mpm_prefork.conf was changed from package defaults to custom tuned values we had set in the past. The latter changes were written in ./conf-available' as opposed to 'mods-available'. (as apache::conf class used to do it). There were 2 competing configs. Since the main config loads "mods-available" after "conf-available" the one in mods-available should have been the relevant one. That means our custom tuning config wasn't actually applied for a while. Now they are again.

19:13 < mutante> !log re-enabling puppet on phabricator, applying change that adds php-fpm support on stretch ..which doesnt affect phab1001 (prod) on jessie.. BUT re-adds tuning config from the past for mpm_prefork.conf (more SpareServers etc) that was not actually applied due to a bug

19:15 < mutante> !log MPM prefork tweaks for high load systems are applied again (apparently they were not since a change in the past that resulted in 2 competing configs in mods-enabled and conf-enabled with the latter one being loaded last and containing the package defaults

We think this happened somewhere around:

https://github.com/wikimedia/puppet/commit/5256156a2832d5bb07884c4f847f41e5842a0a0c#diff-544574f97d57acf91c3517080f5e1b49
https://github.com/wikimedia/puppet/commit/9d22da9cfc0aef4ba6ee9b6e619878e735aec2df#diff-544574f97d57acf91c3517080f5e1b49R8

Dzahn changed the task status from Open to Stalled.Apr 17 2019, 10:07 PM

(hopefully) just blocked by T190568 which is blocked on T215335.

Change 510597 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable php-fpm

https://gerrit.wikimedia.org/r/510597

mmodell claimed this task.

It looks like this one is resolved[1], either by the php 7.2 upgrade or use of php-fpm sapi.

(ambiguitytape)
ambiguitytape

[1] See flat line on the right side of the graph:

Screenshot from 2019-05-23 11-05-11.png (407×951 px, 43 KB)

https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&panelId=7&fullscreen&from=now-7d&to=now

Updated image for phab1003:

Screenshot 2019-05-23 at 22.40.57.png (1×2 px, 1 MB)

It looks much much more healthier than phab1001.

Change 510597 merged by Dzahn:
[operations/puppet@production] phabricator: enable php-fpm in Hiera on both hosts

https://gerrit.wikimedia.org/r/510597

now also phab2001 has been switched to php-fpm and worker . it matches phab1003 now. ^