Page MenuHomePhabricator

Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array"
Closed, DuplicatePublic

Description

Phabricator on phab1001 started to return 503s today from 16:50 UTC (icinga alarming time), this is the related stacktrace:

[Tue Feb 06 16:56:52.502222 2018] [:error] [pid 28700] [client 94.174.160.93:26097] [2018-02-06 16:56:52] EXCEPTION: (Exception) Failed to `proc_open()`: proc_open() expects parameter 2 to be array, unknown given at [<phutil>/src/future/exec/ExecFuture.php:678]
[Tue Feb 06 16:56:52.502884 2018] [:error] [pid 28700] [client 94.174.160.93:26097] arcanist(), phabricator(), phabricator-translations(), phutil(), security(), sprint(), wmf-ext-misc()
[Tue Feb 06 16:56:52.502899 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #0 <#2> ExecFuture::isReady() called at [<phutil>/src/future/Future.php:37]
[Tue Feb 06 16:56:52.502904 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #1 <#2> Future::resolve() called at [<phutil>/src/future/exec/execx.php:37]
[Tue Feb 06 16:56:52.502909 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #2 <#2> exec_manual(string, string) called at [<phutil>/src/filesystem/Filesystem.php:993]
[Tue Feb 06 16:56:52.502913 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #3 <#2> Filesystem::resolveBinary(string) called at [<phutil>/src/filesystem/Filesystem.php:962]
[Tue Feb 06 16:56:52.502918 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #4 <#2> Filesystem::binaryExists(string) called at [<phabricator>/src/applications/config/check/PhabricatorPygmentSetupCheck.php:13]
[Tue Feb 06 16:56:52.502923 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #5 <#2> PhabricatorPygmentSetupCheck::executeChecks() called at [<phabricator>/src/applications/config/check/PhabricatorSetupCheck.php:63]
[Tue Feb 06 16:56:52.502928 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #6 <#2> PhabricatorSetupCheck::runSetupChecks() called at [<phabricator>/src/applications/config/check/PhabricatorSetupCheck.php:258]
[Tue Feb 06 16:56:52.502932 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #7 <#2> PhabricatorSetupCheck::runNormalChecks() called at [<phabricator>/src/applications/config/engine/PhabricatorSetupEngine.php:26]
[Tue Feb 06 16:56:52.502937 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #8 <#2> PhabricatorSetupEngine::execute() called at [<phabricator>/src/applications/config/check/PhabricatorSetupCheck.php:194]
[Tue Feb 06 16:56:52.502941 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #9 <#2> PhabricatorSetupCheck::willProcessRequest() called at [<phabricator>/src/aphront/configuration/AphrontApplicationConfiguration.php:137]
[Tue Feb 06 16:56:52.502946 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #10 <#2> AphrontApplicationConfiguration::runHTTPRequest(AphrontPHPHTTPSink) called at [<phabricator>/webroot/index.php:17]
[Tue Feb 06 16:56:52.502950 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #11 phlog(Exception) called at [<phabricator>/src/aphront/response/AphrontUnhandledExceptionResponse.php:20]
[Tue Feb 06 16:56:52.502955 2018] [:error] [pid 28700] [client 94.174.160.93:26097]   #12 AphrontUnhandledExceptionResponse::setException(Exception) called at [<phabricator>/webroot/index.php:21]

The first occurrence of this error in the phabricator apache error log is at:

[Tue Feb 06 16:14:27.176766 2018] [:error] [pid 18114] [client 220.181.125.124:10109] [2018-02-06 16:14:27] EXCEPTION: (Exception) Failed to `proc_open()`: proc_open() expects parameter 2 to be array, integer given {>} (Exception) Failed to `proc_open()`: proc_open() expects parameter 2 to be array, integer given at [<phutil>/src/future/exec/ExecFuture.php:678]

This issue was "fixed" with an apache2 restart.

Event Timeline

elukey triaged this task as High priority.Feb 6 2018, 5:17 PM
elukey created this task.
elukey renamed this task from Phabricator down due to to Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array".Feb 6 2018, 5:17 PM
elukey updated the task description. (Show Details)

I think this is probably the same root cause as T182832: that one is an unsolved mystery. Something is causing apache workers to get into a state where they are neither recycled to do work nor closed to reclaim their resources. It's a slow leak which eventually leads to resource exhaustion. The symptom is the same in the above task, the only difference is that the in the case of T182832, apache's worker pool size limit is being reached rather than actually exhausting an underlying system resource.

I've been away for 2 weeks due to our team offsite, travel and vacation time. I think the way to address this in the short term is to have a cron job restart apache periodically. This is what @Dzahn and I discussed at all-hands. It doesn't address the underlying problem but currently that problem is unidentified and difficult to pinpoint.

I've been away for 2 weeks due to our team offsite, travel and vacation time. I think the way to address this in the short term is to have a cron job restart apache periodically. This is what @Dzahn and I discussed at all-hands. It doesn't address the underlying problem but currently that problem is unidentified and difficult to pinpoint.

Sure, we could go for the restart script but as outlined in the task we have precise stack traces about what's happening, so I think it is worth to open a issue/bug/something to upstream and ask their opinion in addition to the cron?

I think upstream test with php 7.1 now (even though php 5.2+ is supported) so maybe they didnt catch this?

The php stack trace points to phabricator setup checks which should only happen once and then cache their results for a long time. I don't understand why this would cause an outage.

I guess it must be related to the parent task - it's in ExecFuture which is phabricator's async request dispatcher. So this supports my suspicion that it's some upstream change to the async code which is hitting a race somewhere in C-land. I'm going to dig through the upstream changelogs to see if I can come up with any specific suspects.

@mmodell let's remember that https://wikitech.wikimedia.org/wiki/Incident_documentation/20180206-Phabricator is still to complete, Daniel did the groundwork but it needs to finalized :)