On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754
Closed, ResolvedPublic

Description

http://shinken.wmflabs.org/problems?search=deployment shows alarms for the deployment-mediawikiXX instances that no more respond to HTTP requests. That causes beta to be unavailable.

From deployment-mediawiki04 syslog, it seems that is logrotate that restarted hhvm but an invalid configuration prevent the service from starting.

Nov 16 06:32:45 deployment-mediawiki04 systemd[1]: Stopping HHVM PHP/Hack runtime...
Nov 16 06:33:10 deployment-mediawiki04 diamond[455]: Failed to collect metrics
Nov 16 06:34:10 deployment-mediawiki04 diamond[455]: HTTPError: HTTP Error 503: Service Unavailable
Nov 16 06:34:15 deployment-mediawiki04 systemd[1]: hhvm.service stop-sigterm timed out. Killing.
Nov 16 06:34:15 deployment-mediawiki04 systemd[1]: hhvm.service: main process exited, code=killed, status=9/KILL
Nov 16 06:34:15 deployment-mediawiki04 systemd[1]: Ignoring invalid environment 'RUN_AS_GROUP=www-data
Nov 16 06:34:15 deployment-mediawiki04 systemd[1]: ## Add additional arguments to the hhvm service start up that you can't put in CONFIG_FILE for some reason.
Nov 16 06:34:15 deployment-mediawiki04 systemd[1]: # ADDITIONAL_ARGS=
Nov 16 06:34:15 deployment-mediawiki04 systemd[1]: ': /etc/default/hhvm
Nov 16 06:34:15 deployment-mediawiki04 systemd[1]: Unit hhvm.service entered failed state.

Nov 16 06:34:27 deployment-mediawiki04 kernel: [1242254.202289] hhvm[12342]: segfault at 8 ip 0000000000e6dbb1 sp 00007f7646ff9468 error 4 in hhvm[400000+2500000]

Note the segfault.

How to reproduce:

On a deployment-mediawiki instance: furl http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page

See /var/log/hhvm/error.log and you get either one of:

Core dumped: Segmentation fault

Or

Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754
hashar created this task.Nov 16 2016, 9:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 16 2016, 9:12 AM

And there is a nice stack overflow:

Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 183
# systemctl status hhvm -l
● hhvm.service - HHVM PHP/Hack runtime
   Loaded: loaded (/lib/systemd/system/hhvm.service; enabled)
  Drop-In: /etc/systemd/system/hhvm.service.d
           └─puppet-override.conf
   Active: active (running) since Wed 2016-11-16 09:15:15 UTC; 4min 39s ago
 Main PID: 7579 (hhvm)
   CGroup: /system.slice/hhvm.service
           ├─7579 /usr/bin/hhvm --mode server --config /etc/hhvm/server.ini --user www-data
           ├─7582 /usr/bin/hhvm --mode server --config /etc/hhvm/server.ini --user www-data
           ├─7583 /usr/bin/hhvm --mode server --config /etc/hhvm/server.ini --user www-data
           ├─7584 /usr/bin/hhvm --mode server --config /etc/hhvm/server.ini --user www-data
           ├─7585 /usr/bin/hhvm --mode server --config /etc/hhvm/server.ini --user www-data
           └─7587 /usr/bin/hhvm --mode server --config /etc/hhvm/server.ini --user www-data

Nov 16 09:15:15 deployment-mediawiki05 systemd[1]: Started HHVM PHP/Hack runtime.
Nov 16 09:18:24 deployment-mediawiki05 hhvm[7579]: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 183
Nov 16 09:18:24 deployment-mediawiki05 hhvm[7579]: [Wed Nov 16 09:18:24 2016] [hphp] [7579:7fdcc7bff700:17:000001] [] \nFatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 183

I have upgraded HHVM on the three deployment-mediawikiXX instances.

hashar changed the title from "beta cluster app servers no more respond to http request / beta web access is down" to "On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754".Nov 16 2016, 9:36 AM

On deployment-mediawiki04 under /var/log/hhvm

HHVM has a coredump

/var/log/hhvm/error.log
Nov 16 09:33:45 deployment-mediawiki04 hhvm[555]: [Wed Nov 16 09:33:45 2016] [hphp] [555:7ff0767ff700:13:000002] [] Core dumped: Segmentation fault
Nov 16 09:33:45 deployment-mediawiki04 hhvm[555]: [Wed Nov 16 09:33:45 2016] [hphp] [555:7ff0767ff700:13:000003] [] Stack trace in /var/log/hhvm/stacktrace.555.log

Error log shows stack overflow:

/var/log/hhvm/error.log
Nov 16 09:35:24 deployment-mediawiki04 hhvm: #012Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754
Nov 16 09:35:24 deployment-mediawiki04 hhvm[4169]: [Wed Nov 16 09:35:24 2016] [hphp] [4169:7f9da2fff700:3:000001] [] \nFatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754

From dmesg -T

[Wed Nov 16 09:34:15 2016] hhvm[4152]: segfault at 44 ip 0000000000c9ecab sp 00007f97193f7a80 error 6 in hhvm[400000+2503000]
[Wed Nov 16 09:37:02 2016] hhvm[4193]: segfault at 6a ip 0000000001e44576 sp 00007f9da2ff85c0 error 4 in hhvm[400000+2503000]

reproducible

On a deployment-mediawiki instance: furl http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page

hashar triaged this task as "Unbreak Now!" priority.Nov 16 2016, 9:44 AM
hashar edited the task description. (Show Details)
Restricted Application added subscribers: Jay8g, Luke081515, TerraCodes. · View Herald TranscriptNov 16 2016, 9:44 AM

Mentioned in SAL (#wikimedia-releng) [2016-11-16T09:51:13Z] <hashar> marking deployment-tin offline so I can live hack mediawiki code / scap for T150833 and T15034

From logstash, the first stack overflow occurred on 2016-11-16T05:16:16

The Jenkins job that update beta ran at 05:13 [ https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/130347/console console ] and shows:

05:13:00 INFO:mwcore:cwd: /srv/mediawiki-staging/php-master
05:13:00 INFO:mwcore:running: git pull
05:13:03 From https://gerrit.wikimedia.org/r/p/mediawiki/core
05:13:03    ea42d90..32f3a99  master     -> origin/master
05:13:03 Updating ea42d90..32f3a99
05:13:03 Fast-forward
05:13:03  includes/ServiceWiring.php                         |   5 +-
05:13:03  includes/db/MWLBFactory.php                        |  19 ++++
05:13:03  includes/libs/objectcache/WANObjectCache.php       |  31 ++++++
05:13:03  includes/libs/objectcache/WinCacheBagOStuff.php    |  13 ---
05:13:03  includes/libs/rdbms/database/DBConnRef.php         |   8 ++
05:13:03  includes/libs/rdbms/database/Database.php          |  73 ++++++++++---
05:13:03  includes/libs/rdbms/database/IDatabase.php         |  20 ++++
05:13:03  includes/libs/rdbms/lbfactory/ILBFactory.php       |  28 +++++
05:13:03  includes/libs/rdbms/lbfactory/LBFactory.php        |  42 ++++++++
05:13:03  includes/libs/rdbms/loadbalancer/ILoadBalancer.php |  28 +++++
05:13:03  includes/libs/rdbms/loadbalancer/LoadBalancer.php  |  44 +++++++-
05:13:03  .../libs/rdbms/loadbalancer/LoadBalancerSingle.php |  13 +++
05:13:03  tests/phpunit/includes/db/LBFactoryTest.php        | 119 ++++++++++++++++-----
05:13:03  .../libs/objectcache/WANObjectCacheTest.php        |  31 ++++++
05:13:03  14 files changed, 419 insertions(+), 55 deletions(-)
05:13:03 INFO:mwcore:Exit code: 0

05:14:57 Triggering a new build of beta-scap-eqiad
  • 32f3a99 Merge "objectcache: Remove broken cas() method from WinCacheBagOStuff"
  • d1b53e3 objectcache: Remove broken cas() method from WinCacheBagOStuff
  • b47ce21 objectcache: detect default getWithSetCallback() set options

Mentioned in SAL (#wikimedia-releng) [2016-11-16T10:26:44Z] <hashar> Reverting mediawiki/core b47ce21cec3a4340dd37c773210a514350f10297 on beta cluster T150833

Change 321857 had a related patch set uploaded (by Hashar):
Revert "objectcache: detect default getWithSetCallback() set options"

https://gerrit.wikimedia.org/r/321857

I have reverted b47ce21cec3a4340dd37c773210a514350f10297 on the beta cluster and that restored the service. Hence reverting the change in mediawiki/core with https://gerrit.wikimedia.org/r/321857

Change 321857 merged by jenkins-bot:
Revert "objectcache: detect default getWithSetCallback() set options"

https://gerrit.wikimedia.org/r/321857

hashar closed this task as "Resolved".Nov 16 2016, 11:04 AM
hashar claimed this task.

Manually triggered the beta cluster code update (console) at 10:57:14

beta-scap completed at 11:02:50.

Beta cluster works again