Page MenuHomePhabricator

jobrunner/jobchron services fail in codfw
Closed, DeclinedPublic

Description

Icinga started to show several " CRITICAL - degraded: The system is operational but one or more units failed" on multiple mw appservers, all of them in codfw

I looked at the output of systemctl and saw the failed services causing it were all jobrunner (and also second service "jobchron").

They were in " Active: failed" status when checking with systemctl status jobrunner.

I saw in syslog that puppet made an upgrade just before this:

Stage[main]/Mediawiki::Jobrunner/Package[jobrunner]/ensure) ensure changed ...

I noticed the last SAL entry before was: 02:51 AaronSchulz: Restarted job services for 5101424 (statsd batching) after monitoring mw1161

that is https://gerrit.wikimedia.org/r/#/c/259660/

At first it seemed like i can simply start the services, but then they failed again shortly after.

And the error then is " LightProcess::closeShadow failed due to exception: Failed in afdt::sendRaw: Broken pipe"


19:56 < mutante> ah, an update failed. puppet ensured package upgrade and then the service failed
19:59 < AaronSchulz> mutante: why would a package upgrade trigger?
19:59 < mutante> puppet-agent did it
20:00 < mutante> Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Stage[main]/Mediawiki::Jobrunner/Package[jobrunner]/ensure) ensure changed 'a0e821661a107b5dbf4616b0f3570fdd93346010' to 'a1eb96c2f30b31cd05f1ef42e61cdfd1421f505a'
20:00 < mutante> Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Package[jobrunner]) Scheduling refresh of Service[jobrunner]
20:00 < mutante> Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Stage[main]/Mediawiki::Jobrunner/Base::Service_unit[jobrunner]/Service[jobrunner]) Triggered 'refresh' from 1 events
20:00 < mutante> Mar 10 03:16:38 mw2249 systemd[1]: jobrunner.service: main process exited, code=exited, status=143/n/a
20:00 < mutante> Mar 10 03:16:38 mw2249 systemd[1]: Unit jobrunner.service entered failed state.

20:01 < mutante> AaronSchulz: does it make any sense that it would be related to that deploy?
20:02 < mutante> started about an hour ago
20:03 < AaronSchulz> which was during the salt restart of the two services, but well after the git deploy
20:04 < mutante> looks like i can simply start it on this one host
20:04 < mutante> want me to just start them?

20:05 < mutante> !log mw2249 systemctl start jobrunner - now Active: active (running)
20:05 < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
20:05 < AaronSchulz> sure
20:06 < AaronSchulz> the running the program with --verbose itself looks fine on 2250 (as it does in eqiad)
20:08 < mutante> mw2155, was: Active: failed but a simple "start" and it's working
20:09 < mutante> icinga recovery would be nice now
20:09 < mutante> ah, there is "jobchron" service too
20:10 < mutante> and that is still failed, in the output of systemctl , which makes icinga unhappy


20:14 < mutante> !log more mw appservers ... - systemctl start jobchron, systemctl start jobrunner (both were failed but are now active (running)

20:21 < icinga-wm> RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational
20:22 < icinga-wm> RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational
20:22 < icinga-wm> RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational

but then

20:28 < icinga-wm> PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
20:28 < icinga-wm> PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.

 04:29 mutante: codfw mw jobrunner: they start but then fail again shortly after: mw2248 jobrunner[67314]: [Fri Mar 10 04:23:07 2017] [hphp] [67314:7f6a34b746c0:0:000024] [] LightProcess::closeShadow failed due to exception: Failed in afdt::sendRaw: Broken pipe
04:12 mutante: more codfw appservers ... - systemctl start jobchron, systemctl start jobrunner (both were failed but are now active (running)
04:09 mutante: mw2155 - systemctl start jobchron, systemctl start jobrunner (both were failed but are now active (running)
04:02 mutante: mw2249 systemctl start jobrunner - now Active: active (running)
03:56 mutante: codfw appserver jobrunner service fail related to https://gerrit.wikimedia.org/r/#/c/259660/ ?
03:54 mutante: codfw appservers showing "systemd degraded" alerts are failed jobrunner service unit. after puppet-agent "Mediawiki::Jobrunner/Package[jobrunner]/ensure) ensure changed..." ..then jobrunner.service: main process exited, code=exited, status=143/n/a
02:51 AaronSchulz: Restarted job services for 5101424 (statsd batching) after monitoring mw1161

Event Timeline

Dzahn created this task.Mar 10 2017, 4:57 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 10 2017, 4:57 AM
aaron added a comment.Mar 10 2017, 5:22 AM

Probably trebuchet/puppet breakage. I wonder if https://phabricator.wikimedia.org/T129148 would handle this.

Dzahn added a comment.Mar 10 2017, 5:59 PM

This looks fixed now in Icinga but there is nothing in SAL or on this ticket that would explain how it got fixed. ?

Dzahn triaged this task as Low priority.Mar 28 2017, 12:20 AM
Krinkle moved this task from Untriaged to Legacy infra on the WMF-JobQueue board.Jul 11 2018, 3:03 AM
Krinkle closed this task as Declined.Jul 11 2018, 3:05 AM
Krinkle added a subscriber: Krinkle.

Closing out as this seems specific to the old Redis-based JobQueue and JobRunner that are no longer as of last week. Please re-open if it still applies.

(See T198220 and T157088)