Investigate why cobalt went down for 1 minute on 2017-02-05 and then again 4 minutes later
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Paladox
	Feb 5 2017, 1:09 AM

Description

Investigate why gerrit went down for one minute on 05/02/07

[01:06:16]  <icinga-wm>	PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:06:17]  <icinga-wm>	PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:06:26]  <icinga-wm>	PROBLEM - Check whether ferm is active by checking the default input chain on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

then recovered a minute later.

[01:07:07]  <icinga-wm>	RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[01:07:07]  <icinga-wm>	RECOVERY - configured eth on cobalt is OK: OK - interfaces up
[01:07:16]  <icinga-wm>	RECOVERY - Check whether ferm is active by checking the default input chain on cobalt is OK: OK ferm input default policy is set

after i created this task, these warnings started

[01:10:16]  <icinga-wm>	PROBLEM - dhclient process on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:10:16]  <icinga-wm>	PROBLEM - salt-minion processes on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:10:17]  <icinga-wm>	PROBLEM - Check systemd state on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:10:17]  <icinga-wm>	PROBLEM - DPKG on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:10:17]  <icinga-wm>	PROBLEM - MD RAID on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

then recovered a minute later.

[01:11:07]  <icinga-wm>	RECOVERY - dhclient process on cobalt is OK: PROCS OK: 0 processes with command name dhclient
[01:11:07]  <icinga-wm>	RECOVERY - salt-minion processes on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:11:07]  <icinga-wm>	RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational
[01:11:07]  <icinga-wm>	RECOVERY - DPKG on cobalt is OK: All packages OK
[01:11:07]  <icinga-wm>	RECOVERY - MD RAID on cobalt is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0

times are in utc +0 (uk time)

Event Timeline

Paladox created this task.Feb 5 2017, 1:09 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2017, 1:09 AM

Paladox renamed this task from Gerrit went down for 1minute on 05/01/17 to Gerrit went down for 1minute on 05/02/17.Feb 5 2017, 1:12 AM

Paladox updated the task description. (Show Details)

Very important (if not unbreak now).

Paladox raised the priority of this task from High to Needs Triage.Feb 5 2017, 1:15 AM

Paladox triaged this task as High priority.

Paladox updated the task description. (Show Details)

Investigate channel logs in #wikimedia-operations as necessary: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20170205.txt

Paladox renamed this task from Gerrit went down for 1minute on 05/02/17 to Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 and then again 4 minute later.Feb 5 2017, 1:22 AM

Setting unbreak as it appears loading gerrit changes is taking longer then usual to load. Unless it's just me, but unlikely as google loads fast and gerri's main page loads the same as it does but loading changes is taking longer then usual.

+ This needs to be check as soon as possible since those warnings need to be looked into.

This would indicate something is wrong.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptFeb 5 2017, 1:37 AM

Paladox updated the task description. (Show Details)Feb 5 2017, 2:09 AM

Paladox updated the task description. (Show Details)

JustBerry renamed this task from Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 and then again 4 minute later to Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 (dd/mm/yy) and then again 4 minute later.Feb 5 2017, 2:48 AM

Jay8g renamed this task from Gerrit: Investigate why gerrit went down for 1minute on 05/02/17 (dd/mm/yy) and then again 4 minute later to Gerrit: Investigate why gerrit went down for 1 minute on Feburary 5 and then again 4 minutes later.Feb 5 2017, 5:25 AM

Peachey88 renamed this task from Gerrit: Investigate why gerrit went down for 1 minute on Feburary 5 and then again 4 minutes later to Investigate why colbolt went down for 1 minute on Feburary 5 and then again 4 minutes later.Feb 5 2017, 8:58 AM

Peachey88 renamed this task from Investigate why colbolt went down for 1 minute on Feburary 5 and then again 4 minutes later to Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later.

Peachey88 lowered the priority of this task from Unbreak Now! to Needs Triage.

Setting unbreak as it appears loading gerrit changes is taking longer then usual to load.

Seems fine to me, speed wise. also lowering back to triage, since cobalt (the machine) is still running which is what all the warnings are in relation to and so is Gerrit the service that is running on it

In T157203#2999206, @Peachey88 wrote:

Setting unbreak as it appears loading gerrit changes is taking longer then usual to load.

Seems fine to me, speed wise. also lowering back to triage, since cobalt (the machine) is still running which is what all the warnings are in relation to and so is Gerrit the service that is running on it

Thanks, we should then set it back to high to investigate why those warnings went off.

Cobalt semed to be having higher then normal cpu levels, it's showing the system using a lot of cpu. (This is after the warnings started)

Screen Shot 2017-02-05 at 11.56.42.png (504×794 px, 116 KB)

Changing to high as needs investigation as soon as possible. @Peachey88 if you change it back to triage i won't change the priority anymore, i just thought the warnings doint come out of no where so maybe something is wrong on the system.

[01:06:17] <icinga-wm> PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

This does not have to mean there is anything wrong with cobalt itself. It can also just mean that the Icinga server itself was too busy. Are you sure it was _only_ cobalt that icinga-wm was reporting about and not just conincidence?

In T157203#2999445, @Dzahn wrote:

[01:06:17] <icinga-wm> PROBLEM - configured eth on cobalt is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

This does not have to mean there is anything wrong with cobalt itself. It can also just mean that the Icinga server itself was too busy. Are you sure it was _only_ cobalt that icinga-wm was reporting about and not just conincidence?

nope pretty sure it was cobalt, at the time i tried to access gerrit.wikimedia.org which did not load until a minute later which is when it said recovered. Then some new warnings showed and a minute later it said it recovered.

The huge User CPU spike around 21:00UTC is me doing maintenance on Zuul git repositories. Went on scandium.eqiad.wmnet and issued a git remote prune origin on all repos. I did monitor cobalt via Grafana and CPU went back to normal after.

Seems it is all normal right now.

atop might help dig in the history of CPU usage.

In T157203#2999739, @hashar wrote:

The huge User CPU spike around 21:00UTC is me doing maintenance on Zuul git repositories. Went on scandium.eqiad.wmnet and issued a git remote prune origin on all repos. I did monitor cobalt via Grafana and CPU went back to normal after.

Seems it is all normal right now.

atop might help dig in the history of CPU usage.

Oh was that yesterday or today? Looks like it was today as i see a huge spike around that time.

Meh, not worth investigating...seems as though it was transient.

Aklapper renamed this task from Investigate why cobalt went down for 1 minute on Feburary 5 and then again 4 minutes later to Investigate why cobalt went down for 1 minute on 2017-02-05 and then again 4 minutes later.Feb 6 2017, 1:10 PM

	F5493309: Screen Shot 2017-02-05 at 11.56.42.png
	Feb 5 2017, 11:57 AM

Investigate why cobalt went down for 1 minute on 2017-02-05 and then again 4 minutes laterClosed, DeclinedPublicActions

Description

Event Timeline

Investigate why cobalt went down for 1 minute on 2017-02-05 and then again 4 minutes later
Closed, DeclinedPublic
Actions