Fri, Jun 15
Thu, Jun 14
Thanks a lot @elukey!
@Joe ack to all your replies, thanks for integrating the suggestions!
Wed, Jun 13
Quick first feedback/questions on the proposal:
Tue, Jun 12
This happened again today unfortunately. And because I don't see any logs of spurious passive checks from other frack hosts, I guess we have to discard the hypothesis that it might have been that the cause of the issue.
Mon, Jun 11
Sat, Jun 9
From a quick test the slowest one letter search was ~1s and was for less common letters like z or q. As of now I cannot repro the issue, feel free to resolve the task if you think that the new version have solved it too. It can be re-opened it case we found some repro.
Fri, Jun 8
Thu, Jun 7
Tue, Jun 5
@Cmjohnson which disk is a tricky question in this case.
While investigating the possible root causes for this I discovered that we had some new frack hosts just installed last week that were sending metrics although not yet fully configured. In particular they are not present in Icinga hostlist, hence Icinga discards those messages and in theory that shouldn't harm. But to avoid to have too many variables in place I've asked @Jgreen if it was possible to avoid sending the metrics at all until they are fully configured and he very kindly accepted and already implemented.
Thu, May 31
Having a look around in the system utility (ESC+9) I found that:
Tue, May 29
@jynus got it, thanks for the info. FYI if you want to test your workaround solution, there is another DB missing: frimpressions. I didn't re-create it though, as I have no context on it. I would have told you tomorrow ;)
@jcrespo FYI I was deploying debmonitor today and the replication broke on db1065 and db1117 because of missing debmonitor database.
I now see a SAL entry from @akosiaris:
11:18 akosiaris: powercycling ms-be1034, box is unresposive, tons of logs "sd 0:1:0:1: rejecting I/O to offline device"
Actually it seems that this already recovered: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
@Cmjohnson is this controller really missing the battery or it's a software problem that is just not recognized?
Ack, I propose to leave it as is for now and re-evaluate once also Filippo is back. Resolving for now, feel free to re-open.
Thu, May 24
Forgot to mention that the above message and output was taken on labvirt1020 as I cannot ssh to 1019 right now.
I've double checked both the report script that populate this task and the Icinga check script that raised the alarm. The issue here seems to be that the controller in Slot 1 (the P840 actually used) doesn't have/recognize the battery, hence the CRITICAL:
I just discovered that this host is planned for reimage in the next few days, not bothering fixing the md array as the host is not seeing the replaced disk and might need anyway a reboot, going directly for the reimage at this point.
It looks to me that the battery is broken/not recognized.
@jcrespo naos has been reimaged to deploy2001.codfw.wmnet, so I guess it can now be added to the grants. Mentioning it here just for not forgetting, there is absolutely no hurry do to do it.
The proposed approach don't take into account hosts installed for the first time. As for detecting the newly added host on the Icinga configuration is not trivial at all, same for the disable notifications, that as of now requires a commit to hiera. Unless that part is moved to a more dynamic storage I don't see an easy fix going that path.
Wed, May 23
@akosiaris the current EDAC check is sum(increase($metric[4d])), so is checking the increase over the last 4 days, I'd say is not time-sensitive at all.
Tue, May 22
The CPU usage is already back to 40%, we can decide tomorrow if we want to increase the check_interval further.
To keep everyone in the loop, I've chat with @Catrope the other day about this and we debugged it a bit together.
For reference, last month CPU trend with the two clear increases:
Thanks @ArielGlenn for re-opening this. From a quick look we had two big increases, one on May 2nd and one on May 8th. I think they are related to those two changes that are basically adding a check for each host each:
May 18 2018
@Dzahn That usually happens if the alarm flap on icinga for some reason, the handler open a new task for each CRITICAL/HARD triggered by Icinga.
May 14 2018
Yeah, puppetdb1001 will probably just generate some spam on IRC for failing puppet runs, transient.
May 3 2018
Great! Thanks a lot.
May 1 2018
Apr 30 2018
As discussed in the monitoring meeting here some feedback:
@jcrespo ack, no blocker for me, I'm actually not using it.
Setup DNS, DHCP, netboot and created 2 VMs on Ganeti: debmonitor001.
I've an additional question, what is the expected behaviour in the following failure scenarios for each option?
Apr 29 2018
I've downtimed db1098 on Icinga until Wed mid EU day and disabled notifications.
Apr 28 2018
Apr 26 2018
Apr 24 2018
Sorry, I didn't mention the multi-DC setup :)
Apr 23 2018
Personally never used, +1 to drop it.
Apr 19 2018
To summarize the work done recently, I've made an audit of existing checks and fixed/improved some of them that had clear errors or needed to be updated. @chasemp has very kindly offered himself to review the WMCS related checks, users and groups.
Apr 18 2018
@Joe no it would not be super easy to solve in a DRY way, I agree.
Apr 12 2018
I've fixed it, it was a case of password misalignment, see one of the cases described in T150160,
Apr 10 2018
Reporting it here too for the future, to fix it's sufficient to replace the --diff of the above command with --commit and then re-run the --diff to ensure that this time it will show no error.
Apr 9 2018
Patch updated to overcome this problem, once reviewed and merged it should solve the issue.