Thu, Feb 14
@fgiunchedi with the bandaid the problem doesn't show up anymore, do we want to keep this open for tracking or it's ok to resolve it?
The potential source of the data could be https://tools.wmflabs.org/openstack-browser/puppetclass/
Removing Operations-Software-Development for now, feel free to re-add if you need anything from us on this.
I went ahead and tried the naming convention approach adding a column to that table and adding my Phabricator username there were relevant. I've actually added a link to the phab profile, probably just the name is enough if we specify that those should be Phabricator account names.
Wed, Feb 13
Sorry I have to amend what I said above, both PD0 and PD3 are missing. I'm sending a patch to improve the get-raid-status-megacli script. With it the new output would have been:
It seems that PD3 is totally gone, from sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli -a:
I've merged the other auto-generated tasks into this one. FYI, as you can see on some of the others (where NRPE didn't timeout) the RAID was rebuilding.
@RobH thanks a lot for the tests and follow up!
Just to clarify a detail regarding the "normal" reboot messages, at least in a couple of occasions (for the others I'm not fully sure), the host rebooted itself (triggered either by mgmt board or kernel) but was not manually rebooted by us.
Tue, Feb 12
@RobH I actually disagree as the host has crashed already 2 times before it was even in production, so without any icinga-related load, than at least once a couple of months ago and I think 4 times in the last 3 weeks, so I wouldn't call it stable at all 😉
Mon, Feb 11
Isn't this due to the PDU issue we had that affected prometheus1003?
Sun, Feb 10
The host already re-crashed, I'm leaving it as is for now. I've ack'ed the alerts on icinga.
The host is stuck again (no ping, no ssh, nothing in console but [ 2451.381422] m, nothing new on getsel or getraclog, forcing a reboot,
As an additional datapoint from T210108, that I've just merged into this task, is that we had 2 reboots that showed the same symptoms during provisioning, I think even before the Icinga software was running.
I'm merging this with T214760 as those are now clearly just two different manifestation of the same issue (stuck and reboot) and we have the same entries in getraclog:
icinga1001 was stuck again today, but in a slightly different way that gave us some additional information.
No ping, no ssh and no icinga web were working and no racadm errors were logged, but attaching myself to the console, although unable to get a prompt, I was able to capture the following, that was repeated every few seconds:
I got also an email alert from our external monitoring today and upon checking icinga1001 I noticed that the uptime coincide with a reboot around that time. See T214760#4941030 for more details. Unclear if related at this point.
The host crashed again today and got rebooted, nothing in getsel and from getraclog I just got:
Tue, Feb 5
Mon, Feb 4
I'd like to add that you should think about it in the general sense of any openstack installation, not only ours. Cumin is a generic software that has no WMF-specific code and we'd like to keep it that way.
That's why I strongly think that we should keep it agnostic and allow the possibility to specify the region in the query.
Sat, Jan 26
Forgot to mention that as a quick fix passing the option --rename-mgmt` explicitely should avoid the problem.
The problem was misdiagnosed here, we have two issues, one is related to Icinga and is anyway considered a warning and doesn't make the script stop, the other is the real error instead.
Fri, Jan 25
With the latest version we're deploying we don't have anymore netmiko as it was coming with napalm and we removed that dependency. So all good here, resolving.
Thu, Jan 24
@hashar what are you suggesting to do here? 3.13 is the latest upstream release.
Wed, Jan 23
Jan 17 2019
Bump, this is still happening.
Jan 15 2019
FYI the host is currently down due to a partial power issue in that rack.
Jan 14 2019
@hashar thanks a lot, CI is finally running successfully for Debmonitor (i.e. https://gerrit.wikimedia.org/r/c/operations/software/debmonitor/+/483131 )
Jan 10 2019
@jcrespo you could try first any of the known/listed things in https://wikitech.wikimedia.org/wiki/Management_Interfaces (aliased from IPMI) and of course feel free to expand it if it's something different.
Jan 9 2019
Jan 8 2019
It should be resolved with the above patch, feel free to reopen if not.
Jan 7 2019
@Dzahn it's reported as degraded by megacli:
Jan 5 2019
Path deployed, resolving for now. Please re-open if that doesn't fix it.
The raid handler had the old path valid in jessie for the command file.
Jan 3 2019
Jan 2 2019
Fix applied, it required an Icinga restart. See the commit message for more details.
Removed the parent task as the stretch part of the goal was not done, but keeping the task around as we want to create this backend anyway.
Resolving as the quarter is over and the goal-specific tasks have been completed.
Migration has been completed and cumin001 are fully in service since few weeks. Resolving.
Dec 31 2018
Thanks @elukey, I'll have a look in the next days, at first look it seems that:
- the logrotate config is the default one from the uwsgi package and doesn't point to where we actually log and where they are rotated independently (/srv/log/netbox).
- we don't use options like log-reopen in uwsgi, something to look into to fine tune the configuration so that it doesn't trigger this error.
Dec 29 2018
Dec 24 2018
If I may it seems much more natural to return a dict hostname: value 😉 (not to mention the past issues with top-level JSON arrays).
Dec 19 2018
@CDanis I have a quick and dirty solution that seems seems to do 95% of the work, given that is a one-shot I think it might be near-usable, just my 2c at it, feel free to ignore ;)
Current status is:
- ~95% of the library migrated
- documentation done
- other wmf-* script partially done (peding merge)
- wmf-auto-reimage script TODO
Some of the other scripts have been migrated, CR are ready, pending to deploy the new 0.0.10 release of Spicerack.
Given that the newer Spicerack package was not deployed today and we decided to postpone it to the first week after the holidays, I've add -2 to those CRs to avoid accidental merges.
This is overdue in respect to the Goal, and will be tackled right after the holidays.