Those are a good start, but if I'm not mistaken, all of them are local on the host, right?
I think we might need one for netbox.wikimedia.org too, done from the Icinga host itself to check that the service is reachable, and potentially that has some data in it.
The last part might be a bit more difficult given that is behind a login, unless netbox expose a "check" URL where it makes some internal checks and expose the result.
Cumin 2.0.0 with support for Puppet API v4 was released. Debdeploy was updated accordingly.
Both debdeploy and cumin were released into production.
Fri, Jan 19
Nice! So I guess that our puppetization is not correct and should restart Postgres after the first configuration change to ensure that the new data directory is used from the start.
Thu, Jan 18
I've run clean + deactivate for cp4018 as part of cleanup of stale puppet certs.
Wed, Jan 17
Tue, Jan 16
Fri, Jan 12
@Andrew yes those are the Puppet compiler instances that Jenkins uses. We can agree that the name of the project was not chosen to be very future-proof ;) but the hosts are very much in use.
Thu, Jan 11
This was a misunderstanding on my side, @Dzahn actually stopped it manually.
TL;DR: Everything is back to einsteinium now, and everything is working. Resolving.
While trying to fix the issues after the reboot for the kernel upgrade, I've opened T184634.
But now it seems that the Postgres DB is empty (no tables in the netbox DB). I'm not sure if it was emptied as part of some of the tests above, or the reboot + puppet broken might have done this.
Wed, Jan 10
Confirmed that on tegmen it works fine after failovering the active Icinga server to it.
The links are properly rendered and the ampersends are not dropped, as opposed to what happens on einsteinium.
Tue, Jan 9
Mon, Jan 8
Thu, Jan 4
Wed, Jan 3
Thanks to @cwdent for notifying us.
Tue, Jan 2
Tue, Dec 26
pdfrender on all eqiad hosts required restarts tonight (UTC), see SAL. Thanks @madhuvishy for taking care of it.
Dec 22 2017
Dec 21 2017
To summarize the current status, everything is deployed and works as expected, except one small detail: the ampersand are removed from the dashboard URLs, making them mostly useless :/
Workaround to make it fail completely and let systemd restart it deployed. Resolving it for now.
Dec 19 2017
All done, resolving.
@RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning.
Dec 18 2017
The reimage scripts should be back on track and work as expected. It was tested today with a couple of reimages. I cannot exclude we'll find some other corner cases with less used OS versions and with Puppet4 clients. But from my side this could be resolved.
Powercycled ganeti1005, unable to ssh, console unresponsive.
Dec 13 2017
I just noticed that in late_command.sh we have a special case for cp* that I guess will need to be updated to include eqsin too.
Mentioning it here because it's not a common place to look for and might be missed.
Dec 11 2017
I'm sorry the test did't helped.
Digging a bit more it seems that the controller that we have (Smart Array P440ar) supports HBA mode (Host Bus Adapter), that, according to HP manual :