Wed, Sep 13
Tue, Sep 12
For the record they were reimaged correctly, the new reimage script hit a small bug in the post-reimage part, I've already re-run it for the "failed" host to complete the post-reimage steps.
Fri, Sep 8
With the refactoring for T166300 this problem will be naturally solved moving from a shellout to confctl to the use of the conftool library. Resolving.
Configuration looks stable since a while, resolving.
Issue fixed in master branch, leaving the task open to not forget to revert the hotfix in the debian branch done in https://gerrit.wikimedia.org/r/#/c/373513/
Thu, Sep 7
False positive, I'll check why was not blacklisted
Wed, Sep 6
Tue, Sep 5
Mon, Sep 4
Sun, Sep 3
I might have a good candidate for what is causing it: mdadm checkarray
Sat, Sep 2
Wed, Aug 30
Mon, Aug 28
Adding ops-eqiad, looks like we'll probably end up replacing the disk
Fri, Aug 25
+1 for me, I see this almost as a noop. Over 2y is more likely than the user change the physical device (in particular if mobile) than HSTS expires 😉
The check_ping on our icinga hosts doesn't seem to have an option to set the equivalent of the -i of the ping command. Reducing from 5 to 3 packets half the time to 2s per check, but I'm not sure if it's worth given the increased risk of false positives (although for 3 packets should be low enough inside our prod network).
I'm too curious about what is the current issue/bottleneck
An alternative option could be to make this check passive, with a freshness threshold of like 35m, with the data pushed directly by the run-puppet-agent/puppet-run after each run. If the check is stale (no data received by icinga after the threshold) than an active check can be performed automatically, allowing (I think) to keep the current logic of warning/critical for last puppet run.
I agree, the only drawback I see to have them bundled together is that we couldn't use stalking to tell them apart given that the temperature will change on each check.
What is the status of this server? I can see it all red in Icinga, trying to SSH gives the key changed warning but puppet is not aware of the new one.
Thu, Aug 24
Thanks @thcipriani for the answers, with my little knowledge of the zuul-jenkins relationship and (in)direct variables settings, it seems to me a fairly normal requirement to be able to configure a repository to run a CI job and specify/set some parameters for it.
The above should not be needed on megaraid hosts where smartctl --scan-open works well AFAICT (see on ms-be2014).
Wed, Aug 23
Tue, Aug 22
This works but is super ugly:
At first sight it might just be that the update frequency of the data and the smallest retention period set in graphite do not match to each other, having a much smaller retention period than the update frequency.
I think so too but it might need some parameter or hiera value to define those as "provisioning", given that they will have already the production MariaDB role but will not be fully provisioned. So if I'm understanding it correctly, yes it's possible but it will require an additional commit to remove the "provisioning" param/hiera once the provisioning is completed.
My answers to the above questions are: YES, YES, YES (but I'd like them to be separated in the UI, unfortunately this is not possible in Icinga), NO
Aug 3 2017
I've started working on this, I hoped to be able to finish it by today but the list of checks is long. I will complete it when I'll be back.
Aug 2 2017
Jul 31 2017
The added PyBal IPVS diff check is flapping a bit with UNKNOWN for some hosts (lvs100[3,6,9], lvs200[3,6]) with message:
HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=1.0)
$ grep -c "PyBal IPVS diff check" icinga.log 34
When specifying the timeout in Requests you can use a tuple to put different values for connect and read timeouts. My guess is that sometimes on those hosts PyBal is not able to reply within the 1s timeout and we might need a larger one.
Jul 26 2017
So we had a small hiccup today in which puppetdb responded 28 times 503s between 16:20:13 and 16:20:39 UTC, of those 17 where POSTs to update the hosts facts and we had a bit of a failure spam on IRC. It recovered by itself.
This is s4 master.
Jul 25 2017
Jul 24 2017
I've disabled (if not already) and removed files for the following users:
Soufianehamouda Houssamista Marama12 Oussama177