Sat, May 20
Wed, May 17
@faidon let me know if you want the Icinga RAID handler to open tasks also for warnings, these includes the above and the predictive drive failures for HP controllers.
Tue, May 16
- Etcd not listening (iptables REJECT)
- Host not responding (iptables DROP)
- Host unrechable (iptables REJECT with icmp-host-unreachable)
- No DNS response for the SRV record (NXDOMAIN)
- DNS SRV record(s) returns an invalid name (NXDOMAIN)
- Etcd slow to respond (reaches the configured timeout)
- High packet loss between MediaWiki and Etcd (i.e. when the master is in the other DC and there is an issue in the cross-DC connection)
Wed, May 10
Tue, May 9
Mon, May 8
@Papaul thanks for letting me know. I understand the problem, given the particular nature of the haze host, although after a quick check I didn't see a way to get the physical location of the drive from megacli. If you know an easy way to get this information I can modify the script to check/include it when available.
Fri, May 5
@BBlack Thanks for opening this feature request, because right now it's totally implementation dependent and actually I realized this is neither clear nor explained in the docs / readme.
Thu, May 4
Wed, May 3
Resolving this after a successful MediaWiki switchover to codfw and switchback to eqiad using the automation software Switchdc (operations-switchdc on gerrit). The tracking task for improvements is T163363.
@EddieGP I agree with you, I closed it because this one was targeting this specific rollout and switchdc and didn't want to left it open until next switch.
Tasks implemented and tested. They were lated reverted because etcd was not activated in MediaWiki. Resolving
Tue, May 2
Mon, May 1
Is there an easy way I could check which version and/or value of an Etcd-driven MW-config variable is actually loaded/cached by the running application?
The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands in production before the switchback the related tasks in Switchdc will be updated to use conftool to change those values, hence that hardcoded part will go away anyway.
Sat, Apr 29
Fri, Apr 28
Regarding the implementation of the MW configuration, in particular CR https://gerrit.wikimedia.org/r/#/c/347537 (current patchset is #8), I think that we should first agree on the failure model, because I've seen different comments and approaches.
Thu, Apr 27
Please ensure also that remote IPMI is working, eventually applying the fix in T150160, because right now is not:
@Cmjohnson: great! Thanks a lot!
I've added a few more that I saw today in https://puppet-compiler.wmflabs.org/6247/
Wed, Apr 26
Tue, Apr 25
From the audit I got the same results of the tables in T163196#3206314 except the following ones, and all looks good now for the ipaddress6_primary version:
Mon, Apr 24
Comparison beween ipaddress6 and ipaddress6_primary. All the ones where there is some issue are marked in bold and have a number in square brakects that is referred in the list of details at the bottom. For all the others the correct one seems to be ipaddress6_primary to me, it matches also the DNS record when present:
Comparison beween ipaddress and ipaddress_primary, for all the different ones the correct one seems to be ipaddress_primary to me, it matches also the DNS record for the host: