wdqs1004 suddently went down.. loss of network..
then we powercycled it and it came back at first.
only to die again a few minutes later.
then we depooled it and it recovered again. but something is definitely wrong with it
wdqs1004 suddently went down.. loss of network..
then we powercycled it and it came back at first.
only to die again a few minutes later.
then we depooled it and it recovered again. but something is definitely wrong with it
16:07 < icinga-wm> PROBLEM - Host wdqs1004 is DOWN: PING CRITICAL - Packet loss = 100%
16:08 < gehel> I'm on the console on wdqs1004, it looks reasonnably well except that I cant reach network
16:11 < gehel> !log powercycling wdqs1004 (complete loss of network)
16:13 < icinga-wm> RECOVERY - Check size of conntrack table on wdqs1004 is OK: OK: nf_conntrack is 0 % full
16:13 < icinga-wm> RECOVERY - Host wdqs1004 is UP: PING WARNING - Packet loss = 64%, RTA = 0.22 ms
16:14 < icinga-wm> RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 435 bytes in 0.031 second response time
16:23 < XioNoX> mutante: any idea what caused that wdqs1004 issue?
16:34 < icinga-wm> PROBLEM - configured eth on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
16:34 < icinga-wm> PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
16:36 < icinga-wm> PROBLEM - Blazegraph process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
16:36 < icinga-wm> PROBLEM - puppet last run on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
16:37 <+logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet
16:37 < icinga-wm> RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1004 is OK: OK ferm input default policy is set
16:37 < icinga-wm> RECOVERY - DPKG on wdqs1004 is OK: All packages OK
16:41 < icinga-wm> PROBLEM - SSH on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 22: Connection refused
16:34 < mutante> XioNoX: there are java errors but all that stuff seems like red herrings and normal before the incident as well.. all i can really see is it ... stopped working
16:36 < XioNoX> Network interface Carrier transitions: 649, there is definitively something wrong with that host
Change 413485 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues
Change 413485 merged by Dzahn:
[operations/puppet@production] wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues
i ran puppet on all wdqs* via cumin after the merge and smalyshev confirmed the LDF server seems fine
@Cmjohnson This server has been depooled. Could you please run hardware diagnostics on it? You can do it anytime. Thanks!
Mentioned in SAL (#wikimedia-operations) [2018-02-27T08:42:34Z] <gehel> powercycling wdqs1004 - T188045
Hardware diagnostics have completed with no error. It is now up again, catching on updates, but still depooled. I'll keep an eye on it, and if it looks stable and has catched up on updates, I'll repool it (and still keep an eye on it).
Mentioned in SAL (#wikimedia-operations) [2018-02-27T17:38:36Z] <gehel> restarting wdqs-updater on wdqs1004 - T188045
wdqs1004 still does not look well. The logs show a number of UnknownHostException (T188413) and updates are not processed. SSH connections sometime freeze, but do recover, or connecting ends up in "Connection refused", but works again a few seconds later.
Prometheus shows a number of ICMP errors.
This looks to me very much like network connectivity failing for whatever reason. And it looks like I'm going to need some help to get to the bottom of this.
@RobH : suggestions on the next steps are welcomed!
I wonder if it's possible to use one of the new servers we're getting in T187766 to restore full capacity if debugging what is going on with 1004 takes time. Would it be a good thing to do?
There isn't much impact on response time or even load on the cluster at this point. So I would not worry too much yet. If we loose another node, this is going to be an issue, but let's not borrow trouble yet. Note that we have already received new servers in codfw, but wdqs1004 is in eqiad, so that does not help. We can re-discuss that once the new servers are available (they are configured exactly the same, so moving them between clusters should be trivial).
If losing one server out of 4 is an issue, then we have a sizing problem for that cluster. Is that the case? I can't find evidence in the utilization metrics for those servers.
If losing one server out of 4 is an issue
One out of 3. We don't have a problem right now, but if we will have to take down another one - for maintenance or other issue - we will have it. If 1004 comes back within short timeframe, it's not a concern, but since it's now dead for almost a week, I'm starting to think about options in case that (bringing it back to life) takes longer time.
A few things to check (thanks for the pointers from my fellow ops):
@Cmjohnson this looks like an issue with the physical connection. Could you try moving the cable to another port on the switch so that we can eliminate that possibility?
I replaced the ethernet cable and was able to login
Debian GNU/Linux 8 auto-installed on Wed Sep 6 12:24:33 UTC 2017.
cmjohnson@wdqs1004:~$
Thanks Chris! First i wasn't able to login, then i was and could start a puppet run. Let's see if Icinga clears up and stays green for a while.
Mentioned in SAL (#wikimedia-operations) [2018-03-01T17:46:16Z] <mutante> re-enabling icinga notifications for wdqs1004 services, ethernet cable has been replaced (T188045)
Mentioned in SAL (#wikimedia-operations) [2018-03-01T17:47:27Z] <gehel> restarting wdqs-updater on wdqs1004 -T188045
Still getting this:
Mar 01 17:48:34 wdqs1004 bash[1382]: Exception in thread "main" java.lang.RuntimeException: java.net.UnknownHostException: www.wikidata.org
looks like the problem is still there.
Ack, i was able to connect to it and run ping for a while but .. it's like it's freezing sometimes and then continues. Also Gehel could confirm there is still an issue. :/
Can we do the "try other switch port" step next @Cmjohnson ?
It's intermittent. Most of the time it works,but then it doesn't...
[wdqs1004:~] $ host www.wikidata.org www.wikidata.org has address 208.80.154.224 www.wikidata.org has IPv6 address 2620:0:861:ed1a::1
from dmesg:
[203721.906779] tg3 0000:02:00.0 eth0: Link is down [203725.760204] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex [203725.760214] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX [203725.760218] tg3 0000:02:00.0 eth0: EEE is disabled [203782.391437] tg3 0000:02:00.0 eth0: Link is down [203798.154227] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex [203798.154236] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX [203798.154239] tg3 0000:02:00.0 eth0: EEE is disabled [206269.350612] tg3 0000:02:00.0 eth0: Link is down [206289.517967] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex [206289.517976] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX [206289.517980] tg3 0000:02:00.0 eth0: EEE is disabled
a ticket has been placed with Dell You have successfully submitted request SR961706803.
The dell technician rescheduled for Thursday 8 March. Tentatively scheduled for 1800UTC.
Change 417311 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] dhcp mac address change wdqs1004
Change 417311 merged by Cmjohnson:
[operations/puppet@production] dhcp mac address change wdqs1004
The system board has been swapped, updated the bios and idrac. Also, update the mac address in puppet. assigning to @RobH for reinstall
Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:
['wdqs1004.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201803091357_gehel_7493.log.
Change 417899 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: migrate to stretch
Change 417899 merged by Gehel:
[operations/puppet@production] wdqs: migrate to stretch
Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:
['wdqs1004.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201803091444_gehel_17927.log.
Bad news... I just finished reimaging wdqs1004, and I still have trouble. SSH session suddenly / randomly freezes. I can't see the same link flapping that @Dzahn saw in dmesg (at least not yet).
@RobH / @Cmjohnson : any idea how to proceed?
Note that Icinga also has random failed checks (size of conntrack table, ferm, dpkg, ...) all with "Return code of 255 is out of bounds". The service (blazegraph) related checks are also failing, but that's expected, since the data isn't loaded yet.
So I can see @Gehel asked:
So it looks like chris swapped the cable, it worked, and the port wasn't changed? It is unlikely to be a bad EX4200 port (I can only recall this happening once in my time here) but it is possible.
We are sill seeing bad ssh connections, and its likely ALL networking and SSH is just the most obvious symptom.
I'd like to ask @Cmjohnson if the port was swapped on this? If not, can we go ahead and swap over the port?
@Cmjohnson did message me about changing the port as well, so that's probably not it.
Side note if it wasnt clear from the gerrit change above, OS also has been upgraded from jessie to stretch as part of the reimage.
@Cmjohnson I'll be vacation starting March 18, and I would be more relax if I knew our wdqs eqiad clsuter has its usual 3 nodes. Coud you already rack one of the new cluster node (T188432) so that we can canibalize one until wdqs1004 is healthy again?
faidon@re0.cr1-eqiad> show arp no-resolve | match 10.64.0.17 78:2b:cb:2d:fa:e6 10.64.0.17 ae1.1017 none faidon@re0.cr2-eqiad> show arp no-resolve | match 10.64.0.17 d0:94:66:2f:ae:02 10.64.0.17 ae1.1017 none
and
faidon@asw-a-eqiad> show ethernet-switching table | match d0:94:66:2f:ae:02 private1-a-eqiad d0:94:66:2f:ae:02 Learn 0 ge-6/0/37.0 faidon@asw-a-eqiad> show ethernet-switching table | match 78:2b:cb:2d:fa:e6 private1-a-eqiad 78:2b:cb:2d:fa:e6 Learn 0 ge-4/0/18.0
ge-6/0/37 is wdqs1004.
ge-4/0/18 has no description; how did that happen? What is connected there, why is the port enabled without a description, why did it use wdqs1004's IP and how come it's been idling like that for 20 days? @Cmjohnson could you debug this further ASAP?
Ok, I had misparsed this, but after chatting with Faidon (and Arzhel) via IRC, I understand what has happened.
rcs1002 was decommissioned on T170157, and I did the steps. Somehow I checked off and updated the task with the port information for both hosts (so I was 100% on the switch stack) and then failed to commit the port disabling. At some point this host got powered back on, and its disks are NOT wiped, and it took back over this IP.
Yes, this is the exact reason we have the checklist (that I wrote). So this is 100% on me. I've disabled the rogue server's port, and re-audited the steps on T170157.
wdqs1004 is now responsive.
Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, since it would have been suffering the same connection issues as wdqs1004 (just the other half time).
Thanks @RobH! Created T189548: reload data on wdqs1004 for loading the data back. @Gehel if you don't see anything else wrong then this one can be resolved. Since it's kind of generic issue maybe we should also make an incident report?
Yay! Thanks @faidon for finding the issue!
@RobH / @Papaul : the symptoms of wdqs2006 mgmt interface look vaguely similar (T189318). Any chance that the cause would be similar?
@Smalyshev : I'll take care of the data reload on T189548
I'm closing this task as it looks very much resolved, I'll reopen if there is more trouble.
So post-mortem, I think there are 4 different things here:
My apologies for all the wasted engineering time -- this is a pretty unfortunate, and pretty basic issue that continues to bite us :(