Page MenuHomePhabricator

wdqs1004 broken
Closed, ResolvedPublic

Description

wdqs1004 suddently went down.. loss of network..

then we powercycled it and it came back at first.

only to die again a few minutes later.

then we depooled it and it recovered again. but something is definitely wrong with it

Details

Related Gerrit Patches:
operations/puppet : productionwdqs: migrate to stretch
operations/puppet : productiondhcp mac address change wdqs1004
operations/puppet : productionwdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 22 2018, 9:42 PM
Dzahn added a comment.Feb 22 2018, 9:50 PM

16:07 < icinga-wm> PROBLEM - Host wdqs1004 is DOWN: PING CRITICAL - Packet loss = 100%
16:08 < gehel> I'm on the console on wdqs1004, it looks reasonnably well except that I cant reach network
16:11 < gehel> !log powercycling wdqs1004 (complete loss of network)
16:13 < icinga-wm> RECOVERY - Check size of conntrack table on wdqs1004 is OK: OK: nf_conntrack is 0 % full
16:13 < icinga-wm> RECOVERY - Host wdqs1004 is UP: PING WARNING - Packet loss = 64%, RTA = 0.22 ms

16:14 < icinga-wm> RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 435 bytes in 0.031 second response time
16:23 < XioNoX> mutante: any idea what caused that wdqs1004 issue?
16:34 < icinga-wm> PROBLEM - configured eth on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
16:34 < icinga-wm> PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: Return code of 255 is out of bounds

16:36 < icinga-wm> PROBLEM - Blazegraph process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
16:36 < icinga-wm> PROBLEM - puppet last run on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
16:37 <+logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet
16:37 < icinga-wm> RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1004 is OK: OK ferm input default policy is set
16:37 < icinga-wm> RECOVERY - DPKG on wdqs1004 is OK: All packages OK

16:41 < icinga-wm> PROBLEM - SSH on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 22: Connection refused

16:34 < mutante> XioNoX: there are java errors but all that stuff seems like red herrings and normal before the incident as well.. all i can really see is it ... stopped working

16:36 < XioNoX> Network interface Carrier transitions: 649, there is definitively something wrong with that host

Change 413485 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues

https://gerrit.wikimedia.org/r/413485

Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptFeb 22 2018, 9:52 PM
Mbch331 moved this task from incoming to blocked on others on the Wikidata board.Feb 22 2018, 9:55 PM

Change 413485 merged by Dzahn:
[operations/puppet@production] wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues

https://gerrit.wikimedia.org/r/413485

i ran puppet on all wdqs* via cumin after the merge and smalyshev confirmed the LDF server seems fine

https://query.wikidata.org/bigdata/ldf

Dzahn added a subscriber: Papaul.EditedFeb 22 2018, 10:14 PM

@Cmjohnson This server has been depooled. Could you please run hardware diagnostics on it? You can do it anytime. Thanks!

Dzahn edited projects, added ops-eqiad; removed Patch-For-Review.Feb 22 2018, 10:23 PM
Dzahn added a subscriber: Cmjohnson.
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Feb 26 2018, 4:04 PM
Gehel added a subscriber: Gehel.Feb 26 2018, 6:11 PM
Gehel claimed this task.Feb 26 2018, 6:27 PM

Hardware diagnostic is running, I'll report back with the results when completed.

Mentioned in SAL (#wikimedia-operations) [2018-02-27T08:42:34Z] <gehel> powercycling wdqs1004 - T188045

Gehel added a comment.Feb 27 2018, 8:50 AM

Hardware diagnostics have completed with no error. It is now up again, catching on updates, but still depooled. I'll keep an eye on it, and if it looks stable and has catched up on updates, I'll repool it (and still keep an eye on it).

Mentioned in SAL (#wikimedia-operations) [2018-02-27T17:38:36Z] <gehel> restarting wdqs-updater on wdqs1004 - T188045

Gehel added a subscriber: RobH.Feb 27 2018, 6:00 PM

wdqs1004 still does not look well. The logs show a number of UnknownHostException (T188413) and updates are not processed. SSH connections sometime freeze, but do recover, or connecting ends up in "Connection refused", but works again a few seconds later.

Prometheus shows a number of ICMP errors.

This looks to me very much like network connectivity failing for whatever reason. And it looks like I'm going to need some help to get to the bottom of this.

@RobH : suggestions on the next steps are welcomed!

I wonder if it's possible to use one of the new servers we're getting in T187766 to restore full capacity if debugging what is going on with 1004 takes time. Would it be a good thing to do?

Smalyshev triaged this task as High priority.Feb 27 2018, 6:26 PM
Gehel added a comment.Feb 27 2018, 7:28 PM

There isn't much impact on response time or even load on the cluster at this point. So I would not worry too much yet. If we loose another node, this is going to be an issue, but let's not borrow trouble yet. Note that we have already received new servers in codfw, but wdqs1004 is in eqiad, so that does not help. We can re-discuss that once the new servers are available (they are configured exactly the same, so moving them between clusters should be trivial).

Joe added a subscriber: Joe.Feb 28 2018, 8:06 AM

I wonder if it's possible to use one of the new servers we're getting in T187766 to restore full capacity if debugging what is going on with 1004 takes time. Would it be a good thing to do?

If losing one server out of 4 is an issue, then we have a sizing problem for that cluster. Is that the case? I can't find evidence in the utilization metrics for those servers.

Smalyshev added a comment.EditedFeb 28 2018, 8:09 AM

If losing one server out of 4 is an issue

One out of 3. We don't have a problem right now, but if we will have to take down another one - for maintenance or other issue - we will have it. If 1004 comes back within short timeframe, it's not a concern, but since it's now dead for almost a week, I'm starting to think about options in case that (bringing it back to life) takes longer time.

Gehel added a comment.Feb 28 2018, 9:57 AM

A few things to check (thanks for the pointers from my fellow ops):

  • is another server stealing its IP
    • check DNS -> nothing suspicious
    • check DHCP leases
    • check ARP caches -> sudo cumin -b200 -p 80 'A:eqiad' "arp -n | grep ^10.64.0.17\ | awk '{print \$3}'" shows only 18:66:da:ac:49:d5 which is the correct MAC of wdqs1004
  • check is network is downgraded to 10Mb -> ethtool say: Speed: 1000Mb/s, Duplex: Full
  • try moving wdqs1004 to another port on the switch
Gehel reassigned this task from Gehel to Cmjohnson.Feb 28 2018, 9:59 AM

@Cmjohnson this looks like an issue with the physical connection. Could you try moving the cable to another port on the switch so that we can eliminate that possibility?

I replaced the ethernet cable and was able to login

Debian GNU/Linux 8 auto-installed on Wed Sep 6 12:24:33 UTC 2017.
cmjohnson@wdqs1004:~$

Dzahn added a comment.Mar 1 2018, 5:42 PM

Thanks Chris! First i wasn't able to login, then i was and could start a puppet run. Let's see if Icinga clears up and stays green for a while.

Mentioned in SAL (#wikimedia-operations) [2018-03-01T17:46:16Z] <mutante> re-enabling icinga notifications for wdqs1004 services, ethernet cable has been replaced (T188045)

Mentioned in SAL (#wikimedia-operations) [2018-03-01T17:47:27Z] <gehel> restarting wdqs-updater on wdqs1004 -T188045

Still getting this:

Mar 01 17:48:34 wdqs1004 bash[1382]: Exception in thread "main" java.lang.RuntimeException: java.net.UnknownHostException: www.wikidata.org

looks like the problem is still there.

Dzahn added a comment.Mar 1 2018, 5:55 PM

Ack, i was able to connect to it and run ping for a while but .. it's like it's freezing sometimes and then continues. Also Gehel could confirm there is still an issue. :/

Can we do the "try other switch port" step next @Cmjohnson ?

Dzahn added a comment.Mar 1 2018, 5:56 PM

It's intermittent. Most of the time it works,but then it doesn't...

[wdqs1004:~] $ host www.wikidata.org
www.wikidata.org has address 208.80.154.224
www.wikidata.org has IPv6 address 2620:0:861:ed1a::1
Dzahn added a comment.Mar 1 2018, 6:12 PM

from dmesg:

[203721.906779] tg3 0000:02:00.0 eth0: Link is down
[203725.760204] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[203725.760214] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[203725.760218] tg3 0000:02:00.0 eth0: EEE is disabled
[203782.391437] tg3 0000:02:00.0 eth0: Link is down
[203798.154227] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[203798.154236] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[203798.154239] tg3 0000:02:00.0 eth0: EEE is disabled
[206269.350612] tg3 0000:02:00.0 eth0: Link is down
[206289.517967] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[206289.517976] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[206289.517980] tg3 0000:02:00.0 eth0: EEE is disabled

a ticket has been placed with Dell You have successfully submitted request SR961706803.

The system board for this is scheduled to be changed out Wednesday 07FEB18

yes march! I don't even know what month it is anymore..thx

The dell technician rescheduled for Thursday 8 March. Tentatively scheduled for 1800UTC.

Change 417311 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] dhcp mac address change wdqs1004

https://gerrit.wikimedia.org/r/417311

Change 417311 merged by Cmjohnson:
[operations/puppet@production] dhcp mac address change wdqs1004

https://gerrit.wikimedia.org/r/417311

Cmjohnson reassigned this task from Cmjohnson to RobH.Mar 8 2018, 5:37 PM

The system board has been swapped, updated the bios and idrac. Also, update the mac address in puppet. assigning to @RobH for reinstall

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201803091357_gehel_7493.log.

Change 417899 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: migrate to stretch

https://gerrit.wikimedia.org/r/417899

Change 417899 merged by Gehel:
[operations/puppet@production] wdqs: migrate to stretch

https://gerrit.wikimedia.org/r/417899

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201803091444_gehel_17927.log.

Completed auto-reimage of hosts:

['wdqs1004.eqiad.wmnet']

and were ALL successful.

Gehel added a comment.Mar 9 2018, 3:22 PM

Bad news... I just finished reimaging wdqs1004, and I still have trouble. SSH session suddenly / randomly freezes. I can't see the same link flapping that @Dzahn saw in dmesg (at least not yet).

@RobH / @Cmjohnson : any idea how to proceed?

Gehel added a comment.Mar 9 2018, 5:43 PM

Note that Icinga also has random failed checks (size of conntrack table, ferm, dpkg, ...) all with "Return code of 255 is out of bounds". The service (blazegraph) related checks are also failing, but that's expected, since the data isn't loaded yet.

RobH added a comment.Mar 9 2018, 5:43 PM

So I can see @Gehel asked:

@Cmjohnson this looks like an issue with the physical connection. Could you try moving the cable to another port on the switch so that we can eliminate that possibility?

I replaced the ethernet cable and was able to login
Debian GNU/Linux 8 auto-installed on Wed Sep 6 12:24:33 UTC 2017.
cmjohnson@wdqs1004:~$

So it looks like chris swapped the cable, it worked, and the port wasn't changed? It is unlikely to be a bad EX4200 port (I can only recall this happening once in my time here) but it is possible.

We are sill seeing bad ssh connections, and its likely ALL networking and SSH is just the most obvious symptom.

I'd like to ask @Cmjohnson if the port was swapped on this? If not, can we go ahead and swap over the port?

Gehel added a comment.Mar 9 2018, 6:05 PM

@Cmjohnson did message me about changing the port as well, so that's probably not it.

Side note if it wasnt clear from the gerrit change above, OS also has been upgraded from jessie to stretch as part of the reimage.

Gehel added a comment.Mar 9 2018, 9:31 PM

@Cmjohnson I'll be vacation starting March 18, and I would be more relax if I knew our wdqs eqiad clsuter has its usual 3 nodes. Coud you already rack one of the new cluster node (T188432) so that we can canibalize one until wdqs1004 is healthy again?

faidon raised the priority of this task from High to Unbreak Now!.Mar 12 2018, 2:43 PM
faidon edited projects, added netops; removed Patch-For-Review.
faidon added a subscriber: faidon.
faidon@re0.cr1-eqiad> show arp no-resolve | match 10.64.0.17 
78:2b:cb:2d:fa:e6 10.64.0.17      ae1.1017                 none

faidon@re0.cr2-eqiad> show arp no-resolve | match 10.64.0.17 
d0:94:66:2f:ae:02 10.64.0.17      ae1.1017                 none

and

faidon@asw-a-eqiad> show ethernet-switching table | match d0:94:66:2f:ae:02    
  private1-a-eqiad  d0:94:66:2f:ae:02 Learn          0 ge-6/0/37.0

faidon@asw-a-eqiad> show ethernet-switching table | match 78:2b:cb:2d:fa:e6                       
  private1-a-eqiad  78:2b:cb:2d:fa:e6 Learn          0 ge-4/0/18.0

ge-6/0/37 is wdqs1004.

ge-4/0/18 has no description; how did that happen? What is connected there, why is the port enabled without a description, why did it use wdqs1004's IP and how come it's been idling like that for 20 days? @Cmjohnson could you debug this further ASAP?

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMar 12 2018, 2:43 PM
RobH added a comment.EditedMar 12 2018, 11:08 PM

Ok, I had misparsed this, but after chatting with Faidon (and Arzhel) via IRC, I understand what has happened.

rcs1002 was decommissioned on T170157, and I did the steps. Somehow I checked off and updated the task with the port information for both hosts (so I was 100% on the switch stack) and then failed to commit the port disabling. At some point this host got powered back on, and its disks are NOT wiped, and it took back over this IP.

Yes, this is the exact reason we have the checklist (that I wrote). So this is 100% on me. I've disabled the rogue server's port, and re-audited the steps on T170157.

wdqs1004 is now responsive.

Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, since it would have been suffering the same connection issues as wdqs1004 (just the other half time).

RobH added a comment.Mar 13 2018, 12:17 AM

Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, since it would have been suffering the same connection issues as wdqs1004 (just the other half time).

The other server was decommissioned, not needed! So no worries.

Smalyshev added a comment.EditedMar 13 2018, 12:58 AM

Thanks @RobH! Created T189548: reload data on wdqs1004 for loading the data back. @Gehel if you don't see anything else wrong then this one can be resolved. Since it's kind of generic issue maybe we should also make an incident report?

Gehel closed this task as Resolved.Mar 13 2018, 9:14 AM

Yay! Thanks @faidon for finding the issue!

@RobH / @Papaul : the symptoms of wdqs2006 mgmt interface look vaguely similar (T189318). Any chance that the cause would be similar?

@Smalyshev : I'll take care of the data reload on T189548

I'm closing this task as it looks very much resolved, I'll reopen if there is more trouble.

So post-mortem, I think there are 4 different things here:

  • T189519: Audit switch ports/descriptions/enable (and do this on an ongoing basis)
  • T189522: Detect IP address collisions
  • General enhancements on our server provisioning and decommissioning pipeline, which has a bunch of long-standing issues, but also requires a more dedicated long-term effort. I'm sure there's one or more tasks related to this, but more broadly, this work stream is something that has been incorporated into our (draft) annual plan as a major item next year.
  • (Tagential) Triage the decom queue in a more prompt way to avoid servers lingering for months after their service decom.

My apologies for all the wasted engineering time -- this is a pretty unfortunate, and pretty basic issue that continues to bite us :(