Page MenuHomePhabricator

Three ports on asw2-d-eqiad are not working as expected
Closed, ResolvedPublic

Description

During the debugging of T247561 the hosts kafka-jumbo1006 and stat1005 were moved to different switch ports that didn't really work.

Timeline:

  • stat1005 on ge-1/0/4 and kafka-jumbo1006 on ge-1/0/5 show up in icinga at the same time as DOWN
  • kafka-jumbo1006 is moved to ge-1/0/9 and regain connectivity
  • stat1005 is moved to ge-1/0/6 but still shows no connectivity
  • stat1005 is moved to ge-1/0/43 and regain connectivity

So at least 3 ports on asw2-d-eqiad are not working as expected: ge-1/0/4, ge-1/0/5 and ge-1/0/6

Arzhel suggested to test those ports with a laptop or similar to see if they are really not working at all or not.

Event Timeline

I just attempted us use ge-1/0/6 and it did not work

If they're dead:

  • Either we need them (eg. short on ports), and in that case we need to replace the switch. Which is a heavy operations.
  • Or we mark the ports as dead (with a mention of that task), disable them and call it a day.

If three ports are permanently failed, I'm not sure how we could ever trust that switch again. Perhaps it's better to do a painful but planned replacement rather than have it fail at some inconvenient time and having to rush a replacement then?

ayounsi changed the task status from Open to Stalled.May 19 2020, 7:29 AM
ayounsi removed Cmjohnson as the assignee of this task.
ayounsi triaged this task as Low priority.
ayounsi added a subscriber: Cmjohnson.

Sounds good! This will have to wait for a time we for example do T196487. Outside of COVID times as it's impactful and not urgent.

Change 623177 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] move dumps around on the snapshots in prep network upgrade work

https://gerrit.wikimedia.org/r/623177

Change 623177 merged by ArielGlenn:
[operations/puppet@production] move dumps around on the snapshots in prep for network upgrade work

https://gerrit.wikimedia.org/r/623177

@ayounsi I did some investigating on this today and there have been servers plugged into 2 of 3 (ge-1/0/5 and 1/0/6) ports now for quite some time and no issues. Maybe we had issues that were not related to the switch. Do you want to keep this open or close and re-visit if necessary.

ayounsi claimed this task.

Noted, thanks! Yeah fine to close for now, and re-open if any issues.