Page MenuHomePhabricator

Audit switch ports/descriptions/enable
Closed, ResolvedPublic

Description

In reviewing T189111#4044259, it seems we have a large number of switch ports lacking descriptions. Specifically, the mw appservers lack them for much of the switch stacks from the initial build out.

We should remedy this, so we need to audit the ports for use, and then compare what the switch software thinks is there against what is actually there (ethernet switch table should help do this without physical tracing.)

We should also figure out a way (an alert or something?) for this to not recur again.

Please audit the following list and remove the table lines once being dealt with (removed from the switch config, or description added on the switch side).

DeviceInterfaceLLDP neighborNotes

Event Timeline

RobH triaged this task as Medium priority.Mar 12 2018, 6:21 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptMar 12 2018, 6:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
faidon renamed this task from audit codfw switch ports/descriptions/enable to Audit switch ports/descriptions/enable.Mar 12 2018, 6:35 PM
faidon added a project: ops-eqiad.
faidon updated the task description. (Show Details)
faidon added a subscriber: faidon.

I just ran into a similar thing today in eqiad with T188045, so I reworded the task to make it generic and for both data centers. I also added a sentence to make sure this doesn't happen again, e.g. by adding an alert, or a Juniper slax script to make sure enabled ports always have a description.

ayounsi claimed this task.EditedMar 13 2018, 10:19 AM
ayounsi added subscribers: Papaul, Cmjohnson.

Only looking at the asw ports with link up for now, using LibreNMS:

@Papaul If the ports with LLDP neighbors are correct, I can mass add the descriptions.

@Papaul @Cmjohnson, all the other ports (with no neighbors) will need to be fixed manually.

I also added/tested an alert in LibreNMS (muted), that we can unmute/activate once the ports bellow are fixed.
EDIT: Moving table to tasks descriptions.

For reference here is the MySQL query I used against the LibreNMS DB:

SELECT hostname, ifDescr, remote_hostname
FROM ports
LEFT JOIN links
ON ports.port_id = links.local_port_id
INNER JOIN devices
ON ports.device_id = devices.device_id
AND ifAdminStatus = 'up' 
AND ifOperStatus = 'up' 
AND ifSpeed != 0
AND iftype = 'ethernetCsmacd'
AND ifAlias REGEXP '^(g|x)e-[0-9]+/[0-9]+/[0-9]+$'
AND hostname REGEXP '^asw'
ORDER BY location;
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Mar 14 2018, 7:11 PM
ayounsi updated the task description. (Show Details)Mar 23 2018, 5:09 PM
RobH added a comment.Mar 23 2018, 5:15 PM

Just changd on asw switch stack:

robh@asw-a-eqiad# show | compare
[edit interfaces]
+ ge-2/0/0 {
+ description db1001;
+ disable;
+ }
+ ge-2/0/8 {
+ description db1009;
+ }
+ ge-2/0/10 {
+ description db1011;
+ disable;
+ }
+ ge-2/0/15 {
+ description db1016;
+ disable;
+ }

Added in 4 labels, as we now know what those ports have connected.

ayounsi updated the task description. (Show Details)Mar 23 2018, 5:21 PM

Removed them from the table from description.

Slightly related.
I disabled all the interfaces on the access switches that are down and don't have a description by adding them in interfaces interface-range disabled.

That makes the output of https://librenms.wikimedia.org/ports/state=down/location=CyrusOne%2C+Carrollton%2C+Texas%2C+USA/format=list_basic/ a lot easier to understand.

Ideally we would have an alert if a port have been down (and not disabled) for longer than X weeks, for cleanup or other actions.

Cmjohnson moved this task from Up next to Not urgent on the ops-eqiad board.Apr 24 2018, 3:09 PM
Cmjohnson updated the task description. (Show Details)Apr 25 2018, 3:38 PM
ayounsi updated the task description. (Show Details)Apr 25 2018, 3:52 PM
Cmjohnson updated the task description. (Show Details)Apr 25 2018, 3:54 PM
ayounsi updated the task description. (Show Details)Apr 25 2018, 4:01 PM
Cmjohnson updated the task description. (Show Details)Apr 25 2018, 5:12 PM
ayounsi removed ayounsi as the assignee of this task.Apr 25 2018, 7:49 PM
ayounsi updated the task description. (Show Details)May 30 2018, 7:32 AM
ayounsi updated the task description. (Show Details)Jul 3 2018, 10:05 PM
ayounsi assigned this task to Papaul.Jul 3 2018, 10:10 PM

I cleaned up the list in the description, it's now down to 5 ports in codfw and 5 in eqiad, should be quick to audit.

Tossed a coin and assigned it to @Papaul to start, please assign it to Chris when you're done and then back to me so I can enable the alerts.

Note that comment T189519#4102901 is still valid and those extra ports need to be audited too (most likely put in the disabled group.

Papaul updated the task description. (Show Details)Jul 9 2018, 3:06 PM

@ayounsi osm-web2001, db2021, db2022 and db2024 are not showing in icinga so i don't know what is the update on those servers

Papaul reassigned this task from Papaul to Cmjohnson.Jul 9 2018, 3:19 PM

@Cmjohnson hey Chris assigning you the task so you can do your audit. Once done you can assign it to @ayounsi.

Thanks.

ayounsi updated the task description. (Show Details)Jul 9 2018, 3:33 PM

Thanks, switch port description updated.

ayounsi updated the task description. (Show Details)Jul 26 2018, 10:50 PM
Cmjohnson updated the task description. (Show Details)Aug 9 2018, 3:36 PM

@ayounsi assigning this to you. Everything has been updated on the switch, verified what they were and disabled the ports.

asw-b-eqiad.mgmt.eqiad.wmnet ge-4/0/4 NULL Not configured at all, no traffic/MAC ___ this is server vanadium is on decom list T182015
asw-b-eqiad.mgmt.eqiad.wmnet ge-4/0/11 NULL No MAC learned, in analytics vlan ___this server is zinc and on decom list T191352

asw2-d-eqiad.mgmt.eqiad.wmnet xe-7/0/6 NULL Only 1 way traffic, Decommissioned mc1017
asw2-d-eqiad.mgmt.eqiad.wmnet xe-7/0/7 NULL Only 1 way traffic, Decommissioned mc1018

@ayounsi is this okay to close?