Page MenuHomePhabricator

Audit switch ports/descriptions/enable
Closed, ResolvedPublic

Description

In reviewing T189111#4044259, it seems we have a large number of switch ports lacking descriptions. Specifically, the mw appservers lack them for much of the switch stacks from the initial build out.

We should remedy this, so we need to audit the ports for use, and then compare what the switch software thinks is there against what is actually there (ethernet switch table should help do this without physical tracing.)

We should also figure out a way (an alert or something?) for this to not recur again.

Please audit the following list and remove the table lines once being dealt with (removed from the switch config, or description added on the switch side).

DeviceInterfaceLLDP neighborNotes

Event Timeline

RobH triaged this task as Medium priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
faidon renamed this task from audit codfw switch ports/descriptions/enable to Audit switch ports/descriptions/enable.Mar 12 2018, 6:35 PM
faidon added a project: ops-eqiad.
faidon updated the task description. (Show Details)
faidon subscribed.

I just ran into a similar thing today in eqiad with T188045, so I reworded the task to make it generic and for both data centers. I also added a sentence to make sure this doesn't happen again, e.g. by adding an alert, or a Juniper slax script to make sure enabled ports always have a description.

ayounsi added subscribers: Papaul, Cmjohnson.

Only looking at the asw ports with link up for now, using LibreNMS:

@Papaul If the ports with LLDP neighbors are correct, I can mass add the descriptions.

@Papaul @Cmjohnson, all the other ports (with no neighbors) will need to be fixed manually.

I also added/tested an alert in LibreNMS (muted), that we can unmute/activate once the ports bellow are fixed.
EDIT: Moving table to tasks descriptions.

For reference here is the MySQL query I used against the LibreNMS DB:

SELECT hostname, ifDescr, remote_hostname
FROM ports
LEFT JOIN links
ON ports.port_id = links.local_port_id
INNER JOIN devices
ON ports.device_id = devices.device_id
AND ifAdminStatus = 'up' 
AND ifOperStatus = 'up' 
AND ifSpeed != 0
AND iftype = 'ethernetCsmacd'
AND ifAlias REGEXP '^(g|x)e-[0-9]+/[0-9]+/[0-9]+$'
AND hostname REGEXP '^asw'
ORDER BY location;

Just changd on asw switch stack:

robh@asw-a-eqiad# show | compare
[edit interfaces]
+ ge-2/0/0 {
+ description db1001;
+ disable;
+ }
+ ge-2/0/8 {
+ description db1009;
+ }
+ ge-2/0/10 {
+ description db1011;
+ disable;
+ }
+ ge-2/0/15 {
+ description db1016;
+ disable;
+ }

Added in 4 labels, as we now know what those ports have connected.

Removed them from the table from description.

Slightly related.
I disabled all the interfaces on the access switches that are down and don't have a description by adding them in interfaces interface-range disabled.

That makes the output of https://librenms.wikimedia.org/ports/state=down/location=CyrusOne%2C+Carrollton%2C+Texas%2C+USA/format=list_basic/ a lot easier to understand.

Ideally we would have an alert if a port have been down (and not disabled) for longer than X weeks, for cleanup or other actions.

I cleaned up the list in the description, it's now down to 5 ports in codfw and 5 in eqiad, should be quick to audit.

Tossed a coin and assigned it to @Papaul to start, please assign it to Chris when you're done and then back to me so I can enable the alerts.

Note that comment T189519#4102901 is still valid and those extra ports need to be audited too (most likely put in the disabled group.

@ayounsi osm-web2001, db2021, db2022 and db2024 are not showing in icinga so i don't know what is the update on those servers

@Cmjohnson hey Chris assigning you the task so you can do your audit. Once done you can assign it to @ayounsi.

Thanks.

Thanks, switch port description updated.

@ayounsi assigning this to you. Everything has been updated on the switch, verified what they were and disabled the ports.

asw-b-eqiad.mgmt.eqiad.wmnet ge-4/0/4 NULL Not configured at all, no traffic/MAC ___ this is server vanadium is on decom list T182015
asw-b-eqiad.mgmt.eqiad.wmnet ge-4/0/11 NULL No MAC learned, in analytics vlan ___this server is zinc and on decom list T191352

asw2-d-eqiad.mgmt.eqiad.wmnet xe-7/0/6 NULL Only 1 way traffic, Decommissioned mc1017
asw2-d-eqiad.mgmt.eqiad.wmnet xe-7/0/7 NULL Only 1 way traffic, Decommissioned mc1018