User Details
- User Since
- Apr 3 2017, 6:23 PM (414 w, 5 d)
- Availability
- Available
- IRC Nick
- xionox
- LDAP User
- Ayounsi
- MediaWiki User
- AYounsi (WMF) [ Global Accounts ]
Wed, Mar 12
@dcaro you might find this useful https://gerrit.wikimedia.org/r/c/operations/alerts/+/1126966
Chatted about it with Cathal on IRC, the gNMIc deamon just needed a restart.
Tue, Mar 11
That's great, thanks for digging into it.
Mon, Mar 10
The RIPE atlas are in a special sandbox vlan. Which behave like the internet (no special permissions into our network).
alert1002 have a public IP so the ripe atlas is reachable like any internet host. However the prometheus hosts are internal, so pings can't go out (or back in) as expected.
Obviously the HTTP proxies don't work for pings. I don't know what is the best course of action here.
nice !
Wed, Mar 5
I also added that metric to this dashboard as an example visualization : https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1
They're mandatory on long distance link as we've had issue with interface status being up but the provider not forwarding traffic through said link. For local (direct) links I don't think BFD is needed as it's unlikely that the interface is up without forwarding data.
So what about:
- turnilo full dimensions - 1 months
- turnilo sanisitzed/reduced - 12 months
- everything else, in data lake?
Tue, Mar 4
Mostly to be able to see long term trends, for example per destination AS.
Let's see what other people think, but I think it would be fine to :
- Keep only 1 month of not sanitized data, as that data is especially important for its real/short time aspect
- reduce the sanitized data, for example by sampling it even further, as long as we can keep historical trends, it's enough.
- Remove those 3 dimensions from the sanitized data : "parsed_comms", "as_name_src", "as_name_dst"
Not a strong feeling, but I usually try to steer towards the leaner option. So in that case it's to remove BFD between cr1/2-codfw.
Looking at https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml#L164 it seems like only cr1/2-codfw is impacted, all the others already don't have BFD.
Automation wise, we could probably automate "no metric = no BFD".
Mon, Mar 3
Wed, Feb 26
I'm wondering what the benefit is to having the additional metrics saved for other network devices? As in what alterts will we have, and downtime configuration, where that will help us? Consider a link from cr1-codfw to cr2-drmrs. Our normal alerting will fire for both of these if the session goes down. Is there a way to reconfigure that so we still always get alerted when we need, but we can suppress any alerts if only one side is downtimed? Or is there other benefit of having them?
No strong feelings, but a part of it is to treat network devices a bit like servers, having less exceptions.
Step 1, create/trunk the vlan to the hypervisors - https://netbox.wikimedia.org/ipam/vlans/?q=sandbox1&site_id=11&site_id=9&site_id=6&site_id=8
Another question is how to name those new metrics ?
One suggestion, to stay generic as well, is to do something like
gnmi_bgp_neighbor_session_state -> remote_bgp_neighbor_session_state (or peer_bgp_neighbor_session_state)
What do you think ?
I forked the discussion to T387287: Prometheus: attach host's BGP/interface remote side metrics and T387288: Add prometheus-bird-exporter sidecar to calico-node pods as that task was becoming more difficult to read.
Tue, Feb 25
All done ! There was no diff, as expected in the best case scenario.
And what happens if peer_descr is missing or empty ?
good question, in that case the instance label will be :0 since $1 will be empty. Is having peer_descr something that can happen?
No more errors.
Mon, Feb 24
@MatthewVernon that's correct. Thanks !
Our datacenter engineering team has concluded the on-site activity, and no problems were found on our side. Could you please confirm if this has improved the situation and ceased the errors on your side?
We performed a fiber cleaning and checked all the physical components.
Enabling traceoptions shows a no shared cipher error on the switch :
Feb 24 09:33:58 ssl_transport_security.c:948: Handshake failed with fatal error SSL_ERROR_SSL: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher. Feb 24 09:33:58 chttp2_server.c:83: Handshaking failed: {"created":"@1740389638.118250387","description":"Handshake failed","file":"../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_handshaker.c","file_line":276,"tsi_code":10,"tsi_error":"TSI_PROTOCOL_FAILURE"} Feb 24 09:34:02 ssl_transport_security.c:948: Handshake failed with fatal error SSL_ERROR_SSL: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher. Feb 24 09:34:02 chttp2_server.c:83: Handshaking failed: {"created":"@1740389642.111017067","description":"Handshake failed","file":"../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_handshaker.c","file_line":276,"tsi_code":10,"tsi_error":"TSI_PROTOCOL_FAILURE"}
The switch is running a too old junos version for analytics-agent. I tried cloudsw2-d5-eqiad> restart SDN-Telemetry gracefully instead, but that didn't work.
Wed, Feb 19
Nop, thanks for the ping.
There is now T364092: Upgrade core routers to Junos 23.4R2
Thanks @bd808
Please detach 'Ayounsi' from SUL, rename it to 'AYounsi (WMF)', and reattach to SUL.
Thanks! You can remove the now obsolete references from the ospf section in https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml
Feb 5 2025
Good idea regarding BFD. From https://supportportal.juniper.net/s/article/Observing-BGP-IO-ERROR-CLOSE-SESSION-error-logs-when-BGP-protocolgoes-down?language=en_US it seems like JTAC is the next step :(
Sure, as usual for power/console/mgmt.
Regarding production ports :
On the ssw1 side: use et-0/0/7` towards e8 and et-0/0/15 towards f8. They're currently used by the links to the dell switches, so no need to re-run new cables.
Cool, let's start with pc2011 to validate the workflow, then if all good we can iterate faster on the other hosts.
Feb 4 2025
Just coming back, I'm also curious about the upcoming FR-tech changes, is that discussed somewhere ?
Let's decom it and focus our efforts on spinning up VMs instead (T385560).
It needs to be removed from the list on https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common.yaml#L1881 as well as from RIPE's dashboard. Then shut down the switch port, reclaim the IP and hand it over to DCops to be recycled. There are no confidential data on the box so no need for a wipe.
Let's decom it and focus our efforts on spinning up VMs instead (T385560).
It needs to be removed from the list on https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common.yaml#L1881 as well as from RIPE's dashboard. Then shut down the switch port, reclaim the IP and hand it over to DCops to be recycled in the next DC visit. There are no confidential data on the box so no need for a wipe.
For (1) we can have the sre.ganeti.addnode cookbook call the PuppetDBImport script towards the end. What do you and @MoritzMuehlenhoff think ?
Feb 3 2025
I assume no IP changes would happen right?
That's correct.
Looks all good to me !
An alternative (or short term solution until the above/cleaner one is live) is to not alert for sessions on the router side towards k8s nodes, and only monitor BGP from the k8s side.
Nov 22 2024
Nov 21 2024
Ticket ID 30654682 has been successfully created.
Nov 18 2024
Nov 15 2024
Because of the various limitations listed in {T342673} (plus the ones from pygnmi) we're not going to proceed any further on Dell SONiC, focusing on {T371088} now.
Because of the various limitations listed in {T342673} (plus the ones from pygnmi) we're not going to proceed any further on Dell SONiC, focusing on {T371088} now.
Thanks for dictdiffer, because of a change in priorities and current limitations in pyGNMI, there is no more need to package it.
Going to close that task as we're not planning on using gNMI for automation any further, due to various shortcoming in the existing python gNMI library. We're alternatively looking into JSON-RPC see T371088#10272661 for example.
Cool, nothing urgent, in that case please let you know when you can which hosts that you want to migrate (or the ones that are not worth it), we can then figure out a plan of attack.
Nov 14 2024
Nov 13 2024
interesting idea, definitely worth a try. I'm particularly curious on how routing between VMs would work in that setup, and where to apply filtering. But not requiring multihop would be a plus.
Nov 12 2024
Updated :)
Nov 7 2024
If it's a bug on the switch it's probably worth opening a JTAC ticket. Even if it's not fixed on time for us they could provide a workaround or fix it in the longer run (unfortunately not on time for us).
Nov 5 2024
Another point, after running the script, the changelog on a problematic interface shows 3 changes (for that interface) in the same transaction:
- "updated" Post-Change Data looks like what we want (disabled, no vlans, no mtu, cable still attached).
- "updated" that "reverts" the values we don't want to keep <- that's the odd one
- "delete" that removes the cable termination, as expected
Oct 31 2024
@bking I think it's a question worth asking, but probably not in that task :) Could you open a dedicated one for the Procurement/DCops team?
Had a chat with Riccardo on IRC, here is the new list I came up with:
@Papaul 54 but that only included rows A and B, now C and D are also eligible to a free 10G upgrade when available.