User Details
- User Since
- Apr 3 2017, 6:23 PM (338 w, 6 d)
- Availability
- Available
- IRC Nick
- xionox
- LDAP User
- Ayounsi
- MediaWiki User
- AYounsi (WMF) [ Global Accounts ]
Wed, Sep 27
Thanks, I opened T347494: Remove static routes for anycast prefixes to get rid of them. You can use 10.3.0.2/32 for the NTP VIP.
Yeah, actually you can use 10.3.0.2/32 for NTP, I won't go through renumbering the syslog VIP.
Thanks, as this VIP won't be critical we can skip the static routes and only allocate 10.3.0.8/32.
FYI, the mgmt_junos bug (also present on the fasw) might not be fixed by an upgrade, but maybe with the solution exposed in https://www.reddit.com/r/Juniper/comments/mvq8hf/comment/j7gd6hq/
set interface em0.0 family inet address 10.XXX.XXX.XXX/XX master-only
To keep it somewhere for later, on Dell SONiC it should be on the /openconfig-qos:qos/interfaces path.
Grouping it by source/interface_interface-id/pfc-priority_dot1p and displaying it by "events" gNMI returns this:
{ "name": "interfaces-states", "timestamp": 1695808554638790732, "tags": { "interface_interface-id": "Ethernet9", "pfc-priority_dot1p": "7", "source": "lsw1-e8-eqiad.mgmt.eqiad.wmnet:8080", "subscription-name": "interfaces-states" }, "values": { "/openconfig-qos:qos/interfaces/interface/pfc/pfc-priorities/pfc-priority/config/dot1p": 7, "/openconfig-qos:qos/interfaces/interface/pfc/pfc-priorities/pfc-priority/config/enable": false, "/openconfig-qos:qos/interfaces/interface/pfc/pfc-priorities/pfc-priority/dot1p": 7, "/openconfig-qos:qos/interfaces/interface/pfc/pfc-priorities/pfc-priority/state/dot1p": 7, "/openconfig-qos:qos/interfaces/interface/pfc/pfc-priorities/pfc-priority/state/enable": false, "/openconfig-qos:qos/interfaces/interface/pfc/pfc-priorities/pfc-priority/state/statistics/pause-frames-rx": 0, "/openconfig-qos:qos/interfaces/interface/pfc/pfc-priorities/pfc-priority/state/statistics/pause-frames-tx": 0 } }
It's great to see momentum on this recurring pain point!
Tue, Sep 26
I think we can close that one. @RobH did the audit afaik.
That's a great idea! Opened {T347403}
Thanks, I remembered there was a reason but forgot what it was!
My understanding is that you're one step ahead of Prod here as you're deriving host networking based on Netbox data (eg. rack from vlan, etc) so you might catch new issues.
We should look at provisioning from beginning to end so we can mutualise the efforts here.
We shouldn't alert on NIC saturation (or related counters) in the current state of things (unless we can redirect the alerts to the relevant teams). But we need to alert on errors caused by faulty NICs or faulty cables (anything L1) like we do for network devices.
Mon, Sep 25
This might need to be rolled back the day we start doing BGP unnumbered between spine and leaf as it seems to rely on it: https://www.theasciiconstruct.com/post/junos-bgp-and-bgp-unnumbered/#ipv6-configuration-for-bgp-unnumbered
Deployed
Fri, Sep 22
Personally I've no objection to the first option, just allowing it. But as you mention the policy and overall shape of things in terms of the "cross realm guidelines" needs to be considered. @ayounsi have you any thoughts here?
Prometheus monitors endpoints outside of WMF's network through the proxies, see T303803: Prometheus use of Squid proxies. Would that work for that usecase?
Thu, Sep 21
In the set up the team asked for a couple more items. Can you also share the “aud” (audience) & cid (clientId)values from the ID token?
NTP automation:
Even if Debian Installer supports a coma separated list of NTP servers (to be tested?), some special appliances (like PDUs) only support 1 or 2 NTP servers.
So while it's best to have many servers configured (See for example https://labs.ripe.net/author/christer-weinigel/best-practices-for-connecting-to-ntp-servers/ ) and that's what Puppet does well, we need to have a "catch-all" option. For day to day maintenance there is no need to remove an NTP server from the Puppet managed "timesyncd.conf" file.
FYI, the underlying IRC library seems to support proxies https://github.com/aatxe/irc#configuring-irc-clients
Indeed and hosts on public IPs have a much larger attack surface so they should be a last resort option. The ircbot might need to be audited too if it connects to servers outside of WMF.
@cmooney I think this can be closed?
This is done for now, more improvements to come from Dell, tracked in T342673.
Thanks, I spent a bit more time on that.
Tue, Sep 19
This triggered netbox report alert ganeti2014 (WMF6747) mismatched serials: XXXXX (netbox) != YYYYY (puppetdb)
https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
All good now.
Re-opening as the LibreNMS report needs to be updated to handle those discrepancies.
The support contract is different on the old vs. new licensing, so we need to be able to verify that the proper support is applied to our switches.
Mon, Sep 18
We had a quick chat on IRC.
The checklist is heavily oriented towards extensions and skins.
Yup, that was it.
Quick look makes me think that's related to devices that have been deleted.
Probably a combination of latency (distance between netmon1003 and eqsin) with an increasing number of BGP peers.
Based on https://librenms.wikimedia.org/graphs/type=device_poller_modules_perf/device=159/from=1694940900/ most time is spent on BGP peers. Which is true for all routers, vs. ports for switches, which make sens.
Fri, Sep 15
@jbond from Juniper:
Thanks, that's related to T336511: Access port speed <= 100Mbps False positives and I just removed the alert.
I removed the alert as it was being problematic in T346317: Alert "access port speed less 100mbit" and librenms upgrade as well.
Good point! That was done before the VXLAN deployment to have more predictability on the anycast traffic to the end servers.
Thu, Sep 14
@CDanis Is that still needed now that we have NEL?
Wed, Sep 13
Tue, Sep 12
FYI:
@andrea.denisse. is there a task for this blocking issue? As more and more people are going to upgrade to bookworm thanks for finding those bugs.
Mon, Sep 11
Unfortunately the errors are back, even though not much it's still better to fix the issue.
Fri, Sep 8
Thu, Sep 7
Probably not, probably, probably not.
Thanks, we had a quick chat on IRC about that and indeed that's the current conclusion. The extra details your provided (and fix suggestions) are welcome too!
Please open a new task for that.
Wed, Sep 6
I thought that was not possible but it got introduced recently (in 16.1).
Thanks for the update it all makes sens to me!
Tue, Sep 5
Sounds good to me!
Mon, Sep 4
This is now working in prod.
Sep 1 2023
Aug 31 2023
Thanks, I submitted the on-boarding form, let's see what happens now.
Rolled everywhere, another example, cr1-codfw:
Prefix Nexthop MED Lclpref AS path * 185.15.56.0/24 Self ? * 185.15.57.0/24 Self ? * 185.71.138.0/24 Self I * 198.35.27.0/24 Self I * 198.73.209.0/24 Self 11820 ? * 208.80.152.0/23 Self I
Prefix Nexthop MED Lclpref AS path * 185.15.57.0/24 Self ? * 185.71.138.0/24 Self I * 198.35.27.0/24 Self I * 208.80.152.0/23 Self I
The SF office as well as eqiad WMCS range are gone, but codfw WMCS is still there.
@jbond from Juniper, does it make sens?
“If the customer would like to use OIDC they enter in their token for us to use and authenticate. The vast majority of users sign up requesting OAuth2.0 where we’ll build them credentials instead and share with the customer.
FYI there is now a pending diff for:
[edit forwarding-options dhcp-relay] + /* T337345 */ + forward-snooped-clients non-configured-interfaces;
On the L3 switches. That's as the latest patch is moving the statement outside of the if NOT l3_switch (else).
From my understanding that's the expected behavior, but as they've been working without it so far I'll leave it to you.
Aug 30 2023
Could we use forward-only everywhere once we move to DHCP option 97 with {T304677} ?
I rolled the certificate to all the cloudsw, cr, and asw devices.
I enabled gnmic on all the cloudsw and asw devices.
I configured gnmic to pull the data from all the asw devices.
Before running homer, the cookbook needs to call the sre.network.tls cookbook with the device's name as parameter to add the TLS cert required by the config pushed by Homer.
Nevermind, still doesn't work on the fasw.
Re-opening as the fasw got upgraded since, so we can enable mgmt_junos