User Details
- User Since
- Apr 3 2017, 6:23 PM (376 w, 3 d)
- Availability
- Available
- IRC Nick
- xionox
- LDAP User
- Ayounsi
- MediaWiki User
- AYounsi (WMF) [ Global Accounts ]
Yesterday
Wed, Jun 19
Some notes before I forget, to make the sre.deploy.python-code work I had to:
Mon, Jun 17
We had a quick look at the network side and couldn't find any smoking gun.
Of course ! not planning on doing it today :) The task is there to not forget.
Yeah I think it's what I tried to mean with
We can also decide that batch means to silently skip any device that have a different diff, to not risk blocking the run in the middle of it if a device have local changes
Basically decide if the batch behavior is (3) or (4) and then stick to it. 4 options seems a bit too much.
I tend to prefer (3), and would be ok to not support (4), especially as in a good state there should be no local changes.
I don't understand why the need to be moved to get upgraded to 10G. If we take for example wikikube-ctrl2001 the switch in rack B6 have plenty of available/ready to use 10G ports (for example 44-47).
Can we move the cables instead of moving the servers ?
It's necessary to do the diff on all target devices anyway, so that behavior is fine.
Fri, Jun 7
Thu, Jun 6
Wed, Jun 5
Plan so far is to merge https://gerrit.wikimedia.org/r/1037784 to be able to have a puppetized test server compatible with the new deploy directory scheme (netbox-dev)
Then to merge https://gerrit.wikimedia.org/r/1038694 and check it out from /srv/deployment/netbox-dev/deploy
Then load a copy of the prod Netbox DB on the dev instance pbsql
Then Run the deploy python code cookbook to have a working Netbox 4 setup (and fix any issue that could prevent it)
Then check if the DB migration went well
In parallel merge https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/905570/ and parent change to have better CI on netbox-extra ahead of fixing all the Netbox 4 breaking changes.
Then send/merge patches to fix those netbox-extra changes.
Then (non blocker) Update the sre.netbox.update-extras cookbook to account for those changes.
Then send Spicerack, Cookbooks and Homer patches to fix Netbox's breaking changes. Ideally by moving some of the Cookbook's Netbox API calls to Spicerack.
Mon, Jun 3
Last time we rolled out this change, it was simply updating modules/install_server/files/autoinstall/common.cfg. Do you have any other place in mind where this might need to be reconfigured? I am personally for removing this completely but it's not a big deal and we can keep it around as well.
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Opengear_Serial_Consoles
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/ServerTech
Our engineering team has now indicated that the compact json is not supported, due to hardware limitations with respect to compact json formatting. The feature will be deprecated in Junos 24.4. So, please do not use compact json to export data.
Moving the dynamic nature of NTP definition to some automated system instead of human or Puppet is a great idea :)
Human as in right now for network devices, the list is hard-coded https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml#L365
- i.e. one in Germany which will pick ns0 rather than lower latency ns2
Seems like the main one is adguard-dns.com, which picks them randomly.
https://w.wiki/AGmr
I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a different AuthDNS software on half the servers, or a different Puppet infra.
Before going full anycast we need to make sure we're covering all major failure scenarios, or alternatively making a call to keep some unicast, knowing some places with broken/dumb implementation won't be the fastest, but maybe an ok tradeoff for better resiliency.
Fri, May 31
That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or first anycast ns1 from anywhere then figure out how to modify the setup for possible better redundancy.
It could be useful to list all the failure scenarios, and if we need to mitigate them or not. (server or network missconfig, Bird bug, etc). In other words are we putting too many of our eggs in the same basket ?
Thu, May 30
assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this
As we couldn't get a /24 from LACNIC for magru, we only have two free /24s
We have to decide between a few options:
- allocate a new whole /24 for ns1 right now
- Pro: quick turnaround, no added cost
- Con: risk of lacking public v4 IPs for future projects (eg. new pops) or core site growth, can be mitigated by applying for more prefixes in parallel but not guaranty to get them
- Apply (and pay) for more IPs at RIPE or ARIN (T288342) and wait for an allocation before anycasting ns1
- Pro: limited cost, more flexible on IP usage
- Con: long turnaround (months to years)
- Buy a /24 on resale market
- Pro: faster turnaround
- Con: higher cost
- Don't Anycast ns1
- Listing it only for the sake of completeness, but not preferred. Even though there is a diminishing return after anycasting ns2, we believe anycasting one more ns would bring performance improvements to users
- Use the DoH Anycast prefix for ns1
- Pro: quick turnaround, no added cost
- Con: risk of providers blocking ns1 as a side effect of blocking DoH, mitigated by having 2 other NS.
Tue, May 28
The Typha firewall service is now based on firewall::service and does dynamic name resolution on the puppet server side, let's see if this improves things with the next rename.
The issue didn't happen again, but we also did the move vlan in addition to the rename (so the IP changed too).
Mon, May 27
JTAC was able to confirm/duplicate the bug on 22.3R3-S2.4, they're escalating it to their engineering team.
Fri, May 24
Opened JTAC case 2024-0524-163553
Thu, May 23
sudo cookbook sre.network.tls --system lsw1-f8-eqiad
Before I forget, please notify DCops so they update the physical labels on the server.
Wed, May 22
Sounds good ! I'd recommend doing first a rename then normal re-image, then just a move-vlan, then on a different host, test both actions one after the other.
Tested on a MX204 running Junos 21.2 and 22.4R3.25, the returned JSON is invalid...
Diff:
@@ -100,7 +100,7 @@ { }, "address-family" : - { + [{ "address-family-name" : "inet", "mtu" : "4456", "max-local-cache" : "100000", @@ -142,7 +142,7 @@ "internal-flags" : "0x0" }, "interface-address" : - { + [{ "ifa-flags" : { "ifaf-current-preferred" : "[null]", @@ -174,6 +174,7 @@ } } } + ] }, { "address-family-name" : "multiservice", @@ -182,9 +183,8 @@ { "internal-flags" : "0x0" } - } + }] } } } }
Valid:
{ "interface-information" : { "physical-interface" : { "name" : "xe-0/1/2", "admin-status" : "up", "oper-status" : "up", "local-index" : "164", "snmp-index" : "536", "description" : "Transit: Arelion (IC-) {#1071}", "link-level-type" : "Ethernet", "sonet-mode" : "LAN-PHY", "mtu" : "4470", "mru" : "4478", "source-filtering" : "disabled", "speed" : "10Gbps", "bpdu-error" : "none", "ld-pdu-error" : "none", "l2pt-error" : "none", "loopback" : "none", "if-flow-control" : "enabled", "if-speed-cfg" : "Auto", "pad-to-minimum-frame-size" : "Disabled", "if-device-flags" : { "ifdf-present" : "[null]", "ifdf-running" : "[null]" }, "ifd-specific-config-flags" : { "internal-flags" : "0x100200" }, "if-config-flags" : { "iff-snmp-traps" : "[null]", "internal-flags" : "0x4000" }, "if-media-flags" : { "ifmf-none" : "[null]" }, "physical-interface-cos-information" : { "physical-interface-cos-hw-max-queues" : "8", "physical-interface-cos-use-max-queues" : "8", "physical-interface-schedulers" : "0" }, "current-physical-address" : "f0:4b:3a:ef:7e:45", "hardware-physical-address" : "f0:4b:3a:ef:7e:45", "interface-flapped" : "2023-03-09 08:13:40 UTC (62w6d 04:14 ago)", "traffic-statistics" : { "input-bps" : "7184824", "input-pps" : "8233", "output-bps" : "85449600", "output-pps" : "8470" }, "active-alarms" : { "interface-alarms" : { "alarm-not-present" : "[null]" } }, "active-defects" : { "interface-alarms" : { "alarm-not-present" : "[null]" } }, "ethernet-pcs-statistics" : { "bit-error-seconds" : "3", "errored-blocks-seconds" : "3" }, "interface-transmit-statistics" : "Disabled", "logical-interface" : { "name" : "xe-0/1/2.0", "local-index" : "343", "snmp-index" : "555", "if-config-flags" : { "iff-up" : "[null]", "iff-snmp-traps" : "[null]", "internal-flags" : "0x4004000" }, "encapsulation" : "ENET2", "policer-overhead" : { }, "traffic-statistics" : { "input-packets" : "1184552371726", "output-packets" : "1206955771514" }, "filter-information" : { }, "address-family" : [{ "address-family-name" : "inet", "mtu" : "4456", "max-local-cache" : "100000", "new-hold-limit" : "100000", "intf-curr-cnt" : "1", "intf-unresolved-cnt" : "0", "intf-dropcnt" : "0", "address-family-flags" : { "ifff-rpf-check" : "[null]", "ifff-rpf-loose-mode" : "[null]", "ifff-sendbcast-pkt-to-re" : "[null]", "internal-flags" : "0x0" }, "interface-address" : { "ifa-flags" : { "ifaf-current-preferred" : "[null]", "ifaf-current-primary" : "[null]" }, "ifa-destination" : "80.239.192.64/30", "ifa-local" : "80.239.192.66", "ifa-broadcast" : "80.239.192.67" } }, { "address-family-name" : "inet6", "mtu" : "4456", "max-local-cache" : "75000", "new-hold-limit" : "75000", "intf-curr-cnt" : "2", "intf-unresolved-cnt" : "0", "intf-dropcnt" : "0", "address-family-flags" : { "ifff-rpf-check" : "[null]", "ifff-rpf-loose-mode" : "[null]", "internal-flags" : "0x0" }, "interface-address" : [{ "ifa-flags" : { "ifaf-current-preferred" : "[null]", "ifaf-current-primary" : "[null]" }, "ifa-destination" : "2001:2000:3080:a9a::/64", "ifa-local" : "2001:2000:3080:a9a::2", "interface-address" : { "in6-addr-flags" : { "ifaf-none" : "[null]" } } }, { "ifa-flags" : { "ifaf-current-preferred" : "[null]", "internal-flags" : "0x800" }, "ifa-destination" : "fe80::/64", "ifa-local" : "fe80::f24b:3aff:feef:7e45", "interface-address" : { "in6-addr-flags" : { "ifaf-none" : "[null]" } } } ] }, { "address-family-name" : "multiservice", "mtu" : "Unlimited", "address-family-flags" : { "internal-flags" : "0x0" } }] } } } }
May 21 2024
https://trubka.network.cz/pipermail/bird-users/2024-May/017687.html
I already made it a feature request and plan to implement it.
May 17 2024
To block DHCPNAK packets while allowing other DHCP packets through using iptables, you need to create a rule that specifically matches the DHCPNAK packet type. DHCPNAK packets have an option code of 6 within the DHCP message type option (Option 53).
The Telxius community doesn't seem to be of any effect so far, I'll wait for their reply, maybe they changed or need to be enabled on their side first. I'll look at the other providers afterwards.
Cogent is a bit surprising, from EU or the US they route to magru.
Before advertising ns2, we need to do some traffic engineering. Telxius being part of Spain's main ISP, Telefonica ES prefers magru to drmrs :
See https://w.wiki/A6qH
May 15 2024
As data point, same error today with cumin1002:~$ sudo cookbook sre.network.tls lsw1-d1-codfw
May 14 2024
@cmooney what do you think of duplicating the other POPs allocation scheme?
For example looking at eqiad as example, keep 2a02:ec80:a000::/40 as "reserved for future growth"
Then use 2a02:ec80:a000::/48 for the existing WMCS eqiad infra
Then 2a02:ec80:a000::/56 for public, another /56 for private, /55 for the infra, 2a02:ec80:a000:ed1a::/64 for VIPs, etc
Or is the risk to not be able to allocate a /64 for each VM ? (a /48 is 65536 /64s)
May 13 2024
All good now.
Thanks indeed it looks very promising ! If he have the time maybe @MoritzMuehlenhoff can provide some pointers on the packaging steps.
May 10 2024
I'm not fond of setting up a password on un-managed switched. For example what is the procedure to rotate the password ?
May 3 2024
Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru.
May 2 2024
Apr 29 2024
Apr 28 2024
Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port.
Apr 23 2024
Another question I think is "do we still have to go through text files ?"
It made sens for back in the time when we were manually editing the configuration, and for the few places we still do, but it seems sub-optimal to go from Netbox database to text file to gdnsd.
Probably too simplistic, but could we generate a raw list of IP/FQDN from Netbox, and feed it to gDNSd without having to care about PTR and zones structures ?
Apr 22 2024
Apr 15 2024
I have checked the logs and it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX.
The login attempts are creating process on the SRX that sometimes don't close correctly or take more time to fully close. If enough of them are stack it can cause the reboot.
To fix this we can set a firewall filter for the control plane of the SRX and use an allow list to mitigate the packets that are actually reaching the device.
We might have to re-prioritize this task because of T362522: mr1-eqsin performance issue
Done :)
I started implementing a fix for that but it quickly gets complex as it means shutting down a port, and fully setting up another one. Before going that way let's see if it's something we want/need to do.
Also as I think you mentioned somewhere else, it would mess with @Papaul 's rack U to switch port mapping.
Opened JTAC 2024-0415-128563 and attached logs/RSI/coredump.
Apr 12 2024
Prefixes assigned in Netbox: https://netbox.wikimedia.org/ipam/prefixes/?site_id=11
As data point Ganeti's iPXE send 4 DHCP requests, doubling the timeout between each: 1s, 2s, 4s.
12:34:21.812153 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414 12:34:22.831693 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414 12:34:24.863614 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414 12:34:28.928072 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
Apr 11 2024
Apr 8 2024
Netbox script is great, we can call it from a cookbook if needed later on.
Thanks. What I don't understand is that if they go through ZTP or manual basic setup, they will by definition be managed switches (with root password, IP, etc). I don't think we can have a middle ground where we have only some config.
Apr 5 2024
We first need to discuss if we want to start using managed switches for management switches (except the aggregation ones).
On the plus side it's convenient to have the extra visibility, but it adds a lots of management overhead to our automation, while I'm not sure we have the resources for that.
RFO: The unavailability of the link was due to problems with optical modules and cards at the Marseille and Paris, France locations on the Telxius network. The link returned to normal after the modules and cards were replaced.