Please note the above diagram has a mistake, showing both routers connecting to PP:15/16 when cr1:xe-0/1/1 actually connects to Tata's port 11/12.
Wed, Jan 26
Fri, Jan 21
Thu, Jan 20
This is done, opened T299640 for further improvements.
Wed, Jan 19
Not needed anymore.
Juniper bumped their recommended version to at least Junos 20 on a lot of platforms.
FYI the host is still set to "active" in Netbox.
Tue, Jan 18
Mon, Jan 17
Thu, Jan 13
@ayounsi corrected et-0/1/0 Rolled fiber. has link.
Nice! and LLDP shows msw2 as neighbor.
Wed, Jan 12
Tue, Jan 11
Mon, Jan 10
Relying on parsing a website is often asking for troubles. Maybe we can also ask our account rep. for their recommendation (different API, etc).
We usually use the FQDN for logging and NTP endpoints, see https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/ServerTech#Setting_up_the_Configuration
a quick look on GitHub shows 2 approaches:
- This one parses the firmware page: https://github.com/lateralblast/druid
- The other one uses the catalog (and NFS mount): https://github.com/RackHD/smi-service-dell-server-firmwareupdate
Tue, Jan 4
Mon, Jan 3
I updated Netbox to reflect reality (as required so automation can work), and pushed its initial config.
Could you connect the mgmt port (em0) to ge-0/0/0 (to itself).
Dec 23 2021
The interface msw1:et-0/1/0 is alerting about CRC errors.
Dec 17 2021
In theory there should not be any PII data, but it would be safer to sanitize is nonetheless.
Dec 15 2021
Cool, only ip_version and region are useful here.
Dec 14 2021
Tests are successful:
Dec 13 2021
Could we trunk the new vlan instead of using a 2nd physical port?
Thanks, we're already at 250000 for those. We usually set a high limit from the get go for route servers.
Dec 7 2021
- we can use "internal_flows" (not _netflow as netflow is a protocol).
- can I start this anytime, or we need to create the kafka topic somewhere?
Ok, the fix from T295672#7531535 sounds good to me then!
This is now set to alert to NOC through alertmanager.
As a general note we need to be careful with rolling out config fixes in reaction to unexpected issues.
Even if it's thoroughly tested and I agree with your thorough proposal, it increases the config's complexity by tiny increments, making future changes (small or big) more risky.
As you pointed out, looking at our BGP confederation holistically is long due! (partially with T167841, possibly after looking at OSPF with T200277 to have sound foundations).
3.1 is out of beta, updated the task description accordingly.
Dec 6 2021
Alright, closing this for now then :)
Did you mean _not_ a hard requirement?
Yep, my bad :)
Waiting for Capirca upstream to merge PRs.
Latest Junos recommended is 20.4R3-S1.3
I downloaded it to apt1001:/srv/junos/jinstall-ex-4300-20.4R3-S1.3-signed.tgz
You can also find it on https://webdownload.juniper.net/swdl/dl/secure/site/1/record/140793.html?pf=EX4300 if you have a Juniper account (and if you don't we should create you one :) )
@JAllemandou This is great, thanks! Note that we can tune sampling to adapt.
Dec 2 2021
ge-6/0/26 Interface doesn't match its switch member: 5 on asw-b5-codfw
There was interface ge-6/0/26 configured on FPC5, I deleted it (as there is a ge-6/0/26 on fpc6), it should be good to go now.
I agree the Homer error message could be more clear though!
Dec 1 2021
Thanks I had a quick look and they both are healthy, all 8 interfaces show up as well.
Many sensors are now over threshold, see the red in: https://librenms.wikimedia.org/device/173/
Nov 30 2021
@BTullis thanks! Real-time, would be a nice plus, but a hard requirement (unlike netflow).
Nov 29 2021
Nov 26 2021
I went the "set a different sampling pipeline for internal flows" way with the above POC for the reasons mentioned in T263277#6491140.
Nov 25 2021
All 3 VMs got rebuilt with larger disks, but with the default Debian Buster.
Nov 24 2021
That looks like a faulty cable or interface, over to DCops for troubleshooting, let us know if you need Netops help.
Nov 22 2021
Codfw repooled, everything is back to normal.
The above command doesn't commit on a pre-provisioned VC.
Nov 21 2021
Nov 19 2021
Hopefully we won't need to, but if asw1-b2-codfw needs to be rebooted, here are the impacted servers:
moss-be2002 (not active)
- IPv6 is still broken on asw-b7-codfw (for traffic local and transiting through the switch)
- inet6 is disabled on cr2-codfw:ae2 (to row B)
- That means row B have uplink redundancy for v4 but not v6
- lvs2007 and codfw will stay depooled until Monday, when more intrusive remediation will be performed
- codfw can be repooled if needed (eg. eqiad issue)
- JTAC ticket can't be opened until T294792 is done
Raising the priority to bring attention to this task, feel free to re-triage accordingly.
Nov 17 2021
For the record, there is also a link to lvs2007, after chatting with @BBlack on irc, the usual disable puppet then stop pybal is to do before the maintenance.
If you can take pictures of the front panels that could be useful to instruct remote hands when they get to drmrs too.
This will cause a hard downtime for 6 servers (rack B7), for up to 1h, but most likely less:
mgmt ports to the mgmt switch please :)
Once we have this and console, we can check and upgrade them.