Mon, Apr 19
I killed that domain in 2014 (operations/dns 3a7f472cb3e9bcd03f0492cfdd8c0a2156f448d3). Noone has complained since to my knowledge, and I'd recommend to not reintroduce this redirect at this point. It was confusing to begin with: before that transition main mail exchangers and the mailing list service was all in the same box; these days they are (thankfully) separate, but the side-effect is that "mail" as a label is much more ambiguous. HTH!
@CDanis could you look at this soon? Thanks!
Fri, Apr 16
Mar 17 2021
@crusnov maybe you can have a look?
Mar 16 2021
I think I've implemented this -- it's been a while :)
Mar 5 2021
Mar 4 2021
(I'd suggest to focus on the nitty-gritty like SSH keys later -- I'm not the right person to ask for these either :)
Judging from the last two lines of that transcript, I've been summoned :)
Could you clarify the scope between:
- production hosts that currently have WMCS as the service team (cloudvirt, cloudcephosd, etc.)
- Cloud VPSes that the WMCS team currently semi-manages (i.e. that have other roots, possibly custom puppetmasters etc.)
- Cloud VPSes that the WMCS team is currently managing fully (operates config mgmt such as the puppetmaster), not necessarily exclusively (e.g. I think Toolforge has additional admins)
Mar 3 2021
I believe the Atlas is a PCEngines APU, so you'll need a null modem cable or adapter (RXD->TXD, TXD->RXD, etc.) If this is a Cisco rollover cable, it would do the trick, but your DB9<->RJ45 adapter should not be a crossover adapter, as that would swap crossover twice end-to-end and cancel each other out :)
Mar 1 2021
Feb 13 2021
To clarify the task's scope here, and the need from a network operations angle: as a service provider, providing effectively unrestricted IPv4 connectivity from our public cloud to the rest of the internet we need, for various reasons, the ability to identify and/or block the source of traffic in e.g. an incoming third-party report or request, and to be able to do so retroactively with timestamps into the past as well. (This is not a new requirement, nor the result of recent changes in cloud networking -- just something we're overdue for).
Jan 19 2021
Dec 7 2020
It feels like there are multiple issues being discussed here, so perhaps it's worth breaking this down and talking about some of these issues separately? The last few comments seem to be about the IP numbering and assignment issue, so I'll focus on that below.
Dec 5 2020
Dec 4 2020
OK, to add a little more color:
- The VLAN configuration is not important. brctl addif brq7425e328-56 eno2np1 is enough to reproduce this behavior.
- I was thinking why bridge would matter (thinking hwmode/EVB etc. originally). I had tried setting promisc mode to no effect, but with a clearer mind this morning, I tried promisc + down/up and managed to reproduce, without a bridge being involved. ip link set promisc on dev eno2np1; ip link set down dev eno2np1; ip link set up eno2np1 reproduces it, ip link set promisc off dev eno2np1 restores connectivity.
Dec 3 2020
Arzhel nerd-sniped me with this.
Dec 2 2020
Nov 25 2020
Nov 23 2020
Thanks - can you file a procurement request to that effect (& then resolve this task)?
Per @ayounsi above, "Last missing info is cable IDs". I don't see that as having taken place yet, right? The Cables report is even emitting soft-warnings about it (warnings that we should convert to errors once this work completes). Reopening the task, as it was probably resolved by mistake.
Oct 22 2020
Oct 19 2020
Yay, that's awesome! You can't imagine how much time this would save!
Oct 16 2020
Sep 24 2020
I wonder as what kind of ASN would these flows show up as (esp. with confederations!), as well as whether we could have a dimension to be able to differentiate between internet traffic, and backhaul traffic. We'd also need a dimension of "site" to be able to filter or slice for traffic from esams to eqiad like the parent task required, right? Also see T254332, which also makes me wonder whether adding all of these different dimensions is going to start being a problem :)
Sep 21 2020
BTW, one dangerous impact of this (as with all ECMP!) is that it would harder to notice a situation where we don't have enough capacity to carry regular amounts of traffic when one of the paths is down for whatever reason. We could perhaps mitigate this by tuning our monitoring to alert on 40-50% utilization, at least for the common cases of link redundancy (codfw/eqdfw, eqiad/codfw). So this will still get us extra capacity for "abnormal" conditions (like edge in eqiad but MW & Swift on codfw etc.) but still alert us to the situation where we don't have enough capacity for normal levels of traffic.
Sep 17 2020
Sep 16 2020
Hey - this was brought to my attention, and we discussed it today at the I/F meeting. The outcome of our conversation was that @Volans and @jbond will do a final review pass and merge r621343 ~by the end of this week.
Sep 14 2020
Sep 11 2020
- We shouldn't have outstanding alerts open (or even acknowledged) for more than a few days. If there is an alert, it means there is an abnormal condition that requires fixing. If the issues require a significant amount of work to address, then a a task should be created and the alert acknowledged with the task in the comment while it's getting fixed. I'd expect the DC Ops teams to be primary for such alerts and act on them, but also everyone in SRE is expected to triage alerts and reach out to owners and file tasks about them (like @ayounsi did here)
- If there are false positives often, then this is something that we should fix. We probably need one or more separate task for this, that describes conditions under which an alert is triggered erroneously, so that we can fix this. I'd expect the DC Ops team to be filing this task, and I/F to change the report to meet the adjusted needs.
- The test_missing_assets_from_accounting report is already (and has always been) ignoring discrepancies for items where the purchase date is in the last 90 days. This is configurable and we can tune it further to some other value but it was picked as long enough for accounting to process invoices, and too long to have fallen out of memory (or vendor engagement is over, team changes etc.). If there is a persistent backlog in Finance >90d it'd be good to know and adjust.
Sep 7 2020
@jcrespo & @akosiaris may I ask you to figure this out in a different task? This is a generic task about dozens of servers, so by discussing details about a couple of them we're going to lose the bigger picture :)
Aug 18 2020
Ping? Besides the issues identified by @ayounsi just above, I see that in another comment above @ayounsi mentioned "wipe the switch" but then I saw the switch was removed. @Cmjohnson, can you confirm the switch was wiped before (or after) its removal? (Any reason we didn't go the decom task route here like we normally do?)
Aug 17 2020
@wiki_willy, what's the latest here? What's blocking us from having decom tasks for all of the items above?
Aug 4 2020
Bump! What's the latest here?
Jul 22 2020
We still seem to have remnants of PIM-RP:
email@example.com> show configuration | display set | match 126.96.36.199 set interfaces lo0 unit 0 family inet address 188.8.131.52/32
Jul 21 2020
It looks like both of these issues are resolved now! Boldly resolving :)
Jul 16 2020
To give a little more context: in response to us requesting an extension for the v2 anchors, the RIPE NCC team reached out to ask if they can run a test upgrade on our of anchors (which I of course said OK to!).
Jul 2 2020
So - how do we make progress here? Any thoughts on who/how? :) Some of these features could really make a tremendous amount of difference to our network operations and future planning, so I'm super excited about seeing these into fruition!
Jul 1 2020
I was bitten by this again today - ping!
Jun 26 2020
Jun 25 2020
To add to the above, I'm also wondering how difficult it would be to also include AS *names*, e.g. coming from the MaxMind GeoIP ASN database. I think we've used that database before, maybe for pageview data? Could we perhaps use Druid lookups for this to avoid adding another (identical) dimension to the data set?
Jun 24 2020
I took a look at that list above. It's really not very actionable -- most of these are very large networks that have a restrictive settlement-free peering policy. For the few that remain, we have either established peerings already or have sent unanswered peering requests, which mostly means that they are not actively peering or we are too small for them to care about.
Jun 18 2020
Jun 11 2020
Jun 4 2020
This is now set up on SFMIX's end and up:
On your side please plumb 184.108.40.206/24 and 2001:504:30::ba01:4907:1/64. Usual sane BGP peering rules apply - no broadcast traffic (DHCP, CDP, etc), see https://sfmix.org/connect/guide.
We request at least one required BGP session (to our looking glass) and optional sessions for the route servers
The looking glass is AS12276 at 220.127.116.11 and 2001:504:30::ba01:2276:1. You should announce all your routes to the looking glass, but expect no routes to be announced to you.
We'll push out configs to support these peers this evening.
Jun 3 2020
May 19 2020
Are there any updates to this task and any particular reasons it's been held up? While this was never super urgent, we're now at the ~one year mark since this was ordered and delivered to the data center. Plus I think because at the time the upgrade was imminent, we only bought support for the new switch and not the old, so we're operating with unsupported HW right now. It'd be great if this were to be completed soon. Thanks!
May 15 2020
If three ports are permanently failed, I'm not sure how we could ever trust that switch again. Perhaps it's better to do a painful but planned replacement rather than have it fail at some inconvenient time and having to rush a replacement then?
May 12 2020
I know that historically MaxMind has claimed they update the data roughly on a weekly basis, and maybe in this case it was a normal weekly update and we're just misaligned with their weeks? In any case, the current geoipdate seems to be smart enough to checksum the existing databases and not re-download pointless duplicates, so we could probably run it more often on the puppetmasters.
May 8 2020
LoA received and cross-connect task created.
Apr 30 2020
I just submitted their form.
Apr 27 2020
Interesting idea! Couple of notes:
- What do you mean by "virtual links" and Netbox not supporting them? Is that VLANs for our transports over the PtMP VPLS?
- What do you envision the difference to be between "primary" and "preferred"? (I know you said TBD, but curious :)
- It'd be interesting to see how this would look like before we start adding the fields. That may help us figure out what the right values for those fields may be. Would it make sense to list our links in a Phaste or spreadsheet or something and figure out if the output makes sense?
Apr 14 2020
I think the original intention of this will be addressed by periodic audits that we'll eventually do. I'll decline this for the reasons I mentioned above, but if anyone feels strongly about this, feel free to reopen :)
So breaking down the (very reasonable!) ask, I think there are afew different things at play here:
- Access to iDRAC/iLO so that John can e.g. look at HW status and get reports that vendors ask for. This in turn requires:
- Access to the password store. There is already a "dcops" group with the right access, so we can have John added there. Should be simple, as far as I can tell.
- Access to the mgmt IP network remotely. Right now that's firewalled to the cumin hosts, access to which ties to a bigger project (see below). However, that's perhaps an unnecessary dependency and maybe we can easily work around that (e.g. with a separate bastion for mgmt?). @MoritzMuehlenhoff, @jbond any thoughts here?
- Access to execute cumin cookbooks, like reimaging. That right now is tied to global root, which is a privilege that we can't easily grant. Fixing that limitation has been on our radar, including the PoC work that was part of our Q3 OKRs (T244840). It's definitely not there yet and it's going to take a few months to fully materialize, unfortunately.
Apr 13 2020
If I understand it correctly, this task is specifically about a box that was returned to the spare pool and then was reallocated for a new purpose but kept its old data. We should definitely wipe in those cases. I think that has been standard practice in the past, but perhaps not well-documented or applied uniformly? I'm not sure, something to dig in more for sure :)
Apr 11 2020
The master branch of operations/software/keyholder is not ready for a release at this time, so please don't tag, package or deploy this at this state. There are a bunch of pending changes in Gerrit for about a year, plus more that I've queued up locally (because it's hard to manage dozens of dependent git commits with Gerrit…). If y'all are willing to review these I can clean them up and prepare a release; if not, then I can pick this up and make some progress. Let me know!
Apr 8 2020
Apr 3 2020
Apr 2 2020
Ah! That's awesome to hear. May I suggest to resolve this (and the associated "upgrade firmware"?) task then, and reopen if we have another one of these?
Apr 1 2020
What's the latest here? I haven't heard about these crashes lately but it may just be that I missed it. Do we know more about this now?
Mar 27 2020
@wiki_willy is finalizing the end of our leasing agreement. Once that's done, we'd be the "owner" of all of those assets, and thus we can remove the "owner" field from Netbox. Reassigning to Willy to let us know when that's done :)
Mar 26 2020
Mar 19 2020
Mar 18 2020
Reopening this per IRC, and given this is a prod/WMCS task affecting prod in major ways.
Mar 17 2020
Mar 15 2020
Mar 12 2020
Oh, that sounds perfect, let's do that :) We should also try with a build with the right make flags etc. (something like TARGET=SKYLAKEX like the FAQ says). Thanks all!
Mar 11 2020
OK, so to recap, I read two concerns:
Mar 6 2020
We have one global account, migrated from a previous system. I wasn't able to find how to create individual accounts, so that will do I guess :)