OK, so, after the efforts in the past few days, we're in a much better shape! The PuppetDB report seems to be (almost?) entirely indicative of real issues and is actionable now - I will involve DC Ops to start fixing the cases that are known to be real errors, and we'll see if there are any false positives (I know of at least one, that is tough to handle!).
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 11 2019
Apr 9 2019
@RobH and @Cmjohnson, is this a forgotten decom?
Given T218025, can we resolve this?
According to Netbox, cp1099 is 2 years newer than cp1008, but is still a 6-year old server (purchased Mar 28, 2013). Can we just get rid of it? I'm concerned we're just spending cycles on a box that may die any day now and that we won't be able to repair...
Apr 8 2019
I think this is addressed by systemd's 9009d3b5c3b6d191be69215736be77583e0f23f9, included in v239 (stretch has v232, buster has v241).
Mar 21 2019
Mar 7 2019
For the per-site usage, LibreNMS besides being clunky, is non-public and not accessible to all.
This seems like a duplicate (and subset of) T216088. I've added the custom field proposal as one of the many options listed in its task description and closing this as duplicate to keep the discussion in one place :)
Mar 5 2019
Mar 4 2019
I just merged a duplicate in. @Cmjohnson what's the status of this?
In T122144#4152079, @Dzahn wrote:or they are individual aliases (out of scope of this ticket)
Feb 20 2019
Feb 15 2019
If that's still needed, that's approved, and it takes priority over phab1002. And let's replenish our spare pool indeed!
@Dzahn that's all fine, but we should have that documented in a separate Phabricator task tracking this work, if one doesn't exist already :) Separately, I'd also really love having a permanent non-SPOF setup in each data center as well, whether that's multiple bare metal servers, multiple VMs or running Phabricator on k8s. This is too important of a service to run in one misc-type server per site.
Feb 14 2019
We discussed this a little bit yesterday, and T216088 was filed to further discuss this. Help there is welcome :)
The medium-term plan is for this data to be entered into Netbox after a server is racked but before it's provisioned or even powered up, and that data to be used by our tooling to configure and execute the provisioning itself (DHCP configuration, switchport, OS install etc.).
Feb 12 2019
Before these are delivered for implementation, let's make sure that the two systems have identical settings, especially given we've tested various things on them over the past few months. I reverted my SSD Smart Path setting on 1019, but there are still differences; the most important one that I noticed is that in cloudvirt1019 the P440ar is hidden (disabled in BIOS?) but in cloudvirt1020 it's visible. Maybe a factory reset and then manually reapplying the same settings in each?
Feb 7 2019
Feb 5 2019
Is there a task describing the plans for a secondary Phabricator system? How did we come up with those specs?
Feb 2 2019
Let's not wait for a meeting, approved!
Jan 26 2019
It's tricky, but I think the one we use one is probably the right one and this should be declined. See T212674 for context.
Jan 25 2019
Netbox is now at 2.5 \o/ which allows us to import cable IDs, type, color etc. Let's start with importing eqsin's, with the data that we have in the spreadsheet, so that we can deprecate that? @RobH @ayounsi any takers?
Jan 22 2019
Jan 21 2019
This is a super old server; it just crossed its 7-year mark (we typically refresh servers at 4.5-5 years), so we're way past its warranty and shelf life and I'm not sure if we have spare parts for it at this point... Not sure if we can do much here -- maybe try a different DIMM or something, if we have one, but I don't have high hopes (also, given the use case... faulty memory is scary). @Papaul, any thoughts?
Jan 18 2019
In T213748#4890195, @RobH wrote:Synced up with Chris via IRC:
All systems were able to come back up within a2 without incident. The spare PDU is in place, but it will also be replaced when rows A and B have PDU refresh this fiscal.
@fgiunchedi so could you describe in a bit more detail what is needed here and what were the challenges you faced with prometheus-snmp-exporter last time you attempted this?
Dec 21 2018
I pushed what I had written a while ago in Gerrit (see above). It needs to be hooked up to our monitoring, but it should be in a working condition. Leaving that to @ayounsi, assuming you think the code looks good as it is :) Happy to review any subsequent PS updates!
- On received routes: I don't think we should be making these kind of community-matching in BGP_community_actions. Rather, I think we should have ASnnnn_in policy-statements, that map our upstream's communities into our own communities (e.g. UPSTREAM_CUST_US), and then have BGP_community_actions act on that. That would make reading this match more straightfoward. Note that this follows what we've done with our other communities (e.g. see AS13030_in and the likes).
@Cmjohnson what's the status of this?
Dec 19 2018
Dec 18 2018
In T211750#4831642, @akosiaris wrote:I like black too but from but from https://black.readthedocs.io/en/stable/installation_and_usage.html it tied to having python 3.6 installed.
Black can be installed by running pip install black. It requires Python 3.6.0+ to runWith stretch shipping with 3.4 (what do the Mac OS X versions do?) it might be a bit too restrictive to require it.
In T211750#4817482, @Volans wrote:On my side I've done a test on the cumin codebase with black. The results are:
- all the ignore comments for pylint or any other validation tool were misplaced (moved to the last line when splitting) and require to manually move them to the first line [one off]
- it doesn't pass flake8:
- E203 whitespace before ':' (this seems a bug on their side, it's for a list slice _ARGV[index + 1 :]
This task is great, and the table at the top is a very useful summary! The Q2 goal part of it has been completed indeed, so I can see the argument for the task being resolved.
There were a few notices on the 15th and 16th of December. Did these arrive to maint-announce@?
Dec 17 2018
This was quite complicated but I've managed to forward-port 3.4 and backpored 3.6 and 3.7 to stretch. These are now included in the component component/pyall of suite stretch-wikimedia and they can be installed like one would normally install python (apt install python3.{4,5,6,7}-venv should do it).
Moritz was working on that.
Dec 14 2018
In T196507#4824817, @Andrew wrote:@faidon, who is 'please also construct a draft email' directed to?
In T205899#4824679, @crusnov wrote:In T205899#4824310, @faidon wrote:I would say to also check that all devices matching some criteria, are present in PuppetDB and vice-versa. These criteria may be a combination of:
- Type: Server
- Status: Active or Staged
- Tenant: None (and then define and set tenants "frack" and "sandbox", i.e. RIPE Atlases?)
This might be a lot harder, since the reports can't make a log_failure without a record present in Netbox already. We could make log lines for that though.
Manufacturer, model and serial checks all sound good to me! Manufacturer may need some rewriting, I think there's "Dell, Inc." vs. Dell" and differences like that.
Dec 13 2018
Dec 12 2018
OK, I had a look at this. A few observations first of all:
- While not 100% sure, I don't think this is related to the controller having been swapped before. I don't think it fits.
- cloudvirt1019 & cloudvirt1002 exhibit different symptoms at the moment. 1019 (which @Cmjohnson has been focusing on) shows its battery count as 1 but status as "recharging", while 1020 as having no battery (count = 0).
Makes sense, +1, go for it! A lot has happened since this task was filled in 2015 (e.g. not having precise anymore, T163196 etc.) and including interface::add_ip6_mapped { 'main': } everywhere should be easy, if not completely painless! :)
Has been implemented for all hosts starting with stretch and going forward for a long time now!
Dec 11 2018
What is the rationale behind trying to empty this address space and/or find a new /24?
Dec 10 2018
- It's been a while, but I believe an import statement in the neighbor block overrides the parent one in its entirety, and does not supplement it, so we'd have to repeat the whole import chain there.
- Would it make sense to have separate as-path groups for v4/v6? It's a bit unusual in our config, but it would address the issue with HE and to inadvertently avoid downprefing HE for IPv4 for no reason.
- If we're going to remove the local-preference setting from BGP_IXP_in and just rely on BGP_community_actions to apply based on communities (it's a good idea!), then we should probably do the same for BGP_Private_Peer_in for consistency.
- Nitpick: the non-RS policies are called BGP_IXP_…, so let's follow that naming scheme (i.e. "BGP_IXP_RS_in", not "IX")
They don't, these aren't PoE switches. I didn't know these cameras required PoE. So, two options I suppose:
- Use PoE injectors
- Hook them up to (old) EX4200s. Are we using any of them for mgmt switches yet? Cameras seem a better fit for the mgmt network than the production network anyway, right?
@Cmjohnson all of the ports show as "physical link down", could you have a look? Thanks!
Dec 7 2018
Can we add procurement task and purchase date immediately? It doesn't sound like there is an immediate blocker to this.
Dec 6 2018
Per @bd808 on IRC:
Dec 5 2018
Some thoughts here:
Dec 4 2018
The forward paths are nearly identical, but the reverse is not: reverse path selection is HE for IPv6 and NTT for IPv4, so different paths, and latency could be reasonably explained by that.
Any progress on this?
Nov 30 2018
In T210667#4789588, @chasemp wrote:In this case specifically, my thinking was that I had agreement and understanding with another Opsen, a manager in Tech, a director in Tech and a couple more knowledgeable and engaged parties in real time right before (as review of action). I installed the package with a !log so it would be recorded in the right place and a ping to one of the Opsen who works in that specific area.