Page MenuHomePhabricator
Feed Advanced Search

Fri, Sep 13

faidon reassigned T227425: codfw: 1 misc node for the Kerberos KDC service from faidon to RobH.

Approved.

Fri, Sep 13, 12:04 PM · hardware-requests, Operations, User-Elukey, Analytics
faidon reassigned T227288: eqiad: 1 misc node for the Kerberos KDC service from faidon to RobH.

It sounds like our spare pools are being drained, so if that's the case feel free to open a task to replenish them.

Fri, Sep 13, 12:03 PM · hardware-requests, Operations, User-Elukey, Analytics

Aug 14 2019

faidon reopened T211368: update PDUs for eqsin (asset tag and other info) as "Open".

Note this is now flagged in the Accounting report instead, as these are missing from Finance's spreadsheet - they have not been documented as assets, which is a problem in itself (not capitalized/depreciated etc., although I'm not sure if they meet the capitalization threshold). We'd have to notify Finance/Julianne, ideally with serial numbers...

Aug 14 2019, 12:59 PM · Operations, ops-eqsin
faidon added a comment to T167841: Cleanup confed BGP peerings and policies.

That's an awesome idea, nice!

Aug 14 2019, 12:28 PM · Operations, netops

Aug 2 2019

faidon added a comment to T223450: Triage and resolve all outstanding Netbox report errors.

I've checked some boxes in the task description, but note that it is no longer accurate, as we've had a number of regressions since, with 85 new failures (all look like real data errors with a cursory look).

Aug 2 2019, 12:25 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
Aug 2 2019, 12:23 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon closed T221984: scs-a1-codfw: update serial in netbox, a subtask of T223450: Triage and resolve all outstanding Netbox report errors, as Resolved.
Aug 2 2019, 12:22 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon closed T221984: scs-a1-codfw: update serial in netbox as Resolved.
Aug 2 2019, 12:22 PM · netbox, Operations, ops-codfw

Jul 31 2019

faidon assigned T226044: Prepare Phame to support heavy traffic for a Tech Department blog to JAufrecht.
Jul 31 2019, 5:14 PM · Release-Engineering-Team-TODO (201909), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator

Jul 30 2019

faidon added a comment to T151304: tmpreaper possible race condition.

We can start by responding to Debian bug #763858 with your fix and see if the maintainer is willing to incorporate this!

Jul 30 2019, 11:54 AM · serviceops, Operations

Jul 26 2019

faidon added a comment to T229101: Phase monitoring for new PDUs.

whereas ulsfo PDUs installed in T209101 are currently missing icinga phase monitoring checks (i.e. only ping checks)

Jul 26 2019, 11:35 AM · observability, DC-Ops, Operations

Jul 24 2019

faidon added a comment to T185337: rack spare switches in c1-eqiad.

These could be racked in any rack, including in row A. It would be useful to have a working lab out of our spares - this came up yesterday/today when we were wondering if we had QSFPs that were known to be working.

Jul 24 2019, 2:26 PM · Operations, netops, ops-eqiad

Jul 20 2019

faidon added a comment to T228533: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15.

Note that they do not say that we will stop getting updates but merely that we won't be able to benefit from this "security feature". It does sound scary on a first read, though -- I got confused myself.

Jul 20 2019, 3:34 AM · Analytics, Traffic, Operations

Jul 12 2019

faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
Jul 12 2019, 6:59 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon created T227911: msw1-eqsin/msw2-eqsin missing serial number.
Jul 12 2019, 6:59 PM · ops-eqsin, Operations
faidon lowered the priority of T220639: Show IPs matching a list of IP subnets in Webrequest data from Normal to Low.

How do we run this with a venv so that we can include Pytricia?

Ideally if we had a deb package for this library we could deploy it on all the worker nodes and use it :)

Jul 12 2019, 12:07 PM · User-Elukey, Analytics

Jun 28 2019

faidon closed T187994: netfilter software at WMF: iptables vs nftables as Declined.

I think there's a bit of a confusion. AIUI, nftables can refer to two different things:

  1. The nf_tables kernel subsystem
  2. The nftables userspace tool, which interfaces with (1)
Jun 28 2019, 10:48 PM · cloud-services-team (Kanban), Operations
faidon added a comment to P8683 RPKI check for invalids.

The equivalent linear search (diff below) is ~800-850 times slower on my laptop:

Jun 28 2019, 2:58 PM · netops, Analytics
faidon edited P8683 RPKI check for invalids.
Jun 28 2019, 2:54 PM · netops, Analytics
faidon reopened T220639: Show IPs matching a list of IP subnets in Webrequest data as "Open".

So, a few things:

  • There is a better source for this kind of data, that is updated hourly rather than monthly: https://as286.net/data/ana-invalids.txt
  • For RPKI specifically we would also like to differentiate between three states: no match, match but with no alternative prefix (unreachable), and match but with an alternative prefix (invalid-but-reachable)
  • I'd like us to be able to see the evolution of that data over time, as to basically track the percentage of traffic that we would lose if we were to move forward with rejecting RPKI invalids. Ideally that would be a Grafana graph or something, but if we have no such capabilities, no reason to add them - this would be temporary most likely (i.e. grab data for something like a month).
Jun 28 2019, 2:15 PM · User-Elukey, Analytics
faidon created P8683 RPKI check for invalids.
Jun 28 2019, 2:06 PM · netops, Analytics

Jun 27 2019

faidon renamed T226769: consider running bastion Prometheis inside cgroups from consider running bastion Prometheus inside cgroups to consider running bastion Prometheis inside cgroups.
Jun 27 2019, 10:03 PM · Operations, observability
faidon renamed T226769: consider running bastion Prometheis inside cgroups from consider running bastion Prometheis inside cgroups to consider running bastion Prometheus inside cgroups.
Jun 27 2019, 10:00 PM · Operations, observability

Jun 26 2019

faidon added a comment to T220669: RPKI Validation.

Thanks @JobSnijders, appreciate the feedback very much :) Our goal is to reject all invalids everywhere indeed, just progressively so.

Jun 26 2019, 12:35 AM · Operations, netops

Jun 25 2019

faidon added a comment to T219486: Send peering requests to AS with the worst TTFB.

So I should do that for that list? Are you ok with me requesting peering from all of these AS?
Is there an existing email template?

Jun 25 2019, 10:12 PM · Traffic, Operations, Performance-Team

Jun 21 2019

faidon added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Ceph is capable of saturating 10G links under heavy load
[...]
Rate-limiting traffic is likely to collapse the cluster.
[...]
I will add that plenty of people build new networks just for Ceph (partly to get jumbo frames).

Jun 21 2019, 9:37 AM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services

Jun 15 2019

faidon added a comment to T225713: CPU scaling governor audit.

So, I think there are two distinct problems discovered in the past few days

  • ondemand results into some really poor performance on the ms-be boxes. Going from 50% CPU util to 5% with a ondemand->performance switch probably means that this CPU scaling is not really scaling... on demand :) This may be specific to the workload of ms-bes, potentially affected by Meltdown/Spectre firmware updates, and/or it could be specific to HP hardware (or a subgeneration of it, like HP Gen9). These things tend generally depend on the firmware, but note also that HPs use the pcc_cpufreq Linux module, unlike all other systems.
  • A lot of systems seem to have the governor set to powersave, which may result into poor performance, depending on the workload.
Jun 15 2019, 1:23 PM · User-fgiunchedi, Operations
faidon renamed T225713: CPU scaling governor audit from CPU scaling governor on HP Gen9 hosts to CPU scaling governor audit.
Jun 15 2019, 1:09 PM · User-fgiunchedi, Operations

Jun 13 2019

faidon renamed T225713: CPU scaling governor audit from CPU scaling governor on ms-be hosts to CPU scaling governor on HP Gen9 hosts.
Jun 13 2019, 12:21 PM · User-fgiunchedi, Operations
faidon added a comment to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.

So, the timeout patch above bumped the timeouts to 100s I think. On many hosts (e.g. ms-be1036, ms-be1037) these checks seemed to take about 1.5-3 minutes to run, so this issue would not be addressed by that. However, I also wondered why such a relatively simple thing would take such a long time to execute. The response seems to be two-fold:

Jun 13 2019, 12:41 AM · Patch-For-Review, User-fgiunchedi, Operations, observability

Jun 12 2019

Krinkle awarded T185319: IRC RecentChanges feed: code stewardship request a Orange Medal token.
Jun 12 2019, 9:33 PM · Tools, Operations, Analytics, Wikimedia-IRC-RC-Server, Code-Stewardship-Reviews
faidon assigned T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts to fgiunchedi.

Right now there are 14 outstanding alerts, or about 50% of all outstanding alerts:

Jun 12 2019, 9:05 AM · Patch-For-Review, User-fgiunchedi, Operations, observability

Jun 1 2019

faidon added a comment to T221507: Netbox report to validate network equipment data.

It seems like part of the challenge is identifying clustered equipment (i.e. asw stacks & pfw). In those cases, the device appears in LibreNMS as one device with the switches as FPC linecards (presumably as inventory?), while on the Netbox end they appear as separate, distinct devices. I haven't looked at this deeply, but I suppose a lot of the complexity in the report comes from there.

Jun 1 2019, 1:09 PM · netbox, User-crusnov, SRE-tools, Operations, netops
faidon added a comment to T187456: Decommission labstore100[123] and their disk shelves.

One note for @Cmjohnson for the upcoming decom which is apparently imminent: labstore1003-arrayN are one of the handful cases that lack an asset tag in Netbox. Last time we talked about this (1+ year ago), I believe you had mentioned that the tag wasn't visible due to the way they are racked. Now that they are getting unracked, it'd be ideal to recover that asset tag and enter it in Netbox to have it on the records and keep it while these remain in storage. Thanks!

Jun 1 2019, 11:26 AM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
faidon awarded T209527: Set up scratch and maps NFS services on cloudstore1008/9 a Party Time token.
Jun 1 2019, 11:21 AM · Patch-For-Review, cloud-services-team (Kanban)

May 29 2019

faidon assigned T224535: Investigate cr2-eqord's disconnection from the rest of the network to ayounsi.
May 29 2019, 9:35 AM · Operations, netops
faidon updated the task description for T224535: Investigate cr2-eqord's disconnection from the rest of the network.
May 29 2019, 9:32 AM · Operations, netops
faidon updated subscribers of T224535: Investigate cr2-eqord's disconnection from the rest of the network.

OK, so the vendor "bounced the interface" and the eqiad<->eqord traffic has been restored. What they noticed -and I confirmed- is that this interface was not carrying traffic since May 24th.

May 29 2019, 9:27 AM · Operations, netops
faidon added a comment to T221507: Netbox report to validate network equipment data.
  • esams should be blacklisted for now indeed.
  • test_nb_inventory_in_librenms could use some improvement -- it didn't say which device, s/n or anything to identify them as far as I can tell?
  • On the device types errors, I can't help but think that we're looking at the wrong field? e.g. take cr1-eqsin as an example: the message says Netbox devtype=Juniper MX104, LibreNMS devtype=Juniper 750-062050, but LibreNMS does know this is an MX104 (see under "Hardware" here).
  • I don't know what these "duplicate serial numbers" are, and we'd need more information to understand if these are real errors or report errors.
  • The cr1-eqsin serial change is a bit odd. Netbox used to have a record of what Juniper reports as the "midplane" serial number, not the "chassis". This was changed, but the midplane was what we had from the invoice as well -- so note that the Accounting report is now error'ing out instead.
  • asw-N-eqiad serial changes above -- these are now inconsistent with what we have from the Accounting side (so the report fails now). This needs further investigation for which one is ground truth?
May 29 2019, 6:23 AM · netbox, User-crusnov, SRE-tools, Operations, netops
faidon added a comment to T224535: Investigate cr2-eqord's disconnection from the rest of the network.

So for the two that went down there was no planned maintenance, but we did get an email from the vendor ("00985243 Disturbance") suggesting that this was an unplanned event.

May 29 2019, 5:52 AM · Operations, netops
faidon updated the task description for T224535: Investigate cr2-eqord's disconnection from the rest of the network.
May 29 2019, 5:38 AM · Operations, netops
faidon triaged T224535: Investigate cr2-eqord's disconnection from the rest of the network as High priority.
May 29 2019, 5:36 AM · Operations, netops

May 28 2019

fgiunchedi awarded T93208: (U)EFI support a Love token.
May 28 2019, 8:17 AM · Operations
faidon closed T93208: (U)EFI support as Resolved.

OK, a few changes later, and we have a working EFI install in a VM (d-i-test) \o/

May 28 2019, 12:53 AM · Operations

May 27 2019

faidon added a comment to T93208: (U)EFI support.

So I just pushed a change that uses syslinux.efi above. This may prove to be short-lived, as we may switch to another PXE implementation (iPXE or GRUB, more on that later) but should work. It /may/ require to append initrd=initrd.gz to the kernel command-line options.

May 27 2019, 3:50 PM · Operations
faidon moved T214024: Two test hosts for SREs from Pending Approval to Allocation/Ordering/Implementation on the hardware-requests board.

I don't know what the status of this is, it's been a while it seems. I see it was pending for my approval, which I've missed -- apologies! Approved now.

May 27 2019, 12:51 PM · Operations, hardware-requests

May 23 2019

faidon added a comment to T223628: Replace Camus with Kafka Connect for event data imports.

Unfortunately, I think this is one of the matters that we cannot fully discuss in a public task. I'll start a private email thread; if anyone reading this is interested to be part of this, ping me off-list and I can loop you in :)

May 23 2019, 10:05 PM · Analytics, EventBus
faidon added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

I'm not at all sure, but I don't see an LD 5 at all. Is it possible that instead of remaining as a degraded LD (with a failed disk) it got removed entirely somehow and that's what's causing the renumbering of LDs > 6 to smaller sd letters?

May 23 2019, 12:14 AM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations

May 22 2019

faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 22 2019, 7:26 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops

May 17 2019

faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 17 2019, 4:17 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 17 2019, 11:03 AM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon renamed T223467: Cleanup/delete recycled and returned (lease tranche 1) hardware from Netbox from cleanup/delete sold off decom and lease hardware from netbox to Cleanup/delete recycled and returned (lease tranche 1) hardware from Netbox.
May 17 2019, 8:05 AM · DC-Ops, Operations

May 16 2019

faidon updated subscribers of T222922: wmf7622 wont powercycle (cannot be allocated from spares).

Also adding @Volans here who designed this for his input :)

May 16 2019, 10:28 PM · Operations, ops-eqiad
faidon updated subscribers of T221068: decom ms-be201[345].

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission queue and shifted to dc ops to decom them.
@fgiunchedi: I added in the decommission project so its easier to find out why these are showing on the report listed here.
We should likely shift all those ms-be systems back to active in netbox.

May 16 2019, 9:54 PM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
faidon added a subtask for T223450: Triage and resolve all outstanding Netbox report errors: T221984: scs-a1-codfw: update serial in netbox.
May 16 2019, 9:16 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon added a parent task for T221984: scs-a1-codfw: update serial in netbox: T223450: Triage and resolve all outstanding Netbox report errors.
May 16 2019, 9:16 PM · netbox, Operations, ops-codfw
faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 16 2019, 9:15 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
faidon added a comment to T209425: Decommission rdb2001, rdb2002.

Sure, that sounds fine :)

May 16 2019, 4:46 PM · ops-codfw, User-jijiki, decommission, Operations
faidon reassigned T209425: Decommission rdb2001, rdb2002 from faidon to RobH.

I don't know why this needs my input? This sounds like a standard decom, unless I misunderstand it.

May 16 2019, 4:44 PM · ops-codfw, User-jijiki, decommission, Operations
faidon added a comment to T212878: Netbox racks consistency report.

This is the kind of thing that:

  • Removes flexibility from DC Ops
  • If it occurs, it's not affecting anyone else but the DC Ops person on the ground (compared to e.g. a documentation or operational error like missing consoles)
  • Is not the kind of thing that would go easily unnoticed by the person on the ground (like e.g. a "WNF1234" asset tag).
May 16 2019, 2:15 PM · netbox, Operations, netops
faidon triaged T223450: Triage and resolve all outstanding Netbox report errors as Normal priority.
May 16 2019, 1:34 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops

May 14 2019

faidon updated subscribers of T128592: Add redundancy to IRC recent changes service.

That's an old task! @Ottomata et al may have an opinion.

May 14 2019, 12:58 PM · Operations, Availability (MediaWiki-MultiDC), codfw-rollout
faidon added a comment to T213843: Juniper network device audit - all sites.

Update from IRC: Juniper's install base is actually missing a whole lot of our devices (e.g. only lists 9 EX4300s, out of... 52). @ayounsi is asking them, but this clearly needs more work :(

May 14 2019, 11:07 AM · DC-Ops, netops, Operations

May 13 2019

faidon closed T223100: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad as Resolved.

Perfect, thank you!

May 13 2019, 3:17 PM · Operations, ops-eqiad, DC-Ops
faidon created T223100: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad.
May 13 2019, 1:07 PM · Operations, ops-eqiad, DC-Ops
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

May 13 2019, 10:49 AM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon removed a project from T222424: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6): observability.
May 13 2019, 10:38 AM · Operations, netops

Apr 28 2019

faidon renamed T221984: scs-a1-codfw: update serial in netbox from scs-c1-codfw : update serial in netbox to scs-a1-codfw: update serial in netbox.
Apr 28 2019, 2:08 AM · netbox, Operations, ops-codfw

Apr 27 2019

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.
  • We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.
Apr 27 2019, 12:38 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools

Apr 26 2019

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

(Not sure if I should be piling on this never-ending task!)

Apr 26 2019, 4:26 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Apr 26 2019, 4:20 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon merged T221964: RIPE Atlas data in Prometheus into T167689: Add RIPE atlas data to Prometheus.
Apr 26 2019, 2:33 PM · observability, Operations
faidon merged task T221964: RIPE Atlas data in Prometheus into T167689: Add RIPE atlas data to Prometheus.
Apr 26 2019, 2:33 PM · Traffic, Operations, observability

Apr 25 2019

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

I found (and corrected) two devices yesterday that had a purchase date of 2020-MM-DD. Let's add a simple check for "purchase date is in the future" to catch and avoid those :)

Apr 25 2019, 11:01 AM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon added a comment to T221632: Storage capacity upgrade for WDQS.

I don't think it makes sense to perpetuate a vertical scaling model. Both of the options listed here (adding disks, RAID 0) are things that we generally do not do, due to the hidden costs and burdens for everyone involved. Taking machines offline and rebuilding them from scratch just because a disk failed or because we need more storage is really something that we need to avoid, and something that the data center operations team cannot really support with its existing staffing (esp. taking into account the failure rate of disks).

Apr 25 2019, 10:32 AM · Wikidata, Wikidata-Query-Service, Discovery-Search
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

For the PuppetDB report:

  • I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.

Excluding from sync (as a missing machine) would prevent them from showing up in the report, it is true. Perhaps using that to set the machine status instead would be a better way, so the machine would be present just with a status that we could exclude.

Apr 25 2019, 1:43 AM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools

Apr 24 2019

faidon updated the task description for T205897: Netbox: fill network topology.
Apr 24 2019, 10:11 PM · netbox, Operations

Apr 22 2019

faidon updated the task description for T205897: Netbox: fill network topology.
Apr 22 2019, 11:44 PM · netbox, Operations
faidon updated the task description for T205897: Netbox: fill network topology.
Apr 22 2019, 11:43 PM · netbox, Operations
faidon added a comment to T215229: Keep Ganeti VMs synchronized in Netbox.

Should this be resolved?

Apr 22 2019, 11:42 PM · Patch-For-Review, User-crusnov, SRE-tools
faidon added a comment to T221506: Inventorize network equipment in Netbox.

OK for switches, this did the trick:

#!/usr/bin/perl
use strict;
use warnings;
my $template = $ARGV[0];
my $device;
while (<STDIN>) {
        chomp;
        if (/FPC (\d)/) {
                my $fpc = $1;
                $device = $template;
                $device =~ s/%/$fpc/;
        } elsif (/BUILTIN/) {
                next;
        } elsif (/((?:Power Supply|PIC) \d) +R(?:EV|ev) \d\d + [\d-]+ +([^ ]+) +([^ ]+)$/) {
                my ($fru, $serial, $model) = ($1, $2, $3);
                $model =~ s/-A$//;
                print "$device,\"$fru\",Juniper,$model,$serial\n";
        }
}
Apr 22 2019, 10:48 PM · DC-Ops, Operations, netops
faidon added a comment to T221506: Inventorize network equipment in Netbox.

Apparently Netbox allows for a CSV import even for inventory items.

Apr 22 2019, 9:28 PM · DC-Ops, Operations, netops
faidon added a comment to T221507: Netbox report to validate network equipment data.

All excellent points :) I especially like the PDU & scs suggestion!

Apr 22 2019, 7:58 PM · netbox, User-crusnov, SRE-tools, Operations, netops

Apr 20 2019

faidon created T221507: Netbox report to validate network equipment data.
Apr 20 2019, 9:29 PM · netbox, User-crusnov, SRE-tools, Operations, netops
faidon triaged T221506: Inventorize network equipment in Netbox as Normal priority.
Apr 20 2019, 9:18 PM · DC-Ops, Operations, netops
faidon merged Restricted Task into T213843: Juniper network device audit - all sites.
Apr 20 2019, 8:54 PM · DC-Ops, netops, Operations
faidon removed a parent task for T213843: Juniper network device audit - all sites: Unknown Object (Task).
Apr 20 2019, 8:53 PM · DC-Ops, netops, Operations
faidon added a project to T213843: Juniper network device audit - all sites: DC-Ops.
Apr 20 2019, 8:53 PM · DC-Ops, netops, Operations
faidon added a comment to T213843: Juniper network device audit - all sites.

I was looking at FY19-20 CapEx planning and ran an export of the Entitlement Report from Juniper's website. The output is... not very close to the truth. There are serial there that do not match any of our gear, there are devices with serial numbers that do not match anything we have, plus the locations are all weird and wrong...

Apr 20 2019, 8:53 PM · DC-Ops, netops, Operations
faidon added a comment to T211368: update PDUs for eqsin (asset tag and other info).

Can we add procurement task and purchase date immediately? It doesn't sound like there is an immediate blocker to this.

Apr 20 2019, 8:49 PM · Operations, ops-eqsin
faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Apr 20 2019, 8:46 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

OK, a few more comments:

Apr 20 2019, 8:40 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon reassigned T213128: Replace eqiad mgmt switches with EX4200s from Cmjohnson to ayounsi.

I've surfaced the idea myself in the past, but the more I think about it the more I think it's not such a great idea at this point...

Apr 20 2019, 7:27 PM · ops-eqiad, netops, Operations

Apr 19 2019

faidon renamed T201346: rack/setup/install cumin1001.eqiad.wmnet (new cumin master) from rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) to rack/setup/install cumin1001.eqiad.wmnet (new cumin master).
Apr 19 2019, 11:52 AM · ops-eqiad, SRE-tools, Operations

Apr 18 2019

faidon added a comment to T221290: wiki-mail DKIM failing.

It's been a while but if I recall correctly, the intention was to not allow (= not create a valid signature) emails that had e.g. From: person@wikipedia.org (where person = jimmy for instance), when those emails originated from the MW appserver fleet.

Apr 18 2019, 7:48 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
faidon added a comment to T221290: wiki-mail DKIM failing.

How did it work until now?

Apr 18 2019, 7:07 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
faidon added a comment to T216088: Mapping of servers to stakeholders.

Thanks @colewhite for raising (and re-raising!) this issue. This is a tricky but important problem to solve for sure!

Apr 18 2019, 11:11 AM · Operations

Apr 12 2019

faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Apr 12 2019, 10:14 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

That makes sense, should be pretty straight forward. You want this in the coherence checks?

Apr 12 2019, 10:09 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

I forgot another one, the opposite of this:

We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

Apr 12 2019, 8:39 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, SRE-tools