Page MenuHomePhabricator

faidon (Faidon Liambotis)
SRE

Projects (11)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (245 w, 6 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 21

faidon added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Ceph is capable of saturating 10G links under heavy load
[...]
Rate-limiting traffic is likely to collapse the cluster.
[...]
I will add that plenty of people build new networks just for Ceph (partly to get jumbo frames).

Fri, Jun 21, 9:37 AM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services

Sat, Jun 15

faidon added a comment to T225713: CPU scaling governor audit.

So, I think there are two distinct problems discovered in the past few days

  • ondemand results into some really poor performance on the ms-be boxes. Going from 50% CPU util to 5% with a ondemand->performance switch probably means that this CPU scaling is not really scaling... on demand :) This may be specific to the workload of ms-bes, potentially affected by Meltdown/Spectre firmware updates, and/or it could be specific to HP hardware (or a subgeneration of it, like HP Gen9). These things tend generally depend on the firmware, but note also that HPs use the pcc_cpufreq Linux module, unlike all other systems.
  • A lot of systems seem to have the governor set to powersave, which may result into poor performance, depending on the workload.
Sat, Jun 15, 1:23 PM · media-storage, Operations
faidon renamed T225713: CPU scaling governor audit from CPU scaling governor on HP Gen9 hosts to CPU scaling governor audit.
Sat, Jun 15, 1:09 PM · media-storage, Operations

Thu, Jun 13

faidon renamed T225713: CPU scaling governor audit from CPU scaling governor on ms-be hosts to CPU scaling governor on HP Gen9 hosts.
Thu, Jun 13, 12:21 PM · media-storage, Operations
faidon added a comment to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.

So, the timeout patch above bumped the timeouts to 100s I think. On many hosts (e.g. ms-be1036, ms-be1037) these checks seemed to take about 1.5-3 minutes to run, so this issue would not be addressed by that. However, I also wondered why such a relatively simple thing would take such a long time to execute. The response seems to be two-fold:

Thu, Jun 13, 12:41 AM · Patch-For-Review, User-fgiunchedi, Operations, observability

Wed, Jun 12

Krinkle awarded T185319: IRC RecentChanges feed: code stewardship request a Orange Medal token.
Wed, Jun 12, 9:33 PM · Tools, Operations, Analytics, Wikimedia-IRC-RC-Server, Code-Stewardship-Reviews
faidon assigned T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts to fgiunchedi.

Right now there are 14 outstanding alerts, or about 50% of all outstanding alerts:

Wed, Jun 12, 9:05 AM · Patch-For-Review, User-fgiunchedi, Operations, observability

Sat, Jun 1

faidon added a comment to T221507: Netbox report to validate network equipment data.

It seems like part of the challenge is identifying clustered equipment (i.e. asw stacks & pfw). In those cases, the device appears in LibreNMS as one device with the switches as FPC linecards (presumably as inventory?), while on the Netbox end they appear as separate, distinct devices. I haven't looked at this deeply, but I suppose a lot of the complexity in the report comes from there.

Sat, Jun 1, 1:09 PM · Patch-For-Review, netbox, User-crusnov, Operations-Software-Development, Operations, netops
faidon added a comment to T187456: Decommission labstore100[123] and their disk shelves.

One note for @Cmjohnson for the upcoming decom which is apparently imminent: labstore1003-arrayN are one of the handful cases that lack an asset tag in Netbox. Last time we talked about this (1+ year ago), I believe you had mentioned that the tag wasn't visible due to the way they are racked. Now that they are getting unracked, it'd be ideal to recover that asset tag and enter it in Netbox to have it on the records and keep it while these remain in storage. Thanks!

Sat, Jun 1, 11:26 AM · decommission, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
faidon awarded T209527: Set up scratch and maps NFS services on cloudstore1008/9 a Party Time token.
Sat, Jun 1, 11:21 AM · Patch-For-Review, cloud-services-team (Kanban)

Wed, May 29

faidon assigned T224535: Investigate cr2-eqord's disconnection from the rest of the network to ayounsi.
Wed, May 29, 9:35 AM · Operations, netops
faidon updated the task description for T224535: Investigate cr2-eqord's disconnection from the rest of the network.
Wed, May 29, 9:32 AM · Operations, netops
faidon updated subscribers of T224535: Investigate cr2-eqord's disconnection from the rest of the network.

OK, so the vendor "bounced the interface" and the eqiad<->eqord traffic has been restored. What they noticed -and I confirmed- is that this interface was not carrying traffic since May 24th.

Wed, May 29, 9:27 AM · Operations, netops
faidon added a comment to T221507: Netbox report to validate network equipment data.
  • esams should be blacklisted for now indeed.
  • test_nb_inventory_in_librenms could use some improvement -- it didn't say which device, s/n or anything to identify them as far as I can tell?
  • On the device types errors, I can't help but think that we're looking at the wrong field? e.g. take cr1-eqsin as an example: the message says Netbox devtype=Juniper MX104, LibreNMS devtype=Juniper 750-062050, but LibreNMS does know this is an MX104 (see under "Hardware" here).
  • I don't know what these "duplicate serial numbers" are, and we'd need more information to understand if these are real errors or report errors.
  • The cr1-eqsin serial change is a bit odd. Netbox used to have a record of what Juniper reports as the "midplane" serial number, not the "chassis". This was changed, but the midplane was what we had from the invoice as well -- so note that the Accounting report is now error'ing out instead.
  • asw-N-eqiad serial changes above -- these are now inconsistent with what we have from the Accounting side (so the report fails now). This needs further investigation for which one is ground truth?
Wed, May 29, 6:23 AM · Patch-For-Review, netbox, User-crusnov, Operations-Software-Development, Operations, netops
faidon added a comment to T224535: Investigate cr2-eqord's disconnection from the rest of the network.

So for the two that went down there was no planned maintenance, but we did get an email from the vendor ("00985243 Disturbance") suggesting that this was an unplanned event.

Wed, May 29, 5:52 AM · Operations, netops
faidon updated the task description for T224535: Investigate cr2-eqord's disconnection from the rest of the network.
Wed, May 29, 5:38 AM · Operations, netops
faidon triaged T224535: Investigate cr2-eqord's disconnection from the rest of the network as High priority.
Wed, May 29, 5:36 AM · Operations, netops

Tue, May 28

fgiunchedi awarded T93208: (U)EFI support a Love token.
Tue, May 28, 8:17 AM · Operations
faidon closed T93208: (U)EFI support as Resolved.

OK, a few changes later, and we have a working EFI install in a VM (d-i-test) \o/

Tue, May 28, 12:53 AM · Operations

Mon, May 27

faidon added a comment to T93208: (U)EFI support.

So I just pushed a change that uses syslinux.efi above. This may prove to be short-lived, as we may switch to another PXE implementation (iPXE or GRUB, more on that later) but should work. It /may/ require to append initrd=initrd.gz to the kernel command-line options.

Mon, May 27, 3:50 PM · Operations
faidon moved T214024: Two test hosts for SREs from Pending Approval to Allocation/Ordering/Implementation on the hardware-requests board.

I don't know what the status of this is, it's been a while it seems. I see it was pending for my approval, which I've missed -- apologies! Approved now.

Mon, May 27, 12:51 PM · Operations, hardware-requests

May 23 2019

faidon added a comment to T223628: Replace Camus with Kafka Connect for event data imports.

Unfortunately, I think this is one of the matters that we cannot fully discuss in a public task. I'll start a private email thread; if anyone reading this is interested to be part of this, ping me off-list and I can loop you in :)

May 23 2019, 10:05 PM · Analytics, EventBus
faidon added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

I'm not at all sure, but I don't see an LD 5 at all. Is it possible that instead of remaining as a degraded LD (with a failed disk) it got removed entirely somehow and that's what's causing the renumbering of LDs > 6 to smaller sd letters?

May 23 2019, 12:14 AM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations

May 22 2019

faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 22 2019, 7:26 PM · ops-codfw, ops-eqiad, Operations, Operations-Software-Development, netbox, DC-Ops

May 17 2019

faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 17 2019, 4:17 PM · ops-codfw, ops-eqiad, Operations, Operations-Software-Development, netbox, DC-Ops
faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 17 2019, 11:03 AM · ops-codfw, ops-eqiad, Operations, Operations-Software-Development, netbox, DC-Ops
faidon renamed T223467: Cleanup/delete recycled and returned (lease tranche 1) hardware from Netbox from cleanup/delete sold off decom and lease hardware from netbox to Cleanup/delete recycled and returned (lease tranche 1) hardware from Netbox.
May 17 2019, 8:05 AM · DC-Ops, Operations

May 16 2019

faidon updated subscribers of T222922: wmf7622 wont powercycle (cannot be allocated from spares).

Also adding @Volans here who designed this for his input :)

May 16 2019, 10:28 PM · Operations, ops-eqiad
faidon updated subscribers of T221068: decom ms-be201[345].

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

May 16 2019, 9:54 PM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
faidon added a subtask for T223450: Triage and resolve all outstanding Netbox report errors: T221984: scs-a1-codfw: update serial in netbox.
May 16 2019, 9:16 PM · ops-codfw, ops-eqiad, Operations, Operations-Software-Development, netbox, DC-Ops
faidon added a parent task for T221984: scs-a1-codfw: update serial in netbox: T223450: Triage and resolve all outstanding Netbox report errors.
May 16 2019, 9:16 PM · netbox, Operations, ops-codfw
faidon updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
May 16 2019, 9:15 PM · ops-codfw, ops-eqiad, Operations, Operations-Software-Development, netbox, DC-Ops
faidon added a comment to T209425: Decommission rdb2001, rdb2002.

Sure, that sounds fine :)

May 16 2019, 4:46 PM · Patch-For-Review, ops-codfw, User-jijiki, decommission, Operations
faidon reassigned T209425: Decommission rdb2001, rdb2002 from faidon to RobH.

I don't know why this needs my input? This sounds like a standard decom, unless I misunderstand it.

May 16 2019, 4:44 PM · Patch-For-Review, ops-codfw, User-jijiki, decommission, Operations
faidon added a comment to T212878: Netbox racks consistency report.

This is the kind of thing that:

  • Removes flexibility from DC Ops
  • If it occurs, it's not affecting anyone else but the DC Ops person on the ground (compared to e.g. a documentation or operational error like missing consoles)
  • Is not the kind of thing that would go easily unnoticed by the person on the ground (like e.g. a "WNF1234" asset tag).
May 16 2019, 2:15 PM · netbox, netops, Operations
faidon triaged T223450: Triage and resolve all outstanding Netbox report errors as Normal priority.
May 16 2019, 1:34 PM · ops-codfw, ops-eqiad, Operations, Operations-Software-Development, netbox, DC-Ops

May 14 2019

faidon updated subscribers of T128592: Add redundancy to IRC recent changes service.

That's an old task! @Ottomata et al may have an opinion.

May 14 2019, 12:58 PM · Operations, Availability (MediaWiki-MultiDC), codfw-rollout
faidon added a comment to T213843: Juniper network device audit - all sites.

Update from IRC: Juniper's install base is actually missing a whole lot of our devices (e.g. only lists 9 EX4300s, out of... 52). @ayounsi is asking them, but this clearly needs more work :(

May 14 2019, 11:07 AM · DC-Ops, netops, Operations

May 13 2019

faidon closed T223100: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad as Resolved.

Perfect, thank you!

May 13 2019, 3:17 PM · Operations, ops-eqiad, DC-Ops
faidon created T223100: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad.
May 13 2019, 1:07 PM · Operations, ops-eqiad, DC-Ops
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

May 13 2019, 10:49 AM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon removed a project from T222424: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6): observability.
May 13 2019, 10:38 AM · Operations, netops

Apr 28 2019

faidon renamed T221984: scs-a1-codfw: update serial in netbox from scs-c1-codfw : update serial in netbox to scs-a1-codfw: update serial in netbox.
Apr 28 2019, 2:08 AM · netbox, Operations, ops-codfw

Apr 27 2019

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.
  • We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.
Apr 27 2019, 12:38 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development

Apr 26 2019

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

(Not sure if I should be piling on this never-ending task!)

Apr 26 2019, 4:26 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Apr 26 2019, 4:20 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon merged T221964: RIPE Atlas data in Prometheus into T167689: Add RIPE atlas data to Prometheus.
Apr 26 2019, 2:33 PM · observability, Operations
faidon merged task T221964: RIPE Atlas data in Prometheus into T167689: Add RIPE atlas data to Prometheus.
Apr 26 2019, 2:33 PM · Traffic, Operations, observability

Apr 25 2019

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

I found (and corrected) two devices yesterday that had a purchase date of 2020-MM-DD. Let's add a simple check for "purchase date is in the future" to catch and avoid those :)

Apr 25 2019, 11:01 AM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon added a comment to T221632: Storage capacity upgrade for WDQS.

I don't think it makes sense to perpetuate a vertical scaling model. Both of the options listed here (adding disks, RAID 0) are things that we generally do not do, due to the hidden costs and burdens for everyone involved. Taking machines offline and rebuilding them from scratch just because a disk failed or because we need more storage is really something that we need to avoid, and something that the data center operations team cannot really support with its existing staffing (esp. taking into account the failure rate of disks).

Apr 25 2019, 10:32 AM · Wikidata, Wikidata-Query-Service, Discovery-Search
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

For the PuppetDB report:

  • I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.

Excluding from sync (as a missing machine) would prevent them from showing up in the report, it is true. Perhaps using that to set the machine status instead would be a better way, so the machine would be present just with a status that we could exclude.

Apr 25 2019, 1:43 AM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development

Apr 24 2019

faidon updated the task description for T205897: Netbox: fill network topology.
Apr 24 2019, 10:11 PM · netbox, Operations

Apr 22 2019

faidon updated the task description for T205897: Netbox: fill network topology.
Apr 22 2019, 11:44 PM · netbox, Operations
faidon updated the task description for T205897: Netbox: fill network topology.
Apr 22 2019, 11:43 PM · netbox, Operations
faidon added a comment to T215229: Keep Ganeti VMs synchronized in Netbox.

Should this be resolved?

Apr 22 2019, 11:42 PM · Patch-For-Review, User-crusnov, Operations-Software-Development
faidon added a comment to T221506: Inventorize network equipment in Netbox.

OK for switches, this did the trick:

#!/usr/bin/perl
use strict;
use warnings;
my $template = $ARGV[0];
my $device;
while (<STDIN>) {
        chomp;
        if (/FPC (\d)/) {
                my $fpc = $1;
                $device = $template;
                $device =~ s/%/$fpc/;
        } elsif (/BUILTIN/) {
                next;
        } elsif (/((?:Power Supply|PIC) \d) +R(?:EV|ev) \d\d + [\d-]+ +([^ ]+) +([^ ]+)$/) {
                my ($fru, $serial, $model) = ($1, $2, $3);
                $model =~ s/-A$//;
                print "$device,\"$fru\",Juniper,$model,$serial\n";
        }
}
Apr 22 2019, 10:48 PM · DC-Ops, Operations, netops
faidon added a comment to T221506: Inventorize network equipment in Netbox.

Apparently Netbox allows for a CSV import even for inventory items.

Apr 22 2019, 9:28 PM · DC-Ops, Operations, netops
faidon added a comment to T221507: Netbox report to validate network equipment data.

All excellent points :) I especially like the PDU & scs suggestion!

Apr 22 2019, 7:58 PM · Patch-For-Review, netbox, User-crusnov, Operations-Software-Development, Operations, netops

Apr 20 2019

faidon created T221507: Netbox report to validate network equipment data.
Apr 20 2019, 9:29 PM · Patch-For-Review, netbox, User-crusnov, Operations-Software-Development, Operations, netops
faidon triaged T221506: Inventorize network equipment in Netbox as Normal priority.
Apr 20 2019, 9:18 PM · DC-Ops, Operations, netops
faidon merged Restricted Task into T213843: Juniper network device audit - all sites.
Apr 20 2019, 8:54 PM · DC-Ops, netops, Operations
faidon removed a parent task for T213843: Juniper network device audit - all sites: Unknown Object (Task).
Apr 20 2019, 8:53 PM · DC-Ops, netops, Operations
faidon added a project to T213843: Juniper network device audit - all sites: DC-Ops.
Apr 20 2019, 8:53 PM · DC-Ops, netops, Operations
faidon added a comment to T213843: Juniper network device audit - all sites.

I was looking at FY19-20 CapEx planning and ran an export of the Entitlement Report from Juniper's website. The output is... not very close to the truth. There are serial there that do not match any of our gear, there are devices with serial numbers that do not match anything we have, plus the locations are all weird and wrong...

Apr 20 2019, 8:53 PM · DC-Ops, netops, Operations
faidon added a comment to T211368: update PDUs for eqsin (asset tag and other info).

Can we add procurement task and purchase date immediately? It doesn't sound like there is an immediate blocker to this.

Apr 20 2019, 8:49 PM · Operations, ops-eqsin
faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Apr 20 2019, 8:46 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

OK, a few more comments:

Apr 20 2019, 8:40 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon reassigned T213128: Replace eqiad mgmt switches with EX4200s from Cmjohnson to ayounsi.

I've surfaced the idea myself in the past, but the more I think about it the more I think it's not such a great idea at this point...

Apr 20 2019, 7:27 PM · ops-eqiad, Operations, netops

Apr 19 2019

faidon renamed T201346: rack/setup/install cumin1001.eqiad.wmnet (new cumin master) from rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) to rack/setup/install cumin1001.eqiad.wmnet (new cumin master).
Apr 19 2019, 11:52 AM · ops-eqiad, Operations-Software-Development, Operations

Apr 18 2019

faidon added a comment to T221290: wiki-mail DKIM failing.

It's been a while but if I recall correctly, the intention was to not allow (= not create a valid signature) emails that had e.g. From: person@wikipedia.org (where person = jimmy for instance), when those emails originated from the MW appserver fleet.

Apr 18 2019, 7:48 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
faidon added a comment to T221290: wiki-mail DKIM failing.

How did it work until now?

Apr 18 2019, 7:07 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
faidon added a comment to T216088: Mapping of servers to stakeholders.

Thanks @colewhite for raising (and re-raising!) this issue. This is a tricky but important problem to solve for sure!

Apr 18 2019, 11:11 AM · Operations
faidon added a comment to T221142: Willy Pao onboarding.

Let's add Willy to the group datacenter-ops. I don't think he needs to necessarily be in the group ops (which is really a misnomer at this point), for now.

Apr 18 2019, 12:25 AM · SRE-Access-Requests, Patch-For-Review, Operations, DC-Ops
faidon updated the task description for T221142: Willy Pao onboarding.
Apr 18 2019, 12:23 AM · SRE-Access-Requests, Patch-For-Review, Operations, DC-Ops

Apr 12 2019

faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Apr 12 2019, 10:14 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

That makes sense, should be pretty straight forward. You want this in the coherence checks?

Apr 12 2019, 10:09 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

I forgot another one, the opposite of this:

We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

Apr 12 2019, 8:39 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development

Apr 11 2019

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

OK, so, after the efforts in the past few days, we're in a much better shape! The PuppetDB report seems to be (almost?) entirely indicative of real issues and is actionable now - I will involve DC Ops to start fixing the cases that are known to be real errors, and we'll see if there are any false positives (I know of at least one, that is tough to handle!).

Apr 11 2019, 9:27 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Apr 11 2019, 9:07 PM · netbox, Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development

Apr 9 2019

faidon updated subscribers of T214903: labsdb1002-array1: status clarification.

@RobH and @Cmjohnson, is this a forgotten decom?

Apr 9 2019, 1:38 PM · decommission, DC-Ops, cloud-services-team (Kanban)
faidon added a project to T214903: labsdb1002-array1: status clarification: decommission.
Apr 9 2019, 1:38 PM · decommission, DC-Ops, cloud-services-team (Kanban)
faidon added a comment to T214181: codfw: rename/relabel labtestneutron2001 to cloudnet2001-dev.

Given T218025, can we resolve this?

Apr 9 2019, 1:10 PM · Operations, DC-Ops, ops-codfw
faidon added a comment to T202966: Make cp1099 the new pinkunicorn.

According to Netbox, cp1099 is 2 years newer than cp1008, but is still a 6-year old server (purchased Mar 28, 2013). Can we just get rid of it? I'm concerned we're just spending cycles on a box that may die any day now and that we won't be able to repair...

Apr 9 2019, 12:54 PM · Patch-For-Review, Traffic, Operations

Apr 8 2019

faidon added a comment to T209707: tagged_interface sometimes exceeds IFNAMSIZ.

I think this is addressed by systemd's 9009d3b5c3b6d191be69215736be77583e0f23f9, included in v239 (stretch has v232, buster has v241).

Apr 8 2019, 11:08 PM · Traffic, Operations

Mar 21 2019

Mill <mill@mail.com> committed rOSKEYHOLDER9fb7d69208e6: pyaaaaaaaaaaaa (authored by faidon).
pyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDERecc54f53f151: )yaaaaaaaaaaaa (authored by faidon).
)yaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER4688af2fc102: uyaaaaaaaaaaaa (authored by faidon).
uyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER97de3d4dad7c: yyaaaaaaaaaaaa (authored by faidon).
yyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDERa588fd6bfc05: vyaaaaaaaaaaaa (authored by faidon).
vyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER50927819b02d: tyaaaaaaaaaaaa (authored by faidon).
tyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER48048fa41119: xyaaaaaaaaaaaa (authored by faidon).
xyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER21db23b59d4d: ryaaaaaaaaaaaa (authored by faidon).
ryaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER6a68d2ba2a5e: wyaaaaaaaaaaaa (authored by faidon).
wyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER90fb5301b369: 0yaaaaaaaaaaaa (authored by faidon).
0yaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDERaa816fdf9682: qyaaaaaaaaaaaa (authored by faidon).
qyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER668563582e69: zyaaaaaaaaaaaa (authored by faidon).
zyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDEReb7bd673b43c: syaaaaaaaaaaaa (authored by faidon).
syaaaaaaaaaaaa
Mar 21 2019, 12:41 AM

Mar 7 2019

faidon added a comment to T214183: Setup graphs for power usage readings in Grafana.

For the per-site usage, LibreNMS besides being clunky, is non-public and not accessible to all.

Mar 7 2019, 2:53 PM · DC-Ops, observability
faidon triaged T214183: Setup graphs for power usage readings in Grafana as High priority.
Mar 7 2019, 1:20 PM · DC-Ops, observability
faidon added a comment to T217686: Document service owner in Netbox.

This seems like a duplicate (and subset of) T216088. I've added the custom field proposal as one of the many options listed in its task description and closing this as duplicate to keep the discussion in one place :)

Mar 7 2019, 12:22 PM · Operations