Page MenuHomePhabricator

faidon (Faidon Liambotis)
SRE

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (236 w, 4 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

faidon renamed T201346: rack/setup/install cumin1001.eqiad.wmnet (new cumin master) from rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) to rack/setup/install cumin1001.eqiad.wmnet (new cumin master).
Fri, Apr 19, 11:52 AM · ops-eqiad, Operations-Software-Development, Operations

Thu, Apr 18

faidon added a comment to T221290: wiki-mail DKIM failing.

It's been a while but if I recall correctly, the intention was to not allow (= not create a valid signature) emails that had e.g. From: person@wikipedia.org (where person = jimmy for instance), when those emails originated from the MW appserver fleet.

Thu, Apr 18, 7:48 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
faidon added a comment to T221290: wiki-mail DKIM failing.

How did it work until now?

Thu, Apr 18, 7:07 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
faidon added a comment to T216088: Mapping of servers to stakeholders.

Thanks @colewhite for raising (and re-raising!) this issue. This is a tricky but important problem to solve for sure!

Thu, Apr 18, 11:11 AM · Operations
faidon added a comment to T221142: Willy Pao onboarding.

Let's add Willy to the group datacenter-ops. I don't think he needs to necessarily be in the group ops (which is really a misnomer at this point), for now.

Thu, Apr 18, 12:25 AM · Patch-For-Review, Operations, DC-Ops
faidon updated the task description for T221142: Willy Pao onboarding.
Thu, Apr 18, 12:23 AM · Patch-For-Review, Operations, DC-Ops

Fri, Apr 12

faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Fri, Apr 12, 10:14 PM · Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

That makes sense, should be pretty straight forward. You want this in the coherence checks?

Fri, Apr 12, 10:09 PM · Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

I forgot another one, the opposite of this:

We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

Fri, Apr 12, 8:39 PM · Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development

Thu, Apr 11

faidon added a comment to T220422: Netbox Reports: General Cleanup and Improvement.

OK, so, after the efforts in the past few days, we're in a much better shape! The PuppetDB report seems to be (almost?) entirely indicative of real issues and is actionable now - I will involve DC Ops to start fixing the cases that are known to be real errors, and we'll see if there are any false positives (I know of at least one, that is tough to handle!).

Thu, Apr 11, 9:27 PM · Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development
faidon updated the task description for T220422: Netbox Reports: General Cleanup and Improvement.
Thu, Apr 11, 9:07 PM · Patch-For-Review, User-crusnov, DC-Ops, Operations-Software-Development

Tue, Apr 9

faidon updated subscribers of T214903: labsdb1002-array1: status clarification.

@RobH and @Cmjohnson, is this a forgotten decom?

Tue, Apr 9, 1:38 PM · decommission, DC-Ops, cloud-services-team (Kanban)
faidon added a project to T214903: labsdb1002-array1: status clarification: decommission.
Tue, Apr 9, 1:38 PM · decommission, DC-Ops, cloud-services-team (Kanban)
faidon added a comment to T214181: codfw: rename/relabel labtestneutron2001 to cloudnet2001-dev.

Given T218025, can we resolve this?

Tue, Apr 9, 1:10 PM · Operations, DC-Ops, ops-codfw
faidon added a comment to T202966: Make cp1099 the new pinkunicorn.

According to Netbox, cp1099 is 2 years newer than cp1008, but is still a 6-year old server (purchased Mar 28, 2013). Can we just get rid of it? I'm concerned we're just spending cycles on a box that may die any day now and that we won't be able to repair...

Tue, Apr 9, 12:54 PM · Patch-For-Review, Operations, Traffic

Mon, Apr 8

faidon added a comment to T209707: tagged_interface sometimes exceeds IFNAMSIZ.

I think this is addressed by systemd's 9009d3b5c3b6d191be69215736be77583e0f23f9, included in v239 (stretch has v232, buster has v241).

Mon, Apr 8, 11:08 PM · Patch-For-Review, Traffic, Operations

Mar 21 2019

Mill <mill@mail.com> committed rOSKEYHOLDER9fb7d69208e6: pyaaaaaaaaaaaa (authored by faidon).
pyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDERecc54f53f151: )yaaaaaaaaaaaa (authored by faidon).
)yaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER4688af2fc102: uyaaaaaaaaaaaa (authored by faidon).
uyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER97de3d4dad7c: yyaaaaaaaaaaaa (authored by faidon).
yyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDERa588fd6bfc05: vyaaaaaaaaaaaa (authored by faidon).
vyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER50927819b02d: tyaaaaaaaaaaaa (authored by faidon).
tyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER48048fa41119: xyaaaaaaaaaaaa (authored by faidon).
xyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER21db23b59d4d: ryaaaaaaaaaaaa (authored by faidon).
ryaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER6a68d2ba2a5e: wyaaaaaaaaaaaa (authored by faidon).
wyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER90fb5301b369: 0yaaaaaaaaaaaa (authored by faidon).
0yaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDERaa816fdf9682: qyaaaaaaaaaaaa (authored by faidon).
qyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDER668563582e69: zyaaaaaaaaaaaa (authored by faidon).
zyaaaaaaaaaaaa
Mar 21 2019, 12:41 AM
Mill <mill@mail.com> committed rOSKEYHOLDEReb7bd673b43c: syaaaaaaaaaaaa (authored by faidon).
syaaaaaaaaaaaa
Mar 21 2019, 12:41 AM

Mar 7 2019

faidon added a comment to T214183: Setup graphs for power usage readings in Grafana.

For the per-site usage, LibreNMS besides being clunky, is non-public and not accessible to all.

Mar 7 2019, 2:53 PM · DC-Ops, monitoring
faidon triaged T214183: Setup graphs for power usage readings in Grafana as High priority.
Mar 7 2019, 1:20 PM · DC-Ops, monitoring
faidon added a comment to T217686: Document service owner in Netbox.

This seems like a duplicate (and subset of) T216088. I've added the custom field proposal as one of the many options listed in its task description and closing this as duplicate to keep the discussion in one place :)

Mar 7 2019, 12:22 PM · Operations
faidon updated the task description for T216088: Mapping of servers to stakeholders.
Mar 7 2019, 12:21 PM · Operations
faidon merged T217686: Document service owner in Netbox into T216088: Mapping of servers to stakeholders.
Mar 7 2019, 12:19 PM · Operations
faidon merged task T217686: Document service owner in Netbox into T216088: Mapping of servers to stakeholders.
Mar 7 2019, 12:19 PM · Operations

Mar 5 2019

Dvorapa awarded T191764: CI: run tests with multiple Python3 versions a Love token.
Mar 5 2019, 9:22 AM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure

Mar 4 2019

faidon raised the priority of T212010: Degraded RAID on sodium from Normal to High.

I just merged a duplicate in. @Cmjohnson what's the status of this?

Mar 4 2019, 1:03 PM · ops-eqiad, Operations
faidon merged T217356: Degraded RAID on sodium into T212010: Degraded RAID on sodium.
Mar 4 2019, 1:02 PM · ops-eqiad, Operations
faidon merged task T217356: Degraded RAID on sodium into T212010: Degraded RAID on sodium.
Mar 4 2019, 1:02 PM · ops-eqiad, Operations
faidon reopened T122144: Move most (all?) exim personal aliases to OIT as "Open".

or they are individual aliases (out of scope of this ticket)

Mar 4 2019, 12:00 PM · Mail, Operations

Feb 20 2019

Krinkle awarded T122144: Move most (all?) exim personal aliases to OIT a Orange Medal token.
Feb 20 2019, 12:16 AM · Mail, Operations

Feb 15 2019

faidon reassigned T215837: eqiad: requesting dual cpu misc host for icinga1001 replacement from faidon to RobH.

If that's still needed, that's approved, and it takes priority over phab1002. And let's replenish our spare pool indeed!

Feb 15 2019, 4:11 PM · Operations, hardware-requests
faidon reassigned T215335: requesting WMF7426 as phabricator system in eqiad from faidon to RobH.

@Dzahn that's all fine, but we should have that documented in a separate Phabricator task tracking this work, if one doesn't exist already :) Separately, I'd also really love having a permanent non-SPOF setup in each data center as well, whether that's multiple bare metal servers, multiple VMs or running Phabricator on k8s. This is too important of a service to run in one misc-type server per site.

Feb 15 2019, 4:10 PM · serviceops, Operations, hardware-requests

Feb 14 2019

faidon added a comment to T216133: Increase visibility of auto-generated tasks for RAID errors.

We discussed this a little bit yesterday, and T216088 was filed to further discuss this. Help there is welcome :)

Feb 14 2019, 3:27 PM · DC-Ops, Operations, Wikimedia-Incident, cloud-services-team (Kanban)
faidon added a comment to T205897: Netbox: fill network topology.

The medium-term plan is for this data to be entered into Netbox after a server is racked but before it's provisioned or even powered up, and that data to be used by our tooling to configure and execute the provisioning itself (DHCP configuration, switchport, OS install etc.).

Feb 14 2019, 11:21 AM · Operations

Feb 12 2019

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

Before these are delivered for implementation, let's make sure that the two systems have identical settings, especially given we've tested various things on them over the past few months. I reverted my SSD Smart Path setting on 1019, but there are still differences; the most important one that I noticed is that in cloudvirt1019 the P440ar is hidden (disabled in BIOS?) but in cloudvirt1020 it's visible. Maybe a factory reset and then manually reapplying the same settings in each?

Feb 12 2019, 10:47 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations

Feb 7 2019

CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Feb 7 2019, 1:41 PM · MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review, monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Feb 7 2019, 1:41 PM · MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review, monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations

Feb 5 2019

faidon added a comment to T215335: requesting WMF7426 as phabricator system in eqiad.

Is there a task describing the plans for a secondary Phabricator system? How did we come up with those specs?

Feb 5 2019, 10:50 PM · serviceops, Operations, hardware-requests

Feb 2 2019

faidon changed the status of T214130: Requesting access to production for dsharpe from Stalled to Open.
Feb 2 2019, 9:21 PM · Operations, SRE-Access-Requests
faidon changed the status of T214130: Requesting access to production for dsharpe, a subtask of T213742: Onboarding David Sharpe to Security Team as Information Security Analyst, from Stalled to Open.
Feb 2 2019, 9:21 PM · Security-Team
faidon added a comment to T214130: Requesting access to production for dsharpe.

Let's not wait for a meeting, approved!

Feb 2 2019, 9:21 PM · Operations, SRE-Access-Requests

Jan 26 2019

faidon added a comment to T214762: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines.

It's tricky, but I think the one we use one is probably the right one and this should be declined. See T212674 for context.

Jan 26 2019, 9:58 PM · Operations, monitoring

Jan 25 2019

faidon updated subscribers of T205897: Netbox: fill network topology.

Netbox is now at 2.5 \o/ which allows us to import cable IDs, type, color etc. Let's start with importing eqsin's, with the data that we have in the spreadsheet, so that we can deprecate that? @RobH @ayounsi any takers?

Jan 25 2019, 3:11 AM · Operations

Jan 22 2019

faidon committed rOSKEYHOLDER0fcbce6cca70: Add tests for OSError when loading config files (authored by faidon).
Add tests for OSError when loading config files
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER7894f3de90d9: Add SshKeyBlob per RFC 4253 (authored by faidon).
Add SshKeyBlob per RFC 4253
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDERaeb38db0ca56: Make all SshAgentConfig's methods instance methods (authored by faidon).
Make all SshAgentConfig's methods instance methods
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER2511968dffe4: Add a (very basic) test using OpenSSH's ssh-add (authored by faidon).
Add a (very basic) test using OpenSSH's ssh-add
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER6c5da9d6e848: Test key and config file parsing using test data (authored by faidon).
Test key and config file parsing using test data
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER6e7d5b6ac489: Make all SshAgentConfig's methods instance methods (authored by faidon).
Make all SshAgentConfig's methods instance methods
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDERfa7b998a257b: Add a bunch more tests (authored by faidon).
Add a bunch more tests
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER2a82ff80c088: Add tests for OSError when loading config files (authored by faidon).
Add tests for OSError when loading config files
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER645766d49086: Add SshKeyBlob per RFC 4253 (authored by faidon).
Add SshKeyBlob per RFC 4253
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDERa86c5ae7a3d3: Add a (very basic) test using OpenSSH's ssh-add (authored by faidon).
Add a (very basic) test using OpenSSH's ssh-add
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDERa8cb31ccc3c3: Add a bunch more tests (authored by faidon).
Add a bunch more tests
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER071e33c0d7b0: Properly setup logging when /dev/log doesn't exist (authored by faidon).
Properly setup logging when /dev/log doesn't exist
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER2d614571d5c0: Test key and config file parsing using test data (authored by faidon).
Test key and config file parsing using test data
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDERc094dca54d67: Update tox.ini to facilitate parallel builds (authored by faidon).
Update tox.ini to facilitate parallel builds
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER8131d32632cd: Move tests/unit -> tests (authored by faidon).
Move tests/unit -> tests
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDERb616cb50ac3b: Add a tox environment for Construct 2.8.16 (authored by faidon).
Add a tox environment for Construct 2.8.16
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER74bfd74be76c: Bump minimum Python to 3.5; also test with 3.7 (authored by faidon).
Bump minimum Python to 3.5; also test with 3.7
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDERbbb61dab9f62: Add a pylint tox environment (authored by faidon).
Add a pylint tox environment
Jan 22 2019, 12:30 AM
faidon committed rOSKEYHOLDER067cc37cca29: protocol.compat: disable a couple of pylint errors (authored by faidon).
protocol.compat: disable a couple of pylint errors
Jan 22 2019, 12:30 AM

Jan 21 2019

faidon added a comment to T214313: Add new Tool Labs IPs to Varnish rate limit whitelist.

Per our earlier conversations (T208986, T174596, T209011), I think we should just use the WMCS public IP space to make these kind of exceptions (which also could be dedicated for Toolforge), and not make rate-limit exceptions on 172.16.0.0/12 space.

Jan 21 2019, 8:07 PM · Toolforge, Wikimedia-Apache-configuration, Operations, Traffic
faidon updated subscribers of T214262: labstore2004 - memory error on DIMM A2.

This is a super old server; it just crossed its 7-year mark (we typically refresh servers at 4.5-5 years), so we're way past its warranty and shelf life and I'm not sure if we have spare parts for it at this point... Not sure if we can do much here -- maybe try a different DIMM or something, if we have one, but I don't have high hopes (also, given the use case... faulty memory is scary). @Papaul, any thoughts?

Jan 21 2019, 10:01 AM · cloud-services-team (Kanban), ops-codfw, Operations

Jan 18 2019

faidon added a comment to T213748: swap a2-eqiad PDU with on-site spare.

Synced up with Chris via IRC:

All systems were able to come back up within a2 without incident. The spare PDU is in place, but it will also be replaced when rows A and B have PDU refresh this fiscal.

Jan 18 2019, 4:03 PM · Patch-For-Review, DBA, Analytics, ops-eqiad, Operations
faidon added a comment to T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring.

@fgiunchedi so could you describe in a bit more detail what is needed here and what were the challenges you faced with prometheus-snmp-exporter last time you attempted this?

Jan 18 2019, 3:02 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, Operations, monitoring

Dec 21 2018

faidon reassigned T150264: Icinga check for VRRP from faidon to ayounsi.

I pushed what I had written a while ago in Gerrit (see above). It needs to be hooked up to our monitoring, but it should be in a working condition. Leaving that to @ayounsi, assuming you think the code looks good as it is :) Happy to review any subsequent PS updates!

Dec 21 2018, 12:06 PM · Patch-For-Review, netops, monitoring, Operations
faidon added a comment to T211930: Add eqsin routing special cases to jnt.
  1. On received routes: I don't think we should be making these kind of community-matching in BGP_community_actions. Rather, I think we should have ASnnnn_in policy-statements, that map our upstream's communities into our own communities (e.g. UPSTREAM_CUST_US), and then have BGP_community_actions act on that. That would make reading this match more straightfoward. Note that this follows what we've done with our other communities (e.g. see AS13030_in and the likes).
Dec 21 2018, 11:41 AM · Operations, netops
faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

@Cmjohnson what's the status of this?

Dec 21 2018, 10:53 AM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations

Dec 19 2018

faidon edited P7931 Flask/PuppetDB PoC.
Dec 19 2018, 6:24 PM · Operations
faidon edited P7931 Flask/PuppetDB PoC.
Dec 19 2018, 6:24 PM · Operations
faidon created P7931 Flask/PuppetDB PoC.
Dec 19 2018, 6:22 PM · Operations

Dec 18 2018

faidon added a comment to T211750: Introduce Python code formatters usage.

I like black too but from but from https://black.readthedocs.io/en/stable/installation_and_usage.html it tied to having python 3.6 installed.

Black can be installed by running pip install black. It requires Python 3.6.0+ to run

With stretch shipping with 3.4 (what do the Mac OS X versions do?) it might be a bit too restrictive to require it.

Dec 18 2018, 8:47 PM · Patch-For-Review, Operations, Operations-Software-Development
faidon added a comment to T211750: Introduce Python code formatters usage.

On my side I've done a test on the cumin codebase with black. The results are:

  • all the ignore comments for pylint or any other validation tool were misplaced (moved to the last line when splitting) and require to manually move them to the first line [one off]
  • it doesn't pass flake8:
    • E203 whitespace before ':' (this seems a bug on their side, it's for a list slice _ARGV[index + 1 :]
Dec 18 2018, 8:37 PM · Patch-For-Review, Operations, Operations-Software-Development
faidon reopened T205897: Netbox: fill network topology as "Open".

This task is great, and the table at the top is a very useful summary! The Q2 goal part of it has been completed indeed, so I can see the argument for the task being resolved.

Dec 18 2018, 8:38 AM · Operations
faidon reopened T205897: Netbox: fill network topology, a subtask of T205868: Expand Netbox usage - Q2 2018-19 Goal, as Open.
Dec 18 2018, 8:38 AM · Operations, Operations-Software-Development, Goal
faidon added a comment to T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents.

There were a few notices on the 15th and 16th of December. Did these arrive to maint-announce@?

Dec 18 2018, 7:47 AM · Wikimedia-Incident, Traffic, Operations

Dec 17 2018

faidon added a comment to T191764: CI: run tests with multiple Python3 versions.

This was quite complicated but I've managed to forward-port 3.4 and backpored 3.6 and 3.7 to stretch. These are now included in the component component/pyall of suite stretch-wikimedia and they can be installed like one would normally install python (apt install python3.{4,5,6,7}-venv should do it).

Dec 17 2018, 11:03 AM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure
faidon reassigned T212011: migrate netinsights from rhenium to sulfur from faidon to MoritzMuehlenhoff.

Moritz was working on that.

Dec 17 2018, 7:35 AM · netops, Operations

Dec 14 2018

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

@faidon, who is 'please also construct a draft email' directed to?

Dec 14 2018, 8:46 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
faidon added a comment to T205899: Develop and deploy at least three Netbox reports to assist with data correctness and consistency.

I would say to also check that all devices matching some criteria, are present in PuppetDB and vice-versa. These criteria may be a combination of:

  • Type: Server
  • Status: Active or Staged
  • Tenant: None (and then define and set tenants "frack" and "sandbox", i.e. RIPE Atlases?)

This might be a lot harder, since the reports can't make a log_failure without a record present in Netbox already. We could make log lines for that though.

Dec 14 2018, 8:07 PM · Patch-For-Review, Operations, Operations-Software-Development
faidon added a comment to T205899: Develop and deploy at least three Netbox reports to assist with data correctness and consistency.

Manufacturer, model and serial checks all sound good to me! Manufacturer may need some rewriting, I think there's "Dell, Inc." vs. Dell" and differences like that.

Dec 14 2018, 5:54 PM · Patch-For-Review, Operations, Operations-Software-Development

Dec 13 2018

faidon added a watcher for Keyholder: faidon.
Dec 13 2018, 10:42 AM

Dec 12 2018

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

OK, I had a look at this. A few observations first of all:

  • While not 100% sure, I don't think this is related to the controller having been swapped before. I don't think it fits.
  • cloudvirt1019 & cloudvirt1002 exhibit different symptoms at the moment. 1019 (which @Cmjohnson has been focusing on) shows its battery count as 1 but status as "recharging", while 1020 as having no battery (count = 0).
Dec 12 2018, 7:42 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
faidon added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

Makes sense, +1, go for it! A lot has happened since this task was filled in 2015 (e.g. not having precise anymore, T163196 etc.) and including interface::add_ip6_mapped { 'main': } everywhere should be easy, if not completely painless! :)

Dec 12 2018, 5:49 AM · Patch-For-Review, Traffic, netops, Operations, IPv6
faidon closed T158429: Switch to predictable network interface names? as Resolved.

Has been implemented for all hosts starting with stretch and going forward for a long time now!

Dec 12 2018, 5:48 AM · Patch-For-Review, Operations

Dec 11 2018

faidon added a comment to T211254: Free up 185.15.59.0/24.

What is the rationale behind trying to empty this address space and/or find a new /24?

Dec 11 2018, 7:32 PM · Patch-For-Review, Traffic, netops, Operations

Dec 10 2018

faidon added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.
  • It's been a while, but I believe an import statement in the neighbor block overrides the parent one in its entirety, and does not supplement it, so we'd have to repeat the whole import chain there.
  • Would it make sense to have separate as-path groups for v4/v6? It's a bit unusual in our config, but it would address the issue with HE and to inadvertently avoid downprefing HE for IPv4 for no reason.
  • If we're going to remove the local-preference setting from BGP_IXP_in and just rely on BGP_community_actions to apply based on communities (it's a good idea!), then we should probably do the same for BGP_Private_Peer_in for consistency.
  • Nitpick: the non-RS policies are called BGP_IXP_…, so let's follow that naming scheme (i.e. "BGP_IXP_RS_in", not "IX")
Dec 10 2018, 11:31 PM · Operations, Traffic, netops
faidon added a comment to T207965: eqiad: Re-connect cage cameras .

They don't, these aren't PoE switches. I didn't know these cameras required PoE. So, two options I suppose:

  • Use PoE injectors
  • Hook them up to (old) EX4200s. Are we using any of them for mgmt switches yet? Cameras seem a better fit for the mgmt network than the production network anyway, right?
Dec 10 2018, 1:29 PM · Operations, ops-eqiad