faidon (Faidon Liambotis)
SRE

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (227 w, 6 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF) [ Global Accounts ]

Recent Activity

Fri, Feb 15

faidon reassigned T215837: eqiad: requesting dual cpu misc host for icinga1001 replacement from faidon to RobH.

If that's still needed, that's approved, and it takes priority over phab1002. And let's replenish our spare pool indeed!

Fri, Feb 15, 4:11 PM · Operations, hardware-requests
faidon reassigned T215335: requesting WMF7426 as phabricator system in eqiad from faidon to RobH.

@Dzahn that's all fine, but we should have that documented in a separate Phabricator task tracking this work, if one doesn't exist already :) Separately, I'd also really love having a permanent non-SPOF setup in each data center as well, whether that's multiple bare metal servers, multiple VMs or running Phabricator on k8s. This is too important of a service to run in one misc-type server per site.

Fri, Feb 15, 4:10 PM · Operations, hardware-requests

Thu, Feb 14

faidon added a comment to T216133: Increase visibility of auto-generated tasks for RAID errors.

We discussed this a little bit yesterday, and T216088 was filed to further discuss this. Help there is welcome :)

Thu, Feb 14, 3:27 PM · DC-Ops, Operations, Wikimedia-Incident, cloud-services-team (Kanban)
faidon added a comment to T205897: Netbox: fill network topology.

The medium-term plan is for this data to be entered into Netbox after a server is racked but before it's provisioned or even powered up, and that data to be used by our tooling to configure and execute the provisioning itself (DHCP configuration, switchport, OS install etc.).

Thu, Feb 14, 11:21 AM · Operations

Tue, Feb 12

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

Before these are delivered for implementation, let's make sure that the two systems have identical settings, especially given we've tested various things on them over the past few months. I reverted my SSD Smart Path setting on 1019, but there are still differences; the most important one that I noticed is that in cloudvirt1019 the P440ar is hidden (disabled in BIOS?) but in cloudvirt1020 it's visible. Maybe a factory reset and then manually reapplying the same settings in each?

Tue, Feb 12, 10:47 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations

Thu, Feb 7

CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Thu, Feb 7, 1:41 PM · monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Thu, Feb 7, 1:41 PM · monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations

Tue, Feb 5

faidon added a comment to T215335: requesting WMF7426 as phabricator system in eqiad.

Is there a task describing the plans for a secondary Phabricator system? How did we come up with those specs?

Tue, Feb 5, 10:50 PM · Operations, hardware-requests

Sat, Feb 2

faidon changed the status of T214130: Requesting access to production for dsharpe from Stalled to Open.
Sat, Feb 2, 9:21 PM · SRE-Access-Requests, Operations
faidon changed the status of T214130: Requesting access to production for dsharpe, a subtask of T213742: Onboarding David Sharpe to Security Team as Information Security Analyst, from Stalled to Open.
Sat, Feb 2, 9:21 PM · Security-Team
faidon added a comment to T214130: Requesting access to production for dsharpe.

Let's not wait for a meeting, approved!

Sat, Feb 2, 9:21 PM · SRE-Access-Requests, Operations

Sat, Jan 26

faidon added a comment to T214762: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines.

It's tricky, but I think the one we use one is probably the right one and this should be declined. See T212674 for context.

Sat, Jan 26, 9:58 PM · Operations, monitoring

Fri, Jan 25

faidon updated subscribers of T205897: Netbox: fill network topology.

Netbox is now at 2.5 \o/ which allows us to import cable IDs, type, color etc. Let's start with importing eqsin's, with the data that we have in the spreadsheet, so that we can deprecate that? @RobH @ayounsi any takers?

Fri, Jan 25, 3:11 AM · Operations

Tue, Jan 22

faidon committed rOSKEYHOLDER0fcbce6cca70: Add tests for OSError when loading config files (authored by faidon).
Add tests for OSError when loading config files
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER7894f3de90d9: Add SshKeyBlob per RFC 4253 (authored by faidon).
Add SshKeyBlob per RFC 4253
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDERaeb38db0ca56: Make all SshAgentConfig's methods instance methods (authored by faidon).
Make all SshAgentConfig's methods instance methods
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER2511968dffe4: Add a (very basic) test using OpenSSH's ssh-add (authored by faidon).
Add a (very basic) test using OpenSSH's ssh-add
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER6c5da9d6e848: Test key and config file parsing using test data (authored by faidon).
Test key and config file parsing using test data
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER6e7d5b6ac489: Make all SshAgentConfig's methods instance methods (authored by faidon).
Make all SshAgentConfig's methods instance methods
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDERfa7b998a257b: Add a bunch more tests (authored by faidon).
Add a bunch more tests
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER2a82ff80c088: Add tests for OSError when loading config files (authored by faidon).
Add tests for OSError when loading config files
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER645766d49086: Add SshKeyBlob per RFC 4253 (authored by faidon).
Add SshKeyBlob per RFC 4253
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDERa86c5ae7a3d3: Add a (very basic) test using OpenSSH's ssh-add (authored by faidon).
Add a (very basic) test using OpenSSH's ssh-add
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDERa8cb31ccc3c3: Add a bunch more tests (authored by faidon).
Add a bunch more tests
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER071e33c0d7b0: Properly setup logging when /dev/log doesn't exist (authored by faidon).
Properly setup logging when /dev/log doesn't exist
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER2d614571d5c0: Test key and config file parsing using test data (authored by faidon).
Test key and config file parsing using test data
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDERc094dca54d67: Update tox.ini to facilitate parallel builds (authored by faidon).
Update tox.ini to facilitate parallel builds
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER8131d32632cd: Move tests/unit -> tests (authored by faidon).
Move tests/unit -> tests
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDERb616cb50ac3b: Add a tox environment for Construct 2.8.16 (authored by faidon).
Add a tox environment for Construct 2.8.16
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER74bfd74be76c: Bump minimum Python to 3.5; also test with 3.7 (authored by faidon).
Bump minimum Python to 3.5; also test with 3.7
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDERbbb61dab9f62: Add a pylint tox environment (authored by faidon).
Add a pylint tox environment
Tue, Jan 22, 12:30 AM
faidon committed rOSKEYHOLDER067cc37cca29: protocol.compat: disable a couple of pylint errors (authored by faidon).
protocol.compat: disable a couple of pylint errors
Tue, Jan 22, 12:30 AM

Mon, Jan 21

faidon added a comment to T214313: Add new Tool Labs IPs to Varnish rate limit whitelist.

Per our earlier conversations (T208986, T174596, T209011), I think we should just use the WMCS public IP space to make these kind of exceptions (which also could be dedicated for Toolforge), and not make rate-limit exceptions on 172.16.0.0/12 space.

Mon, Jan 21, 8:07 PM · Toolforge, Wikimedia-Apache-configuration, Traffic, Operations
faidon updated subscribers of T214262: labstore2004 - memory error on DIMM A2.

This is a super old server; it just crossed its 7-year mark (we typically refresh servers at 4.5-5 years), so we're way past its warranty and shelf life and I'm not sure if we have spare parts for it at this point... Not sure if we can do much here -- maybe try a different DIMM or something, if we have one, but I don't have high hopes (also, given the use case... faulty memory is scary). @Papaul, any thoughts?

Mon, Jan 21, 10:01 AM · cloud-services-team (Kanban), ops-codfw, Operations

Jan 18 2019

faidon added a comment to T213748: swap a2-eqiad PDU with on-site spare.

Synced up with Chris via IRC:

All systems were able to come back up within a2 without incident. The spare PDU is in place, but it will also be replaced when rows A and B have PDU refresh this fiscal.

Jan 18 2019, 4:03 PM · Patch-For-Review, DBA, Analytics, ops-eqiad, Operations
faidon added a comment to T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring.

@fgiunchedi so could you describe in a bit more detail what is needed here and what were the challenges you faced with prometheus-snmp-exporter last time you attempted this?

Jan 18 2019, 3:02 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, Operations, monitoring

Dec 21 2018

faidon reassigned T150264: Icinga check for VRRP from faidon to ayounsi.

I pushed what I had written a while ago in Gerrit (see above). It needs to be hooked up to our monitoring, but it should be in a working condition. Leaving that to @ayounsi, assuming you think the code looks good as it is :) Happy to review any subsequent PS updates!

Dec 21 2018, 12:06 PM · Patch-For-Review, netops, monitoring, Operations
faidon added a comment to T211930: Add eqsin routing special cases to jnt.
  1. On received routes: I don't think we should be making these kind of community-matching in BGP_community_actions. Rather, I think we should have ASnnnn_in policy-statements, that map our upstream's communities into our own communities (e.g. UPSTREAM_CUST_US), and then have BGP_community_actions act on that. That would make reading this match more straightfoward. Note that this follows what we've done with our other communities (e.g. see AS13030_in and the likes).
Dec 21 2018, 11:41 AM · Operations, netops
faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

@Cmjohnson what's the status of this?

Dec 21 2018, 10:53 AM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations

Dec 19 2018

faidon edited P7931 Flask/PuppetDB PoC.
Dec 19 2018, 6:24 PM · Operations
faidon edited P7931 Flask/PuppetDB PoC.
Dec 19 2018, 6:24 PM · Operations
faidon created P7931 Flask/PuppetDB PoC.
Dec 19 2018, 6:22 PM · Operations

Dec 18 2018

faidon added a comment to T211750: Introduce Python code formatters usage.

I like black too but from but from https://black.readthedocs.io/en/stable/installation_and_usage.html it tied to having python 3.6 installed.

Black can be installed by running pip install black. It requires Python 3.6.0+ to run

With stretch shipping with 3.4 (what do the Mac OS X versions do?) it might be a bit too restrictive to require it.

Dec 18 2018, 8:47 PM · Operations, Operations-Software-Development
faidon added a comment to T211750: Introduce Python code formatters usage.

On my side I've done a test on the cumin codebase with black. The results are:

  • all the ignore comments for pylint or any other validation tool were misplaced (moved to the last line when splitting) and require to manually move them to the first line [one off]
  • it doesn't pass flake8:
    • E203 whitespace before ':' (this seems a bug on their side, it's for a list slice _ARGV[index + 1 :]
Dec 18 2018, 8:37 PM · Operations, Operations-Software-Development
faidon reopened T205897: Netbox: fill network topology as "Open".

This task is great, and the table at the top is a very useful summary! The Q2 goal part of it has been completed indeed, so I can see the argument for the task being resolved.

Dec 18 2018, 8:38 AM · Operations
faidon reopened T205897: Netbox: fill network topology, a subtask of T205868: Expand Netbox usage - Q2 2018-19 Goal, as Open.
Dec 18 2018, 8:38 AM · Operations, Operations-Software-Development, Goal
faidon added a comment to T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents.

There were a few notices on the 15th and 16th of December. Did these arrive to maint-announce@?

Dec 18 2018, 7:47 AM · Wikimedia-Incident, Traffic, Operations

Dec 17 2018

faidon added a comment to T191764: CI: run tests with multiple Python3 versions.

This was quite complicated but I've managed to forward-port 3.4 and backpored 3.6 and 3.7 to stretch. These are now included in the component component/pyall of suite stretch-wikimedia and they can be installed like one would normally install python (apt install python3.{4,5,6,7}-venv should do it).

Dec 17 2018, 11:03 AM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure
faidon reassigned T212011: migrate netinsights from rhenium to sulfur from faidon to MoritzMuehlenhoff.

Moritz was working on that.

Dec 17 2018, 7:35 AM · netops, Operations

Dec 14 2018

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

@faidon, who is 'please also construct a draft email' directed to?

Dec 14 2018, 8:46 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
faidon added a comment to T205899: Develop and deploy at least three Netbox reports to assist with data correctness and consistency.

I would say to also check that all devices matching some criteria, are present in PuppetDB and vice-versa. These criteria may be a combination of:

  • Type: Server
  • Status: Active or Staged
  • Tenant: None (and then define and set tenants "frack" and "sandbox", i.e. RIPE Atlases?)

This might be a lot harder, since the reports can't make a log_failure without a record present in Netbox already. We could make log lines for that though.

Dec 14 2018, 8:07 PM · Patch-For-Review, Operations, Operations-Software-Development
faidon added a comment to T205899: Develop and deploy at least three Netbox reports to assist with data correctness and consistency.

Manufacturer, model and serial checks all sound good to me! Manufacturer may need some rewriting, I think there's "Dell, Inc." vs. Dell" and differences like that.

Dec 14 2018, 5:54 PM · Patch-For-Review, Operations, Operations-Software-Development

Dec 13 2018

faidon added a watcher for Keyholder: faidon.
Dec 13 2018, 10:42 AM

Dec 12 2018

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

OK, I had a look at this. A few observations first of all:

  • While not 100% sure, I don't think this is related to the controller having been swapped before. I don't think it fits.
  • cloudvirt1019 & cloudvirt1002 exhibit different symptoms at the moment. 1019 (which @Cmjohnson has been focusing on) shows its battery count as 1 but status as "recharging", while 1020 as having no battery (count = 0).
Dec 12 2018, 7:42 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
faidon added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

Makes sense, +1, go for it! A lot has happened since this task was filled in 2015 (e.g. not having precise anymore, T163196 etc.) and including interface::add_ip6_mapped { 'main': } everywhere should be easy, if not completely painless! :)

Dec 12 2018, 5:49 AM · Patch-For-Review, Traffic, netops, Operations, IPv6
faidon closed T158429: Switch to predictable network interface names? as Resolved.

Has been implemented for all hosts starting with stretch and going forward for a long time now!

Dec 12 2018, 5:48 AM · Patch-For-Review, Operations

Dec 11 2018

faidon added a comment to T211254: Free up 185.15.59.0/24.

What is the rationale behind trying to empty this address space and/or find a new /24?

Dec 11 2018, 7:32 PM · Patch-For-Review, Traffic, Operations, netops

Dec 10 2018

faidon added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.
  • It's been a while, but I believe an import statement in the neighbor block overrides the parent one in its entirety, and does not supplement it, so we'd have to repeat the whole import chain there.
  • Would it make sense to have separate as-path groups for v4/v6? It's a bit unusual in our config, but it would address the issue with HE and to inadvertently avoid downprefing HE for IPv4 for no reason.
  • If we're going to remove the local-preference setting from BGP_IXP_in and just rely on BGP_community_actions to apply based on communities (it's a good idea!), then we should probably do the same for BGP_Private_Peer_in for consistency.
  • Nitpick: the non-RS policies are called BGP_IXP_…, so let's follow that naming scheme (i.e. "BGP_IXP_RS_in", not "IX")
Dec 10 2018, 11:31 PM · Operations, Traffic, netops
faidon added a comment to T207965: eqiad: Re-connect cage cameras .

They don't, these aren't PoE switches. I didn't know these cameras required PoE. So, two options I suppose:

  • Use PoE injectors
  • Hook them up to (old) EX4200s. Are we using any of them for mgmt switches yet? Cameras seem a better fit for the mgmt network than the production network anyway, right?
Dec 10 2018, 1:29 PM · Operations, ops-eqiad
faidon added a comment to T207965: eqiad: Re-connect cage cameras .

@Cmjohnson all of the ports show as "physical link down", could you have a look? Thanks!

Dec 10 2018, 11:58 AM · Operations, ops-eqiad

Dec 7 2018

faidon added a comment to T211368: update PDUs for eqsin (asset tag and other info).

Can we add procurement task and purchase date immediately? It doesn't sound like there is an immediate blocker to this.

Dec 7 2018, 1:16 PM · Operations, ops-eqsin

Dec 6 2018

faidon updated subscribers of T187456: Decommission labstore100[123] and their disk shelves.

Per @bd808 on IRC:

Dec 6 2018, 6:52 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
faidon renamed T187456: Decommission labstore100[123] and their disk shelves from Decommission labstore100[12] and their disk shelves to Decommission labstore100[123] and their disk shelves.
Dec 6 2018, 6:51 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad

Dec 5 2018

faidon added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

Some thoughts here:

Dec 5 2018, 12:33 PM · Operations, Traffic, netops

Dec 4 2018

faidon added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

The forward paths are nearly identical, but the reverse is not: reverse path selection is HE for IPv6 and NTT for IPv4, so different paths, and latency could be reasonably explained by that.

Dec 4 2018, 1:55 PM · Operations, Traffic, netops
faidon renamed T211079: IPv6 ~20ms higher ping than IPv4 to gerrit from IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop to IPv6 ~20ms higher ping than IPv4 to gerrit.
Dec 4 2018, 1:49 PM · Operations, Traffic, netops
faidon added a comment to T207965: eqiad: Re-connect cage cameras .

Any progress on this?

Dec 4 2018, 1:34 PM · Operations, ops-eqiad

Nov 30 2018

faidon added a comment to T210667: Can exfat be used in WMF production?.

In this case specifically, my thinking was that I had agreement and understanding with another Opsen, a manager in Tech, a director in Tech and a couple more knowledgeable and engaged parties in real time right before (as review of action). I installed the package with a !log so it would be recorded in the right place and a ping to one of the Opsen who works in that specific area.

Nov 30 2018, 5:36 PM · Security-Team, Analytics, Software-Licensing, WMF-Legal, Operations
faidon added a comment to T210667: Can exfat be used in WMF production?.

So I think this task raises a few different issues (and @Legoktm correct me if I'm wrong):

  1. Legal concerns about using this particular piece of software, and in general software in the same limbo status with regards to freedom-respecting copyright license, but enforced patents;
  2. Guiding principles / Wikimedia movement / free software movements concerns over using patent encumbered software
  3. Installing software outside of our regular processes (puppet, no code review etc.) and in contrast with the commitments we enumerate in L3.
Nov 30 2018, 3:22 PM · Security-Team, Analytics, Software-Licensing, WMF-Legal, Operations

Nov 26 2018

faidon added a comment to T209861: labvirt1007 predicted raid failure.

Sure sounds fine, but @Cmjohnson please file a procurement request so that we can proceed with that purchase :)

Nov 26 2018, 5:24 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)

Nov 23 2018

faidon added a comment to T203003: Keyholder phab repo duplicate work.

I guess we can close rKEYHOLDER. Seems to me keyholder code will be moved out of operations/puppet to operations/software/keyholder where development has been occurring recently.

Nov 23 2018, 3:33 PM · Release-Engineering-Team (Backlog), Operations

Nov 21 2018

faidon added a comment to T177959: Should VPS puppetmasters include labs-recursor0/ns-1 in their resolv.confs?.

If this is about labspuppetmaster1xxx, I have concerns with having a production host use a non-standard recursor, as well having cross-realm DNS queries like that. I can't offer any practical attack vectors right now, but I'd like to ask to block this for now -- preferrably until puppetmasters themselves move to WMCS and this gets implicitly fixed by extension :)

Nov 21 2018, 6:44 PM · Patch-For-Review, cloud-services-team (Kanban)
faidon updated subscribers of T205898: Netbox: explore NAPALM integration.

I think we have consensus on the NAPALM stuff :)

Nov 21 2018, 3:15 PM · Patch-For-Review, Operations
faidon added a comment to T208576: Netbox: Usage guidelines for WMCS .

The "cluster" feature is under the "virtualization" module; it's meant to be used to track where VMs run ("Physical devices may be associated with clusters as hosts. This allows users to track on which host(s) a particular VM may reside"). So in your example, cloudservices and cloudnet etc. wouldn't fit in this definition. cloudvirts... could in theory fit, but even that is a bit of a poor match because VMs in the cloud are in a separate admin domain and not tracked by Netbox. I wouldn't recommend it.

Nov 21 2018, 2:35 PM · Operations, cloud-services-team (Kanban)

Nov 20 2018

faidon added a comment to T208576: Netbox: Usage guidelines for WMCS .

Thanks @GTirloni and @aborrero, useful conversation to have for sure :)

Nov 20 2018, 9:57 PM · Operations, cloud-services-team (Kanban)

Nov 19 2018

faidon added a comment to T171188: Move the main WMCS puppetmaster into the Labs realm.

JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone else set up that?

Nov 19 2018, 2:21 PM · cloud-services-team (Kanban), Cloud-Services, Puppet, Operations

Nov 16 2018

faidon updated subscribers of T209642: Remove labnodepool1001.eqiad.wmnet.

This specific HW is /very/ old and is already overdue for decomissioning (by 3 years no less).

Nov 16 2018, 2:39 PM · Patch-For-Review, DC-Ops, ops-eqiad, decommission, Operations

Nov 13 2018

faidon added a comment to T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis.

Thanks @bd808 and @MusikAnimal :)

Nov 13 2018, 11:15 AM · cloud-services-team (Kanban), Cloud-VPS

Nov 9 2018

faidon added a comment to T179050: setup bast4002/WMF7218.

Can this task be resolved, given we have T178592 to track the bast4001 decom?

Nov 9 2018, 8:06 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo
faidon removed parent tasks for T196432: Configure interface damping on primary links: T189552: Rack/cable/configure ulsfo MX204, T174616: set up cr3-esams.
Nov 9 2018, 8:06 PM · Operations, Traffic, netops
faidon removed a subtask for T174616: set up cr3-esams: T196432: Configure interface damping on primary links.
Nov 9 2018, 8:06 PM · ops-esams, Operations, netops
faidon removed a subtask for T189552: Rack/cable/configure ulsfo MX204: T196432: Configure interface damping on primary links.
Nov 9 2018, 8:06 PM · Patch-For-Review, Operations, ops-ulsfo, netops, Traffic
faidon updated subscribers of T205898: Netbox: explore NAPALM integration.

So the aforementioned functionality was removed as obsolete due to NAPALM support replacing it and will not be part of the 2.5 release. The inventory data models remain in the tree AIUI, and one could write external scripts to populate those, that would either use SNMP or ncclient with public key auth etc. to fetch this information. I think it would be interesting to explore, and indeed, probably more interesting than NAPALM itself.

Nov 9 2018, 2:52 PM · Patch-For-Review, Operations
faidon added a comment to T199675: cp5001 unreachable since 2018-07-14 17:49:21.

Why is this still pending?

Nov 9 2018, 1:18 PM · Operations, ops-eqsin, Traffic
faidon added a comment to T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis.

A good PTR record and whois information for the IP(s) we use for SNAT should help. We really should already be concerned about that for the sake of external sites that may get a large amount of traffic from Cloud VPS/Toolforge hosts. We may also be able to mitigate some of this if hosts with public IPs (like the majority of the Toolforge job grid exec nodes) route directly instead of being consolidated with SNAT. The public IPs on Toolforge grid exec nodes today were added to help with Freenode connection limits which is a similar situation.

Nov 9 2018, 11:42 AM · cloud-services-team (Kanban), Cloud-VPS

Nov 8 2018

faidon added a comment to T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis.

Yup, T174596 is very much overlapping if not duplicate to this. As that task indicates, it's not even consistent right now, and source NATing depends on whether one hits a main or edge PoP, which in turn depends on the GeoDNS config... So it's something that needs to be addressed one way or another soon.

Nov 8 2018, 3:16 PM · cloud-services-team (Kanban), Cloud-VPS

Nov 6 2018

faidon closed T208630: Display remote port name in LLDP output as Resolved.

Cool, thanks :)

Nov 6 2018, 7:40 PM · Operations, netops
faidon added a comment to T208630: Display remote port name in LLDP output.

Mmmm OK, that's not super consistent :( It's possible to change the lldpd config and set configure lldp portidsubtype ifname, but it might be complex because of our Puppet facts and is probably not worth our time in general indeed.

Nov 6 2018, 7:25 PM · Operations, netops
faidon updated subscribers of T208622: Import recommendations into production database.

Hey @bmansurov -- stepping in for @mark while he's on vacation this week.

Nov 6 2018, 6:09 PM · Analytics, User-Banyek, Patch-For-Review, Operations, Research
faidon reopened T208630: Display remote port name in LLDP output as "Open".

Looks like an esthetic Juniper bug:
<snip>

Nov 6 2018, 5:44 PM · Operations, netops
faidon added a comment to T208630: Display remote port name in LLDP output.

That's great! +1 in deploying this more widely! :)

Nov 6 2018, 10:45 AM · Operations, netops

Nov 5 2018

faidon added a comment to T193655: rack/setup/install cloudstore1008 & cloudstore1009.

I've seen the same lockup effect in the past when there was contention between the BIOS and Linux for the serial port. This happened when the serial port redirect settings were misconfigured and e.g. set up for "redirect after boot" and directed to COM1, while Linux was also set up for ttyS0. I'd recommend verifying the BIOS settings against our docs on wikitech if you haven't already!

Nov 5 2018, 6:52 PM · cloud-services-team (Kanban), Patch-For-Review, ops-eqiad, Cloud-VPS, Operations
faidon added a comment to T207321: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster.

Ack, +1. Only thing I'd nitpick is that cloudvirtanalytics1001 may be too long for things like physical labels. I think dumps labvirts were just named "labvirts", could we go for that? If not, something shorter would be great. Maybe cloudvirt-an1001 or cloudvirt-dl1001 (for "data lake")?

Nov 5 2018, 6:44 PM · Analytics-Kanban, netops, Operations, Analytics
faidon added a comment to T208726: Access to network devices for Riccardo (volans).

Go for it.

Nov 5 2018, 2:20 PM · netops, Operations
faidon added a comment to T192532: Figure out a way to enable volunteers to use the puppet compiler.

While this is great, I fear that it will unnecessarily spam the commit messages with information that isn't really about the commit itself.

Nov 5 2018, 2:02 PM · Release-Engineering-Team (Backlog), Operations, Puppet, puppet-compiler, Continuous-Integration-Config
faidon added a comment to T208630: Display remote port name in LLDP output.

Hmmm, weird. In the previous generation of stacks, this was different; compare:

Chassis:
  [...]
  SysName:      asw2-c-eqiad
  [...]
Port:        
  PortID:       local 791
  PortDescr:    bast1002
  MFS:          9192

vs.

Chassis:
  [...]
  SysName:      asw-a-eqiad
  [...]
Port:        
  PortID:       local 950
  PortDescr:    ge-2/0/3.0
  MFS:          9192
Nov 5 2018, 1:50 PM · Operations, netops

Nov 2 2018

faidon updated subscribers of T192532: Figure out a way to enable volunteers to use the puppet compiler.

Thanks to Krenair bringing it up on IRC, I took a stab at implementing this. You can now comment "check experimental" on a operations/puppet patch and it'll trigger PCC.

To pass the list of hosts (so it doesn't take hours to run), you can specify it via the commit message, for example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/463519 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/471195

This is currently implemented via the https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/ job, which is a fork of the standard PCC job. Are there any usecases that triggering via Gerrit/zuul doesn't handle? I'd like to replace the current 'operations-puppet-catalog-compiler' job with the -test one.

Nov 2 2018, 8:42 AM · Release-Engineering-Team (Backlog), Operations, Puppet, puppet-compiler, Continuous-Integration-Config

Nov 1 2018

faidon assigned T208267: Requesting access to netbox for bd808 to MoritzMuehlenhoff.

Alright, let's do all of cn=wmf for now, and cross the cn=nda bridge when we come to it :)

Nov 1 2018, 12:19 PM · Patch-For-Review, LDAP-Access-Requests, Operations, SRE-Access-Requests

Oct 30 2018

faidon added a comment to T201247: Sporadic puppet failures.

Spoke too soon, got another failure overnight.

Oct 23 06:25:20 labvirt1017 puppet-agent[161569]: (/Stage[main]/Openstack::Nova::Common::Base/File[/etc/nova/policy.json]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/openstack/mitaka/nova/common/policy.json: end of file reached
Oct 30 2018, 11:40 PM · cloud-services-team (Kanban), Operations
faidon added a comment to T208281: Set up SPF, DKIM, etc. for new cloud MX servers.

Not necessarily! For what we're currently doing -just aliasing a handful of aliases to a few people- I think it's fine as it is (but if the cloud admin team wants that separate for some reason, that's their call of course). We're not crossing any prod/WMCS barriers as it is, so I don't consider this a security issue.

Oct 30 2018, 12:59 PM · Mail, Cloud-VPS