faidon (Faidon Liambotis)
SRE

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (218 w, 4 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

@faidon, who is 'please also construct a draft email' directed to?

Fri, Dec 14, 8:46 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
faidon added a comment to T205899: Develop and deploy at least three Netbox reports to assist with data correctness and consistency.

I would say to also check that all devices matching some criteria, are present in PuppetDB and vice-versa. These criteria may be a combination of:

  • Type: Server
  • Status: Active or Staged
  • Tenant: None (and then define and set tenants "frack" and "sandbox", i.e. RIPE Atlases?)

This might be a lot harder, since the reports can't make a log_failure without a record present in Netbox already. We could make log lines for that though.

Fri, Dec 14, 8:07 PM · Patch-For-Review, Operations, Operations-Software-Development
faidon added a comment to T205899: Develop and deploy at least three Netbox reports to assist with data correctness and consistency.

Manufacturer, model and serial checks all sound good to me! Manufacturer may need some rewriting, I think there's "Dell, Inc." vs. Dell" and differences like that.

Fri, Dec 14, 5:54 PM · Patch-For-Review, Operations, Operations-Software-Development

Thu, Dec 13

faidon added a watcher for Keyholder: faidon.
Thu, Dec 13, 10:42 AM

Wed, Dec 12

faidon added a comment to T196507: Degraded RAID on cloudvirt1019.

OK, I had a look at this. A few observations first of all:

  • While not 100% sure, I don't think this is related to the controller having been swapped before. I don't think it fits.
  • cloudvirt1019 & cloudvirt1002 exhibit different symptoms at the moment. 1019 (which @Cmjohnson has been focusing on) shows its battery count as 1 but status as "recharging", while 1020 as having no battery (count = 0).
Wed, Dec 12, 7:42 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
faidon added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

Makes sense, +1, go for it! A lot has happened since this task was filled in 2015 (e.g. not having precise anymore, T163196 etc.) and including interface::add_ip6_mapped { 'main': } everywhere should be easy, if not completely painless! :)

Wed, Dec 12, 5:49 AM · Traffic, netops, Operations, IPv6
faidon closed T158429: Switch to predictable network interface names? as Resolved.

Has been implemented for all hosts starting with stretch and going forward for a long time now!

Wed, Dec 12, 5:48 AM · Patch-For-Review, Operations

Tue, Dec 11

faidon added a comment to T211254: Free up 185.15.59.0/24.

What is the rationale behind trying to empty this address space and/or find a new /24?

Tue, Dec 11, 7:32 PM · Traffic, Operations, netops

Mon, Dec 10

faidon added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.
  • It's been a while, but I believe an import statement in the neighbor block overrides the parent one in its entirety, and does not supplement it, so we'd have to repeat the whole import chain there.
  • Would it make sense to have separate as-path groups for v4/v6? It's a bit unusual in our config, but it would address the issue with HE and to inadvertently avoid downprefing HE for IPv4 for no reason.
  • If we're going to remove the local-preference setting from BGP_IXP_in and just rely on BGP_community_actions to apply based on communities (it's a good idea!), then we should probably do the same for BGP_Private_Peer_in for consistency.
  • Nitpick: the non-RS policies are called BGP_IXP_…, so let's follow that naming scheme (i.e. "BGP_IXP_RS_in", not "IX")
Mon, Dec 10, 11:31 PM · Operations, Traffic, netops
faidon added a comment to T207965: eqiad: Re-connect cage cameras .

They don't, these aren't PoE switches. I didn't know these cameras required PoE. So, two options I suppose:

  • Use PoE injectors
  • Hook them up to (old) EX4200s. Are we using any of them for mgmt switches yet? Cameras seem a better fit for the mgmt network than the production network anyway, right?
Mon, Dec 10, 1:29 PM · ops-eqiad, Operations
faidon added a comment to T207965: eqiad: Re-connect cage cameras .

@Cmjohnson all of the ports show as "physical link down", could you have a look? Thanks!

Mon, Dec 10, 11:58 AM · ops-eqiad, Operations

Fri, Dec 7

faidon added a comment to T211368: update PDUs for eqsin (asset tag and other info).

Can we add procurement task and purchase date immediately? It doesn't sound like there is an immediate blocker to this.

Fri, Dec 7, 1:16 PM · Operations, ops-eqsin

Thu, Dec 6

faidon updated subscribers of T187456: Decommission labstore100[123] and their disk shelves.

Per @bd808 on IRC:

Thu, Dec 6, 6:52 PM · cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad
faidon renamed T187456: Decommission labstore100[123] and their disk shelves from Decommission labstore100[12] and their disk shelves to Decommission labstore100[123] and their disk shelves.
Thu, Dec 6, 6:51 PM · cloud-services-team (Kanban), Data-Services, Operations, DC-Ops, ops-eqiad

Wed, Dec 5

faidon added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

Some thoughts here:

Wed, Dec 5, 12:33 PM · Operations, Traffic, netops

Tue, Dec 4

faidon added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

The forward paths are nearly identical, but the reverse is not: reverse path selection is HE for IPv6 and NTT for IPv4, so different paths, and latency could be reasonably explained by that.

Tue, Dec 4, 1:55 PM · Operations, Traffic, netops
faidon renamed T211079: IPv6 ~20ms higher ping than IPv4 to gerrit from IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop to IPv6 ~20ms higher ping than IPv4 to gerrit.
Tue, Dec 4, 1:49 PM · Operations, Traffic, netops
faidon added a comment to T207965: eqiad: Re-connect cage cameras .

Any progress on this?

Tue, Dec 4, 1:34 PM · ops-eqiad, Operations

Fri, Nov 30

faidon added a comment to T210667: Can exfat be used in WMF production?.

In this case specifically, my thinking was that I had agreement and understanding with another Opsen, a manager in Tech, a director in Tech and a couple more knowledgeable and engaged parties in real time right before (as review of action). I installed the package with a !log so it would be recorded in the right place and a ping to one of the Opsen who works in that specific area.

Fri, Nov 30, 5:36 PM · Security-Team, Analytics, Software-Licensing, WMF-Legal, Operations
faidon added a comment to T210667: Can exfat be used in WMF production?.

So I think this task raises a few different issues (and @Legoktm correct me if I'm wrong):

  1. Legal concerns about using this particular piece of software, and in general software in the same limbo status with regards to freedom-respecting copyright license, but enforced patents;
  2. Guiding principles / Wikimedia movement / free software movements concerns over using patent encumbered software
  3. Installing software outside of our regular processes (puppet, no code review etc.) and in contrast with the commitments we enumerate in L3.
Fri, Nov 30, 3:22 PM · Security-Team, Analytics, Software-Licensing, WMF-Legal, Operations

Mon, Nov 26

faidon added a comment to T209861: labvirt1007 predicted raid failure.

Sure sounds fine, but @Cmjohnson please file a procurement request so that we can proceed with that purchase :)

Mon, Nov 26, 5:24 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)

Fri, Nov 23

faidon added a comment to T203003: Keyholder phab repo duplicate work.

I guess we can close rKEYHOLDER. Seems to me keyholder code will be moved out of operations/puppet to operations/software/keyholder where development has been occurring recently.

Fri, Nov 23, 3:33 PM · Release-Engineering-Team (Backlog), Operations

Wed, Nov 21

faidon added a comment to T177959: Should VPS puppetmasters include labs-recursor0/ns-1 in their resolv.confs?.

If this is about labspuppetmaster1xxx, I have concerns with having a production host use a non-standard recursor, as well having cross-realm DNS queries like that. I can't offer any practical attack vectors right now, but I'd like to ask to block this for now -- preferrably until puppetmasters themselves move to WMCS and this gets implicitly fixed by extension :)

Wed, Nov 21, 6:44 PM · Patch-For-Review, cloud-services-team (Kanban)
faidon updated subscribers of T205898: Netbox: explore NAPALM integration.

I think we have consensus on the NAPALM stuff :)

Wed, Nov 21, 3:15 PM · Patch-For-Review, Operations
faidon added a comment to T208576: Netbox: Usage guidelines for WMCS .

The "cluster" feature is under the "virtualization" module; it's meant to be used to track where VMs run ("Physical devices may be associated with clusters as hosts. This allows users to track on which host(s) a particular VM may reside"). So in your example, cloudservices and cloudnet etc. wouldn't fit in this definition. cloudvirts... could in theory fit, but even that is a bit of a poor match because VMs in the cloud are in a separate admin domain and not tracked by Netbox. I wouldn't recommend it.

Wed, Nov 21, 2:35 PM · Operations, cloud-services-team (Kanban)

Tue, Nov 20

faidon added a comment to T208576: Netbox: Usage guidelines for WMCS .

Thanks @GTirloni and @aborrero, useful conversation to have for sure :)

Tue, Nov 20, 9:57 PM · Operations, cloud-services-team (Kanban)

Mon, Nov 19

faidon added a comment to T171188: Move the main WMCS puppetmaster into the Labs realm.

JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone else set up that?

Mon, Nov 19, 2:21 PM · cloud-services-team (Kanban), Cloud-Services, Puppet, Operations

Fri, Nov 16

faidon updated subscribers of T209642: Remove labnodepool1001.eqiad.wmnet.

This specific HW is /very/ old and is already overdue for decomissioning (by 3 years no less).

Fri, Nov 16, 2:39 PM · DC-Ops, ops-eqiad, decommission, Operations

Nov 13 2018

faidon added a comment to T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis.

Thanks @bd808 and @MusikAnimal :)

Nov 13 2018, 11:15 AM · cloud-services-team (Kanban), Cloud-VPS

Nov 9 2018

faidon added a comment to T179050: setup bast4002/WMF7218.

Can this task be resolved, given we have T178592 to track the bast4001 decom?

Nov 9 2018, 8:06 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo
faidon removed parent tasks for T196432: Configure interface damping on primary links: T189552: Rack/cable/configure ulsfo MX204, T174616: set up cr3-esams.
Nov 9 2018, 8:06 PM · Operations, Traffic, netops
faidon removed a subtask for T174616: set up cr3-esams: T196432: Configure interface damping on primary links.
Nov 9 2018, 8:06 PM · ops-esams, Operations, netops
faidon removed a subtask for T189552: Rack/cable/configure ulsfo MX204: T196432: Configure interface damping on primary links.
Nov 9 2018, 8:06 PM · Patch-For-Review, Operations, ops-ulsfo, netops, Traffic
faidon updated subscribers of T205898: Netbox: explore NAPALM integration.

So the aforementioned functionality was removed as obsolete due to NAPALM support replacing it and will not be part of the 2.5 release. The inventory data models remain in the tree AIUI, and one could write external scripts to populate those, that would either use SNMP or ncclient with public key auth etc. to fetch this information. I think it would be interesting to explore, and indeed, probably more interesting than NAPALM itself.

Nov 9 2018, 2:52 PM · Patch-For-Review, Operations
faidon added a comment to T199675: cp5001 unreachable since 2018-07-14 17:49:21.

Why is this still pending?

Nov 9 2018, 1:18 PM · Operations, ops-eqsin, Traffic
faidon added a comment to T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis.

A good PTR record and whois information for the IP(s) we use for SNAT should help. We really should already be concerned about that for the sake of external sites that may get a large amount of traffic from Cloud VPS/Toolforge hosts. We may also be able to mitigate some of this if hosts with public IPs (like the majority of the Toolforge job grid exec nodes) route directly instead of being consolidated with SNAT. The public IPs on Toolforge grid exec nodes today were added to help with Freenode connection limits which is a similar situation.

Nov 9 2018, 11:42 AM · cloud-services-team (Kanban), Cloud-VPS

Nov 8 2018

faidon added a comment to T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis.

Yup, T174596 is very much overlapping if not duplicate to this. As that task indicates, it's not even consistent right now, and source NATing depends on whether one hits a main or edge PoP, which in turn depends on the GeoDNS config... So it's something that needs to be addressed one way or another soon.

Nov 8 2018, 3:16 PM · cloud-services-team (Kanban), Cloud-VPS

Nov 6 2018

faidon closed T208630: Display remote port name in LLDP output as Resolved.

Cool, thanks :)

Nov 6 2018, 7:40 PM · Operations, netops
faidon added a comment to T208630: Display remote port name in LLDP output.

Mmmm OK, that's not super consistent :( It's possible to change the lldpd config and set configure lldp portidsubtype ifname, but it might be complex because of our Puppet facts and is probably not worth our time in general indeed.

Nov 6 2018, 7:25 PM · Operations, netops
faidon updated subscribers of T208622: Import recommendations into production database.

Hey @bmansurov -- stepping in for @mark while he's on vacation this week.

Nov 6 2018, 6:09 PM · Analytics, User-Banyek, Patch-For-Review, Operations, Research
faidon reopened T208630: Display remote port name in LLDP output as "Open".

Looks like an esthetic Juniper bug:
<snip>

Nov 6 2018, 5:44 PM · Operations, netops
faidon added a comment to T208630: Display remote port name in LLDP output.

That's great! +1 in deploying this more widely! :)

Nov 6 2018, 10:45 AM · Operations, netops

Nov 5 2018

faidon added a comment to T193655: rack/setup/install cloudstore1008 & cloudstore1009.

I've seen the same lockup effect in the past when there was contention between the BIOS and Linux for the serial port. This happened when the serial port redirect settings were misconfigured and e.g. set up for "redirect after boot" and directed to COM1, while Linux was also set up for ttyS0. I'd recommend verifying the BIOS settings against our docs on wikitech if you haven't already!

Nov 5 2018, 6:52 PM · cloud-services-team (Kanban), Patch-For-Review, ops-eqiad, Cloud-VPS, Operations
faidon added a comment to T207321: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster.

Ack, +1. Only thing I'd nitpick is that cloudvirtanalytics1001 may be too long for things like physical labels. I think dumps labvirts were just named "labvirts", could we go for that? If not, something shorter would be great. Maybe cloudvirt-an1001 or cloudvirt-dl1001 (for "data lake")?

Nov 5 2018, 6:44 PM · Analytics-Kanban, netops, Operations, Analytics
faidon added a comment to T208726: Access to network devices for Riccardo (volans).

Go for it.

Nov 5 2018, 2:20 PM · netops, Operations
faidon added a comment to T192532: Figure out a way to enable volunteers to use the puppet compiler.

While this is great, I fear that it will unnecessarily spam the commit messages with information that isn't really about the commit itself.

Nov 5 2018, 2:02 PM · Release-Engineering-Team (Backlog), Operations, Puppet, puppet-compiler, Continuous-Integration-Config
faidon added a comment to T208630: Display remote port name in LLDP output.

Hmmm, weird. In the previous generation of stacks, this was different; compare:

Chassis:
  [...]
  SysName:      asw2-c-eqiad
  [...]
Port:        
  PortID:       local 791
  PortDescr:    bast1002
  MFS:          9192

vs.

Chassis:
  [...]
  SysName:      asw-a-eqiad
  [...]
Port:        
  PortID:       local 950
  PortDescr:    ge-2/0/3.0
  MFS:          9192
Nov 5 2018, 1:50 PM · Operations, netops

Nov 2 2018

faidon updated subscribers of T192532: Figure out a way to enable volunteers to use the puppet compiler.

Thanks to Krenair bringing it up on IRC, I took a stab at implementing this. You can now comment "check experimental" on a operations/puppet patch and it'll trigger PCC.

To pass the list of hosts (so it doesn't take hours to run), you can specify it via the commit message, for example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/463519 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/471195

This is currently implemented via the https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/ job, which is a fork of the standard PCC job. Are there any usecases that triggering via Gerrit/zuul doesn't handle? I'd like to replace the current 'operations-puppet-catalog-compiler' job with the -test one.

Nov 2 2018, 8:42 AM · Release-Engineering-Team (Backlog), Operations, Puppet, puppet-compiler, Continuous-Integration-Config

Nov 1 2018

faidon assigned T208267: Requesting access to netbox for bd808 to MoritzMuehlenhoff.

Alright, let's do all of cn=wmf for now, and cross the cn=nda bridge when we come to it :)

Nov 1 2018, 12:19 PM · Patch-For-Review, LDAP-Access-Requests, Operations, SRE-Access-Requests

Oct 30 2018

faidon added a comment to T201247: Sporadic puppet failures.

Spoke too soon, got another failure overnight.

Oct 23 06:25:20 labvirt1017 puppet-agent[161569]: (/Stage[main]/Openstack::Nova::Common::Base/File[/etc/nova/policy.json]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/openstack/mitaka/nova/common/policy.json: end of file reached
Oct 30 2018, 11:40 PM · cloud-services-team (Kanban), Operations
faidon added a comment to T208281: Set up SPF, DKIM, etc. for new cloud MX servers.

Not necessarily! For what we're currently doing -just aliasing a handful of aliases to a few people- I think it's fine as it is (but if the cloud admin team wants that separate for some reason, that's their call of course). We're not crossing any prod/WMCS barriers as it is, so I don't consider this a security issue.

Oct 30 2018, 12:59 PM · Mail, Cloud-VPS
faidon updated subscribers of T208267: Requesting access to netbox for bd808.

Netbox does have a piece of functionality called "secrets", but we're not currently using it. We may in the future, but I don't think it's super important to account for that right now and we'd need to deal with more granular access rights for that anyway.

Oct 30 2018, 11:49 AM · Patch-For-Review, LDAP-Access-Requests, Operations, SRE-Access-Requests
faidon added a comment to T208281: Set up SPF, DKIM, etc. for new cloud MX servers.

Yes, this was because of T137160. wmflabs.org inbound emails are being handled by prod MXes right now, with only a handful of aliases being defined.

Oct 30 2018, 11:03 AM · Mail, Cloud-VPS

Oct 29 2018

faidon added a comment to T208267: Requesting access to netbox for bd808.

+1 sounds good to me. I'd go as far as to say that we should just make this available to all NDA users? Thoughts?

Oct 29 2018, 9:16 PM · Patch-For-Review, LDAP-Access-Requests, Operations, SRE-Access-Requests
faidon added a comment to T208244: ntp broken in new region.

But then this all reminds me of something I should've thought about earlier: NTP and VMs haven't historically blended well anyways, so there might be other gremlins to look out for while you're staring at all of this. The issues have surely evolved since I last looked, and can vary a lot by what hypervisor you're using (KVM?). There's all kinds of conflicting advice on a quick google search right now, and it all looks more complicated than I can wade into at the moment. It sounds like to get really accurate time everywhere, you do want NTP inside your guests, but that there may be some (non-standard?) configuration tricks to make it work well (the various bits about pvclock and kvm-clock and whether you have a constant-time TSC in your host hardware, etc):

Oct 29 2018, 9:14 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, netops, Cloud-VPS
faidon added a comment to T208244: ntp broken in new region.

Can we set up a couple of NTP servers within VPS e.g. in the cloudinfra project instead? Should be just a couple of generic instances with role::ntp applied, right?

Oct 29 2018, 5:52 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, netops, Cloud-VPS

Oct 25 2018

faidon added a comment to T207321: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster.

So, this is quite the can of worms :) There are several pieces to this, and honestly, I feel like VLANs is kind of a secondary question, with the primary being the overall design of this new infrastructure especially from a security perspective. Questions such as "what services should we opening up to the public (WMCS/Internet)", "how should data flow from the Analytics cluster", etc.

Oct 25 2018, 6:55 PM · Analytics-Kanban, netops, Operations, Analytics
faidon added a comment to T207775: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga.

Good idea! I think upstream fe006d2 and 08425ff are the fixes for this particular issue and they seem to apply cleanly on top of 3.0.1 with 1 file changed, 5 insertions(+), 3 deletions(-), so trivial enough.

Oct 25 2018, 2:06 PM · Patch-For-Review, monitoring, Operations
faidon reopened T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents as "Open".

I see emails for SG3 that (as far as I can tell) haven't made it to maint-announce, e.g.

Date: Thu, 25 Oct 2018 13:01:14 +0000 (UTC)
From: Equinix Maintenance NO-REPLY <no-reply@equinix.com>
To: EquinixMaintenance.SG@ap.equinix.com
Subject: REMINDER - Scheduled UPS Power Capacity Upgrade at the SG3 IBX  [5-168456259310]
Oct 25 2018, 1:07 PM · Wikimedia-Incident, Traffic, Operations
faidon reopened T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents, a subtask of T206861: Power incident in eqsin, as Open.
Oct 25 2018, 1:07 PM · Wikimedia-Incident, Traffic, Operations
faidon added a comment to T207536: Move various support services for Cloud VPS currently in prod into their own instances.

Ok, I think I understand this better now.

But we still have "supporting" services which are in the edge of what we would be able to move inside openstack, not only cloudvirts, but NFS servers for example. Our NFS problems are to be discussed in other tasks though :-)
Other services can be moved inside of openstack, like Wikireplicas, but because the way they work, they still need direct access to prod infra (so little benefit on moving them to VMs).

Oct 25 2018, 12:46 PM · cloud-services-team (Kanban), Operations, Cloud-VPS
faidon added a comment to T207536: Move various support services for Cloud VPS currently in prod into their own instances.

Please @faidon confirm I'm understanding this right.

If I compare the last diagram with the one we are currently using:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#/media/File:Eqiad1_network_topology.png

  • what changes would we have to do, long term, to go from what we have right now to the ideal model?
  • would we need to reallocate all physical servers to a single subnet (different from prod) and probably to the same DC row?
  • our supporting servers would still need to reach puppetmasters, install servers, ldap servers, DNS servers, etc. i.e, they are still administered as any other prod server. Would you like to see a change in that aspect as well?
Oct 25 2018, 12:29 AM · cloud-services-team (Kanban), Operations, Cloud-VPS
faidon added a comment to T207536: Move various support services for Cloud VPS currently in prod into their own instances.

@faidon the complete separation seems like a great goal from a security perspective, but considering there's a lot of legacy code out there, it's potentially a big one too (apart from just changing addresses). Would these tasks be part of a larger project and if so, tracked as quarterly goals? In other words, do we have a timeline we should follow? One difficulty I had personally was to understand where all these little tasks fit in the grand scheme of things, thanks for all the background information so far.

Oct 25 2018, 12:14 AM · cloud-services-team (Kanban), Operations, Cloud-VPS

Oct 24 2018

faidon assigned T191362: decom promethium/WMF3571 to Andrew.

@Andrew, promethium's hostname, IP and MAC address are still referenced in a number of places in the puppet tree, including e.g. hardcoded in Python code (proxyleaks.py) that I'd rather not have DC Ops touch :)

Oct 24 2018, 10:23 PM · decommission, Operations, DC-Ops, ops-eqiad
faidon added a project to T207900: Enable csp-report-only mode everywhere : Operations.

Cool! Cc'ing @herron and @fgiunchedi here for awareness and their input. Logstash may or may not be happy about the extra load, depending on how much that would be (esp. for big wikis) :)

Oct 24 2018, 10:04 PM · Restricted Project, Operations, Wikimedia-Site-requests, Security-Team

Oct 23 2018

faidon added a comment to T207533: Move labs-recursors in WMCS.

I'm not sure why there would be a chicken-and-egg problem. Prod recursors run in prod, right? Why is this different?

Oct 23 2018, 7:54 PM · Patch-For-Review, Cloud-VPS, Operations
faidon added a comment to T207536: Move various support services for Cloud VPS currently in prod into their own instances.

My understanding of the problem is:

  • cloud supporting services in hardware (DNS, SMTP, DB replicas, NFS, monitoring, etc) share their addressing with production services, therefore they are considered part of the prod infra (or to be side-by-side).
  • we don't trust what runs on Cloud VPS instances, specially when it comes to interaction with prod infra
  • we have concerns regarding a possible VM --> supporting service --> prod escalation
Oct 23 2018, 7:24 PM · cloud-services-team (Kanban), Operations, Cloud-VPS
faidon added a comment to T207663: Renumber cloud-instance-transport1-b-eqiad to public IPs.

This is essentially part of T122406, which we resolved last week with the intention of making it more specific with this task (among others).

Oct 23 2018, 6:11 PM · cloud-services-team (Kanban), Patch-For-Review, netops, Operations

Oct 22 2018

faidon added a comment to T207533: Move labs-recursors in WMCS.

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

Oct 22 2018, 10:08 PM · Patch-For-Review, Cloud-VPS, Operations
faidon added a comment to T207543: Move labmon (Graphite, StatsD) into a Cloud VPS.

+1 to this task.

Oct 22 2018, 10:01 PM · cloud-services-team (Kanban), Operations, Cloud-VPS
faidon added a comment to T207663: Renumber cloud-instance-transport1-b-eqiad to public IPs.

It's sad to hear that's a major disruption :( Would it make sense to do this now when it's early in the migration and relatively few projects have migrated over? If not, is there a specific timeframe where we can schedule this? Could this be done by, say, end of Q2? Thanks!

Oct 22 2018, 8:26 PM · cloud-services-team (Kanban), Patch-For-Review, netops, Operations
faidon added a comment to T207321: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster.

How many servers are we talking about both right now, as well as in the mid-term e.g. in the next year or two?

Oct 22 2018, 7:23 PM · Analytics-Kanban, netops, Operations, Analytics
faidon added a comment to T174596: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour.
  • I don't really understand what means the curl query and the output. @Krenair could you please elaborate? Genuinely I don't understand what's wrong with that, or what would you expect it to return, etc. Please, advice.

When labs instances connect to prod, I think logically either prod hosts should see labs private IPs, or they should see labs public IPs. Right now we appear to have a bizarre situation where it depends which prod DC you connect to.

Oct 22 2018, 5:34 PM · cloud-services-team (Kanban), netops, Operations, Cloud-VPS

Oct 20 2018

faidon added a project to T207536: Move various support services for Cloud VPS currently in prod into their own instances: Operations.
Oct 20 2018, 11:58 AM · cloud-services-team (Kanban), Operations, Cloud-VPS
faidon updated the task description for T207533: Move labs-recursors in WMCS.
Oct 20 2018, 9:56 AM · Patch-For-Review, Cloud-VPS, Operations
faidon triaged T207533: Move labs-recursors in WMCS as Normal priority.
Oct 20 2018, 9:53 AM · Patch-For-Review, Cloud-VPS, Operations
faidon edited projects for T171188: Move the main WMCS puppetmaster into the Labs realm, added: Cloud-Services; removed Cloud-VPS.
Oct 20 2018, 9:42 AM · cloud-services-team (Kanban), Cloud-Services, Puppet, Operations
faidon added a comment to T171188: Move the main WMCS puppetmaster into the Labs realm.

Ping? Could we setup a couple of puppetmasters in the new "cloudinfra" project and see where that leads us? I was previously told that this is probably a 1-2 weeks projects; is that still the current assessment, and if so, do you have an estimate on when this could be scheduled?

Oct 20 2018, 9:40 AM · cloud-services-team (Kanban), Cloud-Services, Puppet, Operations

Oct 19 2018

faidon closed T207387: Puppet failures on trusty due to libmonitoring-plugin-perl as Resolved.

@Andrew reports that this is fixed indeed, resolving.

Oct 19 2018, 5:09 PM · cloud-services-team
faidon closed T207328: es2017 and es2019 have an idrac ethernet interface in Linux as Resolved.

OK, you were right about the cause. I addressed the symptom, which was to go into iDRAC's web interface, and under Overview > iDRAC Settings > Network > OS to iDRAC Pass-through, and select Disabled.

Oct 19 2018, 12:36 PM · ops-codfw, Operations

Oct 18 2018

faidon added a comment to T207387: Puppet failures on trusty due to libmonitoring-plugin-perl.

So OK, I gave it a shot so that we can move things forward and not waste everyone's time. Took me 10 minutes to spawn a trusty chroot, and... 3 minutes to do the backport (echo 9 > debian/compat; dch --bpo; dpkg-buildpackage -uc -us). I spent another... 2 minutes to copy files around and reprepro include the backport in trusty-wikimedia, which I think should unbreak the setup and address the issue mentioned in the task description.

Oct 18 2018, 6:19 PM · cloud-services-team
faidon added a comment to T207387: Puppet failures on trusty due to libmonitoring-plugin-perl.

Fully agreed on all of your points and desirables here! That 30-day window is possible, but it also means that we'll lose steam in a project that's well underway :/ Could you go with a cherry-picked reverted patch or just with sticking with an older puppet tree during that 30-day period?

Oct 18 2018, 5:51 PM · cloud-services-team
faidon added a comment to T41785: Create a Cloud VPS SMTP smarthost.

Chipping in because I'm not sure if @herron is aware: tools (i.e. Toolforge) has its own (very) special exim configuration. A comparison with a random non-tools VPS may be more appropriate, but even that may not be great -- I wouldn't assume it works correctly now :) What is the desired behavior of WMCS' mx-outs?

Oct 18 2018, 5:25 PM · User-herron, Patch-For-Review, Operations, Cloud-Services, Mail
faidon added a comment to T207387: Puppet failures on trusty due to libmonitoring-plugin-perl.

I think the alternatives are:

  • SRE holds off the upgrade of Icinga from jessie to stretch in production until Shinken maintainers get the chance to keep up. (I don't even know if said maintainers are you or other Foundation staff or volunteers.)
  • SRE introduces backwards compatibility for trusty in our code despite having no use for it ourselves or way to test it, and thus pay the cost for doing so, and for maintaining it for the next ~6 months.
  • SRE backports packages and/or upgrades VMs that someone else maintains, and that we know little about (and which in this case uses software we haven't ever used or know much about).
Oct 18 2018, 5:14 PM · cloud-services-team
faidon created P7692 Hiera files/lines in operations/puppet.
Oct 18 2018, 3:22 PM · Operations
faidon added a comment to T207387: Puppet failures on trusty due to libmonitoring-plugin-perl.

Modifying our checks to support both Nagios::Plugin and Monitoring::Plugin is very messy and we elected in not doing so for the transition in prod. Adding these conditionals for what basically is now a 4½-year old distro used in a VPS is not something we should do.

Oct 18 2018, 2:19 PM · cloud-services-team
faidon closed T122406: Consider renumbering Labs to separate address spaces as Resolved.

Perfect! As far as I can see, there a few pending tasks, but are or should probably be covered in other tasks.

Oct 18 2018, 1:46 PM · Cloud-Services, netops, Operations
faidon closed T122406: Consider renumbering Labs to separate address spaces, a subtask of T167293: Nova-network to Neutron migration, as Resolved.
Oct 18 2018, 1:46 PM · Patch-For-Review, Epic, Cloud-Services
faidon added a comment to T207138: Document eqsin power connections in Netbox.

Awesome, thanks! No field for Cable IDs or labels is a bit disappointing :( It doesn't look like we can do it with a custom field either, but I'm not 100% sure.

Oct 18 2018, 1:38 PM · Traffic, Operations

Oct 16 2018

MusikAnimal awarded T41785: Create a Cloud VPS SMTP smarthost a Cup of Joe token.
Oct 16 2018, 7:45 PM · User-herron, Patch-For-Review, Operations, Cloud-Services, Mail
faidon updated subscribers of T122406: Consider renumbering Labs to separate address spaces.

I think this is now done with Neutron, and while the old space remains for now, the migration is underway, so this task can be closed. @ayounsi, @aborrero, @chasemp?

Oct 16 2018, 9:49 AM · Cloud-Services, netops, Operations
faidon added a comment to T207138: Document eqsin power connections in Netbox.

This refers to power connections specifically as it's a subtask of the power incident, but that spreadsheet covers patches as well, and we should probably document these as well.

Oct 16 2018, 9:19 AM · Traffic, Operations
faidon triaged T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents as High priority.
Oct 16 2018, 9:15 AM · Wikimedia-Incident, Traffic, Operations
faidon triaged T207138: Document eqsin power connections in Netbox as Normal priority.
Oct 16 2018, 9:11 AM · Traffic, Operations

Oct 15 2018

faidon added a comment to T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw).

I just looked briefly at T172459 and it looks like the last update there was to attempt this during the switchover period which is obviously over :)

Oct 15 2018, 5:52 PM · Patch-For-Review, netops, Operations
faidon edited projects for T206861: Power incident in eqsin, added: Wikimedia-Incident; removed Patch-For-Review.
Oct 15 2018, 4:09 PM · Wikimedia-Incident, Operations, Traffic
faidon added a comment to T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw).

@ayounsi, what's the current status of this task? Last update is from over a year ago, but I think some of our latest woes with asw2-b-eqiad are very much interrelated to this?

Oct 15 2018, 2:27 PM · Patch-For-Review, Operations, netops

Oct 14 2018

faidon added a comment to T206861: Power incident in eqsin.

It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online.

Oct 14 2018, 8:03 PM · Wikimedia-Incident, Operations, Traffic
faidon added a comment to T199675: cp5001 unreachable since 2018-07-14 17:49:21.

@RobH ping? This has been pending since July, with the last update being Aug 27(!?)

Oct 14 2018, 7:59 PM · Operations, ops-eqsin, Traffic
faidon renamed T206861: Power incident in eqsin from 1 power feed down in eqsin to Power incident in eqsin.
Oct 14 2018, 7:57 PM · Wikimedia-Incident, Operations, Traffic