Page MenuHomePhabricator

Check router ACLs for early install SSH access from puppet masters/cumin hosts
Closed, ResolvedPublic

Description

Making a separate ticket out of https://phabricator.wikimedia.org/T220505#5433228:

For the initial (pre-puppet run) SSH access we currently use install_console from the Cumin hosts or Puppet masters. Per @Andrew 's comment from above task this doesn't work to connect to e.g. cloudvirt. Is there any router ACL which grants SSH access from iron.wikimedia.org towards labs-hosts-b-eqiad1/labs-hosts-d-eqiad1 which isn't present for puppetmaster*/cumin* hosts? If there's such a rule we should carry it over to the ACLs for puppetmaster/cumin, as the eventual goal is to remove iron fully.

Related Objects

Event Timeline

Hosts in the cloud-hosts1-b-eqiad vlan are behind the labs-in4 firewall filter (applied on traffic going out of that vlan), which also includes the labs-instance-in4 firewall filter.

I'm not sure why we have a firewall filter for the hosts machines (we also have one for the VMs traffic). And I of course only find little mention of it (eg. T199437 when we cleaned it up). My main guess is that it's to protect the prod infra from a cloud user escaping its VM.

This firewall filter prevents cloud hosts to reach private prod. Iron being a public host there was no issues reaching it but both puppetmaster and cumin are private/internal hosts.
As here we want a private hosts to connect to cloud hosts, only the return TCP flows needs to be permitted (source: tcp/22).

Easiest fix is to add an exception in that firewall filter, but as any exception we need to be sure it doesn't compromise the purpose of the filter and doesn't add debt down the road.

From my little understanding, it looks fine, but I don't have the big picture.

Hosts in the cloud-hosts1-b-eqiad vlan are behind the labs-in4 firewall filter (applied on traffic going out of that vlan), which also includes the labs-instance-in4 firewall filter.

I'm not sure why we have a firewall filter for the hosts machines (we also have one for the VMs traffic). And I of course only find little mention of it (eg. T199437 when we cleaned it up). My main guess is that it's to protect the prod infra from a cloud user escaping its VM.

That would be my guess too.

This firewall filter prevents cloud hosts to reach private prod. Iron being a public host there was no issues reaching it but both puppetmaster and cumin are private/internal hosts.
As here we want a private hosts to connect to cloud hosts, only the return TCP flows needs to be permitted (source: tcp/22).

Easiest fix is to add an exception in that firewall filter, but as any exception we need to be sure it doesn't compromise the purpose of the filter and doesn't add debt down the road.

From my little understanding, it looks fine, but I don't have the big picture.

The exception is probably okayish, but your point about adding technical debt and exceptions seems very valid to me. I appears to me that we also have a different option, which is cleaner:

  • We add a new profile::access_new_install_cloudsupport and apply it to one of the servers in cloud-hosts1-b-eqiad. This would enable install_console to work from that server and it doesn't compromise the separation between prod and cloud-supporting-prod (which we rather want to widen, not narrow!).

Looking at the 10.64.20.0/24 VLAN I see mostly cloudvirt (and some legacy labvirt) and cloudnet. How about we pick cloudnet1003 for this? @Andrew, comments?

Moving the iron exception to cloudnet1003 works for me -- presuming we mean adding it to 'wmcs::openstack::eqiad1::net'.

One thing I'm not clear on is how the automatic image scripts (e.g. wmf-auto-reimage) access the initial host. As I understand it they work already, so probably they're somehow unrelated?

Adding @aborrero for his thoughts about networking. He's on holiday for a few weeks though.

Moving the iron exception to cloudnet1003 works for me -- presuming we mean adding it to 'wmcs::openstack::eqiad1::net'.

Ack, that's what I meant.

One thing I'm not clear on is how the automatic image scripts (e.g. wmf-auto-reimage) access the initial host. As I understand it they work already, so probably they're somehow unrelated?

On the system/ferm level there's fleet-wide Ferm rule which grants SSH access from Cumin masters.
On the router level, Arzhel needs to comment.

On the system/ferm level there's fleet-wide Ferm rule which grants SSH access from Cumin masters.
On the router level, Arzhel needs to comment.

...does that mean we can just put install_console on the cumin hosts and avoid having a special case at all? Or does install_console somehow need different access from what's available on the Cumin masters?

On the system/ferm level there's fleet-wide Ferm rule which grants SSH access from Cumin masters.
On the router level, Arzhel needs to comment.

...does that mean we can just put install_console on the cumin hosts and avoid having a special case at all? Or does install_console somehow need different access from what's available on the Cumin masters?

install_console is already available on the Cumin masters. Did you try to use it in the past only from puppetmasters? If so, it might also simply just work by means of router ACLs setup for Cumin. Could you give it a shot when reimaging the fixed cloudvirt1015 maybe?

install_console works just fine from cumin1001. So, no need for a special case here, we can just go ahead and decom iron.

Thanks all!

Great, thanks. Closing the task, will proceed with iron decom.

@ayounsi ; Could you doublecheck whether we have any remaining router/firewall rules for the IPs used by iron.wikimedia.org that need to be cleaned out? (208.80.154.151 and 2620:0:861:2:208:80:154:151) ?

The server is very old and had some special privs in the past, so just want to make sure we prune that when the server is decommed to prevent surprises when it's IP is reused.

Mentioned in SAL (#wikimedia-operations) [2019-09-05T14:42:56Z] <XioNoX> remove iron from mr* routers - T231811