Page MenuHomePhabricator

Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium)
Closed, ResolvedPublic

Description

gallium has lost its disk and an attempt is made to recover it. Meanwhile we had a new server installed contint1001.wikimedia.org with IP 208.80.154.17 . We would need firewall rules to be added to let servers in the labs support network to be able to reach the new server. The existing rules to gallium 208.80.154.135 should be kept.

CI Target 2016 - Overview.png (582×1 px, 81 KB)

Gives an overview, with the three blue box which could be considered as services running on contint1001.wikimedia.org

Flows were previously documented on https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isolation#Security_matrix when we had labnodepool1001.eqiad.wmnet set up.

Flows going to contint1001

Works?Protosource hostsource IPdest Hostdest IP dest port description
TCPscandium10.64.4.12contint1001208.80.154.174730zuul merger to zuul gearman server
TCPlabnodepool100110.64.20.18contint1001208.80.154.174730Nodepool to zuul gearman server
xxxTCPiridium10.64.32.150contint1001208.80.154.174730Phabricator to zuul gearman server. Removed via https://gerrit.wikimedia.org/r/#/c/310039/
TCPlabnodepool100110.64.20.18contint1001208.80.154.178888Nodepool to Jenkins ZeroMQ
TCPlabnodepool100110.64.20.18contint1001208.80.154.17443Nodepool to Jenkins REST API

Flows originating from contint1001

Works ?Protosource hostsource IPdest Hostdest IPdest portdescription
TCPcontint1001208.80.154.17scandium10.64.4.129418Git connection to Zuul-merger git daemon
TCPcontint1001208.80.154.17contintcloud instances10.x.x.x/y22Jenkins server/client connection to slaves
TCPcontint1001208.80.154.17contintcloud instances10.x.x.x/y873rsync
???UDPcontint1001208.80.154.17statsd.eqiad.wmnet???8125Zuul scheduler metrics to statsd

Event Timeline

We might need some extra route on contint1001

If I read this correctly, amongst other things you're requesting firewall holes from (random?) Labs instances (contintcloud instances) to a server (contint1001) in one of the private vlans, right? We can't do that...

Sorry it is not clear in the table. The last three entries have contint1001 has a source. The flow would be:

  • contint1001 establishing a SSH connection to labs instance
  • Jenkins copies the Jenkins client .jar to the instance over ssh and drive it from there

The labs instance have no need to establish any connection to either gallium or contint1001.

I have split the table to separate flows

for the flows going to contint1001 and dest port 4730, we already have:

ACCEPT     tcp  --  labnodepool1001.eqiad.wmnet  anywhere             tcp dpt:4730
ACCEPT     tcp  --  iridium.eqiad.wmnet  anywhere             tcp dpt:4730
ACCEPT     tcp  --  gallium.wikimedia.org  anywhere             tcp dpt:4730
ACCEPT     tcp  --  scandium.eqiad.wmnet  anywhere             tcp dpt:4730
ACCEPT     tcp  --  localhost            anywhere             tcp dpt:4730

This comes from modules/contint/manifests/firewall.pp where a ferm rule uses an srange => "(${zuul_merger_hosts_ferm})" and the hosts are in Hiera in hieradata/common/contint.yaml so this all just got applied with the role and there is nothing that has to be done for it anymore.

Same for labnodepool to 8888, already done:

ACCEPT tcp -- labnodepool1001.eqiad.wmnet anywhere tcp dpt:8888

Just the 443 looks like it is missing. I'll add that. Basically the whole incoming part is just done through the power of puppet roles and hiera.

Change 293441 had a related patch set uploaded (by Dzahn):
contint: add firewall rule for nodepool to Jenkins API

https://gerrit.wikimedia.org/r/293441

Change 293441 merged by Dzahn:
contint: add firewall rule for nodepool to Jenkins API

https://gerrit.wikimedia.org/r/293441

TCPscandium10.64.4.12contint100110.64.0.2374730zuul merger to zuul gearman server✓ already existed
TCPlabnodepool100110.64.20.18contint100110.64.0.2374730Nodepool to zuul gearman server✓ already existed
TCPiridium10.64.32.150contint100110.64.0.2374730Phabricator to zuul gearman server✓ already existed
TCPlabnodepool100110.64.20.18contint100110.64.0.2378888Nodepool to Jenkins ZeroMQ✓already existed
TCPlabnodepool100110.64.20.18contint100110.64.0.237443Nodepool to Jenkins REST API✓ added, did not exist before on gallium, but now it does. will be added on contint1001 once puppet gets re-enabled there
TCPcontint100110.64.0.237scandium10.64.4.129418Git connection to Zuul-merger git daemonalready existed though it allows the entire 10.0.0.0/8 to connect not just contint

So i would say only labs and statsd are the remaining questions here.

Change 293449 had a related patch set uploaded (by Dzahn):
contint: limit access to zuul-merger git daemon

https://gerrit.wikimedia.org/r/293449

@Dzahn thanks, though all those rules are indeed present on hosts since they are provisioned by puppet classes / ferm::rules.

This task is really about enabling the rules on the firewalls (not iptables/ferm) between contint1001 and the hosts in labs support network.

hashar changed the task status from Open to Stalled.Jun 9 2016, 3:46 PM

From a quick chat with @mark we dont want hosts in private lan to communicate with labs instance at all.

Will follow up on parent task T137358.

Change 293449 abandoned by Dzahn:
contint: limit access to zuul-merger git daemon

Reason:
per hashar's comments

https://gerrit.wikimedia.org/r/293449

While T137323#2365101 is important. I don't understand T137323#2368164, what exactly do you mean with private lan?

@JanZerebecki - most of our production hosts are in private internal VLANs with no routing to public space. There's also a smaller subset of production hosts that are in public VLANs where they can be accessed from the rest of the world (given firewall rules allow). From prod networks' point of view, labs instances aren't generally considered to be any more-trustworthy than the public Internet, so there's no direct firewall holes between labs networks and production's private internal VLANs. It wasn't a problem for gallium.wikimedia.org (on a public VLAN), but is a problem for contint1001.eqiad.wmnet (on a private VLAN).

This task is no more relevant, I am keeping it open as a reference until the Zuul/Jenkins etc are migrated to scandium.

Dropping team project tags so it no more shows up on their workboards.

hashar renamed this task from Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) to Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium).Sep 7 2016, 8:25 PM
hashar changed the task status from Stalled to Open.
hashar updated the task description. (Show Details)

contint1001 has been moved to the production public network with fqdn contint1001.wikimedia.org and IP address 208.80.154.17.

I have updated the table with the new address. Also added rsync from contint1001 to instances on port 873.

Mentioned in SAL [2016-09-07T20:30:43Z] <hashar> Updated security group for contintcloud and integration labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323

Mentioned in SAL [2016-09-07T20:35:30Z] <hashar> Updated security group for deployment-prep labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323

Change 309153 had a related patch set uploaded (by Hashar):
contint: allow ssh from contint1001 to labs instance

https://gerrit.wikimedia.org/r/309153

Change 309154 had a related patch set uploaded (by Hashar):
contint: vary ssh from= for prod slave

https://gerrit.wikimedia.org/r/309154

contint1001 now has the default set of rules from contint::firewall.

Have to verify that all the flows are open.

hashar updated the task description. (Show Details)
twentyafterfour@iridium:~$ telnet 208.80.154.17 4730
Trying 208.80.154.17...
telnet: Unable to connect to remote host: Connection refused

Yup the service is not enabled (that is the Gearman server embedded in Zuul and the puppet class is not applied). Would need to use on contint1001 to either tcpdump (needs root) or nc -l -p 4730 208.80.154.17 to listen.

Ditto for port 443.

The iptables rule allowing iridium is present at least.

Change 309261 had a related patch set uploaded (by Hashar):
zuul::merger: allow contint1001

https://gerrit.wikimedia.org/r/309261

Status

I have verified labnodepool1001 -- contint1001 flows. They are all fine.

  • need to check iridium to contint1001 on port 4730. See comment above for how to test. Then maybe we no more need access to Gearman from the Phabricator host? @mmodell isn't Harbormaster using the Jenkins REST API over https instead?
  • DONE contint1001 to scandium for git-daemon is pending Gerrit 309261 DONE
  • contint1001 to statsd.eqiad.wmnet I don't know how to test. But should work, I am assuming statsd allow our public IP.

Change 309153 merged by Muehlenhoff:
contint: allow ssh from contint1001 to labs instance

https://gerrit.wikimedia.org/r/309153

Change 309261 merged by Muehlenhoff:
zuul::merger: allow contint1001

https://gerrit.wikimedia.org/r/309261

Change 309154 merged by Giuseppe Lavagetto:
contint: vary ssh from= for prod slave

https://gerrit.wikimedia.org/r/309154

@mmodell do we still need Harbormaster on iridum to be able to talk to the CI Gearman server? My understanding is that jobs are now triggered using the Jenkins REST API. If that is the case I would rather drop the firewall rule.

Confirmed with @20after4 , there is no more need for Harbormaster/Iridum to talk to gearman.

hashar claimed this task.

All rules have been tested and works.

The one for iridium is no more needed and removed with https://gerrit.wikimedia.org/r/#/c/310039/