Page MenuHomePhabricator

Allocate contint1001 to releng and allocate to a vlan
Closed, ResolvedPublic

Description

gallium had a disk/RAID failure a month ago and contint1001.eqiad.wmnet was allocated at that time as an emergency replacement in T137358. This was not formalized due to the emergency nature and timelines. All documentation and notables have now been changed to reflect the contint1001 identity at this point. This task is for catching loose ends on setup.

contint1001.eqiad.wmnet is:

  • fulfilling the CI services requirements in term of memory/CPU and especially disk space (where we need 500GB - 1TB of space at least)
  • in practice already allocated to CI
  • offering room for growth and performance tuning and improvement for Jenkins
  • allows phasing out both gallium (Precise) and scandium.eqiad.wmnet, both 5+ years old machines).

Remaining steps:

  • Move contint1001 to production network with public IP (discussion below)
  • Change VLAN / IP / DNS
  • Reimage (Debian Jessie) to start fresh from emergency window changes
  • Service implementation with full puppetization
  • Put gallium && scandium back into the spare pool

Event Timeline

hashar created this task.Jul 13 2016, 4:06 PM
Restricted Application added a project: Operations. · View Herald TranscriptJul 13 2016, 4:06 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
chasemp triaged this task as High priority.Jul 13 2016, 4:16 PM
chasemp updated the task description. (Show Details)
RobH assigned this task to mark.EditedJul 13 2016, 4:35 PM
RobH added a subscriber: RobH.

This all seems like a fine idea to me, so I'll just document the systems used so we can get @mark's approval.

Since this was allocated during an emergency situation for a specific use, it should indeed be reviewed before this allocation to another purpose. Thank you for filing the task and documenting this!

Spare system wmf4746 was allocated as contint1001, as @hashar notes, during the downtime of gallium. Both gallium & scandium are very old systems.

gallium is an r310 purchased on 2011-01-27. scandium is an r610, reallocated to use from its original deployment as a squid system, purchased on 2011-01-27. I'd like to get permission (via this task escalation to @mark) to also decommission these two old hosts once they are fully reclaimed from CI use.

Since this is now going to shift role/setup/use slightly and be pushed into permanent service, we should get @mark's approval on this task.

As such, I've assigned and escalated to him for his review. Please attach comments/questions/approvals and I can follow up.

Thanks.

chasemp added a subscriber: faidon.Jul 14 2016, 4:49 PM

@faidon and I chatted about this for a few minutes today. We talked about primarily the logic behind putting contint1001 in labs-support vs continuing the model of public allocation. Taking into account the co-location of current scandium functionality and current gallium functionality.

@faidon is going to think on it and put his thoughts here next week and we'll move forward from there. Either model should be doable so it is more about what is appropriate now.

chasemp renamed this task from Allocate contint1001 to releng and reimage it in labs-support-network to Allocate contint1001 to releng and allocate to a vlan.Jul 14 2016, 5:23 PM

I've deliberated this a little bit and honestly my (slight) preference would be to not (ab)use labs-support for this but instead use a proper public IP like gallium currently has. labs-support sounds like it would just complicate things further with regards to how we're defined it (and how we protect it with ACLs) and I don't really see much benefit over using just a public subnet for it.

That said, I still have my reservations regarding the proposed architecture — but I will follow up on T133300 about that (I'm getting lost between all the different tasks — is T133150 essentially a duplicate of this one?)

mark removed mark as the assignee of this task.Aug 3 2016, 10:59 AM

The allocation of contint1001 is fine.

What's the current status on the discussion on which vlan to use?

I've deliberated this a little bit and honestly my (slight) preference would be to not (ab)use labs-support for this but instead use a proper public IP like gallium currently has.

RelEng has no strong preference for using labs-support vs a public IP.

We just need contint1001 to be able to talk to labs private instances in the integration and contintcloud openstack projects over ssh, serve https://integration.wikimedia.org/ and be reachable by labnodepool1001. Given those criteria, we were hoping to get some expertise from ops about where contint1001 fits in our architecture.

Some context on why labs-support is listed on all the tickets:

The choice for labs-support was rough-hewn from from a few different things. IIRC @mark expressed a preference for not having a public IP as this box will need to talk to boxen with labs private IPs.

After the idea of using a public IP was abandoned, @hashar and I had a meeting with @chasemp in which we tried to find a network that was a good fit for a new jenkins-master given our criteria.

mark added a comment.Aug 30 2016, 10:23 AM

I've deliberated this a little bit and honestly my (slight) preference would be to not (ab)use labs-support for this but instead use a proper public IP like gallium currently has.

RelEng has no strong preference for using labs-support vs a public IP.
We just need contint1001 to be able to talk to labs private instances in the integration and contintcloud openstack projects over ssh, serve https://integration.wikimedia.org/ and be reachable by labnodepool1001. Given those criteria, we were hoping to get some expertise from ops about where contint1001 fits in our architecture.
Some context on why labs-support is listed on all the tickets:
The choice for labs-support was rough-hewn from from a few different things. IIRC @mark expressed a preference for not having a public IP as this box will need to talk to boxen with labs private IPs.

I highly doubt I said that, as that doesn't make sense. I assume you're referring to https://phabricator.wikimedia.org/T137323#2365101 which states that we can't open arbitrary firewall holes between labs instance vlan and production (private) vlans, which would be a major security hole.

If on the other hand, labs instances would need to connect to a server in production (but not in the other direction), a solution would indeed be to put it in the public production vlan(s). That server would then be reachable from a Labs instance just like any other server "on the Internet", and its traffic would be Source-NATed like any other "external" destination.

we can't open arbitrary firewall holes between labs instance vlan and production (private) vlans, which would be a major security hole.
[...]
If on the other hand, labs instances would need to connect to a server in production (but not in the other direction), a solution would indeed be to put it in the public production vlan(s).

The labs instances—the jenkins nodes—don't need to connect contint1001, but contint1001 will need to be able to connect to the jenkins nodes running on labs via SSH.

All the connections needed both to and from contint1001 are in the task description of T137323: Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) (the "Flows going to"/"Flows going from" sections)

Any vlan allocation that is acceptable to ops that allows those connections works for us.

Regarding the use of a public IP: gallium had one for historical reasons and all uses have been migrated (web behind misc-cache, ssh via bastion etc) it is no more needed. To reach labs instances there are rules in the firewall to let it reach labs instances.

We had contint1001 set up in an emergency on a server connected to the production private LAN which can not communicate with labs instances at all.

So we thought about putting the server in the labs-support network which has access to labs instances and would be reachable by the other hosts we have there (labnodepool and scandium).

I don't mind changing the flows one way or another. I think the biggest trouble is we have no clear view of all the networks and how they can communicate with each others.

T133300#2495369 has a summary of the various bits with links to various diagrams. Can we give a quick formal presentation followed by a brainstorming with netops so we can sort it out?

mark added a comment.EditedAug 31 2016, 11:47 AM

@hashar and I just had a long chat on IRC, where we clarified some things in both directions.

A few points of information:

  • Every VLAN discussed except the VLAN actual Labs instances themselves are in the "production" realm, including labs-support and labs-hosts (labs infra)
  • There's a firewall between Production and Labs realms, which lets Labs instances only talk to the "public internet", which includes the public production VLANs/IPs, but explicitly not private.
  • Communication between production and labs realm should be a) minimized, b) well defined.

Regarding the use of a public IP: gallium had one for historical reasons and all uses have been migrated (web behind misc-cache, ssh via bastion etc) it is no more needed. To reach labs instances there are rules in the firewall to let it reach labs instances.

I don't think that's accurate. Gallium is indeed on the public vlan, but doesn't need ACL rules to reach Labs instances directly. (That's not guaranteed to be the in the future though.) In the other direction, Labs instances should always be able to connect to public Internet IPs, as part of "the Internet" (through Source NAT).

We had contint1001 set up in an emergency on a server connected to the production private LAN which can not communicate with labs instances at all.

Yes, that would be quite problematic. The firewall is explicitly there to prevent that.

So we thought about putting the server in the labs-support network which has access to labs instances and would be reachable by the other hosts we have there (labnodepool and scandium).

As we discussed on IRC, that would not actually solve much (other than a few firewall rules), but actually make things worse from a security and management perspective.

I think keeping the situation as is with gallium, i.e. have it in the public VLAN with ferm enabled, is actually better, for the moment. It will still be able to connect to Labs instance IPs (for now), and in the other direction it's not a problem in any case.

Longer-term it would be nice to look at other architectures entirely which are not a hacky mix of production&labs intertwined. (Ideally leaving Labs out of the equation entirely perhaps.)

I'd like @faidon and @chasemp to weigh in as well.

I am good with following in the foot step of gallium here as pragmatic. It seems like the most settled outcome.

faidon added a comment.Sep 7 2016, 1:28 PM

Yes, this is inline with what I've previously said and it sounds fine with me. This is really not something we should be discussing for a month — let's proceed with the other steps.

hashar updated the task description. (Show Details)Sep 7 2016, 1:38 PM

Looping netops . We would need contint1001 to be moved to the public network with a public IPv4 assignment.

contint1001.eqiad.wmnet is in rack A4 and IP 10.64.0.237

Then we will want to reimage it (Jessie)

RobH added a comment.Sep 7 2016, 5:24 PM

So I can handle the vlan move and reimage. Just to confirm there is no data that is currently residing on contint1001, and I can begin this work at any time?

RobH claimed this task.Sep 7 2016, 5:30 PM

checked in release engineering, its cool for me to reimage this now (after vlan move) so doing so.

RobH updated the task description. (Show Details)Sep 7 2016, 5:58 PM

Change 309069 had a related patch set uploaded (by Hashar):
contint: drop roles from contint1001

https://gerrit.wikimedia.org/r/309069

hashar added a comment.Sep 7 2016, 6:59 PM

I confirm the server content on contint1001.eqiad.wmnet can be wiped out.

Before reimagine we need to remove puppet roles applied to the host. https://gerrit.wikimedia.org/r/309069 takes care of that.

RobH reassigned this task from RobH to hashar.Sep 7 2016, 8:56 PM
RobH updated the task description. (Show Details)

contint1001.wikimedia.org is online with puppet and salt keys accepted.

Since there was no entry for contint1001.wikimedia.org in site.pp, it merely gave it the default items and is ready for further implementation.

Change 309069 merged by RobH:
sites.pp: rename contint1001 and drop role

https://gerrit.wikimedia.org/r/309069

hashar closed this task as Resolved.Sep 26 2016, 10:42 AM

We had contint1001 allocated. It has a public IP and DNS entry. We have the basic firewall rules and all network flows are working.

Next step is to add Jenkins/Zuul etc on the host, that is better tracked in separated tasks.

Thanks for the network fix!