deploy cobalt as replacement for gallium
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RobH
	Apr 13 2015, 9:34 PM

Description

This is the tracking task for the deployment of system cobalt as a replacement to gallium. Reasoning for this replacement is noted on the hardware-requests task T95760.

System Deployment Steps:

- mgmt dns entries created/updated (both asset tag & hostname)
- system bios and mgmt setup and tested
- network switch port description set - set from previous use
- network switch port vlan assignment set - needs determination, see notes below
- install-server module updated (dhcp and netboot/partitioning) [done via this task when on-site subtasks complete]
- install OS - Jessie
- accept/sign puppet/salt keys [done via this task post os-installation]
- service implementation [done via this task post puppet/salt acceptance]

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production
Invalid	None	T124121 [keyresult] Migrate Jenkins to Jessie (gallium -> cobalt)
Resolved	Dzahn	T95757 Phase out gallium.wikimedia.org
Resolved	faidon	T95959 install/setup/deploy cobalt as replacement for gallium
Resolved	RobH	T95760 eqiad: (1) allocate server to migrate Zuul server to

Event Timeline

RobH created this task.Apr 13 2015, 9:34 PM

RobH claimed this task.

RobH raised the priority of this task from to Medium.

RobH updated the task description. (Show Details)

RobH added projects: acl*sre-team, Continuous-Integration-Infrastructure, Continuous-Integration-Scaling.

RobH added subscribers: RobH, • chasemp, hashar.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 13 2015, 9:34 PM

RobH added a subtask: T95760: eqiad: (1) allocate server to migrate Zuul server to.Apr 13 2015, 9:34 PM

RobH mentioned this in T95760: eqiad: (1) allocate server to migrate Zuul server to.

RobH closed subtask T95760: eqiad: (1) allocate server to migrate Zuul server to as Resolved.

@chasemp will be chasing down the network requirements. Cobalt needs to talk to labs hosts, which means it would currently have to sit in the public vlan. There are questions about this (like why the entire public vlan can hit all labs hosts), and if cobalt should instead be internal vlan, and have specific routing allowed to labs hosts.

• chasemp claimed this task.Apr 13 2015, 9:48 PM

for whoever does the install server update (I didn't do it yet, since we aren't yet certain of the fqdn.)

NIC1 Ethernet = 78:2b:cb:08:a1:68

RobH updated the task description. (Show Details)Apr 13 2015, 9:58 PM

RobH set Security to None.

Do we have any 30000 feet network diagrams of our vlan / zones / whatever? That would assist in figuring out how machines can communicates. That is a shadow area to me :-/

RobH updated the task description. (Show Details)Apr 13 2015, 10:07 PM

hashar moved this task from Backlog to In-progress on the Continuous-Integration-Scaling board.Apr 13 2015, 10:43 PM

Liuxinyu970226 subscribed.Apr 14 2015, 5:18 AM

Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.Apr 14 2015, 11:31 AM

• chasemp reassigned this task from • chasemp to mark.Apr 14 2015, 4:04 PM

I talked to @mark about this and how it relates to the labnodepool box placement. mark wants to go over things a bit more thoroughly to come up with a recommendation. My synopsis is that labhosts vlan has a bit of a scope creep problem so the determination needs to be made on whether to accept that and realign expectations or to go another direction in the labnodepool case, all of which affect the "new gallium".

tldr; will followup as mark as some time to evaulate

Should we start drawing a network diagram representing the different lan / vlan we have and the traffic flows between jenkins/zuul/gearman/nodepool? I myself have no idea how the network is arranged.

hashar mentioned this in T93076: eqiad: 2 hardware access request for CI isolation on labsnet.Apr 15 2015, 7:09 PM

hashar mentioned this in T95046: install/deploy scandium as zuul merger (ci) server.Apr 17 2015, 8:51 PM

Krenair subscribed.Apr 17 2015, 8:53 PM

mark reassigned this task from mark to faidon.Apr 23 2015, 3:38 PM

mark subscribed.

hashar added a parent task: T95757: Phase out gallium.wikimedia.org.Oct 6 2015, 11:45 AM

maybe we could go ahead with the same setup we had on gallium, or let's make an actual blocker for the network-ops side of things

added @netops please specify which VLAN to use for cobalt

So, gallium is in public1-b-eqiad (208.80.154.128/26). The story behind a public IP is a long one and has to do with labs VM being unable to talk to non-public IP (i.e. 10.0.0.0/8) production hosts due to ACLs on the routers. That's an arbitrary limitation enforced by us AND does not apply to public IP production hosts. The rule is usually referred to "Labs doesn't talk to production" which is not fully true as there is this exception. It's a hole in that policy that we exploit for various things like that labs LDAP servers/CI and so on. That puts them in a gray area where there are services like CI/LDAP that need to talk to both production and labs. It resembles in a way the notion of a DMZ ( https://en.wikipedia.org/wiki/DMZ_(computing)#/media/File:DMZ_network_diagram_1_firewall.svg ), albeit less clearly defined. We have a notion of labs-support-hosts which mostly right now houses labstore hosts and scandium (zuul merger) and nobelium (elasticsearch labs replication test) that could be extended to that perhaps.

While a public IP will obviously work, if we feel like it we can try to get cobalt in that network. @faidon what do you think?

gallium.wikimedia.org has a bunch of services which are exposed publicly via the misc-web varnish cache. I am not sure whether we will accept public traffic to enter the labs-support-hosts network. The services are:

Service	URL	Backend
Jenkins graphical interface	https://integration.wikimedia.org/ci/	Apache proxy - Jenkins daemon
Zuul metadata	https://integration.wikimedia.org/zuul/status.json	Apache proxy - Zuul daemon
Generated docs	https://doc.wikimedia.org/	Apache
CI website + artifacts	https://integration.wikimedia.org/	Apache

The integration.wikimedia.org entry points to the misc-web Varnish cache which rely the requests to gallium.wikimedia.org Apache server. Apache either serves content directly (doc/integration websites) or proxy the requests to the backend daemon Zuul and Jenkins.

The zuul-merger process on gallium can be phased out. There is one on scandium, that is one less thing to deal with.

Would a diagram of the current flows involved be of any help? We can well reorganize the various services and migrate them to different servers as may fit.

hashar removed a project: Continuous-Integration-Scaling.Mar 10 2016, 2:15 PM

hashar edited projects, added Continuous-Integration-Infrastructure (phase-out-gallium); removed Continuous-Integration-Infrastructure.Apr 21 2016, 3:52 PM

We have created a sub project in Phabricator https://phabricator.wikimedia.org/project/view/1966/

First step is for Release-Engineering-Team to agree on an architecture via T133300 and propose it to SRE for validation.

Paladox subscribed.Apr 26 2016, 11:57 AM

Quote from IRC discussing T137265: / on gallium is read only, breaking jenkins replacement box/emergency spares boxes

<moritzm> seems T95959 has never been updated, cobalt is marked as decomissioned in racktables

Due to gallium loosing a disk ( T137265 ) @Joe allocated a new server from the pool. We had it named contint1001.eqiad.wmnet 10.64.0.237 and after a few puppet work have more or less the base packages provisioned.

The service implementation is not complete yet, needs firewall rules to be sorted out ( T137323 ) and Puppet updates to change various IP.

Via T133300: Target architecture without gallium.wikimedia.org we would discuss about a best target for the CI services. Having everything on a single production / real server might not be the best.

Regardless, this task was really about assigning a server and I think that is fulfilled.

Liuxinyu970226 unsubscribed.Jun 8 2016, 11:11 PM

install/setup/deploy cobalt as replacement for galliumClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

install/setup/deploy cobalt as replacement for gallium
Closed, ResolvedPublic
Actions

Related Objects
Search...