Page MenuHomePhabricator

install/setup/deploy cobalt as replacement for gallium
Closed, ResolvedPublic

Description

This is the tracking task for the deployment of system cobalt as a replacement to gallium. Reasoning for this replacement is noted on the hardware-requests task T95760.

System Deployment Steps:

  • - mgmt dns entries created/updated (both asset tag & hostname)
  • - system bios and mgmt setup and tested
  • - network switch port description set - set from previous use
  • - network switch port vlan assignment set - needs determination, see notes below
  • - install-server module updated (dhcp and netboot/partitioning) [done via this task when on-site subtasks complete]
  • - install OS - Jessie
  • - accept/sign puppet/salt keys [done via this task post os-installation]
  • - service implementation [done via this task post puppet/salt acceptance]

Event Timeline

RobH claimed this task.
RobH raised the priority of this task from to Medium.
RobH updated the task description. (Show Details)
RobH added subscribers: RobH, chasemp, hashar.

@chasemp will be chasing down the network requirements. Cobalt needs to talk to labs hosts, which means it would currently have to sit in the public vlan. There are questions about this (like why the entire public vlan can hit all labs hosts), and if cobalt should instead be internal vlan, and have specific routing allowed to labs hosts.

for whoever does the install server update (I didn't do it yet, since we aren't yet certain of the fqdn.)

NIC1 Ethernet = 78:2b:cb:08:a1:68

RobH set Security to None.

Do we have any 30000 feet network diagrams of our vlan / zones / whatever? That would assist in figuring out how machines can communicates. That is a shadow area to me :-/

I talked to @mark about this and how it relates to the labnodepool box placement. mark wants to go over things a bit more thoroughly to come up with a recommendation. My synopsis is that labhosts vlan has a bit of a scope creep problem so the determination needs to be made on whether to accept that and realign expectations or to go another direction in the labnodepool case, all of which affect the "new gallium".

tldr; will followup as mark as some time to evaulate

Should we start drawing a network diagram representing the different lan / vlan we have and the traffic flows between jenkins/zuul/gearman/nodepool? I myself have no idea how the network is arranged.

mark subscribed.

maybe we could go ahead with the same setup we had on gallium, or let's make an actual blocker for the network-ops side of things

added @netops please specify which VLAN to use for cobalt

So, gallium is in public1-b-eqiad (208.80.154.128/26). The story behind a public IP is a long one and has to do with labs VM being unable to talk to non-public IP (i.e. 10.0.0.0/8) production hosts due to ACLs on the routers. That's an arbitrary limitation enforced by us AND does not apply to public IP production hosts. The rule is usually referred to "Labs doesn't talk to production" which is not fully true as there is this exception. It's a hole in that policy that we exploit for various things like that labs LDAP servers/CI and so on. That puts them in a gray area where there are services like CI/LDAP that need to talk to both production and labs. It resembles in a way the notion of a DMZ ( https://en.wikipedia.org/wiki/DMZ_(computing)#/media/File:DMZ_network_diagram_1_firewall.svg ), albeit less clearly defined. We have a notion of labs-support-hosts which mostly right now houses labstore hosts and scandium (zuul merger) and nobelium (elasticsearch labs replication test) that could be extended to that perhaps.

While a public IP will obviously work, if we feel like it we can try to get cobalt in that network. @faidon what do you think?

gallium.wikimedia.org has a bunch of services which are exposed publicly via the misc-web varnish cache. I am not sure whether we will accept public traffic to enter the labs-support-hosts network. The services are:

ServiceURLBackend
Jenkins graphical interfacehttps://integration.wikimedia.org/ci/Apache proxy - Jenkins daemon
Zuul metadatahttps://integration.wikimedia.org/zuul/status.jsonApache proxy - Zuul daemon
Generated docshttps://doc.wikimedia.org/Apache
CI website + artifactshttps://integration.wikimedia.org/Apache

The integration.wikimedia.org entry points to the misc-web Varnish cache which rely the requests to gallium.wikimedia.org Apache server. Apache either serves content directly (doc/integration websites) or proxy the requests to the backend daemon Zuul and Jenkins.

The zuul-merger process on gallium can be phased out. There is one on scandium, that is one less thing to deal with.

Would a diagram of the current flows involved be of any help? We can well reorganize the various services and migrate them to different servers as may fit.

We have created a sub project in Phabricator https://phabricator.wikimedia.org/project/view/1966/

First step is for Release-Engineering-Team to agree on an architecture via T133300 and propose it to SRE for validation.

Quote from IRC discussing T137265: / on gallium is read only, breaking jenkins replacement box/emergency spares boxes

<moritzm> seems T95959 has never been updated, cobalt is marked as decomissioned in racktables

hashar added a subscriber: Joe.

Due to gallium loosing a disk ( T137265 ) @Joe allocated a new server from the pool. We had it named contint1001.eqiad.wmnet 10.64.0.237 and after a few puppet work have more or less the base packages provisioned.

The service implementation is not complete yet, needs firewall rules to be sorted out ( T137323 ) and Puppet updates to change various IP.

Via T133300: Target architecture without gallium.wikimedia.org we would discuss about a best target for the CI services. Having everything on a single production / real server might not be the best.

Regardless, this task was really about assigning a server and I think that is fulfilled.