Page MenuHomePhabricator

eqiad: 2 hardware access request for CI isolation on labsnet
Closed, ResolvedPublic

Description

The CI isolation project aims at running tests in isolated machines. It is reusing the wmflabs OpenStack system to spawn a pool of VMs. They are then consumed as Jenkins jobs are triggered.

The architecture proposed ( overview on wiki ) comes up with two services in labs subnet each on their own hardware:

nodepool:

We will have a pool manager placed in the labs subnet and interacting with the OpenStack API to create images / spawn instances. We might later on move the Zuul scheduler server from gallium.wikimedia.org to that machine as well.

The server also be responsible for bootstrapping images. Thus it will have slight CPU/IO spikes while generating them and a network spike when pushing the resulting image to labs OpenStack.

The nodepool will needs connections to production machines ( [[ https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isolation#Security_matrix | security matrix ). Namely: Zeromq/https to gallium.wikimedia.org and mysql to one of the db10xx server and statsd UDP paquets.

zuul mergers:

A second machine will be in charge of preparing the code that will be tested. It takes the patches proposed in Gerrit and merge them on tip of the branch. The result is then retrieved by the jobs over a git-daemon.

Zuul merger has a noticeable network delay (git remote update takes several seconds) when updating the repos, so we will have two zuul-merger instances running in parallel. Since git is heavily file based, each instance will act on its own SSD. No need for raid, in case of hardware failure the data will be repopulated from Gerrit and we can run with a single instance.

The zuul-mergers will each establish a Gearman connection to gallium.wikimedia.org ( [[ https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isolation#Security_matrix | security matrix ) and statsd UDP paquets.


Labs Project Tested: NodePool hasn't been tested yet. Needs access to an OpenStack API. zuul-merger has been in prod for a while on gallium.wikimedia.org
Site/Location: eqiad in lab subnet
Number of systems: 2
Service: Continuous Integration
Internal/External IP Address: internal IP in labs subnet
VLAN: _____

Nodepool

Upstream has a 8GB instance monitored via cacti: overview, CPU usage, Memory usage

Processor Requirements: 4 cores
Memory: 4GB
Disks: a few GB not much is needed.
NIC(s): 1
Partitioning Scheme: LVM. A partition for /var
Other Requirements:

Zuul merger

It is merely doing git merges which are potentially disk I/O intensive and suffer from disk and network latency.

Processor Requirements: 2 cores, git operations are not that much CPU intensive
Memory: 2GB
Disks: for the zuul-merger two 32+GB SSD. Actual consumption is just 7GB!
NIC(s): 1
Partitioning Scheme: LVM. Each SSD as one big partition mounted under /srv/ssd1 and /srv/ssd2. No RAID needed for the SSD.
Other Requirements:

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, Andrew.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 18 2015, 3:05 PM

+ Dan Duvall

If need be, we can have a hangout together to refine the procurement ticket.

RobH added a subscriber: RobH.Mar 25 2015, 11:19 PM

These are two fairly low requirement systems, can they share a single host?

These are two fairly low requirement systems, can they share a single host?

Totally and we should do that. Looks like I have been missing the big picture. The Zuul merger / git daemon have fairly low CPU/mem. Their disk I/O would be contained to the SSD drives which we have little impact to Nodepool processing. If we need more power later on, we can still split the services.

@RobH Do you know what kind of machines we have floating around?

fgiunchedi triaged this task as Medium priority.Apr 1 2015, 10:54 AM
fgiunchedi added a subscriber: fgiunchedi.
RobH added a comment.Apr 3 2015, 8:09 PM

my apologies on how long a reply has been pending:

https://wikitech.wikimedia.org/wiki/Server_Spares

The above page lists the current spare systems in the two primary locations. Please peruse and if any match the required spec, please say so.

If none are ideal, we can gather info and get a quote for a new system.

Chase / Andrew / I just had a meeting. We followed the discussion on IRC with RobH.

Seems we will get a machine in labs:

610, Single Intel Xeon E5640, 32GB Memory, Dual 120GB Intel SSD

Named labnodepool1001

That will be used as a sandbox to prepare the Nodepool installation. Then will be reinstalled from scratch.

The Zuul merger can be added it to it since it has SSD disks. Though we might end up installing them at another place.

RobH added a comment.Apr 3 2015, 8:35 PM

After an IRC discussion, we will be allocating two of the old squid systems for these tasks:

  • WMF3095 (in row c) as labnodepool1001
  • WMF3121 (in row a) as scandium

Lets get WMF3095 (in row c) as labnodepool1001 installed. We need it to start the integration of Nodepool and play with it.

We don't know yet whether we will host the services on the same machine or if we will need a second machine in a different network. So lets keep an option for the second host WMF3121 (in row a) as scandium . To be confirmed after our next meeting on April 10th.

labnodepool1001 has been installed and is ready for service implementation

scandium (zuul mergers) should land in labs hosts subnet but that are ongoing discussions about where exactly to place this machine. That is similar to T95959: install/setup/deploy cobalt as replacement for gallium which is to land in labs subnet as well.

So second server is still on hold.

RobH assigned this task to hashar.May 28 2015, 5:14 PM

Since the second server is on hold, rather than keep this unassigned (and end up with me checking it every day ;), I'm going to assign this to #hashar.

Once we are ready to move forward, you can assign it back to me. (So I'll simply skip past this task in my hardware-requests list while its assigned to you.)

RobH closed this task as Resolved.Oct 27 2015, 6:00 PM

I'm resolving this task, as the install of scandium is tracked on T95046.