Page MenuHomePhabricator

install/deploy scandium as zuul merger (ci) server
Closed, ResolvedPublic

Description

System Deployment Steps:

  • - system mgmt setup and tested
  • - system dns setup (both mgmt and production entries in labs support vlan)
  • - network switch setup (port description & labs-support vlan)
  • - install-server module updated (dhcp and netboot/partitioning)
  • - install OS - Jessie
  • - accept/sign puppet/salt keys
  • - service implementation

Related Objects

StatusSubtypeAssignedTask
Resolvedhashar
Resolvedhashar
Resolvedhashar
Resolvedhashar
Resolvedhashar
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolvedhashar
ResolvedKrinkle
Resolvedhashar
Resolvedhashar
Resolvedhashar
Resolvedhashar
Resolvedhashar
ResolvedAndrew
Resolved chasemp
Resolvedhashar
Resolvedhashar
Resolvedhashar
Resolvedhashar

Event Timeline

RobH claimed this task.
RobH raised the priority of this task from to Medium.
RobH updated the task description. (Show Details)
RobH added a project: acl*sre-team.
RobH added subscribers: RobH, hashar, dduvall.
RobH changed the task status from Open to Stalled.Apr 3 2015, 8:51 PM

This is being stalled until next Friday (2015-04-10). Then we will discuss with @hashar and proceed as needed.

Following Friday 2015-04-10 checkin, @chasemp talked to @mark about the labs VLANs.

During today checkin, that is stalled pending further discussion pending outcom of the discussion at T95959#1206932 (cobalt, a replacement for gallium).

I'm going to assign this to chase, only while the discussion is pending about the networking. (Since he is discussing with @mark).

I just don't want to see this on my task list daily, and start to ignore it via repetition. As soon as the vlans are figured out, feel free to assign back to me and we can get this installed.

Thanks.

scandium is going to host the Zuul mergers.

On the isolation architecture overview, that is labci002 with the following flows:

source destportdescription
 scandiumgallium (prod)4730Zuul merger to Zuul Gearman server
scandiumTBD8125 Zuul merger metrics to statsd
labs instances scandium9418 Git connections from VMs to the Git daemon instances

I would put it in labs network just like labnodepool1001.eqiad.wmnet . It could connect to the Zuul server on gallium over the internet and we can allow Gearman inbound connection from it to gallium.wikimedia.org.

The machine can thus be named labzuul1001.eqiad.wmnet

We actually decided on this at the offsite :)

This box can go in labs-support and there is no issue w/ current releng permissions translating. It will be accessible from prod and we will make allowance for Labs VM's.

I'm going to assign this to chase, only while the discussion is pending about the networking. (Since he is discussing with @mark).

I just don't want to see this on my task list daily, and start to ignore it via repetition. As soon as the vlans are figured out, feel free to assign back to me and we can get this installed.

Thanks.

Back atcha' @RobH

We are going labs-support :)

Note the zuul-merger process on scandium will need to be able to reach the Gearman server on gallium (production).

Should use the Jessie operating system.

RobH set Security to None.
RobH updated the task description. (Show Details)

I think the service implementation of this would belong to @hashar, so I am assigning this task to him for the final steps.

If this is not correct, please assign to the proper person.

Thanks!

RobH changed the task status from Stalled to Open.Oct 27 2015, 7:13 PM

Holy hell, how do you manage to install servers so fast ? :-}

Need to get rid of the Gerrit replication ( T86661 )

We will need to make sure all slaves (gallium.wikimedia.org and instances in contintcloud and integration) can reach the git-daemon on scandium.

Change 249387 had a related patch set uploaded (by Hashar):
multigit.sh: no more hardcode Zuul git URL

https://gerrit.wikimedia.org/r/249387

Change 249387 merged by jenkins-bot:
multigit.sh: no more hardcode Zuul git URL

https://gerrit.wikimedia.org/r/249387

Change 249389 had a related patch set uploaded (by Hashar):
contint: set Zuul URL based on server fqdn

https://gerrit.wikimedia.org/r/249389

Change 249389 merged by Andrew Bogott:
contint: set Zuul URL based on server fqdn

https://gerrit.wikimedia.org/r/249389

Change 249380 had a related patch set uploaded (by Hashar):
contint: scandium configuration

https://gerrit.wikimedia.org/r/249380

Change 249380 merged by Rush:
contint: scandium configuration

https://gerrit.wikimedia.org/r/249380

We got shell access thanks to ops reviews! Will now look at the network flows. Once happy we can apply the zuul::merger role and discard root access.

@RobH scandium has been installed with Trusty. Would need to reimage it to Jessie instead (sorry).

Some firewall rules have been added (T116975: Allow network flow between labs instance and scandium) so we probably want to keep the same IP address 10.64.4.12.

RobH updated the task description. (Show Details)

Reinstalled to jessie and has all keys accepted, ready for service implementation.

Change 252336 had a related patch set uploaded (by Hashar):
contint: setup zuul-merger on scandium.eqiad.wmnet

https://gerrit.wikimedia.org/r/252336

Change 252337 had a related patch set uploaded (by Hashar):
contint: pool in zuul-merger on scandium

https://gerrit.wikimedia.org/r/252337

I think most of the work has been done now. I have poked by email @Andrew and @chasemp to figure out with whom / when I can handle the deployment.

Scheduled for Tuesday 17th November

15:00–16:00 UTC #
07:00–08:00 PST
16:00–17:00 UTC+1

Change 252336 merged by Andrew Bogott:
contint: setup zuul-merger on scandium.eqiad.wmnet

https://gerrit.wikimedia.org/r/252336

Change 253617 had a related patch set uploaded (by Andrew Bogott):
Ensure that the zuul-merger parent dir exists.

https://gerrit.wikimedia.org/r/253617

Change 253617 abandoned by Hashar:
Ensure that the zuul-merger parent dir exists.

Reason:
Been handled differently by using mkdir https://gerrit.wikimedia.org/r/#/c/253616/

https://gerrit.wikimedia.org/r/253617

Change 253893 had a related patch set uploaded (by Hashar):
contint: move iptables rule for zuul-merger git daemon

https://gerrit.wikimedia.org/r/253893

Change 252337 merged by Andrew Bogott:
contint: pool in zuul-merger on scandium

https://gerrit.wikimedia.org/r/252337

Change 253893 merged by Andrew Bogott:
contint: move iptables rule for zuul-merger git daemon

https://gerrit.wikimedia.org/r/253893

The two last puppet patches has let the zuul-merger on scandium to reach out gallium AND let the slaves git clone from scandium git-daemon.

The zuul-merger instance properly registered its function with the Gearman server on gallium which now shows two workers being available:

gallium$ zuul-gearman.py status|grep ^merger:
merger:update	0	0	2
merger:merge	0	0	2

More unpuppetized / badly pauperized stuff:

stderr: 'Cloning into '/srv/ssd/zuul/git/mediawiki/core'...
Warning: Identity file /var/lib/zuul/.ssh/id_rsa not accessible: No such file or directory.

The zuul-merger is unable to clone from Gerrit because it lacks the jenkins-bot ssh private key :-(

Change 253925 had a related patch set uploaded (by Hashar):
zuul: support for zuul-merger gerrit ssh key

https://gerrit.wikimedia.org/r/253925

I copy pasted the key from gallium to scandium. Manually established a ssh connection to gerrit for the known key:

zuul@scandium$ ssh -p 29418 jenkins-bot@ytterbium.wikimedia.org

Worked. The "recheck" above used the zuul-merger process on scandium.

So this is essentially done. Now pending for the puppet private repo to hold the ssh key https://gerrit.wikimedia.org/r/#/c/253925/

Change 253925 merged by Andrew Bogott:
zuul: support for zuul-merger gerrit ssh key

https://gerrit.wikimedia.org/r/253925

Service implementation is pretty much completed and has been running in prod for a few days now. What is left remaining are the access rights ( T116921 and childs ). Once cleared we can't resolve this task and celebrate.

Dzahn subscribed.

the remaining blockers are closed. you can now *celebrate* @hashar