Page MenuHomePhabricator

Secondary production Jenkins for CI
Closed, ResolvedPublic

Description

Jenkins lives on a single production machine in eqiad (contint1001.eqiad.wmnet) which is itself a single point of failure and does not let us switch to codfw in case of issue on the eqiad datacenter. We have no good way to test a Jenkins upgrade and rolling back an upgrade would be challenging.

+ hotspare or active/active
+ production
+ one less SPOF
+ let us test Jenkins upgrades (eg T144106)

Setting up a secondary production machine hosting a Jenkins in codfw will address the points. We can have it as an hot spare and use it for testing upgrade and later on head toward an active/active setup.

Rough notes from @hashar and @thcipriani meeting:

New production machine in Dallas

Get a new production machine in Dallas ( cont2001.codfw.wmnet ?). We will want to request hardware allocation. Note: we need a Public IP address.

puppet work has to be done:

  • review classes and find out whether they are fully configurable via hiera
  • identify potentially hardcoded IP / hostname
  • Review firewall rules

Culpirts

  • Jenkins job would have to be run on both instances to have jobs in sync
  • Timed jobs (browsertests, beta cluster jobs, doc publishing) would run on both instance of Jenkins! A solution has to be figured out for that use case.
  • Not sure how Zuul will craft a report URL pointing to proper jenkins master

Future

Both Nodepool and Zuul can be connected to multiple Jenkins master if we aim at an active/active setup.

Later on we might want to add a spare Zuul to the new machine.

Event Timeline

hashar created this task.Nov 15 2016, 4:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 15 2016, 4:48 PM
hashar updated the task description. (Show Details)Nov 15 2016, 5:08 PM
hashar updated the task description. (Show Details)
hashar added a subscriber: thcipriani.
hashar added a subscriber: Dzahn.
greg added a subscriber: greg.Nov 15 2016, 5:15 PM
Dzahn added a comment.Nov 15 2016, 8:47 PM

+1 , i think it's a very good idea to have a warm standby server. this would be contint2001 to match contint1001 which replaced gallium

hashar triaged this task as Medium priority.Nov 18 2016, 3:12 PM
RobH added a subscriber: RobH.Dec 15 2016, 9:15 PM

contint2001.wikimedia.org is now online in codfw and calling into puppet. It can be deployed into use as needed.

hashar updated the task description. (Show Details)Dec 15 2016, 9:50 PM

Change 327594 had a related patch set uploaded (by Hashar):
contint: provision the secondary CI master

https://gerrit.wikimedia.org/r/327594

Change 327649 had a related patch set uploaded (by Hashar):
zuul: manage service status from hiera

https://gerrit.wikimedia.org/r/327649

Change 327650 had a related patch set uploaded (by Hashar):
contint: add a disabled zuul server on contint2001

https://gerrit.wikimedia.org/r/327650

Change 327594 merged by Dzahn:
contint: provision the secondary CI master

https://gerrit.wikimedia.org/r/327594

compiled, checked it's noop on contint1001, merged provisioning change.

contint2001 is getting Apache, ferm rules and all that right now...

contint2001 now has:

  • contint-admin/roots users:
[contint2001:~] $ id hashar
uid=1010(hashar) gid=500(wikidev) groups=500(wikidev),720(contint-roots),719(contint-admins)
  • firewall rules
Chain INPUT (policy DROP)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere             PKTTYPE = multicast
DROP       tcp  --  anywhere             anywhere             state NEW tcp flags:!FIN,SYN,RST,ACK/SYN
ACCEPT     icmp --  anywhere             anywhere            
ACCEPT     tcp  --  10.128.0.0/24        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.192.0.0/22        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.192.16.0/22       anywhere             tcp dpt:http
ACCEPT     tcp  --  10.192.20.0/24       anywhere             tcp dpt:http
ACCEPT     tcp  --  10.192.21.0/24       anywhere             tcp dpt:http
ACCEPT     tcp  --  10.192.32.0/22       anywhere             tcp dpt:http
ACCEPT     tcp  --  10.192.48.0/22       anywhere             tcp dpt:http
ACCEPT     tcp  --  10.20.0.0/24         anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.0.0/22         anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.16.0/22        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.20.0/24        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.21.0/24        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.32.0/22        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.36.0/24        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.37.0/24        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.4.0/24         anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.48.0/22        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.5.0/24         anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.52.0/24        anywhere             tcp dpt:http
ACCEPT     tcp  --  10.64.53.0/24        anywhere             tcp dpt:http
ACCEPT     tcp  --  198.35.26.0/28       anywhere             tcp dpt:http
ACCEPT     tcp  --  text-lb.ulsfo.wikimedia.org/27  anywhere             tcp dpt:http
ACCEPT     tcp  --  localnet/27          anywhere             tcp dpt:http
ACCEPT     tcp  --  text-lb.codfw.wikimedia.org/27  anywhere             tcp dpt:http
ACCEPT     tcp  --  208.80.153.32/27     anywhere             tcp dpt:http
ACCEPT     tcp  --  208.80.153.64/27     anywhere             tcp dpt:http
ACCEPT     tcp  --  208.80.153.96/27     anywhere             tcp dpt:http
ACCEPT     tcp  --  208.80.154.0/26      anywhere             tcp dpt:http
ACCEPT     tcp  --  208.80.154.128/26    anywhere             tcp dpt:http
ACCEPT     tcp  --  text-lb.eqiad.wikimedia.org/27  anywhere             tcp dpt:http
ACCEPT     tcp  --  208.80.154.64/26     anywhere             tcp dpt:http
ACCEPT     tcp  --  208.80.155.96/27     anywhere             tcp dpt:http
ACCEPT     tcp  --  91.198.174.0/25      anywhere             tcp dpt:http
ACCEPT     tcp  --  text-lb.esams.wikimedia.org/27  anywhere             tcp dpt:http
ACCEPT     tcp  --  helium.eqiad.wmnet   anywhere             tcp dpt:bacula-fd
ACCEPT     tcp  --  bast1001.wikimedia.org  anywhere             tcp dpt:ssh
ACCEPT     tcp  --  bast2001.wikimedia.org  anywhere             tcp dpt:ssh
ACCEPT     tcp  --  bast3001.wikimedia.org  anywhere             tcp dpt:ssh
ACCEPT     tcp  --  bast4001.wikimedia.org  anywhere             tcp dpt:ssh
ACCEPT     tcp  --  iron.wikimedia.org   anywhere             tcp dpt:ssh
ACCEPT     tcp  --  cobalt.wikimedia.org  anywhere             tcp dpt:29418
ACCEPT     tcp  --  gerrit.wikimedia.org  anywhere             tcp dpt:29418
ACCEPT     tcp  --  labnodepool1001.eqiad.wmnet  anywhere             tcp dpt:4730
ACCEPT     tcp  --  scandium.eqiad.wmnet  anywhere             tcp dpt:4730
ACCEPT     tcp  --  localhost            anywhere             tcp dpt:http-alt
ACCEPT     tcp  --  labnodepool1001.eqiad.wmnet  anywhere             tcp dpt:https
ACCEPT     tcp  --  labnodepool1001.eqiad.wmnet  anywhere             tcp dpt:8888
ACCEPT     all  --  tegmen.wikimedia.org  anywhere            
ACCEPT     all  --  einsteinium.wikimedia.org  anywhere            
ACCEPT     all  --  uranium.wikimedia.org  anywhere            
ACCEPT     tcp  --  prometheus2001.codfw.wmnet  anywhere             tcp dpt:9100
ACCEPT     tcp  --  prometheus2002.codfw.wmnet  anywhere             tcp dpt:9100
ACCEPT     tcp  --  localhost            anywhere             tcp dpt:8001

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
  • running Apache
contint2001:~] $ sudo apachectl status
..
   Server uptime: 5 minutes 16 seconds
  • running jenkins
@contint2001:~# ps aux | grep java
jenkins   89355  0.0  0.0  18604   196 ?        S    00:42   0:00 /usr/bin/daemon --name=jenkins
  • no puppet errors
Notice: Finished catalog run in 20.24 seconds

Change 327649 merged by Dzahn:
zuul: manage service status from hiera

https://gerrit.wikimedia.org/r/327649

Change 327650 merged by Dzahn:
contint: add a disabled zuul server on contint2001

https://gerrit.wikimedia.org/r/327650

Change 327685 had a related patch set uploaded (by Dzahn):
contint: ensure => 'stopped' for zuul instead of 'mask'

https://gerrit.wikimedia.org/r/327685

Change 327685 merged by Dzahn:
contint: ensure => 'stopped' for zuul instead of 'mask'

https://gerrit.wikimedia.org/r/327685

Dzahn added a comment.Dec 16 2016, 1:58 AM

rebased/amended/merged/follow-up fix done

contint1001/2001 are now including identical roles in site.pp and are only different via hiera ovverrides that stop the zuul service on the non-active server.

We could not use the "mask" attribute of the service provider because it only exists in Puppet 4.x version. So that is changed to "stopped" and if we think we need mask, we'll have to use an "exec" to do it. (https://tickets.puppetlabs.com/browse/PUP-1253)

remaining questions:

  • jenkins_zmq_publisher is reported in Icinga as "s 127.0.0.1 and port 8888: Connection refused" - Should it be running or not on the standby server?
  • Icinga alerts for zuul (zuul_service_running, zuul_gearman_service) are alerting because zuul isn't running (as intended!) so we need to skip or disable these on the standby server (acked for now)
  • since both nodes are the same in site.pp now they can be merged with a regex, so they are always identical

Change 327691 had a related patch set uploaded (by Dzahn):
contint: combine contint1001/2001 in a single node regex

https://gerrit.wikimedia.org/r/327691

Change 327693 had a related patch set uploaded (by Dzahn):
contint: simplify includes in site.pp, move things to master role

https://gerrit.wikimedia.org/r/327693

Change 327695 had a related patch set uploaded (by Dzahn):
contint/zuul: skip Icinga monitoring if server not master

https://gerrit.wikimedia.org/r/327695

rebased/amended/merged/follow-up fix done

What a surprise to have everything handled and enhanced when I got my breakfast this morning! :-}

contint1001/2001 are now including identical roles in site.pp and are only different via hiera ovverrides that stop the zuul service on the non-active server.

Yeah that is excellent. From reviews of the follow up change, I really like the idea of having a single hiera variable to switch between host (contint::master_host). Will amend some of your follow up patch in that sense.

We could not use the "mask" attribute of the service provider because it only exists in Puppet 4.x version. So that is changed to "stopped" and if we think we need mask, we'll have to use an "exec" to do it. (https://tickets.puppetlabs.com/browse/PUP-1253)

Yeah that one is entirely my fault, enable => 'mask' is apparently from some later version of Puppet. I was just guessing about it yesterday. A better long term solution will be to move Zuul to systemd and our service::base_unit. I am pretty sure I have a patch to add systemd support but could not find it.

remaining questions:

  • jenkins_zmq_publisher is reported in Icinga as "s 127.0.0.1 and port 8888: Connection refused" - Should it be running or not on the standby server?

The Jenkins server has a plugin to act as a ZeroMQ publisher. If Jenkins is disabled/not running, the service is not running. So essentially when jenkins::service_* is not enabled, that monitoring should be disabled. The same way your patch https://gerrit.wikimedia.org/r/327695 will do for Zuul.

  • Icinga alerts for zuul (zuul_service_running, zuul_gearman_service) are alerting because zuul isn't running (as intended!) so we need to skip or disable these on the standby server (acked for now)

Indeed. Can follow-up on your patch https://gerrit.wikimedia.org/r/327695

  • since both nodes are the same in site.pp now they can be merged with a regex, so they are always identical

\O/

I have updated the labs projects security rules to allow contint2001 to ssh to the labs instances

status: got distracted with other things today. Have to follow up on Daniel follow up patches and refactor a few things.

Change 327691 merged by Dzahn:
contint: combine contint1001/2001 in a single node regex

https://gerrit.wikimedia.org/r/327691

Change 327693 merged by Dzahn:
contint: fix/move 'backup'-includes, move from node to role

https://gerrit.wikimedia.org/r/327693

Change 327695 merged by Dzahn:
contint/zuul: skip Icinga monitoring if server not master

https://gerrit.wikimedia.org/r/327695

Zppix added a subscriber: Zppix.Apr 12 2017, 8:23 PM

Change 348171 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] contint/icinga: make jenkins service monitoring configurable

https://gerrit.wikimedia.org/r/348171

Change 348171 merged by Dzahn:
[operations/puppet@production] contint/icinga: make jenkins service monitoring configurable

https://gerrit.wikimedia.org/r/348171

Change 348191 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] contint/icinga: skip zmq_publisher monitor if no jenkins

https://gerrit.wikimedia.org/r/348191

Change 348191 merged by Dzahn:
[operations/puppet@production] contint/icinga: skip zmq_publisher monitor if no jenkins

https://gerrit.wikimedia.org/r/348191

once jenkins is running on both servers, don't forget to remove https://gerrit.wikimedia.org/r/#/c/348171/2/hieradata/hosts/contint2001.yaml to activate icinga checks

Dzahn added a comment.May 17 2017, 8:18 PM

update: the current status is:

230 # CI master / CI standby (switch in Hiera)
231 node /^(contint1001|contint2001)\.wikimedia\.org$/ {
232     role(ci::master,
233         ci::slave,
234         ci::website,
235         zuul::merger,
236         zuul::server)

So both servers share the same node regex, same roles, we did puppet work to make that happen. etc..

@hashar Is this ticket resolved? If not, what is missing?

hashar changed the task status from Open to Stalled.Oct 12 2017, 8:46 AM

I would like to ultimately have the Jenkins in active/active. I miss time to complete that though :(

hashar closed this task as Resolved.Dec 21 2018, 10:15 PM
hashar claimed this task.

The original intent was to have two masters. But we can not commit to it.

At least we now have an hotspare which would let us quickly restore the Jenkins CI service if the primary dies somehow.