Jenkins lives on a single production machine in eqiad (contint1001.eqiad.wmnet) which is itself a single point of failure and does not let us switch to codfw in case of issue on the eqiad datacenter. We have no good way to test a Jenkins upgrade and rolling back an upgrade would be challenging.
+ hotspare or active/active
+ production
+ one less SPOF
+ let us test Jenkins upgrades (eg T144106)
Setting up a secondary production machine hosting a Jenkins in codfw will address the points. We can have it as an hot spare and use it for testing upgrade and later on head toward an active/active setup.
Rough notes from @hashar and @thcipriani meeting:
New production machine in Dallas
===========
Get a new production machine in Dallas ( cont2001.codfw.wmnet ?). We will want to request hardware allocation. Note: we need a Public IP address.
[ ] Fill a procurement ticket against [[ https://phabricator.wikimedia.org/project/profile/1014/ | hardware-requests ]] see doc there. --> T150865
puppet work has to be done:
[ ] review classes and find out whether they are fully configurable via hiera
[ ] identify potentially hardcoded IP / hostname
[ ] Review firewall rules
Culpirts
=====
[ ] Jenkins job would have to be run on both instances to have jobs in sync
[ ] Timed jobs (browsertests, beta cluster jobs, doc publishing) would run on both instance of Jenkins! A solution has to be figured out for that use case.
[ ] Not sure how Zuul will craft a report URL pointing to proper jenkins master
Future
=====
Both Nodepool and Zuul can be connected to multiple Jenkins master if we aim at an active/active setup.
Later on we might want to add a spare Zuul to the new machine.