Page MenuHomePhabricator

contint1002 service implementation tracking
Closed, ResolvedPublic

Description

This task is to track the service implementation of serviceops host(s) listed in the task description.

Once the linked racking task has been resolved, this task can be implemented.

This sub-task creation/update is per the request of serviceops; this task is assigned at creation to the 'Sub-team Technical Contact' provided in the initial ordering task.

contint1002 should have:

  • integration/docroot deployment
  • basics of Jenkins but service disabled (confirmed it is inactive/masked)
  • git-daemon
  • zuul-merger - The instance on contint1001 MUST be stopped before bringing zuul-merger up on contint1002 due to a per user ssh connection limit in Gerrit

Event Timeline

LSobanski changed the task status from Open to Stalled.Sep 20 2022, 3:26 PM
LSobanski moved this task from Blocked to Incoming on the collaboration-services board.

@LSobanski comment about the incident with contint1001 is at T294276#8357385

I think this is currently blocked on T313830.

@jnuche we will have to setup a spare Jenkins and a Zuul merger on this new host contint1002 :-)

@hashar sounds like a good opportunity to pair!

LSobanski changed the task status from Stalled to Open.Nov 23 2022, 7:49 PM
LSobanski removed LSobanski as the assignee of this task.

Switching to contint1002 would also be a good opportunity to migrate to Bullseye (which per https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy should happen within nine months anyway):

contint1001 keeps crashing due to a faulty memory stick. It happened on October 31st ( T294276#8357385 ) and again on Dec 5th and today Dec 7th.

The machine is not hosting the Jenkins controller and Zuul scheduler but it does provides a zuul-merger and a Jenkins agent. We thus can't afford to shut it down and it should be replaced.

Change 865672 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: give access to contint1002

https://gerrit.wikimedia.org/r/865672

Change 865680 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: add ci::master to contint1002

https://gerrit.wikimedia.org/r/865680

Change 865681 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: add contint1002 as a scap target

https://gerrit.wikimedia.org/r/865681

Change 865672 merged by Dzahn:

[operations/puppet@production] contint: give RelEng access to contint1002

https://gerrit.wikimedia.org/r/865672

I merged your change https://gerrit.wikimedia.org/r/c/operations/puppet/+/865672/4 so now releng members have shell access on contint1002.

[contint1002:~] $ grep roots /etc/group
contint-roots:x:720:hashar,thcipriani,brennen,dancy,jhuneidi,demon,dduvall,jnuche,jforrester

Change 865680 merged by Dzahn:

[operations/puppet@production] contint: add ci::master to contint1002

https://gerrit.wikimedia.org/r/865680

Mentioned in SAL (#wikimedia-operations) [2022-12-07T19:53:55Z] <mutante> registry* (docker registry HA) - adding contint1002 to allowed hosts gerrit:865680 T313832

Mentioned in SAL (#wikimedia-operations) [2022-12-07T20:00:13Z] <mutante> contint* - deploying firewall changes to add contint1002 - T313832

Change 865734 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: add docker::settings for contint1002

https://gerrit.wikimedia.org/r/865734

Change 865734 merged by Dzahn:

[operations/puppet@production] contint: add docker::settings for contint1002

https://gerrit.wikimedia.org/r/865734

Change 865735 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: move docker::settings to common, avoid host names

https://gerrit.wikimedia.org/r/865735

Change 865739 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci::master: hack to bootstrap new server contint1002

https://gerrit.wikimedia.org/r/865739

Change 865739 merged by Dzahn:

[operations/puppet@production] ci::master: hack to bootstrap new server contint1002

https://gerrit.wikimedia.org/r/865739

Mentioned in SAL (#wikimedia-releng) [2022-12-07T21:17:43Z] <hashar> Add contint1002 as an agent to the CI Jenkins, albeit in offline mode cause it is being provisioned | https://integration.wikimedia.org/ci/computer/contint1002/ | T313832

Mentioned in SAL (#wikimedia-releng) [2022-12-08T08:20:08Z] <hashar> Attached contint1002 as an agent of the CI Jenkins # T313832

Change 865681 merged by Clément Goubert:

[operations/puppet@production] contint: add contint1002 as a scap target

https://gerrit.wikimedia.org/r/865681

Mentioned in SAL (#wikimedia-operations) [2022-12-08T09:17:35Z] <hashar@deploy1002> Started deploy [integration/docroot@2e0d44b]: Warm up contint1002 and test php-fpm restart # T313832

Mentioned in SAL (#wikimedia-operations) [2022-12-08T09:17:39Z] <hashar@deploy1002> Finished deploy [integration/docroot@2e0d44b]: Warm up contint1002 and test php-fpm restart # T313832 (duration: 00m 03s)

Mentioned in SAL (#wikimedia-operations) [2022-12-08T09:24:53Z] <hashar@deploy1002> Started deploy [zuul/deploy@4c6859c]: Install Zuul virtualenv on contint1002 # T313832

Mentioned in SAL (#wikimedia-operations) [2022-12-08T09:25:00Z] <hashar@deploy1002> Finished deploy [zuul/deploy@4c6859c]: Install Zuul virtualenv on contint1002 # T313832 (duration: 00m 07s)

Mentioned in SAL (#wikimedia-operations) [2022-12-08T09:38:07Z] <hashar> contint1001: manually stopped and masked zuul-merger. It is under maintenance mode in Icinga # T313832

Mentioned in SAL (#wikimedia-operations) [2022-12-08T09:43:23Z] <hashar> contint1002: stopped puppet and manually started zuul-merger. I am monitoring it cause last time we have bring up a new one it had some issues here and there # T313832

Change 866277 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: move zuul-merger from contint1001 to contint1002

https://gerrit.wikimedia.org/r/866277

Change 866277 merged by Clément Goubert:

[operations/puppet@production] contint: move zuul-merger from contint1001 to contint1002

https://gerrit.wikimedia.org/r/866277

Mentioned in SAL (#wikimedia-operations) [2022-12-08T10:18:52Z] <hashar> contint1002: activated Icinga monitoring , all services are up and running # T313832

hashar claimed this task.
hashar added a subscriber: Clement_Goubert.

contint1002 is now attached as a Jenkins agent and running the zuul-merger service.

I have removed contint1001 from Jenkins and it no more has the zuul-merger running.

Huge thanks to @Dzahn who has planned the hardware replacement ahead of time and for pairing on the Puppet role deployment yesterday. Thanks to @Clement_Goubert for stepping up this morning, doing the Puppet paperwork and provide assistance for Icinga monitoring.

Change 865735 merged by Dzahn:

[operations/puppet@production] ci: move docker::settings to common, avoid host names

https://gerrit.wikimedia.org/r/865735

Change 867675 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] cloud: allow VMs to connect to contint1002 and contint2002

https://gerrit.wikimedia.org/r/867675

Change 867675 merged by Dzahn:

[operations/puppet@production] cloud: allow VMs to connect to contint1002 and contint2002

https://gerrit.wikimedia.org/r/867675