Page MenuHomePhabricator

Fix gerrit-restart cookbook
Closed, ResolvedPublic

Description

replication has been warmed up, @Dzahn will handle the last step:

sudo cookbook sre.gerrit.restart-gerrit -t T417247 --host gerrit2003

16:56:19 <mutante> raise AlertmanagerError(f"Unable to {method.upper()} to any Alertmanager: {self._alertmanager_urls}", response)

this taks is to track the debugging of that error

Details

Related Changes in Gerrit:

Event Timeline

ABran-WMF triaged this task as Medium priority.Feb 24 2026, 3:59 PM
ABran-WMF moved this task from Incoming to Backlog on the collaboration-services board.

Change #1239003 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: alerting downtime update

https://gerrit.wikimedia.org/r/1239003

Change #1239003 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: alerting downtime update

https://gerrit.wikimedia.org/r/1239003

Things are going as expected:

DRY-RUN: Executing cookbook sre.gerrit.restart-gerrit with args: ['--host', 'gerrit2002']
DRY-RUN: Found [('conf1009.eqiad.wmnet', 4001), ('conf1007.eqiad.wmnet', 4001), ('conf1008.eqiad.wmnet', 4001)]
DRY-RUN: New etcd client created for https://conf1009.eqiad.wmnet:4001
DRY-RUN: Retrieved list of machines: ['https://conf1007.eqiad.wmnet:4001', 'https://conf1008.eqiad.wmnet:4001', 'https://conf1009.eqiad.wmnet:4001']
DRY-RUN: Machines cache initialised to ['https://conf1007.eqiad.wmnet:4001', 'https://conf1008.eqiad.wmnet:4001']
DRY-RUN: Acquiring lock for key sre.gerrit.restart-gerrit: {'concurrency': 1, 'created': '2026-03-04 13:24:17.719159', 'owner': 'arnaudb@cumin1003 [4141862]', 'ttl': 900}
DRY-RUN: Reduce tries from 27 to 1 in DRY-RUN mode
DRY-RUN: Issuing read for key /spicerack/locks/cookbooks/sre.gerrit.restart-gerrit with args {'timeout': 60}
DRY-RUN: Skipping lock acquire/release in DRY-RUN mode
DRY-RUN: Acquired lock for key /spicerack/locks/cookbooks/sre.gerrit.restart-gerrit: {'concurrency': 1, 'created': '2026-03-04 13:24:17.723665', 'owner': 'arnaudb@cumin1003 [4141862]', 'ttl': 900}
DRY-RUN: START - Cookbook sre.gerrit.restart-gerrit Restarting Gerrit on gerrit2002
DRY-RUN: Setting downtime for gerrit2002
DRY-RUN: Resolved CNAME record for icinga.wikimedia.org: icinga.wikimedia.org. 249 IN CNAME alert1002.wikimedia.org.
DRY-RUN: Executing commands ["grep -P '\\s*command_file\\s*=.+' /etc/icinga/icinga.cfg"] on 1 hosts: alert1002.wikimedia.org
DRY-RUN: Executing commands [cumin.transports.Command('/usr/local/bin/icinga-status -j "gerrit2002"', ok_codes=[])] on 1 hosts: alert1002.wikimedia.org
DRY-RUN: Scheduling downtime on Icinga server alert1002.wikimedia.org for hosts: gerrit2002
DRY-RUN: Executing commands ['bash -c \'echo -n "[1772630659] SCHEDULE_HOST_DOWNTIME;gerrit2002;1772630659;1772645059;1;0;14400;arnaudb@cumin1003;Restarting Gerrit on gerrit2002" > /var/lib/icinga/rw/icinga.cmd \''] on 1 hosts: alert1002.wikimedia.org
DRY-RUN: Executing commands ['bash -c \'echo -n "[1772630659] SCHEDULE_HOST_SVC_DOWNTIME;gerrit2002;1772630659;1772645059;1;0;14400;arnaudb@cumin1003;Restarting Gerrit on gerrit2002" > /var/lib/icinga/rw/icinga.cmd \''] on 1 hosts: alert1002.wikimedia.org
DRY-RUN: Reduce tries from 12 to 1 in DRY-RUN mode
DRY-RUN: Executing commands [cumin.transports.Command('/usr/local/bin/icinga-status -j "gerrit2002"', ok_codes=[])] on 1 hosts: alert1002.wikimedia.org
DRY-RUN: Some hosts are not yet downtimed: ['gerrit2002']
DRY-RUN: Would have called POST http://alertmanager-eqiad.wikimedia.org/api/v2/silences
DRY-RUN: Would have called POST http://alertmanager-eqiad.wikimedia.org/api/v2/silences
==> About to restart Gerrit on gerrit2002. Full downtime active (Host + Alertmanager). Proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
DRY-RUN: User input is: "go"
DRY-RUN: Restarting gerrit service on gerrit2002
DRY-RUN: Executing commands ['systemctl restart gerrit'] on 1 hosts: gerrit2002.wikimedia.org
DRY-RUN: Skipping monitoring wait because of dry run.
DRY-RUN: Would have called DELETE http://alertmanager-eqiad.wikimedia.org/api/v2/silence/
DRY-RUN: Deleted silence ID 
DRY-RUN: Executing commands ['bash -c \'echo -n "[1772630664] DEL_DOWNTIME_BY_HOST_NAME;gerrit2002" > /var/lib/icinga/rw/icinga.cmd \''] on 1 hosts: alert1002.wikimedia.org
DRY-RUN: Would have called DELETE http://alertmanager-eqiad.wikimedia.org/api/v2/silence/
DRY-RUN: Deleted silence ID 
DRY-RUN: Gerrit restart completed successfully. Downtimes removed.
DRY-RUN: Releasing lock for key sre.gerrit.restart-gerrit with ID 046af7f1-cbfb-4249-8c32-8b3d1785c95a
DRY-RUN: Issuing read for key /spicerack/locks/cookbooks/sre.gerrit.restart-gerrit with args {'timeout': 60}
DRY-RUN: Lock for key /spicerack/locks/cookbooks/sre.gerrit.restart-gerrit and ID 046af7f1-cbfb-4249-8c32-8b3d1785c95a not found. Unable to release it. Was expired?
DRY-RUN: __COOKBOOK_STATS__:name=sre.gerrit.restart-gerrit,exit_code=0,duration=6.651
DRY-RUN: END (PASS) - Cookbook sre.gerrit.restart-gerrit (exit_code=0) Restarting Gerrit on gerrit2002