Page MenuHomePhabricator

Reboot pc1013
Closed, ResolvedPublic

Description

Steps:

  • Stop replication on pc1014.
  • Merge CR to change pc3 primary to be pc1014: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/787497
  • Deploy: scap sync-file wmf-config/ProductionServices.php "Set pc1014 as pc3 primary T307101"
  • Downtime pc3: sudo cookbook sre.hosts.downtime --hours 1 -r "Rebooting pc1013 T307101" 'A:db-section-pc3'
  • Reboot pc1013: SKIP_DBCTL=1 SKIP_START_SLAVE=1 ~kormat/bin/reboot-host T303174 pc1013.eqiad.wmnet
  • Revert previous CR
  • Deploy: scap sync-file wmf-config/ProductionServices.php "Set pc1013 as pc3 primary T307101"
  • Start replication on pc1014.

Note that pc2013 will stay trying to replicate from pc1013 during this time.

Afterwards:

  • Move pc1014 back to pc1
  • Truncate all tables on pc1014

Related Objects

StatusSubtypeAssignedTask
Resolved Kormat

Event Timeline

Change 787496 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc1014: Move to pc3.

https://gerrit.wikimedia.org/r/787496

Change 787496 merged by Kormat:

[operations/puppet@production] pc1014: Move to pc3.

https://gerrit.wikimedia.org/r/787496

Kormat triaged this task as Medium priority.Apr 28 2022, 1:09 PM
Kormat updated the task description. (Show Details)

Change 787497 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Promote pc1014 to primary of pc3

https://gerrit.wikimedia.org/r/787497

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Made some fixes - it looks good

Change 787497 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Promote pc1014 to primary of pc3

https://gerrit.wikimedia.org/r/787497

Mentioned in SAL (#wikimedia-operations) [2022-04-28T13:42:06Z] <kormat@deploy1002> Synchronized wmf-config/ProductionServices.php: Set pc1014 as pc3 primary T307101 (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2022-04-28T13:42:49Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Rebooting pc1013 T307101

Mentioned in SAL (#wikimedia-operations) [2022-04-28T13:42:54Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Rebooting pc1013 T307101

Mentioned in SAL (#wikimedia-operations) [2022-04-28T14:08:48Z] <kormat@deploy1002> Synchronized wmf-config/ProductionServices.php: Set pc1013 as pc3 primary T307101 (duration: 00m 54s)

Change 787511 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc1014: Move back to pc1.

https://gerrit.wikimedia.org/r/787511

Change 787511 merged by Kormat:

[operations/puppet@production] pc1014: Move back to pc1.

https://gerrit.wikimedia.org/r/787511

Kormat updated the task description. (Show Details)