Page MenuHomePhabricator

(OoW) db2045 failed battery
Closed, DeclinedPublic

Description

WARNING: Slot 0: OK: 1I:1:1, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9 - Controller: OK - Cache: Permanently Disabled - Battery/Capacitor: Failed (Replace Batteries)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I will try to force a relearn process/reboot, in case that works.

Mentioned in SAL (#wikimedia-operations) [2019-07-12T10:24:22Z] <jynus> switchover x1 codfw master from db2045 to db2069 T227862

Everything went well except:

Updating tendril...
[WARNING] Old master not found on tendril server list
Updating zarcillo...
[WARNING] Old master not found on zarcillo master list
jcrespo triaged this task as Medium priority.Jul 12 2019, 10:27 AM

Change 522403 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Promote db2069 to be the new x1 codfw master

https://gerrit.wikimedia.org/r/522403

Change 522403 merged by Jcrespo:
[operations/puppet@production] mariadb: Promote db2069 to be the new x1 codfw master

https://gerrit.wikimedia.org/r/522403

Change 522409 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Promote db2069 to be the new x1 codfw master

https://gerrit.wikimedia.org/r/522409

Change 522409 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Promote db2069 to be the new x1 codfw master

https://gerrit.wikimedia.org/r/522409

Everything went well except:

Updating tendril...
[WARNING] Old master not found on tendril server list
Updating zarcillo...
[WARNING] Old master not found on zarcillo master list

And here is probably the issue:

| x1      | codfw | db2045     |
| x1      | eqiad | db2045     |
root@db1115.eqiad.wmnet[zarcillo]> update masters set instance='db2069' where section='x1' and dc='codfw';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

root@db1115.eqiad.wmnet[zarcillo]> update masters set instance='db1120' where section='x1' and dc='eqiad';      
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Tendril should fail, as it doesn't have the concept of "primary master" vs. Datacenter master, so it won't work for --replicating-master runs

wiki_willy renamed this task from db2045 failed battery to (OoW) db2045 failed battery.Jul 15 2019, 7:36 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-15T20:02:10Z] <jynus> reducing consistency of db2045 to avoid lag at T227862

Change 523880 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Clarify db2045 status

https://gerrit.wikimedia.org/r/523880

Change 523880 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Clarify db2045 status

https://gerrit.wikimedia.org/r/523880

Mentioned in SAL (#wikimedia-operations) [2019-07-17T09:21:15Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool and clarify db2045 status T227862 (duration: 00m 55s)

No point on spending time with this old host, I will start its decommissioning process.

Going to close this ticket as I have created the decommission one: T228281: decommission db2045.codfw.wmnet