Description
Details
Related Objects
Event Timeline
Change #1071623 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] mariadb: wipe pc1017 pc2017
Change #1071623 merged by Arnaudb:
[operations/puppet@production] mariadb: wipe pc1017 pc2017
Change #1071750 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] mariadb: pc1017 pc2017 back to normal
Mentioned in SAL (#wikimedia-operations) [2024-09-10T13:23:44Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: T374355
Mentioned in SAL (#wikimedia-operations) [2024-09-10T13:23:58Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: T374355
Change #1071750 merged by Arnaudb:
[operations/puppet@production] mariadb: pc1017 pc2017 back to normal
@Ladsgroup both hosts are downtimed for the coming 7 days and the patch has been merged, they are all yours now! Maybe we could debrief after to put what you rediscovered in writing?
So for now, I can't connect to them in cumin and they are not showing up in orchestrator. I will check
ladsgroup@pc1017:~$ sudo mysql ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Aborting Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.db' doesn't exist Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '::'. Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '0.0.0.0'. Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '::'. Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '0.0.0.0'. Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directo> Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
Change #1071886 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] conftool-data: Remove pc5 for now
Change #1071886 merged by Ladsgroup:
[operations/puppet@production] conftool-data: Remove pc5 for now
These steps were missing: https://wikitech.wikimedia.org/wiki/MariaDB#Setting_up_a_fresh_server
but now it fails because another process is holding the lock:
root@pc1017:/srv/sqldata-cache# /opt/wmf-mariadb106/scripts/mysql_install_db --basedir /opt/wmf-mariadb106/ Installing MariaDB/MySQL system tables in '/srv/sqldata-cache' ... 2024-09-10 18:17:34 0 [ERROR] mariadbd: Can't lock aria control file '/srv/sqldata-cache/aria_log_control' for exclusive use, error: 11. Will retry for 30 seconds
I can't stop maraidb service (it's not up either)
pc2017 was more successful, grants and setting up pt-heartbeat was missing. Done now.
Force killed mariadb in pc1017. Things now work. Now setting up circular replication.
Set up circular replication, now I need to add orchestrator grants and add it there. Zarcillo entries is also left.
Zarcillo step was missed as well: https://wikitech.wikimedia.org/wiki/MariaDB/Provisioning_a_host#Add_it_to_zarcillo_DB
Done this:
INSERT INTO instances (name, server, port, `group`) VALUES ('pc1017','pc1017.eqiad.wmnet',3306, 'parsercache');
INSERT INTO instances (name, server, port, `group`) VALUES ('pc2017','pc2017.codfw.wmnet',3306, 'parsercache');
INSERT INTO section_instances (instance, section) VALUES ('pc1017','pc5');
INSERT INTO section_instances (instance, section) VALUES ('pc2017','pc5');
INSERT INTO servers (fqdn, hostname, dc, rack) VALUES ('pc1017.eqiad.wmnet', 'pc1017', 'eqiad', 'd1');
INSERT INTO servers (fqdn, hostname, dc, rack) VALUES ('pc2017.eqiad.wmnet', 'pc2017', 'codfw', 'b7');Change #1071955 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] pc5: Enable notification
Change #1071955 merged by Ladsgroup:
[operations/puppet@production] pc5: Enable notification
Change #1075052 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] pc2017: Set it to master
Change #1075052 merged by Ladsgroup:
[operations/puppet@production] pc2017: Set it to master
