Page MenuHomePhabricator

Productionize pc(1|2)017
Closed, ResolvedPublic

Description

In the line of T368874 and T373579

  • pc1017
  • pc2017

Event Timeline

ABran-WMF changed the task status from Open to In Progress.Sep 9 2024, 12:46 PM
ABran-WMF triaged this task as High priority.
ABran-WMF moved this task from Triage to Ready on the DBA board.

Change #1071623 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: wipe pc1017 pc2017

https://gerrit.wikimedia.org/r/1071623

Change #1071623 merged by Arnaudb:

[operations/puppet@production] mariadb: wipe pc1017 pc2017

https://gerrit.wikimedia.org/r/1071623

Change #1071750 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: pc1017 pc2017 back to normal

https://gerrit.wikimedia.org/r/1071750

Mentioned in SAL (#wikimedia-operations) [2024-09-10T13:23:44Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: T374355

Mentioned in SAL (#wikimedia-operations) [2024-09-10T13:23:58Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: T374355

Change #1071750 merged by Arnaudb:

[operations/puppet@production] mariadb: pc1017 pc2017 back to normal

https://gerrit.wikimedia.org/r/1071750

ABran-WMF moved this task from Ready to Done on the DBA board.

@Ladsgroup both hosts are downtimed for the coming 7 days and the patch has been merged, they are all yours now! Maybe we could debrief after to put what you rediscovered in writing?

So for now, I can't connect to them in cumin and they are not showing up in orchestrator. I will check

ladsgroup@pc1017:~$ sudo mysql
ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Aborting
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.db' doesn't exist
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '::'.
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '0.0.0.0'.
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '::'.
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [Note] Server socket created on IP: '0.0.0.0'.
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directo>
Sep 10 14:18:31 pc1017 mysqld[96501]: 2024-09-10 14:18:31 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded

Change #1071886 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] conftool-data: Remove pc5 for now

https://gerrit.wikimedia.org/r/1071886

Change #1071886 merged by Ladsgroup:

[operations/puppet@production] conftool-data: Remove pc5 for now

https://gerrit.wikimedia.org/r/1071886

These steps were missing: https://wikitech.wikimedia.org/wiki/MariaDB#Setting_up_a_fresh_server

but now it fails because another process is holding the lock:

root@pc1017:/srv/sqldata-cache# /opt/wmf-mariadb106/scripts/mysql_install_db --basedir /opt/wmf-mariadb106/
Installing MariaDB/MySQL system tables in '/srv/sqldata-cache' ...
2024-09-10 18:17:34 0 [ERROR] mariadbd: Can't lock aria control file '/srv/sqldata-cache/aria_log_control' for exclusive use, error: 11. Will retry for 30 seconds

I can't stop maraidb service (it's not up either)

pc2017 was more successful, grants and setting up pt-heartbeat was missing. Done now.

Force killed mariadb in pc1017. Things now work. Now setting up circular replication.

Set up circular replication, now I need to add orchestrator grants and add it there. Zarcillo entries is also left.

Done this:

INSERT INTO instances (name, server, port, `group`) VALUES ('pc1017','pc1017.eqiad.wmnet',3306, 'parsercache');

INSERT INTO instances (name, server, port, `group`) VALUES ('pc2017','pc2017.codfw.wmnet',3306, 'parsercache');

INSERT INTO section_instances (instance, section) VALUES ('pc1017','pc5');
INSERT INTO section_instances (instance, section) VALUES ('pc2017','pc5');

INSERT INTO servers (fqdn, hostname, dc, rack) VALUES ('pc1017.eqiad.wmnet', 'pc1017', 'eqiad', 'd1');
INSERT INTO servers (fqdn, hostname, dc, rack) VALUES ('pc2017.eqiad.wmnet', 'pc2017', 'codfw', 'b7');

Change #1071955 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] pc5: Enable notification

https://gerrit.wikimedia.org/r/1071955

Change #1071955 merged by Ladsgroup:

[operations/puppet@production] pc5: Enable notification

https://gerrit.wikimedia.org/r/1071955

Change #1075052 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] pc2017: Set it to master

https://gerrit.wikimedia.org/r/1075052

Change #1075052 merged by Ladsgroup:

[operations/puppet@production] pc2017: Set it to master

https://gerrit.wikimedia.org/r/1075052