Page MenuHomePhabricator

Move db1176 to m1
Closed, ResolvedPublic

Description

m1 needs a switchover.
We need to recloned db1176 to become a m1 replica and install 10.4 back (it is running mariadb 11 at the moment)

Related Objects

StatusSubtypeAssignedTask
Open Marostegui
Resolved Marostegui

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

Change 883133 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Install MariaDB 11 on db1106

https://gerrit.wikimedia.org/r/883133

Change 883133 merged by Marostegui:

[operations/puppet@production] mariadb: Install MariaDB 11 on db1106

https://gerrit.wikimedia.org/r/883133

Change 883136 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1176 to m1

https://gerrit.wikimedia.org/r/883136

Change 883136 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1176 to m1

https://gerrit.wikimedia.org/r/883136

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1176.eqiad.wmnet with OS bullseye

1*************************** 1. row ***************************
2 Slave_IO_State:
3 Master_Host: db1195.eqiad.wmnet
4 Master_User: repl
5 Master_Port: 3306
6 Connect_Retry: 60
7 Master_Log_File: db1195-bin.001769
8 Read_Master_Log_Pos: 159907434
9 Relay_Log_File: db1117-relay-bin.000002
10 Relay_Log_Pos: 43866087
11 Relay_Master_Log_File: db1195-bin.001769
12 Slave_IO_Running: No
13 Slave_SQL_Running: No
14 Replicate_Do_DB:
15 Replicate_Ignore_DB:
16 Replicate_Do_Table:
17 Replicate_Ignore_Table:
18 Replicate_Wild_Do_Table:
19 Replicate_Wild_Ignore_Table:
20 Last_Errno: 0
21 Last_Error:
22 Skip_Counter: 0
23 Exec_Master_Log_Pos: 159907434
24 Relay_Log_Space: 43866397
25 Until_Condition: None
26 Until_Log_File:
27 Until_Log_Pos: 0
28 Master_SSL_Allowed: Yes
29 Master_SSL_CA_File:
30 Master_SSL_CA_Path:
31 Master_SSL_Cert:
32 Master_SSL_Cipher:
33 Master_SSL_Key:
34 Seconds_Behind_Master: NULL
35 Master_SSL_Verify_Server_Cert: No
36 Last_IO_Errno: 0
37 Last_IO_Error:
38 Last_SQL_Errno: 0
39 Last_SQL_Error:
40 Replicate_Ignore_Server_Ids:
41 Master_Server_Id: 172001292
42 Master_SSL_Crl:
43 Master_SSL_Crlpath:
44 Using_Gtid: Slave_Pos
45 Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1589344544
46 Replicate_Do_Domain_Ids:
47 Replicate_Ignore_Domain_Ids:
48 Parallel_Mode: conservative
49 SQL_Delay: 0
50 SQL_Remaining_Delay: NULL
51 Slave_SQL_Running_State:
52 Slave_DDL_Groups: 0
53Slave_Non_Transactional_Groups: 0
54 Slave_Transactional_Groups: 63429

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1176.eqiad.wmnet with OS bullseye completed:

  • db1176 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301241054_marostegui_57917_db1176.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB