We're lacking a testing environment for heavy changes. We need to be able to quickly test T371351 in conditions quite close to reality, so we'll leverage T368919 and T368920 that are still insetup for that first part. This task is also to track the next phases to come to have a proper test environment.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| mariadb: testing https://w.wiki/Ayvd | operations/cookbooks | master | +1 K -0 | |
| mariadb: temporary testing environment | operations/puppet | production | +12 -2 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | jcrespo | T376916 Upgrade backup hosts to Debian Bookworm 12.X | |||
| Resolved | jcrespo | T383902 Upgrade backup source or mediabackup database host os to Debian bookworm or decommission them | |||
| Unknown Object (Task) | |||||
| Resolved | • Marostegui | T373579 Productionize db22[21-40] | |||
| Resolved | ABran-WMF | T372893 Testing environment for mysql cookbooks |
Event Timeline
Change #1064033 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] mariadb: temporary testing environment
Change #1064033 abandoned by Arnaudb:
[operations/puppet@production] mariadb: temporary testing environment
Reason:
https://phabricator.wikimedia.org/T372893#10080907
Change #1064778 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/cookbooks@master] mariadb: testing https://w.wiki/Ayvd
given the priority, I've worked around the bumps towards having a crude testing platform, I've temporarly repurposed 2 insetup machines for this.
cookbooks.sre.switchdc.databases for the (test) switch from test-s1 to test-s1 started by arnaudb@cumin1002
cookbooks.sre.switchdc.databases for the (test) switch from test-s1 to test-s1 started by arnaudb@cumin1002 executed with errors:
cookbooks.sre.switchdc.databases for the (test) switch from test-s1 to test-s1 started by testing
cookbooks.sre.switchdc.databases for the (test) switch from test-s1 to test-s1 started by testing executed with errors:
I've been able to assert sre.switchdc.databases.prepare logic through and through, it looks sound.
I've had an issue on my testing platform to issue SQL commands to avoid running on get_core_dbs() but on get_dbs() selected hosts instead. I've tried to workaround the implementation with no success. PS22 is my last attempt
I'll move on to sre.switchdc.databases.finalize for now and get back to debugging my issue after that. If I don't manage to debug on my own I'll ping @Volans when he gets back.
cookbooks.sre.switchdc.databases for the switch from test-s1 to test-s1 started by arnaudb@cumin1002 completed:
- test-s1 (PASS)
- MASTER_TO pc2017.codfw.wmnet has no replication set, skipping.
cookbooks.sre.switchdc.databases for the switch from test-s1 to test-s1 started by arnaudb@cumin1002 completed:
- test-s1 (PASS)
cookbooks.sre.switchdc.databases for the switch from test-s4 to test-s4 started by arnaudb@cumin1002 completed:
- test-s4 (PASS)
thanks @Marostegui for the help with the platform!
I've been able to test prepare.py up to the proper topology:
I've spotted slight issue on class MasterUseGTID(Enum):: [...]NO = "no"
I've skipped that part:
MASTER_TO db2230.codfw.wmnet MASTER_USE_GTID=no. MASTER_TO db2230.codfw.wmnet START SLAVE. [%s] %s checking SLAVE STATUS %s=%s test-s4 MASTER_TO db2230.codfw.wmnet Slave_IO_Running Yes [%s] %s checking SLAVE STATUS %s=%s test-s4 MASTER_TO db2230.codfw.wmnet Slave_SQL_Running Yes [%s] %s checking SLAVE STATUS %s=%s test-s4 MASTER_TO db2230.codfw.wmnet Last_IO_Errno 0 [%s] %s checking SLAVE STATUS %s=%s test-s4 MASTER_TO db2230.codfw.wmnet Last_SQL_Errno 0 [%s] %s checking SLAVE STATUS %s=%s test-s4 MASTER_TO db2230.codfw.wmnet Using_Gtid No **MASTER_TO db2230.codfw.wmnet wrong SLAVE STATUS Using_Gtid=No, expected no instead** **Failed to verify disabled GTID on db2230.codfw.wmnet** Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.disable_gtid: Failed to verify disabled GTID on db2230.codfw.wmnet ==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution. > skip User input is: "skip" MASTER_TO db2230.codfw.wmnet STOP SLAVE.
I've stumbled upon that:
[%s] %s checking SLAVE STATUS %s=%s test-s4 MASTER_FROM db1125.eqiad.wmnet Slave_IO_Running Preparing **MASTER_FROM db1125.eqiad.wmnet wrong SLAVE STATUS Slave_IO_Running=Preparing, expected Yes instead** Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.enable_cross_replication: MASTER_FROM db1125.eqiad.wmnet wrong SLAVE STATUS Slave_IO_Running=Preparing, expected Yes instead ==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution. > skip User input is: "skip"
with no trouble on the resulting topology. I'll head on to finalize to see the resulting topology and I'll schedule a meeting with Manuel so we can assert the detailed conditions of each steps together. [Edit] Current PS is 22
cookbooks.sre.switchdc.databases for the switch from test-s4 to test-s4 started by arnaudb@cumin1002 completed:
- test-s4 (FAIL)
- MASTER_FROM db1125.eqiad.wmnet should be read only
- MASTER_TO db2230.codfw.wmnet STOP SLAVE.
- MASTER_TO db2230.codfw.wmnet RESET SLAVE ALL.
- MASTER_TO db2230.codfw.wmnet has no replication set.
- MASTER_TO db1125.eqiad.wmnet heartbeat server IDs to delete are: []
- MASTER_FROM db1125.eqiad.wmnet STOP SLAVE.
- MASTER_FROM db1125.eqiad.wmnet MASTER_USE_GTID=slave_pos.
- MASTER_FROM db1125.eqiad.wmnet START SLAVE.
- Failed to enable GTID on db1125.eqiad.wmnet, current value: Slave_Pos
This is a good summary of the resulting output of the last run (at the same PS):
The resulting topology looks good from Orchestrator's point of view, I noticed that pt-heartbeat was askew after the switch. I'll try to fix it before trying the opposite maneuver
running the cleanup tasks like T373173 fixes the GTID situation, reverting the topology in the same way with PS43
cookbooks.sre.switchdc.databases for the switch from test-s4 to test-s4 started by arnaudb@cumin1002 completed:
- test-s4 (PASS)
cookbooks.sre.switchdc.databases for the switch from test-s4 to test-s4 started by arnaudb@cumin1002 completed:
- test-s4 (FAIL)
- Validated replication topology for section test-s4 between MASTER_TO db1125.eqiad.wmnet and MASTER_FROM db2230.codfw.wmnet
- MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
- MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
- MASTER_TO db1125.eqiad.wmnet has no replication set.
- MASTER_TO db2230.codfw.wmnet heartbeat server IDs to delete are: []
- MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
- MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
- MASTER_FROM db2230.codfw.wmnet START SLAVE.
- Failed to enable GTID on db2230.codfw.wmnet, current value: Slave_Pos
This should kept open until we have wiped db2230, otherwise it will be forgotten.
Also this test was for a more generic testing environment, the one I temporary set was just for the DC switch task and it was done in a rush.
The task original description also says: "This task is also to track the next phases to come to have a proper test environment.
good call, I've taken note of the wipe thing on T373579 so we don't forget when T371351 is done.
The task original description also says: "This task is also to track the next phases to come to have a proper test environment."
The original description was a bit mistaken as:
I'd prefer we focus on T356053
Change #1064778 abandoned by Arnaudb:
[operations/cookbooks@master] mariadb: testing https://w.wiki/Ayvd
Reason:
obsolete



