Page MenuHomePhabricator

Restore db2118
Closed, ResolvedPublic

Description

Running a data-consistency check on db2118 after upgrading to buster failed, badly: T288244#7311265.

Attempting to restore from a dump also ended badly, as the dump procedure assumes that you are not trying to restore the node that was primary at the time the dump was taken.

Procedure for attempt 2:

  • Drop all wiki databases from db2118 (and centralauth)
  • Re-restore the same dump as before: time recover-dump --host db2118.codfw.wmnet --user root --password REDACTED --port 3306 /srv/backups/dumps/ongoing/kormat-dump.s7.2021-08-24--00-00-02
  • Stop mariadb on db2118, and make a copy of /srv/sqldata, then restart it again. This means we won't need to re-do the dump-restore step if a later step goes wrong.
  • Run reset slave all on db2118
  • Configure db2118 to replicate from db2121, using db2121-bin.002436 and 130479946
  • Start replication, and wait at least 30mins.

Event Timeline

Kormat triaged this task as High priority.Aug 27 2021, 1:03 PM
Kormat updated the task description. (Show Details)
Kormat moved this task from Triage to In progress on the DBA board.

Dump recovery started from dbprov2001.

Stop mariadb

Probably already known, but make sure to no auto-start mariadb replication on server start, so nothing is written accidentally with systemctl set-environment MYSQLD_OPTS="--skip-slave-start"

And one last thing to check after things go well, is to check potential missing schema changes in the last 6 days.

Failed after ~27 minutes of replication with:

Error 'Duplicate entry '951915-edit' for key 'pr_pagetype'' on query. Default database: 'eswiki'. Query: 'INSERT /* WikiPage::doUpdateRestrictions  */ INTO `page_restrictions` (pr_page,pr_type,pr_level,pr_cascade,pr_expiry) VALUES (951915,'edit','autoconfirmed',0,'20210826151244')'

And one last thing to check after things go well, is to check potential missing schema changes in the last 6 days.

Good point - none from my side. I believe all the pending ones were completed a few weeks ago.

So from what I can see, the page was set to 'autoconfirmed' correctly and then it was updated to none later (once it expired).
This happened on (still before the switchover):db2121-bin.002434.876795369

INSERT /* WikiPage::doUpdateRestrictions  */ INTO `page_restrictions` (pr_page,pr_type,pr_level,pr_cascade,pr_expiry) VALUES (951915,'edit','autoconfirmed',0,'20210824125807')
/*!*/;
# at 876795337
#210823 12:58:07 server id 180367395  end_log_pos 876795369 CRC32 0x0fada0f9    Intvar

Once that expired, then there's another insert to protect that same page that happens on: db2121-bin.002440.371692119 (-rw-rw---- 1 mysql mysql 1001M Aug 25 20:41 db2121-bin.002440) which is already AFTER the switchover and the one that fails:

INSERT /* WikiPage::doUpdateRestrictions  */ INTO `page_restrictions` (pr_page,pr_type,pr_level,pr_cascade,pr_expiry) VALUES (951915,'edit','autoconfirmed',0,'20210826151244')
/*!*/;
# at 371692087
#210825 15:12:44 server id 180355206  end_log_pos 371692119 CRC32 0x002b59e2    Intvar

That server_id is already the new master:

mysql:root@localhost [(none)]> show global variables like 'server_id';
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 180355206 |
+---------------+-----------+
1 row in set (0.001 sec)

I haven't been able to find anything being written in between those events in db2121 binlogs (neither writes or deletions), so I am not sure why the row was present there but not on other hosts.
db2121 starts replication:

Aug 30 09:15:07 db2118 mysqld[897]: 2021-08-30  9:15:07 68 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db2121.codfw.wmnet:3306' in log 'db2121-bin.002436' at position 130479946

So that means that whatever was going on with that row was already coming thru the backup source. As the writes happen on:
db2121-bin.002434 (already passed binlog)

And then the next one touching pr_page,pr_type (951915,edit) is db2121-bin.00244 which is 4 binlogs after replication was started and what made it break

I am not sure how this could've failed on this host but not on all of them if there're no more events on the binlogs.

This is what is on the dump, in case it helps (taken consistently at 2021-08-24 00:00:02):

(951915,"edit","autoconfirmed",0,NULL,"20210824125807",143971),
(951915,"move","sysop",0,NULL,"infinity",143972),

Yeah, that matches the insert on db2121-bin.002434.876795369 which is what lives on the table at the moment (and the reason why replication broke). But how can that be there on the backup source and the backup source not breaking?.

root@db2118.codfw.wmnet[eswiki]> select * from page_restrictions where pr_page=951915;
+---------+---------+---------------+------------+---------+----------------+--------+
| pr_page | pr_type | pr_level      | pr_cascade | pr_user | pr_expiry      | pr_id  |
+---------+---------+---------------+------------+---------+----------------+--------+
|  951915 | edit    | autoconfirmed |          0 |    NULL | 20210824125807 | 143971 |
|  951915 | move    | sysop         |          0 |    NULL | infinity       | 143972 |
+---------+---------+---------------+------------+---------+----------------+--------+
2 rows in set (0.034 sec)

I am still digging into binlogs, but so far I haven't seen any other transaction touching that pr_page,pr_type in between that could explain that.

It is quite difficult to scan the STATEMENT-based binlog for changes (updates, deletes, etc.). I suggest there could be easier to check the changes on a ROW replica?

Yep, doing that as we speak, I firstly want to scan the involved hosts only to discard binlogs corruptions

Procedure we followed.

Look at the dump metadata file

Started dump at: 2021-08-24 00:00:02
SHOW SLAVE STATUS:
        Connection name:
        Host: db2118.codfw.wmnet
        Log: db2118-bin.002581
        Pos: 110134450
        GTID:0-180359185-3380903947,171970590-171970590-196269578,171970664-171970664-2362613365,171978767-171978767-4484858466,17197886
1-171978861-433436917,180355111-180355111-139571514,180359185-180359185-72003839,180363275-180363275-7694462,180363371-180363371-1939316
1,180367395-180367395-554641873

Finished dump at: 2021-08-24 03:00:56

Important pieces of data:

  • Dump started at 2021-08-24 00:00:02
  • The GTID for db2118 at the time (server_id 180367395): 180367395-180367395-554641873

Find the transaction with that GTID
Look for the current binlog on db2121 at the point that the dump started:

-rw-rw---- 1 mysql mysql 1048576570 Aug 23 22:45 db2121-bin.002435
-rw-rw---- 1 mysql mysql 1048577311 Aug 24 08:27 db2121-bin.002436 # <====

Search that for the GTID:

$ sudo mysqlbinlog -vv --base64-output=DECODE-ROWS /srv/sqldata/db2121-bin.002436 > ~/db2121-bin.002436.txt
$ grep -B10 -A10 180367395-180367395-554641873 ~/db2121-bin.002436.txt
# at 130479070
#210824  0:00:02 server id 180367395  end_log_pos 130479260 CRC32 0xb20610c1    Query   thread_id=448143616     exec_time=0     error_code=0
use `arwiki`/*!*/;
SET TIMESTAMP=1629763202/*!*/;
UPDATE /* LinksUpdate::updateLinksTimestamp  */  `page` SET page_links_updated = '20210823235959' WHERE page_id = 2781316
/*!*/;
# at 130479260
#210824  0:00:02 server id 180367395  end_log_pos 130479291 CRC32 0x38311a84    Xid = 5068317920
COMMIT/*!*/;
# at 130479291
#210824  0:00:02 server id 180367395  end_log_pos 130479333 CRC32 0xf5ed6c21    GTID 180367395-180367395-554641873 trans
/*!100001 SET @@session.gtid_seq_no=554641873*//*!*/;
START TRANSACTION
/*!*/;
# at 130479333
#210824  0:00:02 server id 180367395  end_log_pos 130479573 CRC32 0x070d2e8f    Query   thread_id=448143616     exec_time=0     error_code=0
SET TIMESTAMP=1629763202/*!*/;
REPLACE /* FlaggableWikiPage::updateStableVersion  */ INTO `flaggedpages` (fp_page_id,fp_stable,fp_reviewed,fp_quality,fp_pending_since) VALUES (2781316,54528012,1,0,NULL)
/*!*/;
# at 130479573
#210824  0:00:02 server id 180367395  end_log_pos 130479747 CRC32 0xb3849f90    Query   thread_id=448143616     exec_time=0     error_code=0

(Side note: really not sure i follow the logic here. Shouldn't we just take 130479291, as it's the offset of the next transaction?)
Then look for the start of the next transaction :

# at 130479946
#210824  0:00:02 server id 180367395  end_log_pos 130479988 CRC32 0x5e13c573    GTID 180367395-180367395-554641874 trans
/*!100001 SET @@session.gtid_seq_no=554641874*//*!*/;
START TRANSACTION

Ok, 130479946 is our binglog offset.

Restore the dump, and set replication to continue from db2121, db2121-bin.002436 offset 130479946

From a freshly restored dump state, I can confirm that the GTID referenced in the metadata (180367395-180367395-554641873) _has_ been applied, and that the immediately following transaction (180367395-180367395-554641874) has _not_ been applied.

This confirms that we have been picking the correct binlog position.

Can you check the status for the page_restrictions table for the above row? pr_page=951915

MariaDB [eswiki]> select * from page_restrictions where pr_page=951915;
+---------+---------+---------------+------------+---------+----------------+--------+
| pr_page | pr_type | pr_level      | pr_cascade | pr_user | pr_expiry      | pr_id  |
+---------+---------+---------------+------------+---------+----------------+--------+
|  951915 | edit    | autoconfirmed |          0 |    NULL | 20210824125807 | 143971 |
|  951915 | move    | sysop         |          0 |    NULL | infinity       | 143972 |
+---------+---------+---------------+------------+---------+----------------+--------+
2 rows in set (0.002 sec)

Right, as expected then. So that means that replication will break again once we arrive to that 440 binlog unless the server_id things does something about it. But from the binlogs I have not seen any transaction that would clean up that role before the new inserts...

@Marostegui : just spotted that not all transactions that touch page_restrictions use pr_page. For example:

#210824 14:22:45 server id 180367395  end_log_pos 623397872     GTID 180367395-180367395-557339435 trans
/*!100001 SET @@session.gtid_seq_no=557339435*//*!*/;
START TRANSACTION
/*!*/;
# at 623397872
#210824 14:22:45 server id 180367395  end_log_pos 623398087     Query   thread_id=453175600     exec_time=0     error_code=0
SET TIMESTAMP=1629814965/*!*/;
DELETE /* Title::purgeExpiredRestrictions  */ FROM `page_restrictions` WHERE pr_id IN ('136621','136622','143971','143973','143974','143975','143976')
/*!*/;
# at 623398087
#210824 14:22:45 server id 180367395  end_log_pos 623398114     Xid = 9258563449
COMMIT/*!*/;

Note pr_id 143971 is selected, which matches the pr_page=951915 edit entry.

Ok, at this point we have good explanations for the issues we've run into. We currently have db2118 at a freshly-restored-from-dump state. To proceed:

  • Disable puppet
  • Stop mariadb
  • Edit mariadb config to disable log_slave_updates, and run sudo systemctl set-environment MYSQLD_OPTS="--skip-slave-start --replicate-same-server-id"
  • Start mariadb. Configure it to replicate from db2121 at db2121-bin.002436 and 130479946
  • Start replication. Hope a lot.
  • If replication catches up, stop it again. Stop mariadb, revert config changes. Start it.
  • Run mysqlcheck --all-databases
  • Start replication again, let it catch up
  • Run db-compare for software/dbtools/tables_to_check.txt
  • reset master

Possible future optimisation: use START SLAVE UNTIL <binlog position at which db2121 becomes primary>

Confirming mariadb is started correctly:

$ systemctl status mariadb
● mariadb.service - mariadb database server
   Loaded: loaded (/lib/systemd/system/mariadb.service; disabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-08-31 08:49:00 UTC; 1min 23s ago
...
           └─9725 /opt/wmf-mariadb104/bin/mysqld --skip-slave-start --replicate-same-server-id

$ sudo mysql
...
MariaDB [(none)]> show slave status\G
Empty set (0.000 sec)

MariaDB [(none)]> select @@GLOBAL.log_slave_updates;
+----------------------------+
| @@GLOBAL.log_slave_updates |
+----------------------------+
|                          0 |
+----------------------------+
1 row in set (0.000 sec)

Looking at the relay-binlogs, i can confirm that db2118 is applying transactions that have its own server_id.

Current status: db2118 successfully caught up on replication. Before continuing with the planned next steps, i'm going to take a copy of the current /srv/sqldata, revert to the original state, and check if we can also successfully replicate using GTID. Once that's done, i'll switch back to the latest /srv/sqldata, and continue on with double-checking data consistency etc.

Current status: db2118 successfully caught up on replication. Before continuing with the planned next steps, i'm going to take a copy of the current /srv/sqldata, revert to the original state, and check if we can also successfully replicate using GTID. Once that's done, i'll switch back to the latest /srv/sqldata, and continue on with double-checking data consistency etc.

This didn't work. It failed with the same (misleading) error as before:

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'

Using the GTID corresponding to the binlog+position on db2121 that we calculated earlier _does_ work. Sadly the details of the mariadb GTID implementation appear to be almost completely undocumented :/

Looking at the mariadb GTID docs, i did however spot this:

If the backup was taken of an existing replica server, then the new replica should already have the correct GTID position stored in the mysql.gtid_slave_pos table. This is assuming that this table was backed up and that it was backed up in a consistent manner with changes to other tables. In this case, there is no need to explicitly look up the GTID position on the old server and set it on the new replica - it will be already correctly loaded from the mysql.gtid_slave_pos table.

We currently do not back up this table, from what i can see. Doing so would simplify restoring replication in general (though it's not clear if it would help in this particular case, as the implementation details aren't documented).

mysqlcheck --all-databases completed successfully over night. db2118 is now catching up on replication, and then i'll start doing db-compare's.

Full set of db-compare runs completed successfully \o/