Page MenuHomePhabricator

labsdb1004: s51541_sulwatcher.logging is out of sync
Closed, ResolvedPublic

Description

While investigating and fixing T191001, I discovered that the s51541_sulwatcher.logging table on labsdb1004 is not in sync with its master (labsdb1005).

# labsdb1004
MariaDB [s51541_sulwatcher]> select count(*) from logging;
+----------+
| count(*) |
+----------+
|   245675 |
+----------+
1 row in set (0.10 sec)

# labsdb1005
MariaDB [s51541_sulwatcher]> select count(*) from logging;
+----------+
| count(*) |
+----------+
|    12188 |
+----------+
1 row in set (0.17 sec)

Feel free to close this task if you have a more general one for checksumming tool labsdb.

Event Timeline

Volans created this task.Mar 29 2018, 1:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 29 2018, 1:08 PM
Marostegui edited projects, added Tools, Tool-stewardbots; removed DBA.Mar 29 2018, 1:21 PM

Thanks for fixing this @Volans!
I'm adding the tags suggested by https://tools.wmflabs.org/contact/ for that project to see if we can find an owner for this tool and check what should be the next steps.

Hi. I am a maintainer of the tool. As logged on our SAL yesterday https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL#2018-03-28 we deleted old records from our db. Let me know if I can assist with anything.

@MarcoAurelio thanks for checking in and for the additional info. At this point I think that this is due to some older discrepancy between the two hosts for this table due to the fact that your query deleted only 14678 rows on the master while the slave has much more rows that met this condition:

MariaDB [s51541_sulwatcher]> select count(*) from logging WHERE l_timestamp < 20180201000000;
+----------+
| count(*) |
+----------+
|   233487 |
+----------+

This is because we moved to row-based replication (as opposed to statement-based), hence only the 14678 deleted rows on the master issued a delete on the slave, it didn't replicate the actual statement. And my educated guess is that there was some discrepancy from before on this table.

I think that a quick solution for this is if I just run the same query of your SAL (DELETE FROM logging WHERE l_timestamp < 20180201000000;) on the slave host (labsdb1004), to ensure that the same garbage collection is applied there too. I can then run some quick ballpark checks on the content of the table to see if there is any major discrepancy left. @Marostegui thoughts?

MarcoAurelio added a comment.EditedMar 29 2018, 9:59 PM

Hi @Volans. As you can see on our SAL, on December I ran a similar deletion after several years of unmaintainance (https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Tools.stewardbots/SAL#2017-12-04). 200k rows were deleted back then.

I must admit I am not familiar with how our db works. I thought that our s51541_sulwatcher db was the only one with data. But it looks we have one thousand and five?

Regards.

We can either do what @Volans propose or just reimport the table from the master - it is pretty small (83M on disk) on the master.
I would prefer to go for the master reimport, as we can fix any other issues that table might have.

MarcoAurelio added a comment.EditedApr 3 2018, 1:04 PM

Note that I'll be regularly purging the table to avoid storing old data there no longer useful for us. Let me know if I should warn you in advance, etc. for the next time.

Actually having a cron deleting automatically all data from that table older than 30 days would be a plus. No idea if that can be done through cron or a shell script.

Note that I'll be regularly purging the table to avoid storing old data there no longer useful for us. Let me know if I should warn you in advance, etc. for the next time.
Actually having a cron deleting automatically all data from that table older than 30 days would be a plus. No idea if that can be done through cron or a shell script.

Then it might break replication as soon as it hit a row that isn't present on the slave.
I think I am going to reimport the table on the slave and get this over with :-)

@MarcoAurelio we just need to stop writing to that table for some minutes during reimport, so backup server is synced. Can that happen at some point?

We can either do what @Volans propose or just reimport the table from the master - it is pretty small (83M on disk) on the master.
I would prefer to go for the master reimport, as we can fix any other issues that table might have.

+1

@Marostegui and @jcrespo Sure, I can take down the bots so it doesn't write to they don't write to those tables. I'll ping you on IRC.

@Marostegui and @jcrespo Sure, I can take down the bots so it doesn't write to they don't write to those tables. I'll ping you on IRC.

great - ping me and I can start the reimport

Mentioned in SAL (#wikimedia-cloud) [2018-04-03T13:15:00Z] <Hauskatze> sulwatcher: bots taken down to fix a db master/replica issue - T191020

Mentioned in SAL (#wikimedia-operations) [2018-04-03T13:21:03Z] <marostegui> Reimport  s51541_sulwatcher.logging from master to slave - T191020

This is now done:

root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hlabsdb1004 -e "select count(*) from s51541_sulwatcher.logging"
+----------+
| count(*) |
+----------+
|    12998 |
+----------+
root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl -hlabsdb1005 -e "select count(*) from s51541_sulwatcher.logging"
+----------+
| count(*) |
+----------+
|    12998 |
+----------+
Marostegui closed this task as Resolved.Apr 3 2018, 1:31 PM
Marostegui claimed this task.