run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Jan 3 2017, 3:14 PM

Description

Updated title task on 24th March 2017: I have changed the title to reflect that it contains the data for s2 as it was the initial shard that was checksummed for testing, so we used this ticket for it. The rest of the shards have their own ticket.

In s3 the following servers will be decommissioned (T134476)

db1038 - old master
db1015 - rc
db1035
db1044 - temporary master for db1095 (sanitarium2)

Before shutting them down for good, we need to make sure we are not losing any data, so we'd need to run pt-table-checksum on them.
Some modifications might be needed in order to run it safely.

Details

Subject	Repo	Branch	Lines +/-
db-eqiad.php: Repool db1036	operations/mediawiki-config	master	+7 -7
realm.pp: Add __wmf_checksums table to be ignored	operations/puppet	production	+1 -0
generate_dsns_table.sh: Add new script for dsns.	operations/software	master	+35 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T134476 Decommission old coredb machines (<=db1050)
Resolved	None	T162699 Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036)
Resolved	jcrespo	T154485 run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2017-02-22T15:11:57Z] <marostegui> Restart MySQL on db1069 to apply new replication filters - https://phabricator.wikimedia.org/T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-22T15:21:46Z] <marostegui> Restart MySQL on db1095 to apply new replication filters - https://phabricator.wikimedia.org/T154485

I have restarted db1069 and db1095 and they have the new filter applied:

%.__wmf_checksums

Mentioned in SAL (#wikimedia-operations) [2017-02-23T07:59:14Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on logging table - T154485

I have started to run pt-table-checksum only for the logging table (17GB) and the expected time reported by pt-table-checksum is around 40 minutes. I am closely monitoring lag on the shard.
After that if all goes fine I will go for revision (23G)

It finished successfully in 44 minutes:

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:45:43      0      0 26747597  267478       0 2668.835 nlwiki.logging

I am going to go for revision now.

Mentioned in SAL (#wikimedia-operations) [2017-02-23T08:51:04Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on revision table - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-23T08:54:38Z] <marostegui> Stop pt-table-checksum on nlwiki.revision - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-23T09:09:27Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-23T09:23:36Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s)

Some seconds after starting the pt-table-checksum for the revision table I had to kill it as both recentchanges slaves in eqiad and codfw (with the revision table partitioned of course) started to lag behind, to the point where I have had to decommission db1036 as it is lagging behind (1000 seconds now).
logging table is also partitioned and there was no lag there in any of those two hosts., so maybe for revision tables we have to go for a lower chunk (it is 100 now). The ETA for revision with chunk-size was 1:22h.
The crtl+c showed:

Checksumming nlwiki.revision:   0% 01:22:15 remain
^C# Caught SIGINT.
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:54:18      0      0   504100    5041       0  56.792 nlwiki.revision
TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:54:18      0      0   504100    5041

Which indeed shows chunks of 100 rows.

Once the server have caught up we can try to decrease the chunks to 10, keep db1036 depooled (db1063 is serving recentchanges now) and see what happens.

db1036 and db2035 were still lagging behind more than 30k:

˜/jynus 19:26> it is now doing a full table scan for every chunk

So we decided to add a replication filter on those two hosts to ignore that table and let the servers catch up:

root@PRODUCTION s2[(none)]> stop slave;
Query OK, 0 rows affected (26.81 sec)

root@PRODUCTION s2[(none)]> set global Replicate_Wild_Ignore_Table = "nlwiki.__wmf_checksums";
Query OK, 0 rows affected (0.00 sec)

root@PRODUCTION s2[(none)]> start slave;
Query OK, 0 rows affected (0.72 sec)

In T154485#3050716, @Marostegui wrote:
db1036 and db2035 were still lagging behind more than 30k:
˜/jynus 19:26> it is now doing a full table scan for every chunk
So we decided to add a replication filter on those two hosts to ignore that table and let the servers catch up:
root@PRODUCTION s2[(none)]> stop slave;
Query OK, 0 rows affected (26.81 sec)

root@PRODUCTION s2[(none)]> set global Replicate_Wild_Ignore_Table = "nlwiki.__wmf_checksums";
Query OK, 0 rows affected (0.00 sec)

root@PRODUCTION s2[(none)]> start slave;
Query OK, 0 rows affected (0.72 sec)

I am tempted to leave those replication filters there and try to run the pt-table-checksum again on Monday to see, if at least, it is a viable option to get the whole schema tested without any lag on the revision table (obviously those two slaves wouldn't get checked for consistency).

I wouldn't be against it, but I would ask to start puppetizing that- Something like a replication filter, and add those through hiera with a new production.my option. If the servers restart, we will create issues (I think replication options are lost).

In T154485#3052496, @jcrespo wrote:

I wouldn't be against it, but I would ask to start puppetizing that- Something like a replication filter, and add those through hiera with a new production.my option. If the servers restart, we will create issues (I think replication options are lost).

I am still not completely sure I want this for good, as it would mean making those servers (or all the recentchanges ones) even more special. We can start thinking about it though, so far let's see if the test goes fine on Monday morning and we can discuss how to proceed further.
Ideally we would be able to do
if role == recentchanges then replication filter for __wmf_checksum
Ideal world! :-)

Change 339609 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1036

https://gerrit.wikimedia.org/r/339609

Change 339609 merged by jenkins-bot:
db-eqiad.php: Repool db1036

https://gerrit.wikimedia.org/r/339609

Mentioned in SAL (#wikimedia-operations) [2017-02-24T09:48:14Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1036 - T154485 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-27T07:59:41Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on revision table - T154485

I am running pt-table-checksum again on s2 (keeping the replication filters for the rc slaves - db1036 and db2035.) I have added --no-check-replication-filters in order to bypass the rc slaves.
ETA for revision table to be checked 1:20h

The table has been checksummed already and there was no lag generated, it took around 1:30h in the end. Now it is checking for the differences on the slaves, and that is going to take a huge amount of time which is probably just too much for a 23G table:

Waiting to check replicas for differences:   0% 2+02:53:47 remain

It was increasing all the time.
I have ctrl+C and will probably run with PTDEBUG=1 to see what is actually doing and see if it is getting stuck somewhere, because it seems too much, and if it is true, there is no way we can run this on enwiki.

@Marostegui Are you checking the dbs with filters? It may be stuck trying to read the rows from those hosts, which will never arrive: http://kedar.nitty-witty.com/blog/pt-table-checksum-waiting-to-check-replicas-for-differences-0-0000-remain

In T154485#3057279, @jcrespo wrote:

@Marostegui Are you checking the dbs with filters? It may be stuck trying to read the rows from those hosts, which will never arrive: http://kedar.nitty-witty.com/blog/pt-table-checksum-waiting-to-check-replicas-for-differences-0-0000-remain

Yes, I was just checking the dsns table on tendril because I thought I had remove those two slaves when we added the filters on Thursday, but apparently I didn't so it was definitely checking them.

You can keep them on the replication list, so they cannot lag, but run it with --no-replicate-check, we can do the checks later, independently.

Once removed those two slaves from the dsns_s2 table, the check was completed successfully:

02-27T11:32:17      0      0 45398597  453988       0 5878.306 nlwiki.revision

Checking for differences reveals that all slaves have the same data in the revision table (obviously the rc slaves are excluded for this check, but given that they didn't have any differences for the rest of the tables, it is likely that for this table they are probably consistent as well).

I've got a list of tables per wiki from s2 that do not have a PK (unfortunately they are not the same across all the wikis, although they are pretty similar). And I have built a pt-table-checksum command per database and with the tables to ignore per database as well. So I would like to run it (one by one) for all s2 shard and see what we get.

Will start everyday in the morning to avoid going in the evening. This run will no longer split between tables, and will include revision and logging tables in the same batch. We haven't seen any delays, so it should be okay to run in, again, one database at the time.

Mentioned in SAL (#wikimedia-operations) [2017-02-28T07:12:09Z] <marostegui> run pt-table-checksum on bgwiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T08:24:52Z] <marostegui> run pt-table-checksum on bgwiktionary (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T08:59:33Z] <marostegui> run pt-table-checksum on cswiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T10:23:29Z] <marostegui> run pt-table-checksum on enwikiquote (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T10:53:32Z] <marostegui> run pt-table-checksum on enwiktionary (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T14:32:29Z] <marostegui> run pt-table-checksum on eowiki (s2) - T154485

The following wikis were checksummed in s2 during the day with no issues:

bgwiki - no differences
bgwiktionary - no differences
cswiki - no differences
enwikiquote - no differences
enwiktionary - db1047 has different data in the archive table
eowiki - stopped because it was getting to late and I didn't want it to run if I was not present (will --resume tomorrow, it was already on the revision table, which ETA was 40 minutes)

I have resumed the run on eowiki:

Resuming from eowiki.revision chunk 932, timestamp 2017-02-28 17:13:33
Checksumming eowiki.revision:   3% 13:40 remain

Mentioned in SAL (#wikimedia-operations) [2017-03-01T13:16:33Z] <marostegui> run pt-table-checksum on idwiki - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-01T15:45:52Z] <marostegui> Resume pt-table-checksum on idwiki (s2) - T154485

Wikis tested today:
eowiki - no differences
fiwiki - no differences
idwiki - stopped on the page table as it is pretty much the end of the day and I didn't want to leave it running, although at this point I am fairly confident that it wouldn't cause any issue, it has been running in the background and unattended the whole day.

Mentioned in SAL (#wikimedia-operations) [2017-03-02T07:09:51Z] <marostegui> Resume pt-table-checksum on idwiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-02T08:27:11Z] <marostegui> Start pt-table-checksum on itwiki (s2) - T154485

Today's checksums
idwiki - after resuming it today, db1047 has differences on one table:
idwiki.geo_tags 1 0 1 PRIMARY 988797 1916637

itwiki - dbstore1002 has differences on:

itwiki.wbc_entity_usage 26352 1 1 PRIMARY 4982283 4982418
itwiki.wbc_entity_usage 41520 -1 1 PRIMARY 6987621 6987766

Mentioned in SAL (#wikimedia-operations) [2017-03-03T08:22:25Z] <marostegui> Run pt-table-checksum on s2 (nowiki) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-03T10:53:38Z] <marostegui> Start pt-table-checksum on plwiki (s2) - T154485

Today checksums:
nowiki - dbstore1002 and db1047 have differences:

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nowiki.archive 5811 0 1 PRIMARY 631936 632068

Differences on dbstore1002
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nowiki.archive 5811 0 1 PRIMARY 631936 632068
nowiki.archive 6170 0 1 PRIMARY 681436 681558
nowiki.archive 6171 0 1 PRIMARY 681559 681716
nowiki.page_props 9346 -1 1 PRIMARY 1275171,1275171,wikibase_item 1275273,1275273,displaytitle
nowiki.page_props 9347 -1 1 PRIMARY 1275273,1275273,wikibase_item 1275466,1275466,defaultsort
nowiki.page_props 9507 -7 1 PRIMARY 1296037,1296037,wikibase_item 1296130,1296130,wikibase_item
nowiki.page_props 9509 -1 1 PRIMARY 1296213,1296213,defaultsort 1296312,1296312,wikibase_item
nowiki.page_props 9510 -5 1 PRIMARY 1296313,1296313,page_image_free 1296422,1296422,defaultsort
nowiki.page_props 9736 -1 1 PRIMARY 1319889,1319889,displaytitle 1320027,1320027,wikibase_item
nowiki.page_props 9737 -2 1 PRIMARY 1320028,1320028,page_image_free 1320129,1320129,wikibase_item
nowiki.page_props 9738 -14 1 PRIMARY 1320130,1320130,wikibase_item 1320231,1320231,wikibase_item
nowiki.page_props 9739 -2 1 PRIMARY 1320232,1320232,wikibase_item 1320303,1320303,defaultsort
nowiki.page_props 9740 -1 1 PRIMARY 1320303,1320303,page_image_free 1320400,1320400,wikibase_item
nowiki.page_props 9743 -3 1 PRIMARY 1320554,1320554,page_top_level_section_count 1320634,1320634,wikibase_item
nowiki.page_props 9746 -2 1 PRIMARY 1320728,1320728,defaultsort 1320820,1320820,wikibase_item
nowiki.page_props 9750 -1 1 PRIMARY 1321070,1321070,wikibase_item 1321125,1321125,page_image_free
nowiki.page_props 9802 -3 1 PRIMARY 1328742,1328742,wikibase_item 1328896,1328896,wikibase_item
nowiki.page_props 9803 -1 1 PRIMARY 1328897,1328897,defaultsort 1329046,1329046,wikibase_item
nowiki.page_props 9804 -8 1 PRIMARY 1329047,1329047,wikibase_item 1329190,1329190,wikibase_item
nowiki.page_props 9806 -6 1 PRIMARY 1329300,1329300,wikibase_item 1329429,1329429,displaytitle
nowiki.page_props 9808 -1 1 PRIMARY 1329599,1329599,wikibase_item 1329748,1329748,wikibase_item
nowiki.page_props 9811 -1 1 PRIMARY 1330082,1330082,wikibase_item 1330344,1330344,wikibase_item
nowiki.page_props 9812 -5 1 PRIMARY 1330345,1330345,wikibase_item 1330463,1330463,wikibase_item
nowiki.page_props 9818 -8 1 PRIMARY 1331053,1331053,wikibase_item 1331152,1331152,wikibase_item
nowiki.page_props 9819 -1 1 PRIMARY 1331153,1331153,wikibase_item 1331263,1331263,wikibase_item
nowiki.page_props 9820 -1 1 PRIMARY 1331265,1331265,page_top_level_section_count 1331392,1331392,wikibase_item
nowiki.page_props 9821 -3 1 PRIMARY 1331393,1331393,wikibase_item 1331510,1331510,wikibase_item
nowiki.page_props 9824 -1 1 PRIMARY 1331798,1331798,wikibase_item 1331907,1331907,wikibase_item
nowiki.page_props 9833 -1 1 PRIMARY 1332907,1332907,wikibase_item 1333008,1333008,page_image_free

plwiki has been stopped and will be resumed on Monday. Starting on Monday I will do all the wikis non-stop, as during this week there has been 0 problems while running them.

Mentioned in SAL (#wikimedia-operations) [2017-03-06T07:22:04Z] <marostegui> Resume pt-table-checksum on plwiki (s2) - T154485

I have started the checksum on plwiki again after all the crashes with db1060

Mentioned in SAL (#wikimedia-operations) [2017-03-08T07:27:12Z] <marostegui> Start pt-table-checksum on plwiki (s2) - T154485

It took a long run to finish plwiki, but here are the results.
differences in db1047 and quite lots of them in dbstore1002 too.

• Marostegui moved this task from Backlog to In progress on the DBA board.Mar 13 2017, 9:52 AM

ptwiki:
differences on db1047 and dbstore1002.

The checksum on svwiki stopped due to db1054 going down for maintenance. I will resume once it is back up again.

I am sorry :-(.

No worries whatsoever! This is now a background task that doesn't require much babysitting, so no problem at all!

svwiki has no differences
thwiki - differences on db1047 and dbstore1002
trwiki - in progress now

• Marostegui mentioned this in T160509: run pt-tablechecksum on s6.Mar 15 2017, 10:33 AM

db1047:

db      tbl     total_rows      chunks
enwiktionary    archive 100     1
idwiki  geo_tags        100     1
nlwiki  archive 200     2
nowiki  archive 100     1
plwiki  archive 500     5
ptwiki  geo_tags        100     1
thwiki  geo_tags        100     1
zhwiki  archive 100     1

Good and bad news:
{P5061}
There are no differences in data, but the primary key values differ, probably because those were inserted with the unsafe INSERT...SELECT and different locking patterns happened. We should make sure that is no longer happening.

I intend to run:

(echo "SET SESSION sql_log_bin=0; " && mysqldump --single-transaction -h db1018.eqiad.wmnet enwiktionary archive --where "ar_id >= 1068351 and ar_id <= 1068357" --no-create-info --skip-add-locks --replace | grep REPLACE ) | mysql -h db1047.eqiad.wmnet enwiktionary

to fix db1047:enwiktionary

In T154485#3101548, @jcrespo wrote:

I intend to run:

(echo "SET SESSION sql_log_bin=0; " && mysqldump --single-transaction -h db1018.eqiad.wmnet enwiktionary archive --where "ar_id >= 1068351 and ar_id <= 1068357" --no-create-info --skip-add-locks --replace | grep REPLACE ) | mysql -h db1047.eqiad.wmnet enwiktionary

to fix db1047:enwiktionary

Looks good to me - thanks!! :)

I have done the above, there is no longer a diff for db1047 on enwiktionary.archive. I will repeat the steps for all detected changes. Some important tables are skipped (text)- I may have to check those manually stopping a slave and exporting its contents.

The last database for s2 was checksummed and these are the results:

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
zhwiki.archive 19830 0 1 PRIMARY 2339129 2339333

There are also differences on dbstore1002.

So we can say that s2 has been checksummed entirely (for all the tables with PKs)

I have actually fixed all db1047 found differences. I am working now on dbstore1002.

dbstore1002:

$ mysql -h $slave nlwiki -e "SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl"
+--------------+------------------+------------+--------+
| db           | tbl              | total_rows | chunks |
+--------------+------------------+------------+--------+
| bgwiki       | page_props       |       8741 |     91 |
| cswiki       | archive          |        100 |      1 |
| cswiki       | page_props       |       3791 |     39 |
| enwiktionary | page_props       |      26740 |    338 |
| eowiki       | flaggedimages    |        757 |     10 |
| eowiki       | flaggedrevs      |       2665 |     27 |
| eowiki       | flaggedtemplates |       1578 |     22 |
| eowiki       | page_props       |       6171 |     63 |
| fiwiki       | flaggedimages    |        573 |      7 |
| fiwiki       | flaggedrevs      |      21657 |    219 |
| fiwiki       | flaggedtemplates |       1975 |     23 |
| fiwiki       | page_props       |       3385 |     35 |
| idwiki       | flaggedimages    |        193 |      2 |
| idwiki       | flaggedrevs      |      11470 |    116 |
| idwiki       | flaggedtemplates |        230 |      3 |
| idwiki       | geo_tags         |        100 |      1 |
| itwiki       | page_props       |       1978 |     20 |
| itwiki       | wbc_entity_usage |        200 |      2 |
| nlwiki       | archive          |        100 |      1 |
| nlwiki       | page_props       |        297 |      3 |
| nowiki       | archive          |        300 |      3 |
| nowiki       | page_props       |       2519 |     26 |
| plwiki       | flaggedimages    |        390 |      4 |
| plwiki       | flaggedrevs      |      15332 |    155 |
| plwiki       | flaggedtemplates |        702 |      8 |
| plwiki       | logging          |         99 |      1 |
| plwiki       | page_props       |       5873 |     61 |
| plwiki       | spoofuser        |         99 |      1 |
| plwiki       | user             |         99 |      1 |
| ptwiki       | archive          |        100 |      1 |
| ptwiki       | geo_tags         |        100 |      1 |
| ptwiki       | page_props       |        688 |      7 |
| thwiki       | geo_tags         |        100 |      1 |
| thwiki       | page_props       |        692 |      7 |
| trwiki       | flaggedimages    |       1930 |     23 |
| trwiki       | flaggedrevs      |      11182 |    113 |
| trwiki       | flaggedtemplates |       2335 |     33 |
| trwiki       | page_props       |        983 |     10 |
| zhwiki       | page_props       |        875 |      9 |
+--------------+------------------+------------+--------+

dbstore1001 - it is 24 hours behind, so it has to be checked tomorrow again:

root@dbstore1001[nlwiki]> SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl;
+--------+----------+------------+--------+
| db     | tbl      | total_rows | chunks |
+--------+----------+------------+--------+
| cswiki | geo_tags |          2 |      1 |
| idwiki | geo_tags |        100 |      1 |
| nlwiki | page     |        100 |      1 |
| nlwiki | revision |         99 |      1 |
| nlwiki | user     |        100 |      1 |
| ptwiki | geo_tags |        100 |      1 |
| thwiki | geo_tags |        100 |      1 |
+--------+----------+------------+--------+
7 rows in set (6 min 9.99 sec)

db1021 and db1024- no changes.

db1036.eqiad.wmnet:

$ mysql -h $slave nlwiki -e "SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl"
+--------+---------+------------+--------+
| db     | tbl     | total_rows | chunks |
+--------+---------+------------+--------+
| nlwiki | archive |        800 |      8 |
+--------+---------+------------+--------+

I will focus on these so they can be decommissioned ASAP.

I finished dbstore1002 too (again, only for the tables with PKs) for s2, I will fix dbstore1001 now.

jcrespo claimed this task.Mar 16 2017, 2:26 PM

dbstore1001 is fixed for s2, again with the exception of tables without PKs and zhwiki, which is finishing at this point. Going with db1036.

db1036 is now ok. dbstore2001.codfw.wmnet was all ok, but it is missing several rows from bv20**_edits tables (they are empty there).

I have checked all s2 slaves, including codfw and those not part of the specific lists. All either didn't have differences or have been corrected. The following things are to be taken into account: db1036 not fully checked because it is a "special slave". Labs ones (including db1069) are in a bad state, even if having into account filter differences, but they are going to be decommissioned soon. codfw slaves are all ok, but MIXED replication on the codfw master could kick in (master is good, however). New labs uses row, so they cannot be checked reliably.

I will now try to check some extra important tables on s2, at least:

oldimage
text
user_props

Most of the rest are cache/summary tables/derived data from parsing, etc.

• Marostegui renamed this task from run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 to run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038).Mar 24 2017, 9:03 AM

• Marostegui updated the task description. (Show Details)

• Marostegui added a parent task: T134476: Decommission old coredb machines (<=db1050).

• Marostegui mentioned this in T161294: run pt-tablechecksum on s5/s8.

Checking text now, at least on the master, dbstore1002 and the hosts to be decommissioned:

$ while read db; do for host in db1018 db1024 db1021 db1036; do echo $host:$db; python3 compare.py dbstore1002 $host $db text old_id --step=10000; done; done < s2.dblist
db1018:bgwiki
No differences found.
db1024:bgwiki
No differences found.
db1021:bgwiki
No differences found.
db1036:bgwiki
No differences found.
db1018:bgwiktionary
...

I have to do a second pass for : idwiki itwiki nlwiki nowiki plwiki ptwiki svwiki trwiki zhwiki

All others had no differences- MySQL complained on those of comparing too many rows at a time, so I had to reduce the batch to 1000 rows (which will be slower).

Only diff:

db1036:svwiki
Rows are different WHERE old_id BETWEEN 27558001 AND 27559000

$ python3 compare.py dbstore1002 db1036 svwiki text old_id --from-value=27558001 --to-value=27559000 --step=1
Rows are different WHERE old_id BETWEEN 27558799 AND 27558799

They are not toys! They are useful:

$ ./sql.py -h dbstore1002.eqiad.wmnet svwiki -e "SELECT * FROM text WHERE old_id BETWEEN 27558799 AND 27558799" --no-dry-run

Results for dbstore1002.eqiad.wmnet:3306/svwiki:
+----------+-----------------+-------------+------------------------+---------------+------------+-----------------+-----------------+------------------+---------------------+---------------------+
|   old_id |   old_namespace | old_title   | old_text               | old_comment   |   old_user | old_user_text   | old_timestamp   |   old_minor_edit | old_flags           | inverse_timestamp   |
|----------+-----------------+-------------+------------------------+---------------+------------+-----------------+-----------------+------------------+---------------------+---------------------|
| 27558799 |               0 |             | DB://cluster25/XXXXXXX |               |          0 |                 |                 |                0 | utf-8,gzip,external |                     |
+----------+-----------------+-------------+------------------------+---------------+------------+-----------------+-----------------+------------------+---------------------+---------------------+
1 row in set (0.00 sec)

$ ./sql.py -h db1036.eqiad.wmnet svwiki -e "SELECT * FROM text WHERE old_id BETWEEN 27558799 AND 27558799" --no-dry-run

Results for db1036.eqiad.wmnet:3306/svwiki:
0 rows in set (0.00 sec)

text is now ok after reimport, oldimage, too (with some false positives due to some deleted revisions out of order).

I am going over user_props now, and I have detected something, but lost the output despite running on a screen- it could be a false positive or an actual problem. I will restart the check soon again.

There are a lot of differences on user_properties between the master and dbstore1002 on:

itwiki
plwiki
svwiki

One (maybe) on:

zhwiki

This needs further research.

Differences on dbstore1002:

itwiki: --where "up_user BETWEEN 0 AND 1009999"
nowiki: --where "up_user BETWEEN 70000 AND 79999"
        --where "up_user BETWEEN 100000 AND 109999"
plwiki: --where "up_user BETWEEN 0 AND 59999"
svwiki: --where "up_user BETWEEN 0 AND 9999"
trwiki: --where "up_user BETWEEN 950000 AND 957402"

differences on db1047:

itwiki: --where "up_user BETWEEN 0 AND 1009999"
nowiki: --where "up_user BETWEEN 100000 AND 109999"
plwiki: --where "up_user BETWEEN 0 AND 59999"
svwiki: --where "up_user BETWEEN 0 AND 9999"

Mentioned in SAL (#wikimedia-operations) [2017-04-07T07:21:45Z] <jynus> reimporting several damaged db tables on s2 T154485

Everything seems ok or has been corrected (there were some false positives above). I am going to give a quick look at watchlist and then close this ticket.

In T154485#3162872, @jcrespo wrote:

Everything seems ok or has been corrected (there were some false positives above). I am going to give a quick look at watchlist and then close this ticket.

Thanks a lot for all the archeology work you've done. It is amazing.

db1018 and db1021 have the exact same watchlist table, we can now use the slave to compare it to the other servers (next week).

That is great, one step further to be able to decommission those hosts finally :-)

The modified task is for me complete- run pt-table-checksum on s2. I have not checked every table and every server- but I am mostly confident with the results we got and the corrections made to decommission the older servers.

Aside from pt-table-checksums on those servers where it is possible (not labs or non-row), I have checked non derivative data on all old servers, the master, db1054, db1047, dbstore1002 and codfw master.

Yaaaaaaaay!!! :-)
Thanks for the hard archeology work!

jcrespo added a parent task: T162699: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036).Apr 11 2017, 3:12 PM

jcrespo removed a parent task: T148078: Decommission db1015, db1035, db1044 and db1038.

• Marostegui mentioned this in T162699: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036).May 5 2017, 8:37 AM

• Marostegui mentioned this in T183735: Check data consistency across production shards.Dec 28 2017, 7:54 AM

Found at least 2 differences on cswiki:db2035:archive detected as part of T184090:

./compare.py cswiki archive ar_id db2017 db2035 db2041 db2064 --from-value=784601 --to-value=784700 --step=1
Starting comparison between id 784601 and 784700
DIFFERENCE on db2035.codfw.wmnet:3306: WHERE ar_id BETWEEN 784626 AND 784626
DIFFERENCE on db2035.codfw.wmnet:3306: WHERE ar_id BETWEEN 784627 AND 784627

There would have been no data loss, but the rows have switched primary key.

s2 master and decommed server where checked- tables:

archive ar_id
logging log_id
page page_id
revision rev_id
text old_id
user user_id

• Marostegui mentioned this in T200061: Recent duplicate entries on change_tag on sanitarium hosts.Jul 20 2018, 9:54 AM

run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038)
Closed, ResolvedPublic
Actions

Related Objects
Search...