Page MenuHomePhabricator

Checking archive tables across the databases
Closed, ResolvedPublic

Description

Recently we had multiple replication breakdowns on the sanitariums: T208954, T208672, T208695. All of the errors were caused by duplicate or missing entries in the archive tables.
We should check the archive tables around all the wikis to avoid these errors.
(Maybe we should use the methodology what will be introduced in T207253)

Progress:

  • s1
  • s2
  • s3
  • s4
  • s5
  • s6
  • s7
  • s8

Event Timeline

The hard part is that on sanitarium we have filters on archive tables, so compare.py will return error

Marostegui triaged this task as Medium priority.Nov 12 2018, 2:04 PM
Marostegui updated the task description. (Show Details)

As a first approach and in order to advance some work, we can probably compare sanitarium masters with the master (or other slaves) and leave sanitarium themselves aside for now. At least we will be sure that masters and sanitarium masters are the same.
Sanitariums were originally cloned from sanitariums masters.

I started the comparison on s5 between db1070 and db1082 with

for db in $(cat mediawiki-config/dblists/s5.dblist); do ./wmfmariadbpy/wmfmariadbpy/compare.py ${db} archive ar_id db1070 db1082; done | tee s5_check.out
Marostegui moved this task from Pending comment to In progress on the DBA board.

Assigning it to you to reflect the current status

for db in $(cat mediawiki-config/dblists/s6.dblist); do echo "Checking database ${db}"; ./wmfmariadbpy/wmfmariadbpy/compare.py ${db} archive ar_id db1061 db1085; done | tee s6_check.out

The screen name in sanitarium_check and runs on cumin1001

s6 ok, continuing with s7

for db in $(cat mediawiki-config/dblists/s7.dblist); do echo "Checking database ${db}"; ./wmfmariadbpy/wmfmariadbpy/compare.py ${db} archive ar_id db1062 db1079; done | tee s7_check.out

s7 completed, s8:

for db in $(cat mediawiki-config/dblists/s8.dblist); do echo "Checking database ${db}"; ./wmfmariadbpy/wmfmariadbpy/compare.py ${db} archive ar_id db1071 db1087; done | tee s8_check.out

s8 completed, s2:

for db in $(cat mediawiki-config/dblists/s2.dblist); do echo "Checking database ${db}"; ./wmfmariadbpy/wmfmariadbpy/compare.py ${db} archive ar_id db1066 db1074; done | tee s2_check.out

s2 done, s4:

for db in $(cat mediawiki-config/dblists/s4.dblist); do echo "Checking database ${db}"; ./wmfmariadbpy/wmfmariadbpy/compare.py ${db} archive ar_id db1068 db1121; done | tee s4_check.out

s4 done, s3:

for db in $(cat mediawiki-config/dblists/s3.dblist); do echo "Checking database ${db}"; ./wmfmariadbpy/wmfmariadbpy/compare.py ${db} archive ar_id db1075 db1077; done | tee s3_check.out

s3 finished, one difference found:

Checking database tcywiki 
Starting comparison between id 1 and 226
DIFFERENCE on db1077.eqiad.wmnet:3306: WHERE ar_id BETWEEN 1 AND 226
Execution ended, a total of 1 chunk(s) are different.

I re-run the check several times, and it was show up as an error.

s3 finished, one difference found:

Checking database tcywiki 
Starting comparison between id 1 and 226
DIFFERENCE on db1077.eqiad.wmnet:3306: WHERE ar_id BETWEEN 1 AND 226
Execution ended, a total of 1 chunk(s) are different.

I re-run the check several times, and it was show up as an error.

fixed

in the meanwhile the check finished on s1 too, and found no differences

Banyek updated the task description. (Show Details)

tldr:
all the sections were checked only one differences was found in s3 (tcywiki) and @Marostegui fixed it too.

You only checked 2 servers on each comparison- you should check all of them- it takes approximately the same amount of time, speed it up with --step=100000 and automate the check with --quiet and || break 2 as the scripts returns 0 when successfully executed and other on error.

You only checked 2 servers on each comparison- you should check all of them- it takes approximately the same amount of time, speed it up with --step=100000 and automate the check with --quiet and || break 2 as the scripts returns 0 when successfully executed and other on error.

  • with tcywiki I checked all the hosts after I found that diff
  • I was searching the the output of the tool created with grep to find if there any other error