compare.py
ActivePublic
Actions

Authored by jcrespo on Jan 16 2018, 11:57 AM.

Tags

None

Referenced Files

	F12591815: compare.py
	Jan 16 2018, 11:57 AM

Subscribers

• Marostegui

	root@neodymium:~$ ./compare.py db1111 db1112 commonswiki page page_id --from-value 1 --to-value=3000
	Rows are different WHERE page_id BETWEEN 1 AND 1000
	1 chunk(s) different.
	root@neodymium:~$ ./compare.py db1111 db1112 commonswiki page page_id --from-value 1 --to-value=1000 --step=1
	Rows are different WHERE page_id BETWEEN 4 AND 4
	1 chunk(s) different.
	root@neodymium:~$ echo $?

	./compare.py --help
	usage: compare.py [-h] [--step STEP]
	[--group_concat_max_len GROUP_CONCAT_MAX_LEN]
	[--from-value FROM_VALUE] [--to-value TO_VALUE]
	[--order-by ORDER_BY] [--verbose]
	host1 host2 database table column

	Compares the table contents between 2 WMF MySQL/MariaDB servers.

	positional arguments:
	host1 First instance, with format host1[:port]
	host2 Second instance, with format host2[:port]
	database Database to connect on both instances
	table Table to compare on both instances
	column Numeric id to loop on, normally an autoincrement field

	optional arguments:
	-h, --help show this help message and exit
	--step STEP How many rows to compare each time. The larger number
	of rows, the faster the full comparison will be, but
	also more aggressive. Also, there is an upper limit
	due to group_concat_max_len, depending on the size of
	the rows. Default: 1000 rows
	--group_concat_max_len GROUP_CONCAT_MAX_LEN
	Sets group_concat_max_len. Normally needed when
	increasing the "step" or the rows are very large.
	Default: 10MB
	--from-value FROM_VALUE
	Start from given column value. Useful if only part of
	the table has to be scanned. Defaults to min column
	value for the table on the first given instance, at
	start.
	--to-value TO_VALUE Start from given column value. Useful if only part of
	the table has to be scanned. Defaults to max column
	value for the table on the first given instance, at
	start.
	--order-by ORDER_BY Ordering when doing the comparions. Useful when the
	column is not an unique filed, and results could be
	returned out of order. By default, we order by column,
	which could give false positives if the column is not
	unique.
	--verbose Enables verbose logging, where all SQL commands sent
	to the server are sent to the standard output.
	Defaults to disabled.

Event Timeline

jcrespo created this paste.Jan 16 2018, 11:57 AM

$ ./compare.py db1111 db1112 commonswiki page page_id --step=10000
Starting comparison between id 1 and 64894894
DIFFERENCE: WHERE page_id BETWEEN 1 AND 10000
100 iterations, id 990001/64894894, 1 chunk(s) found different
200 iterations, id 1990001/64894894, 1 chunk(s) found different
300 iterations, id 2990001/64894894, 1 chunk(s) found different
400 iterations, id 3990001/64894894, 1 chunk(s) found different
500 iterations, id 4990001/64894894, 1 chunk(s) found different
600 iterations, id 5990001/64894894, 1 chunk(s) found different
700 iterations, id 6990001/64894894, 1 chunk(s) found different
800 iterations, id 7990001/64894894, 1 chunk(s) found different

$ ./compare.py db1111 db1112 commonswiki page page_id --step=10000
Starting comparison between id 1 and 64894894
DIFFERENCE: WHERE page_id BETWEEN 1 AND 10000
2018-01-16T15:05:41.084014: row id 990001/64894894, 1 chunk(s) found different
2018-01-16T15:05:51.302744: row id 1990001/64894894, 1 chunk(s) found different
2018-01-16T15:06:01.520885: row id 2990001/64894894, 1 chunk(s) found different
2018-01-16T15:06:11.738242: row id 3990001/64894894, 1 chunk(s) found different

Now you can compare more than 1 server at a time (as fast as doing only 2), and an eta for finishing is printed:

$ ./compare.py commonswiki page page_id db2073 db1111 db1112
Starting comparison between id 1 and 64894894
DIFFERENCE on db1111.eqiad.wmnet:3306: WHERE page_id BETWEEN 1 AND 10000
DIFFERENCE on db1112.eqiad.wmnet:3306: WHERE page_id BETWEEN 1 AND 10000
DIFFERENCE on db1111.eqiad.wmnet:3306: WHERE page_id BETWEEN 10001 AND 20000
DIFFERENCE on db1112.eqiad.wmnet:3306: WHERE page_id BETWEEN 10001 AND 20000
...
2018-01-16T19:05:19.702697: row id 990001/64894894, ETA: 20m41s, 99 chunk(s) found different

• Marostegui subscribed.Jan 16 2018, 8:36 PM

I would find useful an option to ONLY print differences.
That is to say, not to print things like:

2018-01-17T07:19:03.238318: row id 33990001/40566930, ETA: 01m07s, 0 chunk(s) found different
2018-01-17T07:19:13.460769: row id 34990001/40566930, ETA: 00m57s, 0 chunk(s) found different
2018-01-17T07:19:23.677512: row id 35990001/40566930, ETA: 00m46s, 0 chunk(s) found different
2018-01-17T07:19:33.899286: row id 36990001/40566930, ETA: 00m36s, 0 chunk(s) found different
2018-01-17T07:19:44.120978: row id 37990001/40566930, ETA: 00m26s, 0 chunk(s) found different
2018-01-17T07:19:54.344143: row id 38990001/40566930, ETA: 00m16s, 0 chunk(s) found different
2018-01-17T07:20:04.566056: row id 39990001/40566930, ETA: 00m05s, 0 chunk(s) found different

And only print any possible differences like:

DIFFERENCE on db1089.eqiad.wmnet:3306: WHERE ct_id BETWEEN 40560001 AND 40566930

Now, with --print-every=0, only differences are printed (unless there is an error). Return value still must be checked to test if they are equal or there are differences:

./compare.py commonswiki page page_id db1111 db1112 --print-every=0
DIFFERENCE on db1112.eqiad.wmnet:3306: WHERE page_id BETWEEN 1 AND 10000

host:port not working?

root@neodymium:/home/marostegui# /home/jynus/compare.py enwiki change_tag ct_id db1089 db1099:3311
Traceback (most recent call last):
  File "/home/jynus/compare.py", line 184, in <module>
    main()
  File "/home/jynus/compare.py", line 74, in main
    connections = connect_in_parallel(options.hosts, options.database, options.threads)
  File "/home/jynus/compare.py", line 46, in connect_in_parallel
    mysql = async_result[host].get()
KeyError: 'db1099:3311'

Let me check it, maybe python doesn't like dictionaries with an ':'.

This should be fixed, I just changed internally references to e.g. db1099:3311 by db1099_3311 and assume we will never use _ in a hostname/dns:

./compare.py enwiki change_tag ct_id db1089 db1099:3311
Starting comparison between id 1 and 40627513
2018-01-18T10:45:59.628063: row id 990001/40627513, ETA: 06m44s, 0 chunk(s) found different
2018-01-18T10:46:09.936190: row id 1990001/40627513, ETA: 06m36s, 0 chunk(s) found different
2018-01-18T10:46:20.155734: row id 2990001/40627513, ETA: 06m25s, 0 chunk(s) found different
2018-01-18T10:46:30.377224: row id 3990001/40627513, ETA: 06m15s, 0 chunk(s) found different
2018-01-18T10:46:40.694896: row id 4990001/40627513, ETA: 06m05s, 0 chunk(s) found different
2018-01-18T10:46:50.915050: row id 5990001/40627513, ETA: 05m55s, 0 chunk(s) found different

Thanks!

compare.pyActivePublicActions

Event Timeline

compare.py
ActivePublic
Actions