Page MenuHomePhabricator

Make checksum parallel to the data transfer in transferpy package
Closed, ResolvedPublic

Description

The transferpy is a package used for database backup and recovery. The calculate_checksum function calculates the checksum for each file as a part of sanity_checks at sender side (before the actual transfer). After the transfer completes, at the receiver side, it needs to calculate the checksum again as a part of after_transfer_checks. This ticket is to discuss the possibility of making checksum parallel to the actual data transfer on both sides so that the total time taken to complete the process will be less compared to the current scenario.

Event Timeline

Privacybatm moved this task from Triage to GSOC2020 on the DBA board.

I would like to calculate the checksum for the actual tarred file. We can do this parallel to transfer like this:
At sender: tar cf - <directory> | tee >(echo $(md5sum) > /tmp/transfer_send) | remaining-commands
At receiver: commands | tee >(echo $(md5sum) > /tmp/transfer_recv) | tar xf - <directory>
Then we can compare those two checksum-temp-files at the end of the transfer. It will surely reduce the overall time.
What do you think?

I think it is a good starting point- I suggest you do some benchmarking (doesn't need to be implemented on code yet) of how much expensive this strategy would be compared to the current method and compared to no checksum to understand the impact/improvement.

You should test it with both a small number of files and a large number (we run this with up to 200K files and around 2TB of data), over wan and lan. If you need a pair of virtual machines for better testing, let me know and I will try to setup something.

The only worry I would have is that this would detect netcat errors, but not tar/untar ones- although not sure how likely those would be.

I think it is a good starting point- I suggest you do some benchmarking (doesn't need to be implemented on code yet) of how much expensive this strategy would be compared to the current method and compared to no checksum to understand the impact/improvement.

I will do that.

You should test it with both a small number of files and a large number (we run this with up to 200K files and around 2TB of data), over wan and lan. If you need a pair of virtual machines for better testing, let me know and I will try to setup something.

It will be great if I can get a pair of virtual machines.

The only worry I would have is that this would detect netcat errors, but not tar/untar ones- although not sure how likely those would be.

If netcat results in error, any way the transfer should also fail right? (I don't know I correctly understood your point :D )

It will be great if I can get a pair of virtual machines.

I will ask what I can get you.

If netcat results in error, any way the transfer should also fail right? (I don't know I correctly understood your point :D )

Here errors I mean as "wrong data was transferred". I wonder how likely that is to happen, vs "corruption happened while writing to disk/decompressing", but in both cases without a literal error being produced (silent changes). We could do some testing around that.

The issue with just checksuming netcat would mean that tar/untar, gzip/gunzip, disk, or software bugs could get undetected. The current method is slow, but would likely catch some of those.

Oh okay, how about giving the user a choice?

  • Checksum parallel to transfer (document the issues we find at testing)
  • Checksum after the transfer (document the delay issues)

I tried incorporating the parallel md5sum in the code. But not working as expected!

test1:
I get the correct checksum when I run send and receive commands in the terminal, but running in cumin gives different checksum. Please see the images below to understand my situation:

Terminal output (after running the command in the terminal):

correct_md5.png (741×1 px, 119 KB)

When using the cumin, transferpy shows successful message and file transferred correctly:

pycharmOutput.png (741×1 px, 162 KB)

but md5sum recorded is wrong, (terminal output, after running using cumin):
wrong_md5_cumin.png (291×390 px, 28 KB)

test2:
If the cumin writes the md5sum to a file and finish there, The md5 sum is correct:
PyCharm transferpy run output:

pycharmWithoutTransfer.png (741×1 px, 155 KB)

Correct md5sum (Terminal output):
terminalCorrect2.png (223×488 px, 27 KB)

Question:
Any idea about this issue?

Change 605851 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/transferpy@master] transferpy: Generate checksum parallel to the data transfer

https://gerrit.wikimedia.org/r/605851

(Machine spec: i5-2nd Gen with SATA HDD and 6GB DDR3 RAM)

I have run my patch (in the following order) for the data of Size: 6928532418 bytes (6.5GB)

Without both checksums
real 6m18.774s
user 0m0.805s
sys 0m0.219s

With checksum
real 8m58.570s
user 0m1.292s
sys 0m0.324s

With parallel-checksum
real 5m12.291s
user 0m0.883s
sys 0m0.254s

I think parallel-checksum took less time due to the disk caching or something like that! So I ran it again with another file of size 7487166128 bytes (7GB) in the reverse order.

With parallel-checksum
real 5m50.813s
user 0m1.218s
sys 0m0.271s

With Checksum
real 9m21.275s
user 0m1.365s
sys 0m0.287s

Without both checksums
real 5m23.072s
user 0m1.098s
sys 0m0.222s

It clearly shows that parallel-checksum is better than the normal checksum. Also, the difference between without and with parallel checksum may interest users to use the parallel-checksum option!

What do you think?

As you correctly assume, those number may be misleading due to filesystem cache + parallelism behavior on memory.

We should test with larger filesets to increase the number of iops- Hopefully by today's meeting I will be able to provide you a way.

I have run benchmarks with the new cloud test machines.
bigfile: 1.4TB
manySmallFiles300: 293GB (150 000 files)

I have run the benchmark in the following order. Ran --parallel-checksum first so that no caching would affect the time of parallel checksum.

bigfile: parallel-checksum
real 954m11.743s
user 0m7.238s
sys 0m1.087s

manySmallFiles300: parallel-checksum
real 196m1.839s
user 0m3.150s
sys 0m0.823s

bigfile: checksum
real 1440m41.371s
user 0m9.363s
sys 0m1.637s

manySmallFiles300: checksum
real 336m47.861s
user 5m36.720s
sys 0m50.243s

bigfile: nochecksum
real 962m27.799s
user 0m7.929s
sys 0m1.346s

manySmallFiles300: nochecksum
real 197m40.573s
user 0m3.044s
sys 0m0.581s

A preliminary result from this suggests that parallel checksum should be able to be disabled, but be enabled by default (unless cpu usage increased a lot).

Now it would be nice to try to make the existing checksum method faster with some kind of parallelism, as we discussed already that the parallel checksum does not substitute all possible alerts that the non-parallel checksum can do.

A preliminary result from this suggests that parallel checksum should be able to be disabled, but be enabled by default (unless cpu usage increased a lot).

Enabling it automatically could make issues like: What if the load to servers come after we start transferpy parallel-checksum? I feel like user enabling will be the best idea right?

Now it would be nice to try to make the existing checksum method faster with some kind of parallelism, as we discussed already that the parallel checksum does not substitute all possible alerts that the non-parallel checksum can do.

Yeah, I will try with source parallel checksum as we discussed in the last meeting.

I feel like user enabling will be the best idea right?

Fair enough.

I will try with source parallel checksum as we discussed in the last meeting.

Thanks.

Sorry, I forgot to give the sysbench outputs.

sysbench --test=fileio --file-total-size=150G prepare
161061273600 bytes written in 373.16 seconds (411.62 MiB/sec).

sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --max-time=1200 --max-requests=0 run

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 1.1719GiB each
150GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!

File operations:

reads/s:                      178.97
writes/s:                     119.32
fsyncs/s:                     381.82

Throughput:

read, MiB/s:                  2.80
written, MiB/s:               1.86

General statistics:

total time:                          1200.1682s
total number of events:              816118

Latency (ms):

min:                                    0.01
avg:                                    1.47
max:                                  648.84
95th percentile:                        4.91
sum:                              1197039.08

Threads fairness:

events (avg/stddev):           816118.0000/0.00
execution time (avg/stddev):   1197.0391/0.00

sysbench --test=fileio --file-total-size=150G --file-test-mode=seqrewr --max-time=1200 --max-requests=0 run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 1.1719GiB each
150GiB total file size
Block size 16KiB
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential rewrite test
Initializing worker threads...

Threads started!

File operations:

reads/s:                      0.00
writes/s:                     3021.49
fsyncs/s:                     3867.55

Throughput:

read, MiB/s:                  0.00
written, MiB/s:               47.21

General statistics:

total time:                          1200.0295s
total number of events:              8266972

Latency (ms):

min:                                    0.01
avg:                                    0.14
max:                                  274.26
95th percentile:                        0.10
sum:                              1185406.15

Threads fairness:

events (avg/stddev):           8266972.0000/0.00
execution time (avg/stddev):   1185.4061/0.00

sysbench --test=fileio --file-total-size=150G --file-test-mode=seqwr --max-time=1200 --max-requests=0 run

sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 1.1719GiB each
150GiB total file size
Block size 16KiB
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Initializing worker threads...

Threads started!

File operations:

reads/s:                      0.00
writes/s:                     2790.97
fsyncs/s:                     3572.47

Throughput:

read, MiB/s:                  0.00
written, MiB/s:               43.61

General statistics:

total time:                          1200.0431s
total number of events:              7636320

Latency (ms):

min:                                    0.01
avg:                                    0.16
max:                                  504.83
95th percentile:                        0.11
sum:                              1185290.39

Threads fairness:

events (avg/stddev):           7636320.0000/0.00
execution time (avg/stddev):   1185.2904/0.00

sysbench --test=fileio --file-total-size=150G --file-test-mode=seqrd --max-time=1200 --max-requests=0 run

sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 1.1719GiB each
150GiB total file size
Block size 16KiB
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential read test
Initializing worker threads...

Threads started!

File operations:

reads/s:                      5391.02
writes/s:                     0.00
fsyncs/s:                     0.00

Throughput:

read, MiB/s:                  84.23
written, MiB/s:               0.00

General statistics:

total time:                          1200.0326s
total number of events:              6469448

Latency (ms):

min:                                    0.00
avg:                                    0.18
max:                                   52.13
95th percentile:                        1.58
sum:                              1188082.93

Threads fairness:

events (avg/stddev):           6469448.0000/0.00
execution time (avg/stddev):   1188.0829/0.00

sysbench --test=fileio --file-total-size=150G --file-test-mode=seqrewr --max-time=1200 --max-requests=0 run

sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 1.1719GiB each
150GiB total file size
Block size 16KiB
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential rewrite test
Initializing worker threads...

Threads started!

File operations:

reads/s:                      0.00
writes/s:                     2929.40
fsyncs/s:                     3749.70

Throughput:

read, MiB/s:                  0.00
written, MiB/s:               45.77

General statistics:

total time:                          1200.0471s
total number of events:              8015150

Latency (ms):

min:                                    0.01
avg:                                    0.15
max:                                  248.01
95th percentile:                        0.11
sum:                              1185537.02

Threads fairness:

events (avg/stddev):           8015150.0000/0.00
execution time (avg/stddev):   1185.5370/0.00

Change 608640 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/transferpy@master] Transferer.py: Calculate source checksum parallel to the data transfer

https://gerrit.wikimedia.org/r/c/operations/software/transferpy/ /608640

This comment was removed by Privacybatm.

@jcrespo Can you please tell me a way to corrupt the source socket in xtrabackup. By corruption, I meant some changes,

for example: In case of file I would do something like this:

["/bin/bash", "-c", r'"echo corruption >> {}"'.format(self.transferer.source_path)]

ERROR: file was not found on the target path /home/privacybatm/testing/xtrabackup_info after transfer

That is how xtrabackup backups are checked- they expect an xtrabackup_info file on the final dir, otherwise the backup hasn't completed correctly.

a way to corrupt the source socket in xtrabackup

It depends on how you want to do it. As files are not backed up, it is a process that generates the files, files are not just moved away. In the current installation, data is copied from /srv/sqldata, you can write there, but if that corrupts or not will depend on how it is written. You could also end up with a corrupt mysql instance and may have to regenerate it.

If you shutdown mysql (sudo systemctl stop mariadb) and copy away /srv/sqldata before corrupting it, you will be able to recover a damaged mariadb data directory easily.

I have run the source multiprocess checksum and the results are given below:

bigfile: checksum
real 1440m41.371s
user 0m9.363s
sys 0m1.637s

bigfile: multiprocess checksum
real 1201m29.280s
user 0m9.798s
sys 0m1.552s

manySmallFiles300: checksum
real 336m47.861s
user 5m36.720s
sys 0m50.243s

manySmallFiles300: multiprocess checksum
real 268m40.175s
user 4m14.050s
sys 0m33.618s

I have changed the logic a bit to resolve the stuck issue I have informed in the patch: https://gerrit.wikimedia.org/r/c/608640

I have created a code for parallel data transfer using multiprocessing. I have benchmarked the code in our test machines and the results are given below:

parallel jobs = 1, data=500GB(One file)
real 321m25.202s
user 0m3.825s
sys 0m0.769s

parallel jobs = 2, data=500GB(One file)
real 642m35.490s
user 0m9.898s
sys 0m2.193s

parallel jobs = 1, data=300GB(150K files)
real 197m40.573s
user 0m3.044s
sys 0m0.581s

parallel jobs = 2, data=300GB(150K files)
real 395m25.443s
user 0m6.942s
sys 0m1.391s

Clearly it is not giving any performance improvement in a normal scenario. So, I would really prefer to work with some other issues that really give some improvement in the normal scenario. One work in my mind is:

  • Polish the packaging so that we can have two directories reserved for transferpy (one for configs and other for temporary files like checksum, port-reservation-files ..etc)

What do you think?

Currently, I paused the work related to multiprocessing and I am looking to add more tests to improve the transferpy test coverage.

I don't think this setup is adequate for testing paralelism, given we only have 1 host to transfer to (in parallel). I believe this could be way more interesting when using a 10Gb host with multiple 1Gb targets, plus it would help a lot with target checksum parallelism (which is the use case I mentioned to you in our meeting). Did you create a prototype for this or did you run a command manually? If you did some code (even if not good enough), I would like to see it so I can test it on my own.

If you don't want to focus on this and work on the package instead that is more than ok to me, but I would like to explore it on my own further with our production setup. Technically this ticket is only for parallelism checksum, so that is anyway not in scope of this ticket.

I don't think this setup is adequate for testing paralelism, given we only have 1 host to transfer to (in parallel). I believe this could be way more interesting when using a 10Gb host with multiple 1Gb targets, plus it would help a lot with target checksum parallelism (which is the use case I mentioned to you in our meeting). Did you create a prototype for this or did you run a command manually? If you did some code (even if not good enough), I would like to see it so I can test it on my own.

I have done a couple of lines of code to make it parallel. I have just pushed it here: (It is really not good code :D): https://gerrit.wikimedia.org/r/c/operations/software/transferpy/+/610750/

I agree with you, We may need a good infrastructure to understand the effect of this multiprocessing.

If you don't want to focus on this and work on the package instead that is more than ok to me, but I would like to explore it on my own further with our production setup. Technically this ticket is only for parallelism checksum, so that is anyway not in scope of this ticket.

Actually, the tests with --checksum may interest me, But for the coming week, I would like to focus on the automatic folder creation (/var/lib/transferpy) and changing the temp files to them. Also just a folder in /etc and an empty config file for future use. After that and tests (tests which I am listing now in my personal notes), I can write some actual code on the multiprocessing, so that we can check them in some good infrastructure. What do you think?

But for the coming week, I would like to focus on the automatic folder creation (/var/lib/transferpy) and changing the temp files to them

Sounds good to me, create tickets for that.

Change 608640 merged by jenkins-bot:
[operations/software/transferpy@master] Transferer.py: Calculate source checksum parallel to the data transfer

https://gerrit.wikimedia.org/r/608640

Change 605851 merged by jenkins-bot:
[operations/software/transferpy@master] transferpy: Generate checksum parallel to the data transfer

https://gerrit.wikimedia.org/r/605851

So the largest issues on how options work, which makes them very confusing:

If I do --no-checksum, I expect to not get any checksum; however, I get a parallel checksum.
If I do --parallel-checksum, I expect to get a parallel checksum; however, I get a normal checksum

So the largest issues on how options work, which makes them very confusing:

If I do --no-checksum, I expect to not get any checksum; however, I get a parallel checksum.
If I do --parallel-checksum, I expect to get a parallel checksum; however, I get a normal checksum

This is indeed a problem! This problem is resolved by this patch here: https://gerrit.wikimedia.org/r/c/operations/software/transferpy/+/613128

Change 613128 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/transferpy@master] Make transferpy configurable using a configuration file

https://gerrit.wikimedia.org/r/613128

I have just updated the commit message so that it is visible here!

Change 613128 merged by jenkins-bot:
[operations/software/transferpy@master] Make transferpy configurable using a configuration file

https://gerrit.wikimedia.org/r/613128

I think we can close this! What do you think?