Page MenuHomePhabricator

transfer.py fails when copying data between es hosts
Open, MediumPublic

Description

There've been 3 attempts to copy data between es2014 to es2026 and the three of the failed.
The first one after around 1TB, the second one after 8.1TB and the third one after around 70GB.

The error that was showed was:

WARNING: Firewall's temporary rule could not be deleted
ERROR: Copy from es2014.codfw.wmnet:/srv/sqldata to es2026.codfw.wmnet:/srv/ failed

, however the nc processes were still open and data was still being transferred: even if the error was showing and the process was stopped, netcat processes remained alive on both hosts, source and target.

At the moment the transfer is ongoing on es2026 to a different directory than /srv/sqldata as there are suspects that the failure might be related to puppet runs and:

  • Ferm being reloaded
  • chmod/chown changing /srv/sqldata

The amount of data that needs to be transferred is about 9TB so it takes several hours.

There're many upcoming transfers in order to populate the new es hosts for eqiad and codfw.

Event Timeline

Marostegui triaged this task as Medium priority.Sep 9 2020, 9:40 AM
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Despite having just read the conversation history, I already forgot, did we try a smaller transfer between these particular hosts to see if it completes? Also, when was the last time a transfer between them worked and what size was it (if that sort of information is retained anywhere)?

The error seems to be es2026-only (so far), as similar transfers (e.g. all snapshots backups) happened without errors. However there are several things that could affect only this transfer:

  • This is a new host (new config? new hw?)
  • This is buster (however other transfers to buster seem to work)
  • This is a very large transfer (this is unique to this)

However, the fact that the transfer continue after erroring out points to some concurrency issue (while it is possible for backup processes to keep listening due to cumin/RemoteExecution not being able to easily kill children processes), no more bytes should continue to be receiving.

Other options include failures due to puppet runs interfering with the transfer. However, long-running transfers on other host seem to not being able to reproduce the failure (even if there should be some fixes on how iptable is handled).

Currently I have 2 transfers to -> buster host with similar options:

  • One finished correctly for 1h (where puppet ran and reloaded the firewall)
  • The other is of a large transfer (6TB) which is taking 1h and is still ongoing

More food for thought:

One of the errors was pretty close and sort of matched:

Sep  9 07:58:20 es2026 puppet-agent[21980]: (/Stage[main]/Mariadb::Config/File[/srv/sqldata]/owner) owner changed 'root' to 'mysql'
Sep  9 07:58:20 es2026 puppet-agent[21980]: (/Stage[main]/Mariadb::Config/File[/srv/sqldata]/group) group changed 'root' to 'mysql'
Sep  9 07:58:20 es2026 puppet-agent[21980]: (/Stage[main]/Mariadb::Config/File[/srv/sqldata]/mode) mode changed '0700' to '0755'
Sep  9 07:58:24 es2026 puppet-agent[21980]: (/Stage[main]/Ferm/Service[ferm]/ensure) ensure changed 'stopped' to 'running'
Sep  9 07:58:24 es2026 puppet-agent[21980]: (/Stage[main]/Ferm/Service[ferm]) Unscheduling refresh on Service[ferm]

1packet_write_wait: Connection to 10.192.0.141 port 22: Broken pipe
2
3
4About to transfer /srv/sqldata from es2014.codfw.wmnet to ['es2026.codfw.wmnet']:['/srv/'] (9366738014294 bytes)
5----- OUTPUT of 'netstat -altn | ...[0-9]$ || echo 0' -----
60
7================
8PASS |███████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 2.48hosts/s]
9FAIL | | 0% (0/1) [00:00<?, ?hosts/s]
10100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'netstat -altn | ...[0-9]$ || echo 0'.
11100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
12----- OUTPUT of '/sbin/iptables -...t 4400 -j ACCEPT' -----
13================
14PASS |███████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 2.56hosts/s]
15FAIL | | 0% (0/1) [00:00<?, ?hosts/s]
16100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/sbin/iptables -...t 4400 -j ACCEPT'.
17100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
18----- OUTPUT of '/bin/bash -c "cd...odfw.wmnet 4400"' -----
19================
20PASS | | 0% (0/1) [25:01<?, ?hosts/s]
21FAIL |█████████████████████████████████████████████████████████████████████████████| 100% (1/1) [25:01<00:00, 1501.78s/hosts]
22100.0% (1/1) of nodes failed to execute command '/bin/bash -c "cd...odfw.wmnet 4400"': es2014.codfw.wmnet
23100.0% (1/1) of nodes failed to execute command '/bin/bash -c "cd...odfw.wmnet 4400"': es2014.codfw.wmnet
240.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/bash -c "cd...odfw.wmnet 4400"'. Aborting.
250.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/bash -c "cd...odfw.wmnet 4400"'. Aborting.
260.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
270.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
28----- OUTPUT of '/sbin/iptables -...t 4400 -j ACCEPT' -----
29iptables: Bad rule (does a matching rule exist in that chain?).
30================
31PASS | | 0% (0/1) [00:00<?, ?hosts/s]
32FAIL |███████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 2.09hosts/s]
33100.0% (1/1) of nodes failed to execute command '/sbin/iptables -...t 4400 -j ACCEPT': es2026.codfw.wmnet
34100.0% (1/1) of nodes failed to execute command '/sbin/iptables -...t 4400 -j ACCEPT': es2026.codfw.wmnet
350.0% (0/1) success ratio (< 100.0% threshold) for command: '/sbin/iptables -...t 4400 -j ACCEPT'. Aborting.
360.0% (0/1) success ratio (< 100.0% threshold) for command: '/sbin/iptables -...t 4400 -j ACCEPT'. Aborting.
370.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
380.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
39WARNING: Firewall's temporary rule could not be deleted
40ERROR: Copy from es2014.codfw.wmnet:/srv/sqldata to es2026.codfw.wmnet:/srv/ failed

I made a 6TB copy to discard it was the large sizes (not tested a lot for practical reasons) and it worked ok:

root@cumin1001:~$ transfer.py backup1002.eqiad.wmnet:/srv/backups/dumps/latest/dump.es3.2020-05-20--03-46-51 dbprov1003.eqiad.wmnet:/srv/backups/tmp
2020-09-09 08:47:03  INFO: About to transfer /srv/backups/dumps/latest/dump.es3.2020-05-20--03-46-51 from backup1002.eqiad.wmnet to ['dbprov1003.eqiad.wmnet']:['/srv/backups/tmp'] (6301227429901 bytes)
2020-09-09 14:35:46  WARNING: Original size is 6301227429901 but transferred size is 6301227380749 for copy to dbprov1003.eqiad.wmnet
2020-09-09 14:35:47  INFO: Parallel checksum of source on backup1002.eqiad.wmnet and the transmitted ones on dbprov1003.eqiad.wmnet match.
2020-09-09 14:35:48  INFO: 6301227380749 bytes correctly transferred from backup1002.eqiad.wmnet to dbprov1003.eqiad.wmnet
2020-09-09 14:35:49  WARNING: Firewall's temporary rule could not be deleted
2020-09-09 14:35:49  INFO: Cleaning up....

It took only 5 hours because both hosts were on 10G, but parallel checksum would discard any network corruption or data loss.

After having tested the firewall extensively on other tests, and seeing sucessfuly working on es with puppet enabled but a different dir, I believe it is indeed related to puppet runs, which mid-transfer would alter the permissions of /srv/sqldata both in mode and owner, likely making either for some small amount of time, or under some conditions (writes on top-level dir, not subdir) fail. This only may present on es copies because:

  • Backups don't copy a file/dir directly handled by puppet, so not permission change happens
  • recoveries copy the contents, but sqldata dir is not part of the transference (so permissions are not altered)
  • Clones would only show the issue under certain circumstances, and that means puppet runs while top-level files or dirs are being written, which would be more likely on a long-running transfer (12h for es vs <1h for other transfers)

There is 2 things to do here:

  • Confirm this is the issue by reproducing it elsewhere
  • If it is the issue, try to avoid it by changing either how puppet does it or how transfer works (if technically possible)

For now, to avoid the race condition, copying to a non-puppet-permissions-handled dir, and the copying it over atomically (rename dir) would be the workaround (to avoid puppet and transfer stepping over each other).

For what is worth, there was an error today between es2017 and es2027's transfer after 3.5TB:

1root@cumin2001:~# transfer.py es2017.codfw.wmnet:/srv/sqldata es2027.codfw.wmnet:/srv/transfer
22020-09-14 08:08:23 INFO: About to transfer /srv/sqldata from es2017.codfw.wmnet to ['es2027.codfw.wmnet']:['/srv/transfer'] (9405928213557 bytes)
32020-09-14 13:24:57 ERROR: Copy from es2017.codfw.wmnet:/srv/sqldata to es2027.codfw.wmnet:/srv/transfer failed
4WARNING: /tmp/trnsfr_es2027.codfw.wmnet_4400.lock temporary directory could not be deleted
52020-09-14 13:24:57 WARNING: Firewall's temporary rule could not be deleted
62020-09-14 13:24:57 INFO: Cleaning up....

The only puppet runs around those times at the destination host (es2027) are:

  • Sep 14 12:59:41
  • Sep 14 13:30:00

The transfer failed again, after 8TB:

root@cumin2001:~# transfer.py es2017.codfw.wmnet:/srv/sqldata es2027.codfw.wmnet:/srv/transfer
2020-09-14 14:33:26  INFO: About to transfer /srv/sqldata from es2017.codfw.wmnet to ['es2027.codfw.wmnet']:['/srv/transfer'] (9405928213557 bytes)
2020-09-15 04:51:59  ERROR: Copy from es2017.codfw.wmnet:/srv/sqldata to es2027.codfw.wmnet:/srv/transfer failed
WARNING: /tmp/trnsfr_es2027.codfw.wmnet_4400.lock temporary directory could not be deleted
2020-09-15 04:52:00  WARNING: Firewall's temporary rule could not be deleted
2020-09-15 04:52:00  INFO: Cleaning up....

netcat processes are still up on both hosts.
Puppet runs:

on target:

  • Sep 15 04:29:36
  • Sep 15 04:59:03

On source:

  • Sep 15 04:28:38
  • Sep 15 04:58:04

I am going to start the transfer again between those two hosts but just using a screen and netcat directly, and see what happens.

This just happened:

# transfer.py --no-checksum --no-encrypt db1106.eqiad.wmnet:/srv/sqldata clouddb1017.eqiad.wmnet:/srv
2020-12-01 13:42:02  INFO: About to transfer /srv/sqldata from db1106.eqiad.wmnet to ['clouddb1017.eqiad.wmnet']:['/srv'] (1214419045152 bytes)
2020-12-01 14:53:05  ERROR: Copy from db1106.eqiad.wmnet:/srv/sqldata to clouddb1017.eqiad.wmnet:/srv failed
WARNING: /tmp/trnsfr_clouddb1017.eqiad.wmnet_4400.lock temporary directory could not be deleted
2020-12-01 14:53:06  WARNING: Firewall's temporary rule could not be deleted
2020-12-01 14:53:06  INFO: Cleaning up....

And the same on the other transfer:

# transfer.py --no-checksum --no-encrypt db1106.eqiad.wmnet:/srv/sqldata clouddb1013.eqiad.wmnet:/srv
2020-12-01 13:41:21  INFO: About to transfer /srv/sqldata from db1106.eqiad.wmnet to ['clouddb1013.eqiad.wmnet']:['/srv'] (1214419045152 bytes)
2020-12-01 14:52:49  ERROR: Copy from db1106.eqiad.wmnet:/srv/sqldata to clouddb1013.eqiad.wmnet:/srv failed
WARNING: /tmp/trnsfr_clouddb1013.eqiad.wmnet_4400.lock temporary directory could not be deleted
2020-12-01 14:52:50  WARNING: Firewall's temporary rule could not be deleted
2020-12-01 14:52:50  INFO: Cleaning up....

So it might be either something with cumin1001 or the source host, as those are the two common things that transfer had.

However the process is still there on cumin1001:

root     12196  0.0  0.0  15820  6376 pts/16   S    13:41   0:00 ssh -F /etc/cumin/ssh_config -oForwardAgent=no -oForwardX11=no -oConnectTimeout=10 -oBatchMode=yes clouddb1013.eqiad.wmnet /bin/bash -c "cd /srv && /bin/nc -l -w 300 -p 4400  | tee >(md5sum > /tmp/trnsfr_clouddb1013.eqiad.wmnet_4400.lock/transferrer_target.md5sum) | /usr/bin/pigz -c -d | /bin/tar xf -"
root     14050  0.0  0.0  15820  6372 pts/20   S    13:42   0:00 ssh -F /etc/cumin/ssh_config -oForwardAgent=no -oForwardX11=no -oConnectTimeout=10 -oBatchMode=yes clouddb1017.eqiad.wmnet /bin/bash -c "cd /srv && /bin/nc -l -w 300 -p 4400  | tee >(md5sum > /tmp/trnsfr_clouddb1017.eqiad.wmnet_4400.lock/transferrer_target.md5sum) | /usr/bin/pigz -c -d | /bin/tar xf -"

And on the hosts themselves:

root@clouddb1013:/srv# ps aux | grep nc
systemd+   848  0.0  0.0  93084  7072 ?        Ssl  Oct30   0:04 /lib/systemd/systemd-timesyncd
root      9622  0.0  0.0   6644  3152 ?        Ss   13:41   0:00 /bin/bash -c cd /srv && /bin/nc -l -w 300 -p 4400  | tee >(md5sum > /tmp/trnsfr_clouddb1013.eqiad.wmnet_4400.lock/transferrer_target.md5sum) | /usr/bin/pigz -c -d | /bin/tar xf -
root      9623 10.8  0.0   2372  1792 ?        S    13:41   8:11 /bin/nc -l -w 300 -p 4400
root      9627  0.0  0.0   6644  2116 ?        S    13:41   0:00 /bin/bash -c cd /srv && /bin/nc -l -w 300 -p 4400  | tee >(md5sum > /tmp/trnsfr_clouddb1013.eqiad.wmnet_4400.lock/transferrer_target.md5sum) | /usr/bin/pigz -c -d | /bin/tar xf -
root     30671  0.0  0.0   6076   892 pts/0    S+   14:57   0:00 grep nc

If some process is killing cumin1001 connections, the transfer will fail. The most likely cause is a temporary loss of connections betweeen cumin and some of the host, which closes the ssh connection.

I just read the last comment, then that wouldn't explain that.

I started a transfer.py run through remote-backup-mariadb at 14:58:36, would that match?

BTW, you don't need "--no-checksum --no-encrypt", those options are by default on '/etc/transferpy/transferpy.conf'.

What should we do with this task then? Is it worth having it open or should we just assume it can happens and we need to retry?

I just read the last comment, then that wouldn't explain that.

I started a transfer.py run through remote-backup-mariadb at 14:58:36, would that match?

It failed before that (T262388#6659947) :

2020-12-01 14:52:49  ERROR: Copy from db1106.eqiad.wmnet:/srv/sqldata to clouddb1013.eqiad.wmnet:/srv failed
WARNING: /tmp/trnsfr_clouddb1013.eqiad.wmnet_4400.lock temporary directory could not be deleted

What should we do with this task then? Is it worth having it open or should we just assume it can happens and we need to retry?

I don't know why it is failing, I would need time to debug. @LSobanski

As discussed last week, this task needs to be estimated and scheduled. It's not currently considered to be a workflow blocker.

This could explain random network errors between some hosts: T274234

Another potential reason for errors is the service that cleans up temporary files (systemd-tmpfiles-clean.timer). transfer.py uses /tmp for a couple or reasons (locking, checksumming, xtrabackup temporary dir) and maybe there is some bad interaction there that is only observed on long running transfers (e.g. overnight).