Page MenuHomePhabricator

Make WCQS/WDQS data transfer cookbook more reliable
Closed, ResolvedPublic3 Estimated Story Points

Description

While completing T316236 , we noticed that the data-transfer cookbook finished multiple times with no errors. But the transfer was incomplete, and the wcqs-streaming-updater service could not start.

Our first attempt involved adding a disk size sanity-check, but the bigger problem is that the data-transfer cookbook is not completing. There's probably network gremlins and other things out of our control, but we need to change what we control (namely, the cookbook) to address these issues as much as possible.

AC:

  • Try different transfer methods (rsync?). The current approach is focused on speed, reliability would be better.
  • More logging/debug output, so we can troubleshoot more quickly in the future.
  • Depooling from the cookbook might not respect the thresholds pybal normally heeds when pooling/depooling hosts. We should investigate this.
    • further context: we had a situation where wcqs1002 was already depooled, and we did a transfer from wcqs1003->wcqs1001. This resulted in all 3 wcqs hosts being depooled, so no eqiad wcqs hosts were actually pooled during the window.

Event Timeline

Change 849145 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] query_service: sanity-check file size on data-transfer.py

https://gerrit.wikimedia.org/r/849145

Change 849145 merged by jenkins-bot:

[operations/cookbooks@master] query_service: sanity-check file size on data-transfer.py

https://gerrit.wikimedia.org/r/849145

We had a few transfers fail very quickly today. From #wikimedia-operations:

[16:46:44]  <+logmsgbot> !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
16:50:14]  <+logmsgbot> !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)

My first thought is that puppet is running and it's triggering a firewall rules reset which breaks the transfer. The code itself does appear to stop puppet , maybe?

But puppet was enabled on wcqs2001 when I checked just now, and I confirmed the cookbook didn't re-pool the hosts when it failed. So I don't think there's any magic that reverts alerting/pooling/puppet when a cookbook fails. Will investigate further tomorrow.

Regarding the code:
  • The disabled() puppet context manager will always re-enable puppet even on error because the normal state of puppet should be enabled. From puppetboard you can see that there was no puppet run around the time of the cookbook run.
  • The downtimed() alerting hosts context manager doesn't remove the downtime by default on error but can be told to do so with the remove_on_error parameter.
  • To fine-tune what happens on failure in a cookbook you can also define the rollback() method (see also).
As for the actual error:

From the logs the receiver returned exit code 3:

2022-10-25 21:50:10,978 bking 3128308 [DEBUG clustershell.py:783 in ev_hup] node=wcqs2002.codfw.wmnet, rc=3, command='nc -l -p 9876 | 
openssl enc -d -aes-256-cbc -k {...SNIP_OMITTED...} | pigz -c -d > /srv/query_service/wcqs.jnl'

And the code just logs the error and keep going, so it seems a deliberate choice to not make it fail there. If instead it would fail I guess the size check recently added would be superflous.

As for ferm, this was restarted only by the coobook AFAICT:

$ grep 'Stopping ferm firewall' /var/log/syslog.1
...SNIP...
Oct 25 21:41:52 wcqs2002 systemd[1]: Stopping ferm firewall configuration...
Oct 25 21:49:55 wcqs2002 systemd[1]: Stopping ferm firewall configuration...   # Cookbook entering the with open_port() context manager
Oct 25 21:50:11 wcqs2002 systemd[1]: Stopping ferm firewall configuration...   # Cookbook exiting the with open_port() context manager, after the error
$
P.S. LOG SPAM:

When the data transfer is in progress, the prometheus-blazegraph-exporter unit is quite spammy on syslog logging ~1300 lines with python stacktraces every 30 seconds:

$ grep  'prometheus-blazegraph-exporter' /var/log/syslog.1 | grep '20:35:44' -c
1298
$ grep -c 'prometheus-blazegraph-exporter' /var/log/syslog.1
100185

The transfer cookbook continues to fail with the same errors noted above.

At this point, we've talked over the following options:

  • More logging for netcat
  • Using transfer.py or adapting it for our purposes
  • Use rsync/scp, which is much slower, but more reliable
  • Copy to object store first via swiftly or some other client
  • Copying the entire file to a tmpdir on destination before moving it into place, this prevents the destination host from losing all its query service data. *
  • To do this, we'd need to switch to JBOD , which would require a decent about of work.
bking renamed this task from Sanity-check disk size in query services data transfer cookbook to Make WCQS/WDQS data transfer cookbook more reliable .Oct 31 2022, 3:37 PM
bking updated the task description. (Show Details)

Notes from today's SRE meeting where I asked if anyone had any thoughts:

  • (dwisehaupt) ryan: you may want to test throwing a -q XXX on the netcat command if you haven't tried that. we were seeing some timeouts with data transfers and giving it a 100 second setting there helped. (update from future: we tried a one-off test with adding -q XXX to the receiver command, and still hit the same failure, so we probably want to just try the rsync approach)
  • (cdanis) Ryan, there's also some puppetization of the rsync daemon you can try out
    • (dzahn) it's called "rsync::quickdatacopy"

Further context for the rsync approach:

<mutante>
9:57 AM the summary is just.. there is a class called "rsync::quickdatacopy"
9:57 AM that's basically the standard in puppet for this kind of thing
9:58 AM where you want to transfer data between 2 or more servers
9:58 AM it has parameters like "auto_ferm"
9:58 AM which means it also does the needed firewall rules for you
9:58 AM and "stunnel" parameters to encrypt it if you need that
9:58 AM because this is real rsync:// protocol, not rsync over ssh
9:59 AM I would recommend letting this class do all the things, hosts_allow, ferm, the rsyncd daemon config etc
9:59 AM instead of trying to do all that manually
9:59 AM because it usually means you end up with disabled firewall and a mess to clean up..if you do it manually
9:59 AM but with the class it's pretty "clean"
10:00 AM and you can't "just rsync over ssh" because we cant forward agents, firewall is closed etc
10:00 AM you just have to know source host name, dest host name and path 
10:01 AM you can even enable "autosync" to add a timer that does it all the time in the background
10:01 AM or without auto_sync it will do all the things needed to make it possible to manually rsync.. without actually running a timer
10:02 AM 
<ryankemper> ack, I'll bring that context to our pairing meeting. I'll follow up with you afterwards if we can't figure it out during our pairing meeting, but I suspect we've got what we need :)
10:03 AM I think for our purposes the without-autosync option makes sense, since we only want to perform a transfer when a given host needs the new journal

Update:

  • Adding a timeout to the nc command did not help.

We will try the rsync strategy outlined above next.

We looked the rsync class in Puppet, and it didn't seem like a great fit for our use case. Instead, we simplified the cookbook (removing openssl and pigz ) and were able to complete the data transfer.

We'll keep this ticket open until these changes can be committed.

IMHO the encryption part can't just be removed. The cookbook is also used to transfer cross-DC and for that encryption AFAIK is a requirement and anyway strongly encouraged also for same-DC transfers.

AFAICT from SAL, the cookbook has been run in the past successfully, so what's changed in the meanwhile to cause this issue?
At the same time I think that transfer.py is used daily without issues, and uses pretty much the same tools.

Cross-DC encryption isn't optional and shouldn't be removed.

Regarding OpenSSL cipher suites: aes-256-cbc shouldn't be used anywhere in production nowadays as CBC a is vulnerable to padding oracle attacks. https://wikitech.wikimedia.org/wiki/Transfer.py currently uses chacha20 (and not chacha20-poly1305). chacha20 alone is faster than aes-256-gcm but not faster than aes-128-gcm but it's only ensuring confidentiality while aes-(128|256)-gcm provide both confidentiality and authenticity. If you consider chacha-poly1305 (confidentiality and authenticity are ensure) then aes-256-gcm should be chosen as it's faster (thanks to the hardware implementation provided by AES-NI)

openssl single threaded benchmarks for aes-128|256-gcm, chacha20 and chacha20-poly1305
vgutierrez@cp5020:~$ openssl speed -evp aes-128-gcm
Doing aes-128-gcm for 3s on 16 size blocks: 98984079 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 63266501 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 34909080 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 1024 size blocks: 12490399 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 1871993 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 16384 size blocks: 940859 aes-128-gcm's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-gcm     527915.09k  1349685.35k  2978908.16k  4263389.53k  5111788.89k  5138344.62k
vgutierrez@cp5020:~$ openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: 88506987 aes-256-gcm's in 2.99s
Doing aes-256-gcm for 3s on 64 size blocks: 53360573 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 30021007 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 10522239 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 1574806 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 16384 size blocks: 805803 aes-256-gcm's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-gcm     473615.98k  1138358.89k  2561792.60k  3591590.91k  4300270.25k  4400758.78k
vgutierrez@cp5020:~$ openssl speed -evp chacha20
Doing chacha20 for 3s on 16 size blocks: 66353577 chacha20's in 3.00s
Doing chacha20 for 3s on 64 size blocks: 28290196 chacha20's in 3.00s
Doing chacha20 for 3s on 256 size blocks: 26906428 chacha20's in 3.00s
Doing chacha20 for 3s on 1024 size blocks: 12356852 chacha20's in 3.00s
Doing chacha20 for 3s on 8192 size blocks: 1663576 chacha20's in 3.00s
Doing chacha20 for 3s on 16384 size blocks: 840369 chacha20's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
chacha20        353885.74k   603524.18k  2296015.19k  4217805.48k  4542671.53k  4589535.23k
vgutierrez@cp5020:~$ openssl speed -evp chacha20-poly1305
Doing chacha20-poly1305 for 3s on 16 size blocks: 43895625 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 64 size blocks: 23138776 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 256 size blocks: 18327745 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 1024 size blocks: 8294664 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 8192 size blocks: 1135730 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 16384 size blocks: 572043 chacha20-poly1305's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
chacha20-poly1305   234110.00k   493627.22k  1563967.57k  2831245.31k  3101300.05k  3124117.50k

Change 859007 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/transferpy@master] Transferer: Update encryption to use aes-128-gcm instead of chacha20

https://gerrit.wikimedia.org/r/859007

After a few tests with openssl enc it looks like it doesn't support AEAD ciphersuites, so chacha20 should be used instead of aes-256-cbc having into account that authenticity isn't guaranteed by openssl during the transfer process

Change 859007 abandoned by Jcrespo:

[operations/software/transferpy@master] Transferer: Update encryption to use aes-128-gcm instead of chacha20

Reason:

not supported

https://gerrit.wikimedia.org/r/859007

IMHO the encryption part can't just be removed. The cookbook is also used to transfer cross-DC and for that encryption AFAIK is a requirement and anyway strongly encouraged also for same-DC transfers.

Agreed, this is not a best practice and we shouldn't make that a permanent change in the cookbook (I don't know why I said that earlier, must have run out of coffee ;) ). We did look at the rsync Puppet class and saw there was an option to disable encryption, so we figured it would be OK as a one-off (also considering we were transferring publicly-available data).

AFAICT from SAL, the cookbook has been run in the past successfully, so what's changed in the meanwhile to cause this issue?
At the same time I think that transfer.py is used daily without issues, and uses pretty much the same tools.

Yeah, this is the big question for us as well. It's possible that disk errors on wcqs1003 contributed to this failure , but the cookbook failed several times even when wcqs1003 was not involved. As the ticket had already dragged out for months, we opted to complete the immediate task. We will definitely investigate if this reoccurs.

Sorta off-topic, but I wonder if there are decent Python libraries that can do some of the encryption and compression work. It might be easier to troubleshoot that way.

Mentioned in SAL (#wikimedia-operations) [2022-11-25T13:11:42Z] <gehel> re-enabling puppet on wcqs1001 - data transfer completed - T321605

Gehel triaged this task as High priority.Mar 13 2023, 4:51 PM
Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

We're having problems running this cookbook again. It's failed at least 12 times since I started trying to run it on Thursday (Apr 20).

I've seen the following errors:

bad decrypt 140326619927872:error:06065064:digital envelope routines:EVP_DecryptFinal_ex:bad decrypt:../crypto/evp/
evp_enc.c:610:  pigz: skipping: <stdin>: corrupted -- incomplete deflate data
  • Likely the same encryption errors we've seen already. The cookbook always fails with this error on first run, but it used to complete after that.
File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-transfer.py", line 127, in _copy_file  raise RuntimeError(f'Dest filesize of {dest_file_size} does not match source filesize of {source_fi
le_size}')
  • I think this error message masks a different error (broken connections, maybe?) .
  • Cookbook hangs forever while the receiving file stays at zero bytes.

Need to rule out a few things such as IPv6, encryption etc.

Per today's IRC conversation in #wikimedia-sre , I believe it's time to try integrating transfer.py . It seems the code has matured a lot since Search Platform wrote the original data-transfer.py . We should be able to import it into data-transfer.py as a library and work from there.

Change 914018 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] wdqs: use transferpy lib for data-transfer.py

https://gerrit.wikimedia.org/r/914018

Change 918597 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] [WIP]wdqs: Activate wdqs2021

https://gerrit.wikimedia.org/r/918597

Change 918597 merged by Ryan Kemper:

[operations/puppet@production] wdqs: Activate wdqs2021

https://gerrit.wikimedia.org/r/918597

I've been working on this a bit more lately. The Transfer.py documentation mentions "remote_execution" but does not mention that it's a required argument for most? all? methods.

"cumin" is the only supported remote execution engine, so I believe you can instantiate a class like this:
holepuncher = Firewall(source, remote_execution="cumin")

Still need to test this further.

Mentioned in SAL (#wikimedia-operations) [2023-06-07T17:46:43Z] <inflatador> bking@wdqs depool wdqs2012 T321605

Change 914018 merged by jenkins-bot:

[operations/cookbooks@master] wdqs: use transferpy lib for data-transfer.py

https://gerrit.wikimedia.org/r/914018

bking closed this task as Resolved.EditedJun 13 2023, 9:18 PM
bking claimed this task.
bking moved this task from monitoring to in progress on the Wikidata board.

I believe this is complete; moving to 'needs reporting' column on workboard.

bking reopened this task as In Progress.Jun 22 2023, 7:11 PM

Change 932324 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.wdqs.data-transfer: fix broken logic

https://gerrit.wikimedia.org/r/932324

Change 932324 merged by Ryan Kemper:

[operations/cookbooks@master] sre.wdqs.data-transfer: fix broken logic

https://gerrit.wikimedia.org/r/932324