While completing T316236 , we noticed that the data-transfer cookbook finished multiple times with no errors. But the transfer was incomplete, and the wcqs-streaming-updater service could not start.
Our first attempt involved adding a disk size sanity-check, but the bigger problem is that the data-transfer cookbook is not completing. There's probably network gremlins and other things out of our control, but we need to change what we control (namely, the cookbook) to address these issues as much as possible.
AC:
- Try different transfer methods (rsync?). The current approach is focused on speed, reliability would be better.
- More logging/debug output, so we can troubleshoot more quickly in the future.
- Depooling from the cookbook might not respect the thresholds pybal normally heeds when pooling/depooling hosts. We should investigate this.
- further context: we had a situation where wcqs1002 was already depooled, and we did a transfer from wcqs1003->wcqs1001. This resulted in all 3 wcqs hosts being depooled, so no eqiad wcqs hosts were actually pooled during the window.