We have deployed a possible rclone-based replacement for swiftrepl (see T299125). The first live run, though, errored out having not done any deletions:
2023/01/13 20:14:37 ERROR : Swift root: not deleting files as there were IO errors 2023/01/13 20:14:37 ERROR : Swift root: not deleting directories as there were IO errors 2023/01/13 20:14:37 ERROR : Attempt 3/3 failed with 27143 errors and: failed to open source object: Object Not Found 2023/01/13 20:14:37 Failed to sync with 27143 errors: last error was: failed to open source object: Object Not Found
(exit code 1). The problem of those 27143 missing objects is tracked in T327253; that's likely to be a medium-term project, though, so we must decide what to do about rclone and/or swiftrepl in the short term.
There are (at least) 3 problems currently:
- rclone retries 3 times, and emits a log line for every missing object, resulting in 81k log lines, doubling syslog size
- the retries don't achieve anything, but waste time and system resources
- rclone doesn't attempt to sync deletions, so any deletions in eqiad but not in codfw will be undone next time we switch DC
In the general case, rclone's conservative approach is sensible; but it's not working for us at the moment. I think our options are:
- Tell rclone to only try once, and delete even if there are i/o errors
- Tell rclone to only try once, and hope we fix T327253 before the next DC switchover
- Revert to swiftrepl (which doesn't care) until T327253 is fixed
I think 1. is the most pragmatic solution (though it makes my paranoid side twitch a bit), but I'm open to arguments in favour of the alternatives.