Page MenuHomePhabricator

Rclone is fussy about missing objects
Closed, ResolvedPublic

Description

We have deployed a possible rclone-based replacement for swiftrepl (see T299125). The first live run, though, errored out having not done any deletions:

2023/01/13 20:14:37 ERROR : Swift root: not deleting files as there were IO errors
2023/01/13 20:14:37 ERROR : Swift root: not deleting directories as there were IO errors
2023/01/13 20:14:37 ERROR : Attempt 3/3 failed with 27143 errors and: failed to open source object: Object Not Found
2023/01/13 20:14:37 Failed to sync with 27143 errors: last error was: failed to open source object: Object Not Found

(exit code 1). The problem of those 27143 missing objects is tracked in T327253; that's likely to be a medium-term project, though, so we must decide what to do about rclone and/or swiftrepl in the short term.

There are (at least) 3 problems currently:

  • rclone retries 3 times, and emits a log line for every missing object, resulting in 81k log lines, doubling syslog size
  • the retries don't achieve anything, but waste time and system resources
  • rclone doesn't attempt to sync deletions, so any deletions in eqiad but not in codfw will be undone next time we switch DC

In the general case, rclone's conservative approach is sensible; but it's not working for us at the moment. I think our options are:

  1. Tell rclone to only try once, and delete even if there are i/o errors
  2. Tell rclone to only try once, and hope we fix T327253 before the next DC switchover
  3. Revert to swiftrepl (which doesn't care) until T327253 is fixed

I think 1. is the most pragmatic solution (though it makes my paranoid side twitch a bit), but I'm open to arguments in favour of the alternatives.

Event Timeline

I note in passing that we didn't pick this up in testing because in dry-run mode rclone (not unreasonably) tells you what objects it would try and copy but doesn't actually try to read the source objects.

Revert to swiftrepl (which doesn't care) until T327253 is fixed

Apologies if this should be obvious, but what are swiftrepl's semantics here? I assume it's meant to propagate deletions from one DC to the other as well, what is it doing under these circumstances? Does it equate to any of the three options you listed?

Yeah, swifrepl propagates deletions, and seems not to care about missing objects; reverting to using it instead of rclone temporarily is my option 3.

Yeah, swifrepl propagates deletions, and seems not to care about missing objects; reverting to using it instead of rclone temporarily is my option 3.

Sorry, that was poorly framed. Does that make it equivalent to option 1?

OIC. Yes, I think so [with a caveat that I don't know how swiftrepl deals with other failures]

Change 881662 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: make rclone less fussy

https://gerrit.wikimedia.org/r/881662

Change 881662 merged by MVernon:

[operations/puppet@production] swift: make rclone less fussy

https://gerrit.wikimedia.org/r/881662

Resolved by implementing option 1. We might want to revisit this once T327253 is done.