Page MenuHomePhabricator

Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand)
Closed, ResolvedPublic

Description

Currently /srv/software/swiftrepl/swiftrepl.conf is deployed by hand to the Swift front-end running swiftrepl (one in each DC).

It should instead be deployed by puppet (which needs handling with care as it has secrets in it).

More important, this code is python2-only, and depends on libraries unavailable in bullseye

Event Timeline

MatthewVernon renamed this task from `swiftrepl.conf` should be puppet-managed to Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand).Apr 12 2022, 2:00 PM
MatthewVernon raised the priority of this task from Medium to High.
MatthewVernon updated the task description. (Show Details)

This is blocking the migration of the two remaining Swift frontends away from Stretch. It seems an alternative replication via rclone will replace it, what is that status of that?

The plan is indeed to replace swiftrepl with rclone. There are two infelicities with rclone for our use case:

  1. it holds entire container listings in memory (rather than any sort of paging) to avoid relying on consistently-ordered containers. On some of our containers with ~7M objects, this uses up the rather modest amount of memory available on the frontends.
  2. it checks every object to see if it is a large object (either dynamic or static large objects), because for large objects the checksum in the listing is the manifest checksum not the checksum of the downloaded object. This effectively means a HEAD call for every object in every container, which is painfully slow.

We can address the first point by no longer replicating thumbnail directories (since those are meant to be transient, and we don't care if we have to regenerate them) and moving to running rclone on one of the modern backend nodes (which have more RAM).

Regarding the large object issue, after discussion with upstream, I've built a patched rclone (needs relatively new golang to build) which entirely comments-out the large object check. Testing of this in dry-run mode looks positive.

Next steps are to turn that hacky patch into a backend-specific config option (probably with warnings if someone tries to then copy such a large object), and to push this upstream. Upstream are IME pretty responsive and helpful. I think @Eevans has been aiming to look at this this quarter, but I don't know how much time he's had.

Then we need to build a .deb of the patched rclone (may be annoying because of the need of newer golang as a build-dep), and then puppetise deployment of it and sufficient configuration.

One thing we could do is depool the stretch hosts, if you were worried about them in the mean time.

Then we need to build a .deb of the patched rclone (may be annoying because of the need of newer golang as a build-dep)

We've been pragmatic with Go deps since they easily explode complexity-wise: Given that they only produce a static ELF binary we've had multiple cases where we e.g. build a deb on the latest stable and then copied the resulting deb to an older suite (and in some cases even on sid).

I've a package of rclone 1.60.1 that builds cleanly against unstable now; I'll be uploading it soon (tomorrow unless anyone on the go team objects).

Change 870555 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: add swift::rclone

https://gerrit.wikimedia.org/r/870555

Change 879520 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: disable swifrepl timer job

https://gerrit.wikimedia.org/r/879520

Change 879520 merged by MVernon:

[operations/puppet@production] swift: disable swifrepl timer job

https://gerrit.wikimedia.org/r/879520

Change 870555 merged by MVernon:

[operations/puppet@production] swift: add swift::rclone

https://gerrit.wikimedia.org/r/870555

Change 879769 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: make storage servers also conftool clients

https://gerrit.wikimedia.org/r/879769

Change 879769 merged by MVernon:

[operations/puppet@production] swift: make storage servers also conftool clients

https://gerrit.wikimedia.org/r/879769

Change 879783 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: fix typo in rclone.conf template

https://gerrit.wikimedia.org/r/879783

Change 879783 merged by MVernon:

[operations/puppet@production] swift: fix typo in rclone.conf template

https://gerrit.wikimedia.org/r/879783

MatthewVernon claimed this task.

I think we're now at the point where we can commit to our rclone-based replacement.