Page MenuHomePhabricator

Create a container image containing the necessary sync utilities for publishing dump files
Closed, ResolvedPublic

Description

The dumps running on the snapshot servers use rsync to publish the data files from the intermediate storage (dumpsdata) servers to the distribution (clouddumps) servers.

Once we have the dumps running on Kubernetes, we will also need to have a means to publish them in the same way.

This might be rsync, but we might also want to experiment with some other tools, such as lsyncd or parsyncfp2 (which is only available in trixie).

To begin with, we will need an image that includes these sync-utils, so that we can experiment.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add changelog to trigger build and publish pipelinerepos/data-engineering/sync-utils!2btullisbump_buildmain
Add data-engineering/sync-utils to the trusted runnersrepos/releng/gitlab-trusted-runner!111btullisadd_sync_utilsmain
Add the initial build pipeline and sync utilitiesrepos/data-engineering/sync-utils!1btullisinitial_pipelinemain
Customize query in GitLab

Event Timeline

BTullis triaged this task as High priority.

I have created a basic image to start with.
Currently awaiting a review from RelEng so that it can be added to the trusted runners.

I have decided to leave parsyncfp2 and hdfs-rsync out of the image for the moment, since rsync is really what we need to begin testing.

This is now complete.

btullis@barracuda:~$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/sync-utils:2025-04-07-135737-7291841795ffa86db311e48e39684e0faf9e36b1@sha256:097a5bba821bce585815d9837eeeb20a113b1cd54745bc823adda69ae3792b20
Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/sync-utils:2025-04-07-135737-7291841795ffa86db311e48e39684e0faf9e36b1@sha256:097a5bba821bce585815d9837eeeb20a113b1cd54745bc823adda69ae3792b20' locally
docker-registry.wikimedia.org/repos/data-engineering/sync-utils@sha256:097a5bba821bce585815d9837eeeb20a113b1cd54745bc823adda69ae3792b20: Pulling from repos/data-engineering/sync-utils
cfba3dca3abc: Pull complete 
7173becacac9: Pull complete 
1c5f71d8ae47: Pull complete 
dc98657bbee5: Pull complete 
Digest: sha256:097a5bba821bce585815d9837eeeb20a113b1cd54745bc823adda69ae3792b20
Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/sync-utils@sha256:097a5bba821bce585815d9837eeeb20a113b1cd54745bc823adda69ae3792b20
runuser@2d0272461ebe:/home/sync-utils$ rsync --version
rsync  version 3.2.7  protocol version 32
Copyright (C) 1996-2022 by Andrew Tridgell, Wayne Davison, and others.
Web site: https://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, symlinks, symtimes, hardlinks, hardlink-specials,
    hardlink-symlinks, IPv6, atimes, batchfiles, inplace, append, ACLs,
    xattrs, optional secluded-args, iconv, prealloc, stop-at, no crtimes
Optimizations:
    SIMD-roll, no asm-roll, openssl-crypto, no asm-MD5
Checksum list:
    xxh128 xxh3 xxh64 (xxhash) md5 md4 sha1 none
Compress list:
    zstd lz4 zlibx zlib none
Daemon auth list:
    sha512 sha256 sha1 md5 md4

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.
runuser@2d0272461ebe:/home/sync-utils$