Page MenuHomePhabricator

bring swiftrepl back to life
Open, Needs TriagePublic

Description

While the decision seemingly wasn't documented anywhere when it was made, it seems swiftrepl was turned off as part of T204245: Run MediaWiki media originals active/active.

We should bring it back to life, adding some logging on the case where a file wasn't already replicated (as unlike in its previous life, that's now unexpected).

This task does not have fixing T162123: Running swiftrepl is not puppetized in its immediate scope.

Event Timeline

Change 531964 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software@master] Re-enable logging of files needing to be synced, since this is now unexpected in a post-active/active world.

https://gerrit.wikimedia.org/r/531964

Fano removed a subscriber: Fano.Sat, Aug 24, 3:15 AM

Change 532793 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software@master] swiftrepl: bring close to as-is in production

https://gerrit.wikimedia.org/r/532793

aaron added a comment.Wed, Aug 28, 8:02 AM

I do worry about the risk of data loss if swiftrepl is also deleting files based on container list differences.

Between some FileBackendMultiWrite cleanup, LocalFileBatch* refactoring, and possibly making ?action=purge trigger FileBackendMultiWrite (no patch for that last one yet), maybe that would enough.

Change 532793 merged by jenkins-bot:
[operations/software@master] swiftrepl: bring close to as-is in production

https://gerrit.wikimedia.org/r/532793

Change 531964 merged by jenkins-bot:
[operations/software@master] swiftrepl: log on replications

https://gerrit.wikimedia.org/r/531964

Mentioned in SAL (#wikimedia-operations) [2019-09-05T14:11:02Z] <cdanis> restarted swiftrepl on ms-fe1005 T231110

Leaving some notes here before I'm gone for two weeks.

  • The script as it exists on ms-fe1005:/srv/swiftrepl matches git master, and works. Kinda.
  • I modified repl.sh (which is not in git) to have a slightly different naming format for its log files.
  • I modified repl_all.sh (ditto) to not run cross-cluster deletes, out of paranoia.
  • There's a logic error in handling some of the containers: P9064 I did not have time to get to the bottom of this. Instead I just quit that swiftrepl instance when it reached that state, and the wrapper script continued on to other containers.
  • There were a _lot_ of Etag mismatches for the thumbnail containers. In those cases I also quit the script and moved on. I'm assuming the cause is simply that thumbor isn't 100% deterministic, and that no harm is actually being done by replicating the other versions, but I do not actually know either of those things for sure. (Also: maybe, in the grand scheme of things, replicating thumbnails is just unnecessary?)

I think next steps are to debug/bandaid the issue where swiftrepl.py gets stuck, and do some Puppetization to do a git clone, and also to create a systemd service+timer, running in both eqiad and codfw.

Change 536586 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP swift: add swiftrepl

https://gerrit.wikimedia.org/r/536586