Page MenuHomePhabricator

bring swiftrepl back to life
Closed, ResolvedPublic

Description

While the decision seemingly wasn't documented anywhere when it was made, it seems swiftrepl was turned off as part of T204245: Run MediaWiki media originals active/active.

We should bring it back to life, adding some logging on the case where a file wasn't already replicated (as unlike in its previous life, that's now unexpected).

This task does not have fixing T162123: Refactor swift credentials to be global rather than per-site in its immediate scope.

Event Timeline

Change 531964 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software@master] Re-enable logging of files needing to be synced, since this is now unexpected in a post-active/active world.

https://gerrit.wikimedia.org/r/531964

Change 532793 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software@master] swiftrepl: bring close to as-is in production

https://gerrit.wikimedia.org/r/532793

I do worry about the risk of data loss if swiftrepl is also deleting files based on container list differences.

Between some FileBackendMultiWrite cleanup, LocalFileBatch* refactoring, and possibly making ?action=purge trigger FileBackendMultiWrite (no patch for that last one yet), maybe that would enough.

Change 532793 merged by jenkins-bot:
[operations/software@master] swiftrepl: bring close to as-is in production

https://gerrit.wikimedia.org/r/532793

Change 531964 merged by jenkins-bot:
[operations/software@master] swiftrepl: log on replications

https://gerrit.wikimedia.org/r/531964

Mentioned in SAL (#wikimedia-operations) [2019-09-05T14:11:02Z] <cdanis> restarted swiftrepl on ms-fe1005 T231110

Leaving some notes here before I'm gone for two weeks.

  • The script as it exists on ms-fe1005:/srv/swiftrepl matches git master, and works. Kinda.
  • I modified repl.sh (which is not in git) to have a slightly different naming format for its log files.
  • I modified repl_all.sh (ditto) to not run cross-cluster deletes, out of paranoia.
  • There's a logic error in handling some of the containers: P9064 I did not have time to get to the bottom of this. Instead I just quit that swiftrepl instance when it reached that state, and the wrapper script continued on to other containers.
  • There were a _lot_ of Etag mismatches for the thumbnail containers. In those cases I also quit the script and moved on. I'm assuming the cause is simply that thumbor isn't 100% deterministic, and that no harm is actually being done by replicating the other versions, but I do not actually know either of those things for sure. (Also: maybe, in the grand scheme of things, replicating thumbnails is just unnecessary?)

I think next steps are to debug/bandaid the issue where swiftrepl.py gets stuck, and do some Puppetization to do a git clone, and also to create a systemd service+timer, running in both eqiad and codfw.

Change 536586 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP swift: add swiftrepl

https://gerrit.wikimedia.org/r/536586

I'm looking into the failure at P9064 and here's my theory with relevant code from sync_container

dstobjects = None
while True:
  srcobjects = get_container_objects(srccontainer, limit=NOBJECT, marker=last, connpool=srcconnpool)
  limit = NOBJECT
  while dstobjects is None or (len(dstobjects) >= limit and dstobjects[-1].name < srcobjects[-1].name):
      dstobjects = get_container_objects(dstcontainer, limit=limit, marker=last, connpool=dstconnpool)
        if len(dstobjects) == limit:
            limit *= 2
            if limit > 10000:
                dstobjects = None
                break

Since we're getting an IndexError on the while line it means that either dstobjects[-1] or srcobjects[-1] are trying to read the end of an empty list. AFAICT an empty list get returned by get_container_objects when marker is the last object in the container and limit is an (integer) multiple of objects in the container, with the end result being srcobjects empty when reaching the end of the container.

Looking at logs all containers with errors are indeed multiples of NOBJECT (1000), for example

STATS: global-data-math-render.0d processed: 1000/38000 (2%), hit rate: 100%
...
STATS: global-data-math-render.0d processed: 38000/38000 (100%), hit rate: 100%
 Traceback (most recent call last):
  File "./swiftrepl.py", line 502, in replicator_thread
    sync_container(container, kwargs['srcconnpool'], kwargs['dstconnpool'])
  File "./swiftrepl.py", line 322, in sync_container
    while dstobjects is None or (len(dstobjects) >= limit and dstobjects[-1].name < srcobjects[-1].name):
  File "/usr/lib/python2.7/dist-packages/cloudfiles/storage_object.py", line 733, in __getitem__
    return Object(self.container, object_record=self._objects[key])
IndexError: list index out of range

Abandoning container global-data-math-render.0d for now

I'll try with adding a guard on len(srcobjects) > 0

Change 537610 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/software@master] swiftrepl: handle empty srcobjects when scanning for dstobjects

https://gerrit.wikimedia.org/r/537610

Looks like it has worked, on a container with exactly 12k objects:

# grep wikipedia-commons-local-deleted.0n 2019-09-18-repl-commons.log 
STATS: wikipedia-commons-local-deleted.0n processed: 1000/12000 (8%), hit rate: 100%
STATS: wikipedia-commons-local-deleted.0n processed: 2000/12000 (16%), hit rate: 100%
STATS: wikipedia-commons-local-deleted.0n processed: 3000/12000 (25%), hit rate: 100%
STATS: wikipedia-commons-local-deleted.0n processed: 4000/12000 (33%), hit rate: 100%
STATS: wikipedia-commons-local-deleted.0n processed: 5000/12000 (41%), hit rate: 100%
 STATS: wikipedia-commons-local-deleted.0n processed: 6000/12000 (50%), hit rate: 100%
STATS: wikipedia-commons-local-deleted.0n processed: 7000/12000 (58%), hit rate: 99%
STATS: wikipedia-commons-local-deleted.0n processed: 8000/12000 (66%), hit rate: 99%
STATS: wikipedia-commons-local-deleted.0n processed: 9000/12000 (75%), hit rate: 99%
STATS: wikipedia-commons-local-deleted.0n processed: 10000/12000 (83%), hit rate: 99%
STATS: wikipedia-commons-local-deleted.0n processed: 11000/12000 (91%), hit rate: 99%
 STATS: wikipedia-commons-local-deleted.0n processed: 12000/12000 (100%), hit rate: 99%
STATS: wikipedia-commons-local-deleted.0n processed: 12000/12000 (100%), hit rate: 99%
FINISHED: wikipedia-commons-local-deleted.0n
# grep -ir 'index out of range' 2019-09-18-repl-commons.log
#

Change 537610 merged by Filippo Giunchedi:
[operations/software@master] swiftrepl: handle empty srcobjects when scanning for dstobjects

https://gerrit.wikimedia.org/r/537610

Change 536586 merged by Filippo Giunchedi:
[operations/puppet@production] swift: add swiftrepl

https://gerrit.wikimedia.org/r/536586

This is effectively done (i.e. swiftrepl is back), following up in T162123: Refactor swift credentials to be global rather than per-site

Change 548739 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swiftrepl: add ensure

https://gerrit.wikimedia.org/r/548739

Change 548740 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swiftrepl: disable in codfw

https://gerrit.wikimedia.org/r/548740

Change 548739 merged by Filippo Giunchedi:
[operations/puppet@production] swiftrepl: add ensure

https://gerrit.wikimedia.org/r/548739

Change 548740 merged by Filippo Giunchedi:
[operations/puppet@production] swiftrepl: disable in codfw

https://gerrit.wikimedia.org/r/548740