Page MenuHomePhabricator

Swiftrepl was stuck in an infinite loop since days
Closed, ResolvedPublic

Description

Swiftrepl was running through repl_all.sh on ms-fe1005 and was stuck in an infinite loop since days, also creating a big log file (>1GB).

The cause seems to be 2 files with mismatched ETAG in 2 different containers: wikipedia-commons-local-thumb.56 and wikipedia-commons-local-thumb.03. See the related logs here below. This was causing the script to infinitely loop over those 2 containers, stopping always at the same point and forbidding the script to go ahead with the other containers.

STATS: wikipedia-commons-local-thumb.56 processed: 61000/3413461 (1%), gets: 0, hit rate: 100%
wikipedia-commons-local-thumb.56        5/56/19670630_35_LI_211_New_York_(11974786273).jpg/180px-19670630_35_LI_211_New_York_(11974786273).jpg  E-Tag mis
match: c7411afe7ef90b2aae5fe1cab5b69328/519012036565348ab6c07cf7756362fd, syncing
transferred 6360 out of 9900 for 5/56/19670630_35_LI_211_New_York_(11974786273).jpg/180px-19670630_35_LI_211_New_York_(11974786273).jpg
transferred 6360 out of 9900 for 5/56/19670630_35_LI_211_New_York_(11974786273).jpg/180px-19670630_35_LI_211_New_York_(11974786273).jpg
Repeated error in replicate_object
 Traceback (most recent call last):
  File "./swiftrepl.py", line 473, in replicator_thread
    sync_container(container, kwargs['srcconnpool'], kwargs['dstconnpool'])
  File "./swiftrepl.py", line 332, in sync_container
    replicate_object(srcobj, dstobj, srcconnpool, dstconnpool)
  File "./swiftrepl.py", line 185, in replicate_object
    send_object(dstobj, object_stream(response, chunksize=65536), headers)
  File "./swiftrepl.py", line 137, in send_object
    raise cloudfiles.errors.IncompleteSend()
IncompleteSend

Abandoning container wikipedia-commons-local-thumb.56 for now
STATS: wikipedia-commons-local-thumb.03 processed: 104000/3431616 (3%), gets: 0, hit rate: 99%
wikipedia-commons-local-thumb.03        0/03/2014.06.18_maz-54329.JPG/1920px-2014.06.18_maz-54329.JPG   E-Tag mismatch: 70c5b8f3b38f277f80f50a4aadc7d66b/
0080a2f0746c8da110a7504a49004b2a, syncing
transferred 259747 out of 259758 for 0/03/2014.06.18_maz-54329.JPG/1920px-2014.06.18_maz-54329.JPG
transferred 259747 out of 259758 for 0/03/2014.06.18_maz-54329.JPG/1920px-2014.06.18_maz-54329.JPG
Repeated error in replicate_object
 Traceback (most recent call last):
  File "./swiftrepl.py", line 473, in replicator_thread
    sync_container(container, kwargs['srcconnpool'], kwargs['dstconnpool'])
  File "./swiftrepl.py", line 332, in sync_container
    replicate_object(srcobj, dstobj, srcconnpool, dstconnpool)
  File "./swiftrepl.py", line 185, in replicate_object
    send_object(dstobj, object_stream(response, chunksize=65536), headers)
  File "./swiftrepl.py", line 137, in send_object
    raise cloudfiles.errors.IncompleteSend()
IncompleteSend

Abandoning container wikipedia-commons-local-thumb.03 for now

To deploy the discovery URLs to swift-proxy I had to stop it in order to restart the swift-proxy on this host. I've then restarted it and I'll monitor it in the next hours/days to see if the behaviour is the same or is able to get pass those two files.

Event Timeline

You can kill the two thumbs if you want to move past it, as killing thumbs is almost always a safe operation. That said, there is probably an underlying bug that resulted in those two files having a different ETag and we need to figure out why. Plus, swiftrepl getting stuck in an endless loop without any warning is not the right thing to do either, obviously :)

Mentioned in SAL (#wikimedia-operations) [2017-04-05T09:04:41Z] <volans> deleted the 2 swift thumbs that were making swiftrepl stuck in a loop: T162122

Mentioned in SAL (#wikimedia-operations) [2017-04-05T09:48:23Z] <volans> deleted a third swift thumb that was making swiftrepl stuck in a loop: T162122

The third one was:

wikipedia-commons-local-thumb.3b        3/3b/Hendrick_de_Keyser_-_gulden_cabinet.png/85px-Hendrick_de_Keyser_-_gulden_cabinet.png       E-Tag mismatch:
bc68f6efc732fda68647dcd65867cef9/cd3b1b810889387c0ff7bed187e87125, syncing

The first run of the swiftrepl has finally completed! It is now in the 2 hour sleep between runs, I'll check the next one completes without manual intevention.

A second pass was completed successfully without any manual intervention.

fgiunchedi changed the task status from Open to Stalled.Nov 5 2019, 11:15 AM

Stalling since I don't think we've seen a reoccurence yet. Now swiftrepl runs as a timer+service once a week, overlapping runs should result in an alert I believe.

currently there is this alert in Icinga:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-fe1005&service=Check+systemd+state

The following units failed: swiftrepl-mw.service on ms-fe1005. It seemed like this task might be relevant. Just searched for host name in phab.

I think that failure is unrelated to this issue - looking at /var/log/swiftrepl/2022-01-10-repl-commons.log:

2022-01-10T08:00:01.713061 Traceback (most recent call last):
2022-01-10T08:00:01.713243   File "./swiftrepl.py", line 528, in <module>
2022-01-10T08:00:01.713266     srcconn = srcconnpool.get()
2022-01-10T08:00:01.713280   File "/usr/lib/python2.7/dist-packages/cloudfiles/connection.py", line 479, in get
2022-01-10T08:00:01.720405     connobj = Connection(**self.connargs)
2022-01-10T08:00:01.720464   File "/usr/lib/python2.7/dist-packages/cloudfiles/connection.py", line 85, in __init__
2022-01-10T08:00:01.720491     self._authenticate()
2022-01-10T08:00:01.720506   File "./swiftrepl.py", line 34, in https_authenticate
2022-01-10T08:00:01.720519     (url, self.cdn_url, self.token) = self.auth.authenticate()
2022-01-10T08:00:01.720535   File "/usr/lib/python2.7/dist-packages/cloudfiles/authentication.py", line 74, in authenticate
2022-01-10T08:00:01.720701     raise AuthenticationFailed()
2022-01-10T08:00:01.720738 cloudfiles.errors.AuthenticationFailed
2022-01-10T08:00:01.725680 Command exited with non-zero status 1
2022-01-10T08:00:01.725755 0.04user 0.01system 0:00.11elapsed 46%CPU (0avgtext+0avgdata 14340maxresident)k
2022-01-10T08:00:01.725773 312inputs+0outputs (0major+2289minor)pagefaults 0swaps

I tried starting swiftrepl-mw by hand, and get the same error. So something is wrong with auth, and it's not this infinite loop issue.

Split off into separate task, since this is something else going awry.

MatthewVernon claimed this task.

We don't use swiftrepl any more, so closing this.