Page MenuHomePhabricator

Swiftrepl was stuck in an infinite loop since days
Open, Stalled, MediumPublic

Description

Swiftrepl was running through repl_all.sh on ms-fe1005 and was stuck in an infinite loop since days, also creating a big log file (>1GB).

The cause seems to be 2 files with mismatched ETAG in 2 different containers: wikipedia-commons-local-thumb.56 and wikipedia-commons-local-thumb.03. See the related logs here below. This was causing the script to infinitely loop over those 2 containers, stopping always at the same point and forbidding the script to go ahead with the other containers.

STATS: wikipedia-commons-local-thumb.56 processed: 61000/3413461 (1%), gets: 0, hit rate: 100%
wikipedia-commons-local-thumb.56        5/56/19670630_35_LI_211_New_York_(11974786273).jpg/180px-19670630_35_LI_211_New_York_(11974786273).jpg  E-Tag mis
match: c7411afe7ef90b2aae5fe1cab5b69328/519012036565348ab6c07cf7756362fd, syncing
transferred 6360 out of 9900 for 5/56/19670630_35_LI_211_New_York_(11974786273).jpg/180px-19670630_35_LI_211_New_York_(11974786273).jpg
transferred 6360 out of 9900 for 5/56/19670630_35_LI_211_New_York_(11974786273).jpg/180px-19670630_35_LI_211_New_York_(11974786273).jpg
Repeated error in replicate_object
 Traceback (most recent call last):
  File "./swiftrepl.py", line 473, in replicator_thread
    sync_container(container, kwargs['srcconnpool'], kwargs['dstconnpool'])
  File "./swiftrepl.py", line 332, in sync_container
    replicate_object(srcobj, dstobj, srcconnpool, dstconnpool)
  File "./swiftrepl.py", line 185, in replicate_object
    send_object(dstobj, object_stream(response, chunksize=65536), headers)
  File "./swiftrepl.py", line 137, in send_object
    raise cloudfiles.errors.IncompleteSend()
IncompleteSend

Abandoning container wikipedia-commons-local-thumb.56 for now
STATS: wikipedia-commons-local-thumb.03 processed: 104000/3431616 (3%), gets: 0, hit rate: 99%
wikipedia-commons-local-thumb.03        0/03/2014.06.18_maz-54329.JPG/1920px-2014.06.18_maz-54329.JPG   E-Tag mismatch: 70c5b8f3b38f277f80f50a4aadc7d66b/
0080a2f0746c8da110a7504a49004b2a, syncing
transferred 259747 out of 259758 for 0/03/2014.06.18_maz-54329.JPG/1920px-2014.06.18_maz-54329.JPG
transferred 259747 out of 259758 for 0/03/2014.06.18_maz-54329.JPG/1920px-2014.06.18_maz-54329.JPG
Repeated error in replicate_object
 Traceback (most recent call last):
  File "./swiftrepl.py", line 473, in replicator_thread
    sync_container(container, kwargs['srcconnpool'], kwargs['dstconnpool'])
  File "./swiftrepl.py", line 332, in sync_container
    replicate_object(srcobj, dstobj, srcconnpool, dstconnpool)
  File "./swiftrepl.py", line 185, in replicate_object
    send_object(dstobj, object_stream(response, chunksize=65536), headers)
  File "./swiftrepl.py", line 137, in send_object
    raise cloudfiles.errors.IncompleteSend()
IncompleteSend

Abandoning container wikipedia-commons-local-thumb.03 for now

To deploy the discovery URLs to swift-proxy I had to stop it in order to restart the swift-proxy on this host. I've then restarted it and I'll monitor it in the next hours/days to see if the behaviour is the same or is able to get pass those two files.

Event Timeline

Volans created this task.Apr 4 2017, 9:27 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 4 2017, 9:27 AM
faidon added a subscriber: faidon.Apr 4 2017, 1:24 PM

You can kill the two thumbs if you want to move past it, as killing thumbs is almost always a safe operation. That said, there is probably an underlying bug that resulted in those two files having a different ETag and we need to figure out why. Plus, swiftrepl getting stuck in an endless loop without any warning is not the right thing to do either, obviously :)

Mentioned in SAL (#wikimedia-operations) [2017-04-05T09:04:41Z] <volans> deleted the 2 swift thumbs that were making swiftrepl stuck in a loop: T162122

Mentioned in SAL (#wikimedia-operations) [2017-04-05T09:48:23Z] <volans> deleted a third swift thumb that was making swiftrepl stuck in a loop: T162122

Volans added a comment.Apr 5 2017, 9:48 AM

The third one was:

wikipedia-commons-local-thumb.3b        3/3b/Hendrick_de_Keyser_-_gulden_cabinet.png/85px-Hendrick_de_Keyser_-_gulden_cabinet.png       E-Tag mismatch:
bc68f6efc732fda68647dcd65867cef9/cd3b1b810889387c0ff7bed187e87125, syncing

The first run of the swiftrepl has finally completed! It is now in the 2 hour sleep between runs, I'll check the next one completes without manual intevention.

Volans added a comment.Apr 6 2017, 8:04 AM

A second pass was completed successfully without any manual intervention.

fgiunchedi changed the task status from Open to Stalled.Nov 5 2019, 11:15 AM

Stalling since I don't think we've seen a reoccurence yet. Now swiftrepl runs as a timer+service once a week, overlapping runs should result in an alert I believe.