Page MenuHomePhabricator

>=27k objects listed in swift containers but not extant
Closed, ResolvedPublic

Description

The first run of an rclone-based replacement for swiftrepl (cf. T299125) failed because of i/o errors. Specifically, it found 27,143 objects which were in eqiad swift container listings (and not in codfw container listings) that do not in fact exist - attempting to HEAD or GET them results in HTTP 404 (not found).

It seems likely that this is a subset of the problematic objects - any such objects only in codfw, or in both codfw and eqiad won't have been discovered.

It would be good to know:

  • Can we delete the corresponding container entries? [so e.g. rclone will no longer error out, and so our container listings are accurate]
  • Do (any of) these correspond to objects that MW still thinks exist (a very limited sample of searching the wikis for filenames suggests not)
    • if so, can we restore them from backup?
  • Can we find what MW thinks happened to these objects?
  • Are we making more such bad objects? If so, how/why? T289996 is probably relevant here

A small set of example object names and types is in P43158 (NDA). To produce a full list, I recorded rclone output to /home/mvernon/logoutput on ms-be1069. I verified that there are no object names with newlines in (with grep -c and wc), and then produced a list of the bad objects:

sed -ne 's/^.* ERROR : \(.*\): Failed to copy: failed to open source object: Object Not Found/\1/p' <logoutput >sadobjects

As expected, that has 27143 lines in. Since it might be useful, I've processed a list of top-level containers (de-sharded) to show roughly how the objects are distributed:

mvernon@ms-be1069:~$ cut -f 1 -d '/' sadobjects | sed -e 's/\...$//' | sort | uniq -c | sort -bgr
  16297 wikipedia-commons-local-public
   3265 wikipedia-en-local-public
   1165 wikipedia-it-local-public
   1075 wikipedia-ja-local-public
    989 wikipedia-az-local-public
    905 wikipedia-commons-local-transcoded
    605 wikipedia-ru-local-public
    502 wikipedia-bn-local-public
    312 wikipedia-uk-local-public
    312 wikipedia-de-local-public
    293 wikipedia-id-local-public
    251 wikipedia-fr-local-public
    134 wikipedia-commons-local-deleted
    125 wikipedia-ko-local-public
    120 wikipedia-zh-local-public
    113 wikipedia-sr-local-public
     77 wikipedia-pnb-local-public
     69 wikipedia-hu-local-public
     68 wikipedia-th-local-public
     63 wikipedia-tr-local-public
     57 wikipedia-lv-local-public
     52 wikipedia-fi-local-public
     45 wikipedia-ca-local-public
     39 wikipedia-he-local-public
     22 wikipedia-en-local-transcoded
     21 wikipedia-ro-local-public
     16 wikipedia-ru-local-transcoded
     13 wikipedia-sh-local-public
     12 wikiquote-hu-local-public
     12 wikipedia-it-local-transcoded
     10 wikivoyage-zh-local-public
     10 wikipedia-test-local-public
      8 wikipedia-ta-local-transcoded
      8 wikipedia-pt-local-deleted
      8 wikipedia-ka-local-public
      7 wikipedia-hy-local-public
      5 wikipedia-hr-local-public
      4 wikipedia-bcl-local-public
      4 wikimedia-id-internal-local-public
      4 wikibooks-si-local-public
      3 wikisource-fr-local-public
      3 wikiquote-ja-local-public
      3 wikipedia-th-local-transcoded
      3 wikipedia-jv-local-public
      3 wikipedia-de-local-deleted
      3 wikipedia-commons-gwtoolset-metadata
      3 wikipedia-ar-local-public
      2 wikiversity-en-local-public
      2 wikipedia-wa-local-public
      2 wikipedia-id-local-deleted
      2 wikipedia-hi-local-public
      2 wikipedia-eo-local-public
      2 wikipedia-en-local-deleted
      2 wikipedia-az-local-deleted
      1 wikisource-jv-local-public
      1 wikisource-it-local-public
      1 wikisource-es-local-public
      1 wikisource-bn-local-public
      1 wikiquote-it-local-public
      1 wikipedia-wuu-local-public
      1 wikipedia-uk-local-transcoded
      1 wikipedia-mr-local-public
      1 wikipedia-lb-local-public
      1 wikipedia-kk-local-public
      1 wikipedia-ca-local-deleted

Event Timeline

From the small portion, it would seem that the files were uploaded and when they were later deleted at MediaWiki, the files would be copied into the archive, and rmed from swift storage but in some cases the contents disappeared from swift but not the "filename".

Change 881662 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: make rclone less fussy

https://gerrit.wikimedia.org/r/881662

Change 881662 merged by MVernon:

[operations/puppet@production] swift: make rclone less fussy

https://gerrit.wikimedia.org/r/881662

The timer job ran this morning, with our less picky settings, and ended thus:

[...]
Jan 23 10:24:35 ms-be1069 swift-rclone-sync[1539164]: ERROR : wikipedia-de-local-public.16/1/16/Symbol_Limes.png: Couldn't delete: Object Not Found
Jan 23 10:24:35 ms-be1069 swift-rclone-sync[1539164]: ERROR : wikipedia-commons-local-public.6e/6/6e/DebrecenDSCN0208.JPG: Couldn't delete: Object Not Found
Jan 23 10:24:35 ms-be1069 swift-rclone-sync[1539164]: ERROR : Attempt 1/1 failed with 56355 errors and: failed to delete 29156 files
Jan 23 10:24:35 ms-be1069 swift-rclone-sync[1539164]: Failed to sync with 56355 errors: last error was: failed to delete 29156 files

From which I think are three key observations:

  1. There is a similar number of listed-but-not-extant objects in codfw (29,156), which is perhaps not surprising
  2. It seems that the listed-but-not-extant objects cannot be deleted, at least by rclone this is concerning
  3. There are 56 more listed-but-not-extant objects in eqiad than there were last week also concerning

Point 3 suggests there is an ongoing issue here (though if the current small-number-sample is correct and this is something going awry in the deletion workflow, that's not UBN); Point 2 is perhaps more concerning. It's probably worth trying the swift CLI to see if that can delete one of these objects successfully.

Picking one of those to go log-diving (via the hacky sudo cumin O:swift::proxy 'grep Symbol_Limes.png /var/log/swift/proxy-access.log || true') gets 3 hits, one of which is a red herring:

Jan 23 02:12:25 ms-fe2011 proxy-server: 114.24.85.80 10.192.32.36 23/Jan/2023/02/12/25 GET /v1/AUTH_mw/wikipedia-commons-local-public.16/1/16/Symbol_Limes.png HTTP/1.0 200 https://commons.wikimedia.org/ Mozilla/5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20x64%3B%20rv:105.0%29%20Gecko/20100101%20Firefox/105.0 - - 2197 - txaee15d6bdcbd457fa481c-0063cded09 - 0.0382 - - 1674439945.671721697 1674439945.709963799 0
Jan 23 10:23:07 ms-fe2012 proxy-server: 10.64.131.2 10.64.131.2 23/Jan/2023/10/23/07 DELETE /v1/AUTH_mw/wikipedia-de-local-public.16/archive/1/16/20131007201316%2521Symbol_Limes.png HTTP/1.0 404 - rclone/ AUTH_tk4291c253a... - 70 - tx7af7dd2d60ea44f399876-0063ce600b - 0.1056 - - 1674469387.690390825 1674469387.795986652 0
Jan 23 10:24:35 ms-fe2012 proxy-server: 10.64.131.2 10.64.131.2 23/Jan/2023/10/24/35 DELETE /v1/AUTH_mw/wikipedia-de-local-public.16/1/16/Symbol_Limes.png HTTP/1.0 404 - rclone/ AUTH_tk4291c253a... - 70 - tx8abe0371b44b4313a81c4-0063ce6063 - 0.0543 - - 1674469475.023864508 1674469475.078210115 0

The first entry is a false-positive of the similarly-named image in commons; the latter two show rclone trying and failing to delete it (getting 404); which is not the same as the swift CLI failing, but does suggest we have a problem here (item in container listing cannot be deleted via the usual approach, because DELETE gets 404)...

I did a second rclone run on 24 Jan, hoping that entries in that list that weren't in the 23 Jan list would be enlightening. Extracting the copy list as before, then:

join -v2 <( sort sadobjects_20230123 ) <( sort sadobjects_20230124)

Gives us a list of objects in the 24th Jan output but not the 23rd. Picking one at random, we can then do

sudo cumin -x O:swift::proxy 'zgrep wikipedia-commons-local-public.80/8/80/Cook-SouthAustralia.jpg /var/log/swift/proxy-access.lo*'

Which ought to give us a history of this object. Alas, it's unhelpful:
{P43346}
You can see it's already saying 404 before the start of our logs - and many of the requests are coming from thumbor hosts(?!?).

I'll try some others to see if we get any more useful outcomes...

I've done a bunch of investigating, and I don't think I'm much nearer a useful answer.

First, though, it's clear that while DELETE on these "ghost" objects returns 404, it does in fact succeed - no object from 23rd Jan's run recurs in subsequent deletion runs (a smattering do occur as subsequent failed copies, but that's another question). Checked with:

mvernon@ms-be1069:~$ grep -Fcf <(sed -ne "s/^.* ERROR : \(.*\): Couldn't delete: Object Not Found/\1/p" logoutput_20230123) logoutput_2023012{4,5,6,7}
logoutput_20230124:58
logoutput_20230125:67
logoutput_20230126:64
logoutput_20230127:69
mvernon@ms-be1069:~$ grep -Ff <(sed -ne "s/^.* ERROR : \(.*\): Couldn't delete: Object Not Found/\1/p" logoutput_20230123) logoutput_2023012{4,5,6,7} | grep -c "Failed to copy"
258

I think this is a bug in Swift - these DELETEs are successful so should not report a failure code - and have reported it upstream. Arguably rclone is also buggy here.

So, what about objects that are being failed to copy? Can we find any of them being successfully interacted with in our logs? Using some ugly python (P43433) to extract those objects that were only failing to be copied on the latest rclone run, interrogate logs across the frontends, sorting by path then date then time:

sudo cumin -x --force --no-progress --no-color -o txt O:swift::proxy 'zgrep -hf <(curl -s https://people.wikimedia.org/~mvernon/20230127_only_sadobjects) /var/log/swift/proxy-access.lo*' 2>/dev/null | sort -b -k 11,11 -k 3n,3 -k 4,4 >27_only_output_sorted

...alas every entry in that is a 404.

How about looking at everything? find the union of all the failed-to-copy objects, and do a similar analysis (using grep -F for speed, which gives us a ~100-times speedup on a set of patterns this large):

sudo cumin -x --force --no-progress --no-color -o txt O:swift::proxy 'zgrep -hFf <(curl -s https://people.wikimedia.org/~mvernon/all_sadobjects) /var/log/swift/proxy-access.lo*' 2>/dev/null | sort -b -k 11,11 -k 3n,3 -k 4,4 >allsad_output

That gives us 213348 lines of output, nearly all of which is 404 or 499:

mvernon@cumin2002:~$ grep -Evc 'HTTP/1.0 (404|499)' allsad_output 
75

Those logs related to two objects.
The first is wikipedia-commons-local-public.53/5/53/Flying_Seagull.jpg which is a file that actually exists on Commons ; note that both logs are successful, and neither of them relates to rclone; noting the date of the failed copy, this is before our log extracts begin, so probably not worth spending more time on - commons thinks it was successfully uploaded on 2023-01-20.

logoutput:2023/01/17 10:54:21 ERROR : wikipedia-commons-local-public.53/5/53/Flying_Seagull.jpg: Failed to copy: failed to open source object: Object Not Found

{P43434}

The second is wikipedia-en-local-public.bb/b/bb/Ernest_Woodruff.jpeg which is extant on enwiki. Logs here show initial 404s, then PUTs (one in each DC), thereafter successful retrievals. Here again, the failed copy log comes before the object was successfully uploaded (on 23rd January)

logoutput:2023/01/17 11:48:03 ERROR : wikipedia-en-local-public.bb/b/bb/Ernest_Woodruff.jpeg: Failed to copy: failed to open source object: Object Not Found

Which does, perhaps, invite the question of why this object was in a listing before we have a record of a successful upload.

More concerning, though, is that we are left with no clear answer as to why the set of things rclone is trying but failing to copy from eqiad to codfw isn't constant - as best we can tell from the logs these are not objects that are being newly uploaded. There's a suggestion that leaving a container disk unmounted for 'long enough' can cause ghost objects to be produced (by failing to propagate a delete). Is it possible that such ghost listings are only in some shards of the replicated container database and so the list operation might not always return the same answer? That is an unpalatable thought, but would explain the behaviour we're seeing.

https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AFlying+Seagull.jpg I think explains the observations - there was a previous object by the same name deleted in June 2022; if that deletion part-failed and/or left a ghost, that would explain our findings re this object.

I've mentioned to Emperor some things that help explain *some* of the comments. E.g. https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AFlying+Seagull.jpg shows that a file with the same name was uploaded last year, and probably left a non-fully-deleted file, which later got uploaded again.

Due to how dynamic mw behaviour is, it makes debugging issues harder. In particular, titles/upload paths are not stable, and different objects can be renamed or reuploaded again after soft or hard deletion. This will be important to coordinate a potential mass DELETE, as we could be affecting a different file than the one that initially caused the issue (will need some extra safety checks if reasonable).

More updates. @Eevans pointed out that we do now have some clients setting the expiry headers, so it was worth checking the state of the expiry queue.
I did so with a modified script found online (which ought to be pushed upstream probably), and that showed:

  Total entries: 276369
Pending entries: 8
  Stale entries: 20

And entries in the queue were from 2023-01-27 to 2026-01-31; so I think we can largely discount expired-but-not-deleted objects for now (actually running the object expirer is a KR for this quarter, cf T229584).

Picking the container with the highest variation in the number of sad objects, wikipedia-commons-local-public.ad, I thought it worthwhile to see if its various database files agree. You can find the files on-disk thus:

sudo swift-get-nodes /etc/swift/container.ring.gz AUTH_mw wikipedia-commons-local-public.ad
Account         AUTH_mw
Container       wikipedia-commons-local-public.ad
[...]
Use your own device location of servers:
such as "export DEVICE=/srv/node"
ssh 10.192.0.149 "ls -lah ${DEVICE:-/srv/node*}/sdb3/containers/2147/050/08633edb9748fd5d9f3b5a4e94672050"
ssh 10.192.16.72 "ls -lah ${DEVICE:-/srv/node*}/sdb3/containers/2147/050/08633edb9748fd5d9f3b5a4e94672050"
ssh 10.192.48.8 "ls -lah ${DEVICE:-/srv/node*}/sdb3/containers/2147/050/08633edb9748fd5d9f3b5a4e94672050"
[...]

We can then on those target nodes (ms-be20[4-6]1) inspect the database file thus:

sudo swift-container-info /srv/swift-storage/sdb3/containers/2147/050/08633edb9748fd5d9f3b5a4e94672050/08633edb9748fd5d9f3b5a4e94672050.db

In codfw, all three replicas agree on object count and size:

Object Count: 385571
Bytes Used: 1731841915610
Reported Object Count: 385569
Reported Bytes Used: 1731828949763

In eqiad, however, there is not agreement; 2/3 of the replicas (ms-be106[2,4]) have:

Object Count: 385571
Bytes Used: 1731818999985
Reported Object Count: 385566
Reported Bytes Used: 1731804776543

...but the third (ms-be1061) has instead:

Object Count: 385961
Bytes Used: 1733329643793
Reported Object Count: 385958
Reported Bytes Used: 1733316309119

So it's going to be interesting, I suspect, to try and get a list of the different objects from these variations and see how those compare to the changing list of sad objects.

I've extraced the sqlite database files for this container and had a look. To find the scheme one can look at swift/container/backend.py in the source, or inspect the .db file directly; anyhow, it's:

CREATE TABLE object (
                ROWID INTEGER PRIMARY KEY AUTOINCREMENT,
                name TEXT,
                created_at TEXT,
                size INTEGER,
                content_type TEXT,
                etag TEXT,
                deleted INTEGER DEFAULT 0
            , storage_policy_index INTEGER DEFAULT 0)

There are 52 deleted objects in eqiad, 53 in codfw (i.e. where deleted == 1); the 52 are the same, and the extra deleted record in is:

QUERY='SELECT name FROM object WHERE deleted ==1 '; join -v2 <( sqlite3 ms-be1061.db "$QUERY" ) <( sqlite3 ms-be2061.db "$QUERY" )
a/ad/Petertje_van_den_Hengel_-_Verzetskruis_-_na_de_oorlog_uitgereikt_door_Prins_Bernhart.JPG

Interestingly, this row is present in the ms-be1061 (i.e. the divergant eqiad) database, but marked as not deleted. It doesn't appear in any logs, so it's been deleted for a while. There is one object in codfw but not in eqiad (despite eqiad having more objects), which is

a/ad/Tone_Sekelius,_Melodifestivalen_2023,_Göteborg,_repetition_fredag_01.jpg

(I don't think this is relevant to our enquiries here, but I note it just in case).

So, returning to the objects present in ms-be1061 (the divergent eqiad node) but not on the other two eqiad nodes, and also not in codfw (i.e. what we think are ghost objects in eqiad that rclone would try (and fail) to replicate to codfw):

ATTACH DATABASE "ms-be1061.db" as bigeqiad;
ATTACH DATABASE "ms-be1062.db" as smalleqiad;
ATTACH DATABASE "ms-be2061.db" as codfw;

WITH eqiaduniq AS (
    WITH sep AS (SELECT * FROM smalleqiad.object WHERE deleted == 0),
    bep AS (SELECT * FROM bigeqiad.object WHERE deleted == 0)
        SELECT * FROM bep LEFT JOIN sep USING(name) WHERE sep.name IS NULL
), 
    cp AS (SELECT * FROM codfw.object WHERE deleted == 0)
        SELECT COUNT(*) FROM eqiaduniq LEFT JOIN cp USING(name)
            WHERE cp.name IS NULL;

Gives us 386 (of 390 entries in eqiaduniq).

Inspecting our logs from rclone runs, the number of failures from this bucket varies (range 99-130); 103 objects appear only once, and 364 appear in total. So it appears that listing our container can return a range of values? Let us confirm this:

#Run this five times
sudo rclone --config /etc/swift/rclone.conf ls eqiad:wikipedia-commons-local-public.ad | wc -l
386073
386053
386047
386051
386057

Numbering varies, and not monotonically. Let's confirm this isn't an rclone bug by using the swift cli:

. /etc/swift/account_AUTH_mw.env
swift list wikipedia-commons-local-public.ad | wc -l
386054
386002
386096

...and since ms-fe1009 is still stretch, let's repeat on ms-fe1012 a bullseye host (load credentials with read -s):

386060
386086
386023
386032

Since e.g. list_objects_iter can use ROWID, it seems worth checking that it's consistent:

sqlite> SELECT COUNT(*) from bigeqiad.object INNER JOIN smalleqiad.object ON ( bigeqiad.object.ROWID == smalleqiad.object.ROWID AND bigeqiad.object.name == smalleqiad.object.name );
352193
sqlite> SELECT COUNT(*) from bigeqiad.object INNER JOIN smalleqiad.object ON ( bigeqiad.object.ROWID == smalleqiad.object.ROWID AND bigeqiad.object.name <> smalleqiad.object.name );
28743
sqlite> SELECT COUNT(*) from bigeqiad.object INNER JOIN smalleqiad.object ON ( bigeqiad.object.ROWID == smalleqiad.object.ROWID );
380936

We see similar inconsistency in codfw, though actual listing is consistent (always 385972), so I think we can discard this as an incidental finding.

So, in the container we've examined:

  1. 1/3 of the databases contains extra "ghost" objects
  2. these "ghost" objects are partially represented in any given listing of the container
  3. these "ghost" objects are the objects that are causing the rclone failures for this container

Next is to see if there are any tombstone records for the ghosts.

The answer is (small sample size) that there are no on-disk records for ghosts.

Example with a deleted (non-ghost) object listed in all 3 databases:

mvernon@ms-be1069:~$ sudo swift-get-nodes /etc/swift/object.ring.gz AUTH_mw wikipedia-commons-local-public.ad a/ad/1852cdfunny.svg
[...]
ssh 10.64.32.117 "ls -lah ${DEVICE:-/srv/node*}/sdy1/objects/16346/36f/3fda77b8f3515cd3635027742c6ef36f"
[...]
mvernon@ms-be1066:~$ sudo ls /srv/swift-storage/sdy1/objects/16346/36f/3fda77b8f3515cd3635027742c6ef36f
1674882596.43642.ts
mvernon@ms-be1066:~$ sudo file /srv/swift-storage/sdy1/objects/16346/36f/3fda77b8f3515cd3635027742c6ef36f/1674882596.43642.ts
/srv/swift-storage/sdy1/objects/16346/36f/3fda77b8f3515cd3635027742c6ef36f/1674882596.43642.ts: empty

i.e. there is an expected empty tombstone file for this normally-deleted object.

Picking a ghost object (i.e. marked undeleted in the divergent eqiad database, not present in the other two eqiad databases, not present in codfw), e.g. a/ad/0_73aa1_2d9bafe_orig.jpg we perform the same steps, but none of the referred-to directories exist at all.

Can we delete this object? First check it's still present in the database with this python snippet (as the sqlite3 CLI tool isn't installed on ms backends)

import sqlite3
con = sqlite3.connect("/srv/swift-storage/sda3/containers/23440/c3a/5b9097139dc3768e2323b8b0427a2c3a/5b9097139dc3768e2323b8b0427a2c3a.db")
cur = con.cursor()
cur.execute("SELECT * FROM object WHERE name ='a/ad/0_73aa1_2d9bafe_orig.jpg'")
ans = cur.fetchall()

And it is still present:

>>> ans[0]
(112427, 'a/ad/0_73aa1_2d9bafe_orig.jpg', '1412083395.08540', 866785, 'image/jpeg', 'b496c3c5dbf76226c211c35e2c854649', 0, 0)

Then, on a frontend:

root@ms-fe1009:/home/mvernon# . /etc/swift/account_AUTH_mw.env 
root@ms-fe1009:/home/mvernon# swift stat wikipedia-commons-local-public.ad a/ad/0_73aa1_2d9bafe_orig.jpg
Object HEAD failed: http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.ad/a/ad/0_73aa1_2d9bafe_orig.jpg 404 Not Found
Failed Transaction ID: tx3715ad8e803e40f4bc68d-0063e65f3d
root@ms-fe1009:/home/mvernon# swift delete wikipedia-commons-local-public.ad a/ad/0_73aa1_2d9bafe_orig.jpg
Error Deleting: wikipedia-commons-local-public.ad/a/ad/0_73aa1_2d9bafe_orig.jpg: Object DELETE failed: http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.ad/a/ad/0_73aa1_2d9bafe_orig.jpg 404 Not Found  [first 60 chars of response] b'<html><h1>Not Found</h1><p>The resource could not be found.<'

And, the row is now marked as deleted in the divergent database:

>>> ans[0]
(481512, 'a/ad/0_73aa1_2d9bafe_orig.jpg', '1676042079.97693', 0, 'application/deleted', 'noetag', 1, 0)

...and the other databases also now have a corresponding deleted row:

#ms-be1064
>>> ans[0]
(483680, 'a/ad/0_73aa1_2d9bafe_orig.jpg', '1676042079.97693', 0, 'application/deleted', 'noetag', 1, 0)
#ms-be1062
>>> ans[0]
(483679, 'a/ad/0_73aa1_2d9bafe_orig.jpg', '1676042079.97693', 0, 'application/deleted', 'noetag', 1, 0)

Thus, pace the 404 codes, issuing DELETE for these objects does move the container towards a more-consistent and correct state.

...and it also produces a tombstone file:

mvernon@ms-be1061:~$ sudo swift-get-nodes /etc/swift/object.ring.gz AUTH_mw wikipedia-commons-local-public.ad a/ad/0_73aa1_2d9bafe_orig.jpg
[...]
ssh 10.64.32.64 "ls -lah ${DEVICE:-/srv/node*}/sdv1/objects/19149/81a/4acd7e326ce4b517f4dacbb96c41e81a"
[...]
mvernon@ms-be1062:~$ sudo ls /srv/swift-storage/sdv1/objects/19149/81a/4acd7e326ce4b517f4dacbb96c41e81a
1676042079.97693.ts

I 've found this task via a different pathway, trying to help editors in T328875. Debugging that one I ended up dealing with a swift ghost from 2017. While this is old enough to not be able to chased down particularly, I thought about adding it anyway. The summary is that I did

$ swift list --prefix 5/5d/EPA_De wikipedia-commons-local-thumb.5d on ms-fe1009 Feb 8 and in the listing I couldn't see the 200px thumbnail. I could see plenty of other thumbnail though, same as if running it right now[1].

Interestingly, and on a whim, I run swift stat and I got back an output somewhat like the following (I am reconstructing from memory).

swift stat wikipedia-commons-local-thumb.5d 5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
            Account: AUTH_mw
          Container: wikipedia-commons-local-thumb.5d
             Object: 5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
       Content Type: image/jpeg
     Content Length: 11263
      Last Modified: Some time in late 2017
               ETag: some ETAG
Content-Disposition: inline;filename*=UTF-8''EPA_Deputy_Admin_Bob_Perciasepe.jpg
      Accept-Ranges: bytes

I don't remember if I saw X-trans-id and X-timestamp headers, sorry.

Given that a) this was thumbnail, so easy to regenerate b) the thumbnail was 5+ years old already, so difficult to chase down cause of 0 logs after all this time as well as changes in the infrastructure, I proceeded with a swift delete wikipedia-commons-local-thumb.5d 5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg command. Afterwards, just accessing the URL once more, put successfully a new thumbnail in swift and fixed the issue.

[1]

swift list --prefix 5/5d/EPA_De wikipedia-commons-local-thumb.5d 
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/100px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/1024px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/110px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/120px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/130px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/140px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/144px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/150px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/160px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/165px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/180px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/1920px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/192px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg.gif
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg.png
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/201px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/202px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/260px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/270px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/400px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/480px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/50px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/60px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/614px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/75px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/800px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/80px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/90px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/960px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/96px-EPA_Deputy_Admin_Bob_Perciasepe.jpg

Whatever you've found is not the same issue as with ghost objects - a ghost object as defined here is one which appears in swift list (or asking swift for the contents of a container) but then does not exist - so any further operation returns 404. swift stat does a HEAD so will say 404 for a ghost object.
If you have an object where swift stat knows of the object (i.e. HEAD works) but the object is not GET-able, that's something else going wrong :(

But yes, thumbnails are transient, so it should always be OK to delete them.

We're due another full backup of swift contents in the next few days, but I think we need a cookbook or similar to script handling these. In outline, assuming we specify eqiad as the copy to work on and eqiad is primary DC:

For specified $container (or iterate through list thereof):
  in both codfw and eqiad:
    ssh to a storage node, run swift-get-nodes /etc/swift/container.ring.gz AUTH_mw $container
    parse output to find the 3 copies of the container DB
    transfer each container.db to the cumin node (and name them hostname.db)
  check 2/3 eqiad containers agree on number of undeleted objects [sqlite SELECT COUNT(*) FROM object WHERE deleted == 0]
  extract ghost list (this is an sqlite LEFT JOIN with "right" side null) in eqiad
  if codfw containers all agree:
    create list of ghosts with no entry in codfw (likewise, a LEFT JOIN with "right" side null)
  else:
    create list of ghosts with no non-ghost entry in codfw
  in eqiad:
    ssh to node with MW credential (currently swiftrepl node)
    for $entry in list of ghosts:
      run swift stat container $entry ; verify 404 (and ignore non-zero exit)
      run swift delete container $entry ; verify 404 (and ignore non-zero exit) LOG on 2xx code, as we lost the race

We need another mode when eqiad is specified as copy to work on and codfw is primary DC - in that case we delete ghosts which do appear as non-ghosts in codfw (since the next rclone run will then repopulate those entries from the real copy in codfw)

Whatever you've found is not the same issue as with ghost objects - a ghost object as defined here is one which appears in swift list (or asking swift for the contents of a container) but then does not exist - so any further operation returns 404. swift stat does a HEAD so will say 404 for a ghost object.
If you have an object where swift stat knows of the object (i.e. HEAD works) but the object is not GET-able, that's something else going wrong :(

Indeed a different issue, in fact of the inverse nature. I should have pointed that out. There is some extra some extra information in T328875, regarding other thumbnails to the original one reported, was added later on, definitely of a different root cause of what the original task was about. My current reading is that eqiad and codfw were not in sync about that thumbnail, which in my first reading was also nicely identified in this task (the discrepancy exists, but let's keep this task scoped to those listed but not extant objects). I 'll file a new task to track it cause there is some obvious discrepancy between eqiad and codfw that is causing editors frustration and we should capture that.

Change 905595 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/cookbooks@master] sre.swift.remove-ghost-objects: new cookbook

https://gerrit.wikimedia.org/r/905595

Change 905657 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] spicerack: add the transferpy package

https://gerrit.wikimedia.org/r/905657

Change 905657 merged by MVernon:

[operations/puppet@production] spicerack: add the transferpy package

https://gerrit.wikimedia.org/r/905657

While working on this, I found a container where the codfw containers are consistent, but the listing is wrong. The objects that exist are all the same as in eqiad:

root@ms-fe2009:/home/mvernon# ( for i in $(swift list wikipedia-ja-local-public.21 ); do if  swift stat wikipedia-ja-local-public.21 "$i" >/dev/null 2>&1 ;  then echo "$i" ; fi ; done ) | sort | md5sum
126013b227f0b918ba658714d1b5e643  -
root@ms-fe1009:/home/mvernon# ( for i in $(swift list wikipedia-ja-local-public.21 ); do if  swift stat wikipedia-ja-local-public.21 "$i" >/dev/null 2>&1 ;  then echo "$i" ; fi ; done ) | sort | md5sum
126013b227f0b918ba658714d1b5e643  -

Those are both a list of 45 objects. in codfw there are 22 more objects that are ghosts (but consistent across all three containers):

root@ms-fe2009:/home/mvernon# for i in $(swift list wikipedia-ja-local-public.21 ); do if ! swift stat wikipedia-ja-local-public.21 "$i" >/dev/null 2>&1 ; then echo "$i" ; fi ; done
2/21/Chounichiji1.jpg
2/21/Combino_hensei.png
2/21/Hirai_River.jpg
2/21/Hitachi-Motoyamagekizyou.jpg
2/21/Ichimokuzin.png
2/21/Maid_character.png
2/21/Marumori_Station_02.JPG
2/21/Ministry_of_Industry.jpg
2/21/Odakyu3000-1-01.JPG
2/21/SAT721.jpg
2/21/Tetris_TSPIN-Triple.png
2/21/Yagiriowatashi.jpg
2/21/上信上州新屋.jpg
2/21/仙台市営地下鉄旭ヶ丘駅.jpg
2/21/千鳥町駅.JPG
2/21/治良門橋駅ホーム全景.jpg
2/21/琴ヶ浜060730-1.jpg
2/21/街区表示板.jpg
archive/2/21/20151206110536!琴ヶ浜060730-1.jpg
archive/2/21/20151206110616!琴ヶ浜060730-1.jpg
archive/2/21/20151206111119!琴ヶ浜060730-1.jpg
archive/2/21/20151206141149!琴ヶ浜060730-1.jpg

I think the correct approach is to delete these 22 objects - @jcrespo is it easy for you to confirm that the backups don't contain any of these objects?

These look to be relatively old objects - I looked one up in a container DB and it was created in 2014.

I went looking at the failures from yesterday's rclone run. As well as the above wikipedia-ja-local-public.21 I have two further candidates for deletion:

wikipedia-en-local-public.80/8/80/Anotheryear.jpg
wikipedia-commons-local-public.43/4/43/The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf

In both cases:

  • eqiad and codfw container dbs are consistent
  • all container dbs (codfw & eqiad) list the object as not deleted
  • the object does not in fact exist
  • no corresponding filesystem entity exists (not even a tombstone record)

@jcrespo would you mind checking that the backups don't contain either of these objects as well as the 22 from in wikipedia-ja-local-public.21, please? I'd like to delete them to try and increase consistency (and maybe have rclone run to completion OK...)

I have at least some of those files, although I am unsure it is the same one as refered here or previous versions of the same name- it will need more research.

0)
wiki                 | enwiki
title                | Anotheryear.jpg
production_container | wikipedia-en-local-public.80
production_path      | 8/80/Anotheryear.jpg
sha1                 | 28de46d307b6891f03e8f71685137c10cef7f914
sha256               | 87ebe966c2862ed1afc2b2db40b3edb0c2b8c55c28cae44b17643312baca1fd0
size                 | 48984
type                 | BITMAP
production_status    | public
production_url       | https://upload.wikimedia.org/wikipedia/en/8/80/Anotheryear.jpg
upload_date          | 2020-12-01 06:42:22
archive_date         | None
delete_date          | None
backup_status        | backedup
backup_date          | 2021-08-18 22:18:03
backup_location      | https://backup1006.eqiad.wmnet:9000
backup_container     | mediabackups
backup_path          | enwiki/87e/87ebe966c2862ed1afc2b2db40b3edb0c2b8c55c28cae44b17643312baca1fd0

Let me get you a full report first.

Thanks. Interestingly, codfw and eqiad have different creation dates and sizes:

root@ms-fe2009:/home/mvernon# swift list -l wikipedia-en-local-public.80 | grep 8/80/Anotheryear.jpg
       48984 2020-12-01 06:42:22               image/jpeg 8/80/Anotheryear.jpg
root@ms-fe1009:/home/mvernon# swift list -l wikipedia-en-local-public.80 | grep 8/80/Anotheryear.jpg
      130332 2022-05-29 00:00:52               image/jpeg 8/80/Anotheryear.jpg

[which is why rclone was trying to copy this object]
so it looks like the backup copy is of the older version of this object.

The other object listed in both DCs has similar:

root@ms-fe2009:/home/mvernon# swift list -l wikipedia-commons-local-public.43 | grep 4/43/The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf
   151246791 2022-04-18 21:00:56          application/pdf 4/43/The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf
root@ms-fe1009:/home/mvernon# swift list -l wikipedia-commons-local-public.43 | grep 4/43/The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf
   114275335 2022-05-31 19:15:45          application/pdf 4/43/The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf

This latter looks to have been deleted (though obviously not entirely successfully)

eqiad backups:

This is the list of 2 files found with the given criteria:


0)
wiki                 | commonswiki
title                | The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf
production_container | wikipedia-commons-local-deleted.61
production_path      | 6/1/5/615t96kde3387itnyr9e4l8iz57lohz.pdf
sha1                 | 33a422ab07678ce98f6bc17aecadb482f243ba47
sha256               | 6711c4a66bf60ab298e552dd54494df81d6f689c3f648c870b69015fcf48b92b
size                 | 151246791
type                 | OFFICE
production_status    | deleted
production_url       | None
upload_date          | 2022-04-18 21:00:58
archive_date         | 2022-05-31 19:15:42
delete_date          | 2022-07-08 20:03:38
backup_status        | backedup
backup_date          | 2023-03-22 13:26:50
backup_location      | https://backup1005.eqiad.wmnet:9000
backup_container     | mediabackups
backup_path          | commonswiki/671/6711c4a66bf60ab298e552dd54494df81d6f689c3f648c870b69015fcf48b92b

1)
wiki                 | commonswiki
title                | The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf
production_container | wikipedia-commons-local-deleted.8d
production_path      | 8/d/8/8d8nwuuq04vpy4kawlr4wy6wa4qk5n3.pdf
sha1                 | 47a29fdaa2c53f1e6c09004e99847b13229fe38f
sha256               | 9fe9d834ce2abe12fc53a6330b56aabb0d6352fdd8b0a47b64449c8956edb0e6
size                 | 114275335
type                 | OFFICE
production_status    | deleted
production_url       | None
upload_date          | 2022-05-31 19:15:50
archive_date         | None
delete_date          | 2022-07-08 20:03:38
backup_status        | backedup
backup_date          | 2023-03-22 13:26:50
backup_location      | https://backup1006.eqiad.wmnet:9000
backup_container     | mediabackups
backup_path          | commonswiki/9fe/9fe9d834ce2abe12fc53a6330b56aabb0d6352fdd8b0a47b64449c8956edb0e6

codfw:

0)
wiki                 | commonswiki
title                | The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf
production_container | wikipedia-commons-local-deleted.61
production_path      | 6/1/5/615t96kde3387itnyr9e4l8iz57lohz.pdf
sha1                 | 33a422ab07678ce98f6bc17aecadb482f243ba47
sha256               | 6711c4a66bf60ab298e552dd54494df81d6f689c3f648c870b69015fcf48b92b
size                 | 151246791
type                 | OFFICE
production_status    | deleted
production_url       | None
upload_date          | 2022-04-18 21:00:58
archive_date         | 2022-05-31 19:15:42
delete_date          | 2022-07-08 20:03:38
backup_status        | backedup
backup_date          | 2023-04-01 02:11:56
backup_location      | https://backup2005.codfw.wmnet:9000
backup_container     | mediabackups
backup_path          | commonswiki/671/6711c4a66bf60ab298e552dd54494df81d6f689c3f648c870b69015fcf48b92b


1)
wiki                 | commonswiki
title                | The_Collected_Works_of_Mahatma_Gandhi,_vol._92.pdf
production_container | wikipedia-commons-local-deleted.8d
production_path      | 8/d/8/8d8nwuuq04vpy4kawlr4wy6wa4qk5n3.pdf
sha1                 | 47a29fdaa2c53f1e6c09004e99847b13229fe38f
sha256               | 9fe9d834ce2abe12fc53a6330b56aabb0d6352fdd8b0a47b64449c8956edb0e6
size                 | 114275335
type                 | OFFICE
production_status    | deleted
production_url       | None
upload_date          | 2022-05-31 19:15:50
archive_date         | None
delete_date          | 2022-07-08 20:03:38
backup_status        | backedup
backup_date          | 2023-04-01 02:11:56
backup_location      | https://backup2006.codfw.wmnet:9000
backup_container     | mediabackups
backup_path          | commonswiki/9fe/9fe9d834ce2abe12fc53a6330b56aabb0d6352fdd8b0a47b64449c8956edb0e6

These look to me as leftovers- check the paths of production_container + production_path, that is where mw thinks they should be (only). They must have failed to be deleted from the public one (this is one of the many problems of mw file handling, it is very prone to leftovers and data loss if anything errors out, leading to drift between metadata and physical files).

Yes, that fits with the "this file has been deleted" page, so I think that object is good to clear up in both clusters. Thank you!

I'll be interested to hear about the other objects when you've some time :)

I'll be interested to hear about the other objects when you've some time :)

The 22 at wikipedia-ja-local-public.21 from above, right?

I'll be interested to hear about the other objects when you've some time :)

The 22 at wikipedia-ja-local-public.21 from above, right?

Yes, and if you have any further thoughts on 8/80/Anotheryear.jpg

Yes, and if you have any further thoughts on 8/80/Anotheryear.jpg

So backups records are not (and do not intend to be) a complete record of all states of that file, I only record "snapshots" but if we combine them with production metadata:

That file state should be now "deleted", according to metadata. This is what current production metadata shows:

{P46461}

The list of things that happened were:

  • file uploaded on 2020-12-01 06:42:22 ( https://web.archive.org/web/20220516211036/https://en.wikipedia.org/w/index.php?title=File:Anotheryear.jpg&action=info )
  • new version uploaded on 2022-05-28 16:57:20
  • new version uploaded on 2022-05-29 00:00:51
  • The 3 files got deleted on: 2022-06-27 00:00:03
  • Current storage should show 3 files in the deleted containers: 4ruxnrws39l0jwzcw7pobwq06j28nx0.jpg paomxgtq32b2zodei1tv4w35fz0xa6n.jpg and 8muxwcixictu47tjvws11298x7jx8oa.jpg (the containers should be wikipedia-en-local-deleted.4r , etc and the path 4/r/u/4ruxnrws39l0jwzcw7pobwq06j28nx0.jpg )

Anything left in public containers, as of this writing, shouldn't exist (another leftover).

You should have better tooling to check production. I don't know if there is something already, but I can repurpose some backup tools for that, with some time & effort.

Regarding jawiki, there is no latest or or archived (public) files with those names:

root@db1140:~$ cat images.txt | while read image; do echo "SELECT * FROM image WHERE img_name='$image'"; mysql -S /run/mysqld/mysqld.s6.sock jawiki -e "SELECT * FROM image WHERE img_name='$image'"; done             
SELECT * FROM image WHERE img_name='Chounichiji1.jpg'
SELECT * FROM image WHERE img_name='Combino_hensei.png'
SELECT * FROM image WHERE img_name='Hirai_River.jpg'
SELECT * FROM image WHERE img_name='Hitachi-Motoyamagekizyou.jpg'
SELECT * FROM image WHERE img_name='Ichimokuzin.png'
SELECT * FROM image WHERE img_name='Maid_character.png'
SELECT * FROM image WHERE img_name='Marumori_Station_02.JPG'
SELECT * FROM image WHERE img_name='Ministry_of_Industry.jpg'
SELECT * FROM image WHERE img_name='Odakyu3000-1-01.JPG'
SELECT * FROM image WHERE img_name='SAT721.jpg'
SELECT * FROM image WHERE img_name='Tetris_TSPIN-Triple.png'
SELECT * FROM image WHERE img_name='Yagiriowatashi.jpg'
SELECT * FROM image WHERE img_name='上信上州新屋.jpg'
SELECT * FROM image WHERE img_name='仙台市営地下鉄旭ヶ丘駅.jpg'
SELECT * FROM image WHERE img_name='千鳥町駅.JPG'
SELECT * FROM image WHERE img_name='治良門橋駅ホーム全景.jpg'
SELECT * FROM image WHERE img_name='琴ヶ浜060730-1.jpg'
SELECT * FROM image WHERE img_name='街区表示板.jpg'
SELECT * FROM image WHERE img_name='琴ヶ浜060730-1.jpg'

✔️ cat images.txt | while read image; do echo "SELECT * FROM oldimage WHERE oi_name='$image'"; mysql -S /run/mysqld/mysqld.s6.sock jawiki -e "SELECT * FROM oldimage WHERE oi_name='$image'"; done
SELECT * FROM oldimage WHERE oi_name='Chounichiji1.jpg'
SELECT * FROM oldimage WHERE oi_name='Combino_hensei.png'
SELECT * FROM oldimage WHERE oi_name='Hirai_River.jpg'
SELECT * FROM oldimage WHERE oi_name='Hitachi-Motoyamagekizyou.jpg'
SELECT * FROM oldimage WHERE oi_name='Ichimokuzin.png'
SELECT * FROM oldimage WHERE oi_name='Maid_character.png'
SELECT * FROM oldimage WHERE oi_name='Marumori_Station_02.JPG'
SELECT * FROM oldimage WHERE oi_name='Ministry_of_Industry.jpg'
SELECT * FROM oldimage WHERE oi_name='Odakyu3000-1-01.JPG'
SELECT * FROM oldimage WHERE oi_name='SAT721.jpg'
SELECT * FROM oldimage WHERE oi_name='Tetris_TSPIN-Triple.png'
SELECT * FROM oldimage WHERE oi_name='Yagiriowatashi.jpg'
SELECT * FROM oldimage WHERE oi_name='上信上州新屋.jpg'
SELECT * FROM oldimage WHERE oi_name='仙台市営地下鉄旭ヶ丘駅.jpg'
SELECT * FROM oldimage WHERE oi_name='千鳥町駅.JPG'
SELECT * FROM oldimage WHERE oi_name='治良門橋駅ホーム全景.jpg'
SELECT * FROM oldimage WHERE oi_name='琴ヶ浜060730-1.jpg'
SELECT * FROM oldimage WHERE oi_name='街区表示板.jpg'
SELECT * FROM oldimage WHERE oi_name='琴ヶ浜060730-1.jpg'
✔️

All files were soft-deleted, you can (or should) find their objects at:

cat images.txt | while read image; do echo "SELECT fa_storage_key FROM filearchive WHERE img_name='$image'"; mysql -BN -S /run/mysqld/mysqld.s6.sock jawiki -e "SELECT fa_storage_key FROM filearchive WHERE fa_name='$image'"; done
SELECT fa_storage_key FROM filearchive WHERE img_name='Chounichiji1.jpg'
5x5xuxcdaebu8k64jc719t8cv6ko5k3.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='Combino_hensei.png'
cqbdwj5xbxzqhb8163ife7af21vjpo9.png
SELECT fa_storage_key FROM filearchive WHERE img_name='Hirai_River.jpg'
hkxe1jji1mw604z2rcxwc1t92u62lu7.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='Hitachi-Motoyamagekizyou.jpg'
4q4yur8t0t3tm6e6yfwg9mueo1yvxjl.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='Ichimokuzin.png'
8tq54gsqhuuyez31gqwtjke22cnvm8z.png
SELECT fa_storage_key FROM filearchive WHERE img_name='Maid_character.png'
h6tjvj1d4q7z6c6a1u01r80wl9lln0h.png
SELECT fa_storage_key FROM filearchive WHERE img_name='Marumori_Station_02.JPG'
3qvu4ir7ndoa7xmgppp9lux1ztled91.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='Ministry_of_Industry.jpg'
e1l6po8wm9bsnt3r2tbwx5co30bh81o.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='Odakyu3000-1-01.JPG'
dcbqwsjedzjaf0gw98jnyaazbzm4j65.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='SAT721.jpg'
0nmvnmi39jlmjj6o1ky6v3i2i2bifzbt.jpg
rsc11ee2hb1xmbyf1t3mm5xf5k9x4hp.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='Tetris_TSPIN-Triple.png'
sl5n5fp85ate9kq17r1xi5xoynenhpm.png
SELECT fa_storage_key FROM filearchive WHERE img_name='Yagiriowatashi.jpg'
68dp39wppjvgbdpv8epooxge3h6mlsg.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='上信上州新屋.jpg'
613t6l9t7k2za8uqkn7m46rfaxny8v2.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='仙台市営地下鉄旭ヶ丘駅.jpg'
alvzn8d1n2rq6sfzli1tw55uu49vxih.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='千鳥町駅.JPG'
1jtb19nmgix8g1ynzbvnye8cbfxb9o7.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='治良門橋駅ホーム全景.jpg'
13r80skqbdvsr42kzkyqz9lhm5w4hcu.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='琴ヶ浜060730-1.jpg'
cdan1d3tjdaa0f33zxmf53j9p1kzm4e.jpg
cxjly4c31gu2pmyocx1q0m67mjg88g5.jpg
cdan1d3tjdaa0f33zxmf53j9p1kzm4e.jpg
cxjly4c31gu2pmyocx1q0m67mjg88g5.jpg
cdan1d3tjdaa0f33zxmf53j9p1kzm4e.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='街区表示板.jpg'
s6wxs18mqo5sbe1td1hp8vdgba2b7hp.jpg
SELECT fa_storage_key FROM filearchive WHERE img_name='琴ヶ浜060730-1.jpg'
cdan1d3tjdaa0f33zxmf53j9p1kzm4e.jpg
cxjly4c31gu2pmyocx1q0m67mjg88g5.jpg
cdan1d3tjdaa0f33zxmf53j9p1kzm4e.jpg
cxjly4c31gu2pmyocx1q0m67mjg88g5.jpg
cdan1d3tjdaa0f33zxmf53j9p1kzm4e.jpg
✔️

This could be confusing because, for example, Chounichiji1.jpg if you go to:

https://ja.wikipedia.org/wiki/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB:Chounichiji1.jpg

You can see the file exists (!), but looking carefully it is not a local file, it is the one trasposed from Commons: https://commons.wikimedia.org/wiki/File:Chounichiji1.jpg

On the log you can see the history: https://ja.wikipedia.org/wiki/%E7%89%B9%E5%88%A5:%E3%83%AD%E3%82%B0?type=&user=&page=%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%3AChounichiji1.jpg&wpdate=&tagfilter=&wpfilters%5B%5D=newusers&wpFormIdentifier=logeventslist

It got uploaded in 2015, deleted in 2022 when transferred to Commons. Any leftover on a public container shouldn't be there.

Similar story for https://ja.wikipedia.org/wiki/%E7%89%B9%E5%88%A5:%E3%83%AD%E3%82%B0?type=&user=&page=%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%3A%E7%90%B4%E3%83%B6%E6%B5%9C060730-1.jpg&wpdate=&tagfilter=&wpfilters%5B%5D=newusers&wpFormIdentifier=logeventslist

More revisions, but all deleted.

Tip: If you see any of the above pages in japanese and you are not fluent, you can add ?uselang=en to get the interface in English or your preferred language. I don't need that because I configured my global preferences to use English by default.

Thanks, that is super helpful!
I agree that some tooling that's able to look up objects in backups and production would be really useful (if nothing else so I don't end up just bugging you whenever rclone trips over more detritus like this). Maybe a KR for next quarter?

Mentioned in SAL (#wikimedia-operations) [2023-04-12T10:18:46Z] <Emperor> clearing out 24 ghost objects from Swift T327253

Change 905595 merged by MVernon:

[operations/cookbooks@master] sre.swift.remove-ghost-objects: new cookbook

https://gerrit.wikimedia.org/r/905595

MatthewVernon claimed this task.

I think we can resolve this now; the remove-ghost-objects cookbook has helped, and recent rclone runs have successfully completed.