Page MenuHomePhabricator

Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption
Open, Stalled, Needs TriagePublic

Description

We had reports (T383034, T383034) of some thumbnail failures for some users. Further digging showed that they were getting HTTP 401 (unauthorised) from codfw swift. This was because whilst the underlying object still existed (and could be inspected with swift stat, which goes via the rings), the container DB for the container wikipedia-commons-local-thumb.f8 was missing (examples in P71802), and indeed attempting to swift stat wikipedia-commons-local-thumb.f8 resulted in Container 'wikipedia-commons-local-thumb.f8' not found.

This is unlikely to be the result of a correctly-issued deletion request, because (per our docs) deleting a container first deletes the contents, and those were still extant where inspected. Also, the account still "thought" it had a container named wikipedia-commons-local-thumb.f8 (per swift list).

The container DBs, however, were all missing - I checked all six locations in the output of sudo swift-get-nodes /etc/swift/container.ring.gz AUTH_mw wikipedia-commons-local-thumb.f8 and in no case was the containing directory extant, never mind the db file.

We were able to restore service by effectively re-creating the container: swift post wikipedia-commons-local-thumb.f8 --read-acl 'mw:thumbor,mw:media,.r:*' --write-acl 'mw:thumbor,mw:media', and thumbor largely coped with the extra load.

ms-fe2009 first said 401 to a request for something in wikipedia-commons-local-thumb.f8 at 07:20:50 on 2025-01-05 giving us an approximate time-stamp for the deletion.

Inspecting swift logs for the day (sudo cumin -x --force --no-progress --no-color -o txt "A:codfw and P{O:swift::proxy}" "zgrep -F 'DELETE' /var/log/swift/proxy-access.log.1.gz | grep 'wikipedia-commons-local-thumb.f8'" >~/junk/T383023) produces 5208 DELETE requests for items in that container, but they all contain the string px-, meaning they were for objects within the container not the container itself.

It might be instructive to narrow that window down further by inspecting the other codfw frontends, and then checking frontend logs for the relevant time window. But this is a deeply concerning mystery at the moment. This was the only affected thumbnail container in codfw, running a check of all containers (there are 43k of them) is ongoing, and will take a few hours.

Event Timeline

Narrow the time window down thus:

sudo cumin "A:codfw and P{O:swift::proxy}" "zgrep -F 'wikipedia-commons-local-thumb.f8' /var/log/swift/proxy-access.log.1.gz | grep -B 1 -m 1 'HTTP/1.0 401'"

Confirms that we had successful responses at 07:20:49 and none by 07:20:50. So whatever happened, likely happened then.

I found nothing on the proxy-servers, but on ms-be2058 (the first node in the ring for this container), I find (#012 in log line converted to newline):

Jan  5 07:20:28 ms-be2058 container-server: ERROR __call__ error with GET /sdb3/16503/AUTH_mw/wikipedia-commons-local-thumb.f8 : 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 475, in get
    yield conn
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1173, in list_objects_iter
    return [transform_func(r) for r in curs]
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1173, in <listcomp>
    return [transform_func(r) for r in curs]
sqlite3.DatabaseError: database disk image is malformed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swift/container/server.py", line 867, in __call__
    res = getattr(self, req.method)(req)
  File "/usr/lib/python3/dist-packages/swift/common/utils.py", line 2007, in _timing_stats
    resp = func(ctrl, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swift/container/server.py", line 752, in GET
    container_list = src_broker.list_objects_iter(
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1223, in list_objects_iter
    return results
  File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 483, in get
    self.possibly_quarantine(*sys.exc_info())
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 436, in possibly_quarantine
    self.quarantine(exc_hint)
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 414, in quarantine
    raise sqlite3.DatabaseError(detail)
sqlite3.DatabaseError: Quarantined /srv/swift-storage/sdb3/containers/16503/280/4077d9164732d6587761ef101bcbc280 to /srv/swift-storage/sdb3/quarantined/containers/4077d9164732d6587761ef101bcbc280 due to malformed database (txn: tx4d7ef4ae3a434f458e950-00677a32bc)

Similar errors similarly timestamped on the other two storage nodes ms-be2073 and ms-be2074

MatthewVernon renamed this task from Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw to Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption.Jan 6 2025, 4:28 PM

All three database files have different checksums, but the same failure of integrity check:

mvernon@ms-be2073:~$ sqlite3 4077d9164732d6587761ef101bcbc280.db "PRAGMA integrity_check"
row 423322 missing from index ix_object_deleted_name
row 2701219 missing from index ix_object_deleted_name

There is no row in the object table with rowid 423322 (in any copy), the other complained-of row is extant:

2701219|f/f8/Kurokawa_River_and_Sasaogawa_River_from_Sasaogawabashi_Bridge.jpg/640px-Kurokawa_River_and_Sasaogawa_River_from_Sasaogawabashi_Bridge.jpg|1467306734.77461|82019|image/jpeg|ce336915c60b21ca7efdf80407e29819|0|0

The created_at field is Thu 30 Jun 17:12:14 2016, which matches when the original was created.

Attempting the recovery operation (i.e. sqlite3 4077d9164732d6587761ef101bcbc280.db .recover >recovered.sql) gives us 3 files with the same number of rows but different checksums, still no hint of the missing rowid 423322, and the same 7 entries in lost_and_found:

CREATE TABLE "lost_and_found"(rootpgno INTEGER, pgno INTEGER, nfield INTEGER, id INTEGER, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17);
INSERT INTO "lost_and_found" VALUES(1048577, 1048577, 3, NULL, 0, 'thumbor/f/f8/Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg/1200px-Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg', 3168803, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);
INSERT INTO "lost_and_found" VALUES(1048577, 1048577, 3, NULL, 0, 'thumbor/f/f8/Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg/1205px-Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg', 2952149, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);
INSERT INTO "lost_and_found" VALUES(1048577, 1048577, 3, NULL, 0, 'thumbor/f/f8/Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg/150px-Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg', 3175557, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);
INSERT INTO "lost_and_found" VALUES(1048577, 1048577, 3, NULL, 0, 'thumbor/f/f8/Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg/301px-Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg', 3606467, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);
INSERT INTO "lost_and_found" VALUES(1048577, 1048577, 3, NULL, 0, 'thumbor/f/f8/Bulgaria_France_Locator.png/1200px-Bulgaria_France_Locator.png', 3411999, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);
INSERT INTO "lost_and_found" VALUES(1048577, 1048577, 3, NULL, 0, 'thumbor/f/f8/Bulgaria_France_Locator.png/125px-Bulgaria_France_Locator.png', 3194863, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);
INSERT INTO "lost_and_found" VALUES(1048577, 1048577, 3, NULL, 0, 'thumbor/f/f8/Bulgaria_France_Locator.png/153px-Bulgaria_France_Locator.png', 3035667, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);

Those correspond to two originals - https://commons.wikimedia.org/wiki/File:Bulgaria_Bulgaria-0802_-_Stadium_of_Philippopolis_(7432851016).jpg (uploaded in 2015) and https://commons.wikimedia.org/wiki/File:Bulgaria_France_Locator.png (updated in 2015), which makes me slightly suspicious that something went awry back in 2015. Probably not relevant to this issue though? If that was going to cause problems, I rather expect it would have done so before now.

So we have 3 database files with at least similar contents (same number of rows, inspecting differences by hand it seems to be that different thumbs have different rowids in the three tables), but that all got corrupted (presumably at about the same time).

The last update in each table (SELECT * from object WHERE rowid=(SELECT MAX(rowid) from object);) is the same thing:

19933856|f/f8/Gascones,_molino_(1988)_02.jpg/300px-Gascones,_molino_(1988)_02.jpg|1736061590.04401|0|application/deleted|noetag|1|0

(they differ in ROWID and created_at in the subsecond part) which is a thumb for an image uploaded in 2022. The created_at time is Sun 05 Jan 2025 07:19:50 AM UTC, shortly before things started to go wrong.

As perhaps expected, the final transaction before the incident is a DELETE of the various thumbnails of 300px-Gascones,_molino_(1988)_02.jpg (one DELETE per size) all at 07:19:50. That looks normal.

The final PUT is for each container server at 07:19:14 from ms-be2062 of wikipedia-commons-local-thumb.f8/f/f8/Clara_Bow%2C_grayscale.jpg/256px-Clara_Bow%2C_grayscale.jpg which succeeds; the next at 07:20:33 does not (saying 404) as expected.

I've copied proxy-access and server logs from the frontends and serverlog from the backends onto cumin1002 to give myself a little more time to look at them before they get cycled out by logrotate. This will have to be a temporary thing, as they take up 25G between them!

To summarise:

  • 07:19:14 - final successful PUT
  • 07:19:50 - final successful DELETE (recorded in databases OK)
  • 07:20:28 - all three databases corrupt (so they all get quarantined, and errors start being reported to users)

I assume that the databases were not corrupt at 07:19:50, but do we know that? Might be worth seeing when the last successful listing operation was.
No obvious bad transaction or anything is obvious in logs so far, but it's probably worth a wider look (including at the frontend logs) at the 38s of interest just to double-check. At this point, though, it is looking most likely to me that this is a swift bug that has corrupted the sqlite dbs. And possibly checking what the server was doing when it caught the exception?

I fished out the final successful listing request from each backend, and then looked up the relevant transaction in the frontend logs. In time order, they are:

Jan  5 07:00:45 ms-fe2014 proxy-server: 10.194.134.49 10.192.16.194 05/Jan/2025/07/00/45 GET /v1/AUTH_mw/wikipedia-commons-local-thumb.f8%3Flimit%3D9000%26prefix%3Df%252Ff8%252FAspatha_gularis_-_Blue-throated_Motmot_XC485521.mp3%252F%26format%3Djson%26states%3Dlisting HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tk22395377a... - - - txd040128c541c49888459b-00677a2e1d - 0.0103 - - 1736060445.159404278 1736060445.169747114 0
Jan  5 07:19:39 ms-fe2013 proxy-server: 10.194.144.14 10.192.0.87 05/Jan/2025/07/19/39 GET /v1/AUTH_mw/wikipedia-commons-local-thumb.f8%3Flimit%3D9000%26prefix%3Df%252Ff8%252FFWPhil_Musician_Protest_Sweetwater.jpg%252F%26format%3Djson%26states%3Dlisting HTTP/1.0 200 - wikimedia/multi-http-client%20v1.1 AUTH_tk22395377a... - 1339 - tx6eb2c8e735fa423dbc4b6-00677a328b - 0.0189 - - 1736061579.696813583 1736061579.715704679 0
Jan  5 07:19:50 ms-fe2010 proxy-server: 10.194.179.98 10.192.16.76 05/Jan/2025/07/19/50 GET /v1/AUTH_mw/wikipedia-commons-local-thumb.f8%3Flimit%3D9000%26prefix%3Df%252Ff8%252FGascones%252C_molino_%25281988%2529_02.jpg%252F%26format%3Djson%26states%3Dlisting HTTP/1.0 200 - wikimedia/multi-http-client%20v1.1 AUTH_tk22395377a... - 511 - txc6028b8aef0d4705aef82-00677a3296 - 0.0301 - - 1736061590.006474018 1736061590.036608696 0

Is it coincidence that that final successful listing is using prefix=f/f8/Gascones,_molino_(1988)_02.jpg at the same time as some of those entities were being deleted?

I've spent some more time with these logs, and I think I may have reached the point of diminishing returns. I extracted logs for the time period of interest (zgrep -E '^Jan 5 07:(19:5[0-9]|20:[0-2][0-9]) .*wikipedia-commons-local-thumb\.f8' ms-fe*.proxylog.gz >maybeinteresting) which gives me 276 lines, divided up thus:
Return codes:

 86 200
  7 204
  4 206
165 304
 12 404
  1 500
  1 503

Methods:

 14 DELETE
262 GET

It's a little unfortunate that we don't log the difference between a GET of an extant thumbnail (which wouldn't cause any database activity) and a GET of an absent one where thumbor was used (which would update the database).

MatthewVernon changed the task status from Open to Stalled.Jan 17 2025, 4:54 PM

Reported as Debian #1093304; more so we've a record (and in case anyone else has seen this and/or has ideas), I've not managed to find a smoking gun.

I think that's where I'm going to have to leave this task.

We have similar behavior being reported again at https://en.wikipedia.org/wiki/Wikipedia:SVG_help#Rendering_issue – is it the same bug?

I’m going to say probably. I have the same problem with those two images that I had when I originally wrote the first ticket. I get the same error when attempting to view the PNG and it does not display unless viewing the source file.

It's not the same issue - those two files have thumbs in different containers (wikipedia-commons-local-thumb.c6 and wikipedia-commons-local-thumb.47 respectively), and both containers still exist in both clusters.

Mentioned in SAL (#wikimedia-operations) [2025-06-24T12:05:00Z] <Emperor> check thumbnail db integrity T383053

Using test-cookbook and the currently-in-review check-dbs cookbook on all the thumbnail container dbs, we find the following problems:

  • wikipedia-commons-local-thumb.6b -
/srv/swift-storage/accounts1/containers/8721/17d/2211db68672f28e639decfc1f640917d/2211db68672f28e639decfc1f640917d.db on ms-be1083.eqiad.wmnet has errors:
row 1745416 missing from index ix_object_deleted_name
row 1745417 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/sda3/containers/8721/17d/2211db68672f28e639decfc1f640917d/2211db68672f28e639decfc1f640917d.db on ms-be1063.eqiad.wmnet has errors:
row 1745416 missing from index ix_object_deleted_name
row 1745417 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/8721/17d/2211db68672f28e639decfc1f640917d/2211db68672f28e639decfc1f640917d.db on ms-be1074.eqiad.wmnet has errors:
 row 1745416 missing from index ix_object_deleted_name
row 1745417 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.6e
/srv/swift-storage/sdb3/containers/26973/1f3/695dccc2f2758a580e42b976670301f3/695dccc2f2758a580e42b976670301f3.db on ms-be2059.codfw.wmnet has errors:
row 1632783 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/sda3/containers/26973/1f3/695dccc2f2758a580e42b976670301f3/695dccc2f2758a580e42b976670301f3.db on ms-be2058.codfw.wmnet has errors:
row 1632783 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/26973/1f3/695dccc2f2758a580e42b976670301f3/695dccc2f2758a580e42b976670301f3.db on ms-be2076.codfw.wmnet has errors:
row 1632783 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.79
/srv/swift-storage/sda3/containers/30531/b33/774332fe929445555e757c805d65eb33/774332fe929445555e757c805d65eb33.db on ms-be1067.eqiad.wmnet has errors:
row 4537890 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/sdb3/containers/30531/b33/774332fe929445555e757c805d65eb33/774332fe929445555e757c805d65eb33.db on ms-be1070.eqiad.wmnet has errors:
row 4537890 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/30531/b33/774332fe929445555e757c805d65eb33/774332fe929445555e757c805d65eb33.db on ms-be1089.eqiad.wmnet has errors:
row 4537890 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.99
/srv/swift-storage/sdb3/containers/53069/95e/cf4db1352e66ef14017ee2d8a280d95e/cf4db1352e66ef14017ee2d8a280d95e.db on ms-be2064.codfw.wmnet has errors:
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.b7
/srv/swift-storage/sdb3/containers/43276/ec7/a90c2201082e532c429f273788ee6ec7/a90c2201082e532c429f273788ee6ec7.db on ms-be1066.eqiad.wmnet has errors:
row 1612987 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts1/containers/43276/ec7/a90c2201082e532c429f273788ee6ec7/a90c2201082e532c429f273788ee6ec7.db on ms-be1090.eqiad.wmnet has errors:
row 1612987 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/43276/ec7/a90c2201082e532c429f273788ee6ec7/a90c2201082e532c429f273788ee6ec7.db on ms-be1087.eqiad.wmnet has errors:
row 1612987 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.bb
/srv/swift-storage/accounts0/containers/25743/3d0/648f2c55d993f0a991ca1cffd66423d0/648f2c55d993f0a991ca1cffd66423d0.db on ms-be2076.codfw.wmnet has errors:
row 5310349 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.d3
/srv/swift-storage/accounts0/containers/35440/a6a/8a7070d66ad3b2c6df2c635eb41e7a6a/8a7070d66ad3b2c6df2c635eb41e7a6a.db on ms-be1079.eqiad.wmnet has errors:
row 4349795 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts1/containers/35440/a6a/8a7070d66ad3b2c6df2c635eb41e7a6a/8a7070d66ad3b2c6df2c635eb41e7a6a.db on ms-be1085.eqiad.wmnet has errors:
row 4234895 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts1/containers/35440/a6a/8a7070d66ad3b2c6df2c635eb41e7a6a/8a7070d66ad3b2c6df2c635eb41e7a6a.db on ms-be1078.eqiad.wmnet has errors:
row 4349795 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.ea
/srv/swift-storage/accounts0/containers/18674/f11/48f22e340a0b1d0f141c98e1cc4faf11/48f22e340a0b1d0f141c98e1cc4faf11.db on ms-be1078.eqiad.wmnet has errors:
row 6128861 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/18674/f11/48f22e340a0b1d0f141c98e1cc4faf11/48f22e340a0b1d0f141c98e1cc4faf11.db on ms-be1080.eqiad.wmnet has errors:
row 6128861 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name

So we have 98.8% of thumbnail container DBs in a consistent state. It's notable, though, that whilst a couple of the bad containers have at least one good db (and thus we would expect a suitable listing to cause the bad DB(s) to be quarantined), we also have 5 containers with no good DB. Perhaps also worth noting that the problems all appear to be around deletions.

Mentioned in SAL (#wikimedia-operations) [2025-06-30T10:05:20Z] <Emperor> depool codfw ms-swift for container DB repairs T383053

Mentioned in SAL (#wikimedia-operations) [2025-06-30T10:06:58Z] <Emperor> repair wikipedia-commons-local-thumb.6e on ms-be2059 ms-be2058 ms-be2076 T383053

Mentioned in SAL (#wikimedia-operations) [2025-06-30T10:17:10Z] <Emperor> repair wikipedia-commons-local-thumb.99 on ms-be2064 T383053

Icinga downtime and Alertmanager silence (ID=86ad3659-4dfe-4a19-8925-7580975c3341) set by mvernon@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: container db repair

ms-be2076.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-06-30T10:22:28Z] <Emperor> repair wikipedia-commons-local-thumb.bb on ms-be2076 T383053

Mentioned in SAL (#wikimedia-operations) [2025-06-30T10:25:56Z] <Emperor> repool codfw ms-swift after container DB repairs T383053

Mentioned in SAL (#wikimedia-operations) [2025-06-30T11:21:58Z] <Emperor> depool eqiad ms-swift for container DB repairs T383053

Icinga downtime and Alertmanager silence (ID=b9fb1e08-c62e-4b34-b173-ffc58ee22ef8) set by mvernon@cumin2002 for 1:00:00 on 3 host(s) and their services with reason: container db repair

ms-be[1063,1074,1083].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-06-30T11:39:18Z] <Emperor> repair wikipedia-commons-local-thumb.6b on ms-be10[63,74,83] T383053

Icinga downtime and Alertmanager silence (ID=05b6b261-690d-46bf-a42d-e69d15adcfc8) set by mvernon@cumin2002 for 1:00:00 on 3 host(s) and their services with reason: container db repair

ms-be[1067,1070,1089].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-06-30T11:45:57Z] <Emperor> repair wikipedia-commons-local-thumb.79 on ms-be10[70,67,89] T383053

Icinga downtime and Alertmanager silence (ID=0077a8da-dd9e-44d3-b59a-d42061bdb69b) set by mvernon@cumin2002 for 1:00:00 on 3 host(s) and their services with reason: container db repair

ms-be[1066,1087,1090].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-06-30T11:52:08Z] <Emperor> repair wikipedia-commons-local-thumb.b7 ms-be10[66,87,90] T383053

Icinga downtime and Alertmanager silence (ID=469f036d-84d0-4d5b-8246-e8056b4949ca) set by mvernon@cumin2002 for 1:00:00 on 3 host(s) and their services with reason: container db repair

ms-be[1078-1079,1085].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-06-30T11:59:20Z] <Emperor> repair wikipedia-commons-local-thumb.d3 on ms-be10[78,79,85] T383053

Mentioned in SAL (#wikimedia-operations) [2025-06-30T12:05:27Z] <Emperor> repair wikipedia-commons-local-thumb.ea on ms-be10[78,80] T383053

Icinga downtime and Alertmanager silence (ID=a1ed9509-0228-4592-b2ad-7dd36a1c170f) set by mvernon@cumin2002 for 1:00:00 on 2 host(s) and their services with reason: container db repair

ms-be[1078,1080].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-06-30T12:11:12Z] <Emperor> repool eqiad ms-swift after container DB repairs T383053

These corrupt DBs have all been repaired now.

TheDJ claimed this task.
TheDJ subscribed.

@MatthewVernon can this be closed ?

TheDJ removed TheDJ as the assignee of this task.
MatthewVernon changed the task status from Open to Stalled.Feb 9 2026, 11:35 AM

Is there a problem with it being left stalled? The Debian bug I opened hasn't seen any activity, and we've not yet got any idea why this happened.

Reported as Debian #1093304; more so we've a record (and in case anyone else has seen this and/or has ideas), I've not managed to find a smoking gun.

Noting that this was upstreamed from Debian to https://bugs.launchpad.net/swift/+bug/2141924