Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	MatthewVernon
	Mon, Jan 6, 1:58 PM

Description

We had reports (T383034, T383034) of some thumbnail failures for some users. Further digging showed that they were getting HTTP 401 (unauthorised) from codfw swift. This was because whilst the underlying object still existed (and could be inspected with swift stat, which goes via the rings), the container DB for the container wikipedia-commons-local-thumb.f8 was missing (examples in P71802), and indeed attempting to swift stat wikipedia-commons-local-thumb.f8 resulted in Container 'wikipedia-commons-local-thumb.f8' not found.

This is unlikely to be the result of a correctly-issued deletion request, because (per our docs) deleting a container first deletes the contents, and those were still extant where inspected. Also, the account still "thought" it had a container named wikipedia-commons-local-thumb.f8 (per swift list).

The container DBs, however, were all missing - I checked all six locations in the output of sudo swift-get-nodes /etc/swift/container.ring.gz AUTH_mw wikipedia-commons-local-thumb.f8 and in no case was the containing directory extant, never mind the db file.

We were able to restore service by effectively re-creating the container: swift post wikipedia-commons-local-thumb.f8 --read-acl 'mw:thumbor,mw:media,.r:*' --write-acl 'mw:thumbor,mw:media', and thumbor largely coped with the extra load.

ms-fe2009 first said 401 to a request for something in wikipedia-commons-local-thumb.f8 at 07:20:50 on 2025-01-05 giving us an approximate time-stamp for the deletion.

Inspecting swift logs for the day (sudo cumin -x --force --no-progress --no-color -o txt "A:codfw and P{O:swift::proxy}" "zgrep -F 'DELETE' /var/log/swift/proxy-access.log.1.gz | grep 'wikipedia-commons-local-thumb.f8'" >~/junk/T383023) produces 5208 DELETE requests for items in that container, but they all contain the string px-, meaning they were for objects within the container not the container itself.

It might be instructive to narrow that window down further by inspecting the other codfw frontends, and then checking frontend logs for the relevant time window. But this is a deeply concerning mystery at the moment. This was the only affected thumbnail container in codfw, running a check of all containers (there are 43k of them) is ongoing, and will take a few hours.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T383053 Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption
Resolved	BUG REPORT	MatthewVernon	T383023 PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view
Resolved	BUG REPORT	MatthewVernon	T383034 Preview images from Wikimedia Commons cannot be displayed properly

Event Timeline

MatthewVernon created this task.Mon, Jan 6, 1:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Jan 6, 1:58 PM

MatthewVernon added subtasks: T383023: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view, T383034: Preview images from Wikimedia Commons cannot be displayed properly.Mon, Jan 6, 1:58 PM

MatthewVernon closed subtask T383023: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view as Resolved.

MatthewVernon closed subtask T383034: Preview images from Wikimedia Commons cannot be displayed properly as Resolved.

MatthewVernon updated the task description. (Show Details)Mon, Jan 6, 2:08 PM

Yiming subscribed.Mon, Jan 6, 2:13 PM

Narrow the time window down thus:

sudo cumin "A:codfw and P{O:swift::proxy}" "zgrep -F 'wikipedia-commons-local-thumb.f8' /var/log/swift/proxy-access.log.1.gz | grep -B 1 -m 1 'HTTP/1.0 401'"

Confirms that we had successful responses at 07:20:49 and none by 07:20:50. So whatever happened, likely happened then.

A_smart_kitten subscribed.Mon, Jan 6, 2:27 PM

I found nothing on the proxy-servers, but on ms-be2058 (the first node in the ring for this container), I find (#012 in log line converted to newline):

Jan  5 07:20:28 ms-be2058 container-server: ERROR __call__ error with GET /sdb3/16503/AUTH_mw/wikipedia-commons-local-thumb.f8 : 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 475, in get
    yield conn
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1173, in list_objects_iter
    return [transform_func(r) for r in curs]
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1173, in <listcomp>
    return [transform_func(r) for r in curs]
sqlite3.DatabaseError: database disk image is malformed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swift/container/server.py", line 867, in __call__
    res = getattr(self, req.method)(req)
  File "/usr/lib/python3/dist-packages/swift/common/utils.py", line 2007, in _timing_stats
    resp = func(ctrl, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swift/container/server.py", line 752, in GET
    container_list = src_broker.list_objects_iter(
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1223, in list_objects_iter
    return results
  File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 483, in get
    self.possibly_quarantine(*sys.exc_info())
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 436, in possibly_quarantine
    self.quarantine(exc_hint)
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 414, in quarantine
    raise sqlite3.DatabaseError(detail)
sqlite3.DatabaseError: Quarantined /srv/swift-storage/sdb3/containers/16503/280/4077d9164732d6587761ef101bcbc280 to /srv/swift-storage/sdb3/quarantined/containers/4077d9164732d6587761ef101bcbc280 due to malformed database (txn: tx4d7ef4ae3a434f458e950-00677a32bc)

Similar errors similarly timestamped on the other two storage nodes ms-be2073 and ms-be2074

MatthewVernon renamed this task from Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw to Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption.Mon, Jan 6, 4:28 PM

All three database files have different checksums, but the same failure of integrity check:

mvernon@ms-be2073:~$ sqlite3 4077d9164732d6587761ef101bcbc280.db "PRAGMA integrity_check"
row 423322 missing from index ix_object_deleted_name
row 2701219 missing from index ix_object_deleted_name

Pppery subscribed.Mon, Jan 6, 5:48 PM

Cyberdog958 subscribed.Mon, Jan 6, 7:26 PM

Peachey88 subscribed.Mon, Jan 6, 9:01 PM

Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruptionOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption
Open, Needs TriagePublic
Actions

Related Objects
Search...