Page MenuHomePhabricator

Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption
Open, Needs TriagePublic

Description

We had reports (T383034, T383034) of some thumbnail failures for some users. Further digging showed that they were getting HTTP 401 (unauthorised) from codfw swift. This was because whilst the underlying object still existed (and could be inspected with swift stat, which goes via the rings), the container DB for the container wikipedia-commons-local-thumb.f8 was missing (examples in P71802), and indeed attempting to swift stat wikipedia-commons-local-thumb.f8 resulted in Container 'wikipedia-commons-local-thumb.f8' not found.

This is unlikely to be the result of a correctly-issued deletion request, because (per our docs) deleting a container first deletes the contents, and those were still extant where inspected. Also, the account still "thought" it had a container named wikipedia-commons-local-thumb.f8 (per swift list).

The container DBs, however, were all missing - I checked all six locations in the output of sudo swift-get-nodes /etc/swift/container.ring.gz AUTH_mw wikipedia-commons-local-thumb.f8 and in no case was the containing directory extant, never mind the db file.

We were able to restore service by effectively re-creating the container: swift post wikipedia-commons-local-thumb.f8 --read-acl 'mw:thumbor,mw:media,.r:*' --write-acl 'mw:thumbor,mw:media', and thumbor largely coped with the extra load.

ms-fe2009 first said 401 to a request for something in wikipedia-commons-local-thumb.f8 at 07:20:50 on 2025-01-05 giving us an approximate time-stamp for the deletion.

Inspecting swift logs for the day (sudo cumin -x --force --no-progress --no-color -o txt "A:codfw and P{O:swift::proxy}" "zgrep -F 'DELETE' /var/log/swift/proxy-access.log.1.gz | grep 'wikipedia-commons-local-thumb.f8'" >~/junk/T383023) produces 5208 DELETE requests for items in that container, but they all contain the string px-, meaning they were for objects within the container not the container itself.

It might be instructive to narrow that window down further by inspecting the other codfw frontends, and then checking frontend logs for the relevant time window. But this is a deeply concerning mystery at the moment. This was the only affected thumbnail container in codfw, running a check of all containers (there are 43k of them) is ongoing, and will take a few hours.

Event Timeline

Narrow the time window down thus:

sudo cumin "A:codfw and P{O:swift::proxy}" "zgrep -F 'wikipedia-commons-local-thumb.f8' /var/log/swift/proxy-access.log.1.gz | grep -B 1 -m 1 'HTTP/1.0 401'"

Confirms that we had successful responses at 07:20:49 and none by 07:20:50. So whatever happened, likely happened then.

I found nothing on the proxy-servers, but on ms-be2058 (the first node in the ring for this container), I find (#012 in log line converted to newline):

Jan  5 07:20:28 ms-be2058 container-server: ERROR __call__ error with GET /sdb3/16503/AUTH_mw/wikipedia-commons-local-thumb.f8 : 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 475, in get
    yield conn
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1173, in list_objects_iter
    return [transform_func(r) for r in curs]
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1173, in <listcomp>
    return [transform_func(r) for r in curs]
sqlite3.DatabaseError: database disk image is malformed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swift/container/server.py", line 867, in __call__
    res = getattr(self, req.method)(req)
  File "/usr/lib/python3/dist-packages/swift/common/utils.py", line 2007, in _timing_stats
    resp = func(ctrl, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swift/container/server.py", line 752, in GET
    container_list = src_broker.list_objects_iter(
  File "/usr/lib/python3/dist-packages/swift/container/backend.py", line 1223, in list_objects_iter
    return results
  File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 483, in get
    self.possibly_quarantine(*sys.exc_info())
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 436, in possibly_quarantine
    self.quarantine(exc_hint)
  File "/usr/lib/python3/dist-packages/swift/common/db.py", line 414, in quarantine
    raise sqlite3.DatabaseError(detail)
sqlite3.DatabaseError: Quarantined /srv/swift-storage/sdb3/containers/16503/280/4077d9164732d6587761ef101bcbc280 to /srv/swift-storage/sdb3/quarantined/containers/4077d9164732d6587761ef101bcbc280 due to malformed database (txn: tx4d7ef4ae3a434f458e950-00677a32bc)

Similar errors similarly timestamped on the other two storage nodes ms-be2073 and ms-be2074

MatthewVernon renamed this task from Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw to Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption.Mon, Jan 6, 4:28 PM

All three database files have different checksums, but the same failure of integrity check:

mvernon@ms-be2073:~$ sqlite3 4077d9164732d6587761ef101bcbc280.db "PRAGMA integrity_check"
row 423322 missing from index ix_object_deleted_name
row 2701219 missing from index ix_object_deleted_name