Page MenuHomePhabricator

imagecatalog_record.service fails due to read-only sqlite database
Open, LowPublic

Description

Currently alerting

cgoubert@deploy1002:~$ sudo systemctl status imagecatalog_record.service
● imagecatalog_record.service - update the image catalog with all images running in prod
   Loaded: loaded (/lib/systemd/system/imagecatalog_record.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2024-03-21 15:13:14 UTC; 31min ago
     Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
  Process: 21262 ExecStart=/usr/bin/imagecatalog --database=/srv/deployment/imagecatalog/catalog.sqlite --clusters=staging-eqiad:/etc/kubernetes/imagecatalog-staging-eqiad.conf
 Main PID: 21262 (code=exited, status=1/FAILURE)

Mar 21 15:13:13 deploy1002 imagecatalog[21262]:   File "/usr/lib/python3/dist-packages/imagecatalog/cli.py", line 51, in main
Mar 21 15:13:13 deploy1002 imagecatalog[21262]:     return record(args.database, args.clusters)
Mar 21 15:13:13 deploy1002 imagecatalog[21262]:   File "/usr/lib/python3/dist-packages/imagecatalog/cli.py", line 24, in record
Mar 21 15:13:13 deploy1002 imagecatalog[21262]:     image_catalog.record_active_images()
Mar 21 15:13:13 deploy1002 imagecatalog[21262]:   File "/usr/lib/python3/dist-packages/imagecatalog/catalog.py", line 143, in record_active_images
Mar 21 15:13:13 deploy1002 imagecatalog[21262]:     image.namespace, image.pod_name, image.container_name))
Mar 21 15:13:13 deploy1002 imagecatalog[21262]: sqlite3.OperationalError: attempt to write a readonly database
Mar 21 15:13:14 deploy1002 systemd[1]: imagecatalog_record.service: Main process exited, code=exited, status=1/FAILURE
Mar 21 15:13:14 deploy1002 systemd[1]: imagecatalog_record.service: Failed with result 'exit-code'.
Mar 21 15:13:14 deploy1002 systemd[1]: Failed to start update the image catalog with all images running in prod.

Event Timeline

Clement_Goubert created this task.
cgoubert@deploy1002:~$ sudo chown imagecatalog:imagecatalog /srv/deployment/imagecatalog/catalog.sqlite
cgoubert@deploy1002:~$ sudo systemctl restart imagecatalog_record.service

We may need to fix the puppetization to ensure the file has the right owner.

Clement_Goubert lowered the priority of this task from High to Low.Mar 21 2024, 3:58 PM

As the action taken in production fixed the immediate problem, lowering priority.

Curious: As @Clement_Goubert and I discussed, both the directory (via puppet file) and the database file (via puppet exec of imagecatalog init) have the right user, imagecatalog. There's nothing in Puppet (like a recurse) to ensure ownership on the database file, but it still ought to come out correct, as far as I can tell. Claime reports the file was owned by mwbuilder, which runs the release tools, but I think imagecatalog init should still have run first as the imagecatalog user.

The expedient thing is probably to just puppetize the ownership correctly, but I'd love to know how it got this way.

(Separately: We never really established a coherent active/passive story for the image catalog, including data syncs, so I'm not that surprised it choked when the deployment server was switched over. We can correct the ownership issue without fully addressing that yet. I also see in T287130#7651203 that my past self was tripping on the same rake.)

Checking on deploy2002 (which we moved away from with this switchover), the catalog.sqlite files stays in place after a switchover, and is now owned by the helm user there as you mentioned in T287130#7651203

imagecatalog init didn´t run because the file exists, and the exec contains a creates stanza. Even if it did run, that exec runs with the imagecatalog user, so it would have failed to re-init the database because of permissions.