Page MenuHomePhabricator

Debmonitor GC not running
Closed, ResolvedPublic

Description

Looking at something completely different I noticed that the Debmonitor garbage collection is not running since a while. Upon checking the logs the issue is due to:

django.db.models.deletion.ProtectedError: ("Cannot delete some instances of model 'PackageVersion' because they are referenced through a protected foreign key: 'ImagePackage.upgradable_imageversion'"

This happens because when we added the image support to the GC, it was added just for the packages referenced by images and not also for the upgradable packages in the images.
Unfortunately the fix is not that trivial because adding an additional level of join multiplies the number of rows to traverse to sizes that are not anymore feasible. So it needs to be refactored to do smaller queries and a smarter deletion.

Why we didn't notice?

Because the GC runs in a crontab and uses systemd-cat to log to syslog, not generating any output on error, hence not sending a cron-spam email. This bit too should be fixed to make sure we get notified in case of the GC not running.

Event Timeline

Volans triaged this task as High priority.Jun 9 2020, 12:15 PM

Change 603992 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/debmonitor@master] GC: fix garbage collection and refactor its query

https://gerrit.wikimedia.org/r/603992

Change 603993 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] debmonitor GC: generate cron email on failure

https://gerrit.wikimedia.org/r/603993

Change 603993 merged by Volans:
[operations/puppet@production] debmonitor GC: generate cron email on failure

https://gerrit.wikimedia.org/r/603993

Change 603992 merged by jenkins-bot:
[operations/software/debmonitor@master] GC: fix garbage collection and refactor its query

https://gerrit.wikimedia.org/r/603992

Mentioned in SAL (#wikimedia-operations) [2020-06-09T15:06:39Z] <volans> forcing a debmonitor GC to verify the fix of T254865

It runned successfully!

Jun  9 15:07:15 debmonitor1001 debmonitor-maintenance[9730]: Deleted 3878 PackageVersion objects not referenced by any HostPackage or ImagePackage
Jun  9 15:07:15 debmonitor1001 debmonitor-maintenance[9730]: Deleted 1007 Package objects not referenced by any PackageVersion
Jun  9 15:07:15 debmonitor1001 debmonitor-maintenance[9730]: Deleted 1947 SrcPackageVersion objects not referenced by any PackageVersion
Jun  9 15:07:15 debmonitor1001 debmonitor-maintenance[9730]: Deleted 523 SrcPackage objects not referenced by any SrcPackageVersion
Jun  9 15:07:15 debmonitor1001 debmonitor-maintenance[9730]: Deleted 3 KernelVersion objects not referenced by any Host

The fix in puppet has been tested and on failure it generates a cron-spam email referencing to check syslog for debmonitor-maintenance.
The GC is now fixed and can resume it's normal operations.
Debmonitor is now clear of orphaned objects.