Page MenuHomePhabricator

Netbox DB is growing out of control
Open, HighPublic

Description

The Netbox database since the migration to version 3.2 (see T296452) is growing out of control at a rate of around ~300MB/day. Its compressed backups are growing at ~5MB/day.
Yesterday's expanded DB is 1.4GB in size, compared to 31 MB before the migration.

See the disk usage graph (link to live data in [1]):

Screenshot 2022-06-21 at 11.42.42.png (1×2 px, 182 KB)

Looking at the largest tables:

netbox=# SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 3;
          relation          | total_size
----------------------------+------------
 public.extras_jobresult    | 148 MB
 public.extras_objectchange | 33 MB
 public.dcim_interface      | 5248 kB
(3 rows)

It's clear that extras_jobresult is the largest one and that's due because of the getstats.GetDeviceStats job:

netbox=# select name, count(*) as count from extras_jobresult group by name order by count desc limit 3;
          name           | count
-------------------------+-------
 getstats.GetDeviceStats | 67193
 puppetdb.PhysicalHosts  |   288
 accounting.Accounting   |   288
(3 rows)

That I guess is run by prometheus at an unnecessary frequency, we could run it once a day.

Looking at the relations also points to the same culprit:

netbox=# SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 3;
                          relation                          |  size
------------------------------------------------------------+---------
 public.extras_jobresult                                    | 134 MB
 public.extras_objectchange                                 | 29 MB
 pg_toast.pg_toast_19743                                    | 7024 kB
(3 rows)

[1] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=netboxdb1002&var-datasource=thanos&var-cluster=misc&from=now-30d&to=now&viewPanel=28&refresh=5m

Event Timeline

Volans triaged this task as Unbreak Now! priority.Tue, Jun 21, 10:05 AM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 807091 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus: temporarily disable the Netbox job

https://gerrit.wikimedia.org/r/807091

somewhat related, looks like we should maybe run the following houskeeping job from a timer

somewhat related, looks like we should maybe run the following houskeeping job from a timer

Ack. I think we are already calling some manage.py commands from a timer, so we should already have the abstraction to run housekeeping too.

Change 807091 merged by Ayounsi:

[operations/puppet@production] Prometheus: temporarily disable the Netbox job

https://gerrit.wikimedia.org/r/807091

Mentioned in SAL (#wikimedia-operations) [2022-06-21T10:57:44Z] <volans> deleting netbox getstats.GetDeviceStats job results - T311048

Change 807095 had a related patch set uploaded (by Jbond; author: jbond):

[operations/software/netbox-extras@master] getstats: Delete old ve5rsions of this report before running

https://gerrit.wikimedia.org/r/807095

I've opted to run:

>>> jobs = JobResult.objects.filter(name='getstats.GetDeviceStats')
>>> for job in jobs:
...     job.delete()
...

Instead of the more concise JobResult.objects.filter(name='getstats.GetDeviceStats').delete() because according to Django documentation the latter goes mostly directly to SQL, while the former ensures to run any customized delete method if present.

We're now down from ~70k job results to ~2700.

I've also run the housekeeping job:

# python manage.py housekeeping
[*] Clearing expired authentication sessions
	Sessions cleared.
[*] Checking for expired changelog records
	Deleting 54103 expired records... Done.
[*] Checking for expired jobresult records
	No expired records found.
[*] Checking for latest release
	Skipping: RELEASE_CHECK_URL not set
Finished.
Volans lowered the priority of this task from Unbreak Now! to High.Tue, Jun 21, 11:28 AM
[*] Checking for expired changelog records
	Deleting 54103 expired records... Done.

This should not have happened, the CHANGELOG_RETENTION setting was removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/790681 and not set in the Netbox UI. That change was also not strictly needed as those settings can still be managed from the configuration file, that also seems to be a more safe way to manage them (see the top of https://docs.netbox.dev/en/stable/configuration/dynamic-settings/ ).

I'll restore the deleted records from the backups in the afternoon.

As the immediate issue has been eliminated stopping the job causing the growth reducing the priority to high.

Change 807099 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:netbox: add dynamic config back to config file

https://gerrit.wikimedia.org/r/807099

I've restored yesterday's DB backup to netbox-dev2002 (netbox-next), deleted all the changelog existing in current netbox production:

netbox=# delete from extras_objectchange where id > 81665;

Get a count of them:

netbox=# select count(*) from extras_objectchange;
 count
-------
 54103

Exported them to an sql file with insert statments:

pg_dump --table=export_table --data-only --column-inserts -t extras_objectchange netbox > netbox.extras_objectchange.pre-T311048.sql

Edited the file to make sure didn't have anything weird, in particular the set of the sequence at the end of the dump.
Moved the file to netboxdb1002:

SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 root@netbox-dev2002.codfw.wmnet:/var/lib/postgresql/netbox.extras_objectchange.pre-T311048.sql root@netboxdb1002.eqiad.wmnet:/srv/postgres-backup/

Took a safety backup:

postgres@netboxdb1002:/srv/postgres-backup$ /usr/bin/pg_dumpall | /bin/gzip > /srv/postgres-backup/root-preT311048recovery.psql-all-dbs-$(date +'%Y%m%d').sql.gz

Imported the data into netboxdb1002:

postgres@netboxdb1002:/srv/postgres-backup$ cat netbox.extras_objectchange.pre-T311048.sql | /usr/bin/psql -d netbox

Netbox is back having 60608 changelog entries right now and they seem to have kept the same IDs and order.

We're back to a DB dump of ~78MB (~8.5MB compressed).

Change 807099 merged by Jbond:

[operations/puppet@production] P:netbox: add dynamic config back to config file

https://gerrit.wikimedia.org/r/807099

This should not have happened, the CHANGELOG_RETENTION setting was removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/790681 and not set in the Netbox UI.

This has now been reverted

This should not have happened, the CHANGELOG_RETENTION setting was removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/790681 and not set in the Netbox UI.

This has now been reverted

Thanks @jbond, I've re-run manually the housekeeping job and it went much better;

(venv) root@netbox1002:/srv/deployment/netbox/deploy/src/netbox# python manage.py housekeeping
[*] Clearing expired authentication sessions
	Sessions cleared.
[*] Checking for expired changelog records
	Deleting 442 expired records... Done.
[*] Checking for expired jobresult records
	No expired records found.
[*] Checking for latest release
	Skipping: RELEASE_CHECK_URL not set
Finished.

Next step is to set it as a timer to run daily or so on the primary netbox host.

Change 807095 merged by jenkins-bot:

[operations/software/netbox-extras@master] getstats: Delete old versions of this report before running

https://gerrit.wikimedia.org/r/807095