Page MenuHomePhabricator

Netbox rq.timeouts.JobTimeoutException
Closed, ResolvedPublic

Description

In https://netbox.wikimedia.org/extras/reports/network.Network/

An exception occurred: JobTimeoutException: Task exceeded maximum timeout value (300 seconds)

Traceback (most recent call last):
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 187, in get
    rel_obj = self.field.get_cached_value(instance)
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/fields/mixins.py", line 15, in get_cached_value
    return instance._state.fields_cache[cache_name]
KeyError: 'device'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/deployment/netbox/current/src/netbox/extras/reports.py", line 229, in run
    test_method()
  File "/srv/deployment/netbox-extras//reports/network.py", line 241, in test_mgmt_dns_hostname
    ipaddress.assigned_object.device.tenant
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 205, in get
    rel_obj = self.get_object(instance)
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 168, in get_object
    return qs.get(self.field.get_reverse_related_filter(instance))
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/query.py", line 492, in get
    num = len(clone)
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/query.py", line 302, in len
    self._fetch_all()
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/query.py", line 1507, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/query.py", line 57, in iter
    results = compiler.execute_sql(
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1361, in execute_sql
    cursor.execute(sql, params)
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django_prometheus/db/common.py", line 71, in execute
    return super().execute(args, *kwargs)
  File "/usr/lib/python3.9/encodings/utf_8.py", line 15, in decode
    def decode(input, errors='strict'):
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/rq/timeouts.py", line 63, in handle_death_penalty
    raise self._exception('Task exceeded maximum timeout value ' '({0} seconds)'.format(self._timeout))
rq.timeouts.JobTimeoutException: Task exceeded maximum timeout value (300 seconds)

Issue still happens in Netbox 4

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I wonder if has anything to do with T321704 and if the patch there (not merged) could help. Also T339133 is on a similar topic although I think unrelated.

Not sure i f it's a Netbox or Redis issue :(

If the former, the upgrade to 3.6 with T336275 could help.

For the latter I tried to dig a bit in the metrics. The current dashboard is limited (eg. the filtering on the instance doesn't work). So I "imported" and fixed this dashboard:
https://grafana.com/grafana/dashboards/763-redis-dashboard-for-prometheus-redis-exporter-1-x/
To: https://grafana.wikimedia.org/d/-Ay4Dd6Vz/redis-dashboard-for-prometheus-redis-exporter-1-x?orgId=1&var-namespace=&var-instance=rdb1009:16378&from=now-3h&to=now
There we can see that there are constantly 2 blocked clients, dunno if it's a red-herring or not. @jijiki can you see any smoking gun around this Redis instance?

Not sure i f it's a Netbox or Redis issue :(

There we can see that there are constantly 2 blocked clients, dunno if it's a red-herring or not. @jijiki can you see any smoking gun around this Redis instance?

I had a quick look around the netbox redis instance. Regarding the blocked clients you are seeing:

127.0.0.1:6378> CLIENT LIST
id=341245 addr=10.64.0.186:38060 fd=13 name= age=5538345 idle=46 flags=b db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20536 events=r cmd=blpop user=default

id=343667 addr=10.192.0.96:50458 fd=16 name= age=5526580 idle=80 flags=b db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20536 events=r cmd=blpop user=default

<snip>

According to the docs, the BLPOP command blocks a connection when there are no more elements to pop. The CLIENT LIST command returned 2 clients in such flags (flags=b), lining up with what we see on the graphs.

For the time being, nothing seems to be out of the ordinary from that end.

Thanks for having a look! I boldly made it a dependency of the Netbox 3.6 upgrade as we're running an ancient Netbox version. We can revisit if the upgrade doesn't solve the issue.

Volans changed the task status from Open to Stalled.Jan 8 2024, 4:18 PM
Volans triaged this task as Medium priority.
ayounsi renamed this task from Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException to Netbox rq.timeouts.JobTimeoutException.Aug 6 2024, 12:58 PM
ayounsi changed the task status from Stalled to Open.
ayounsi raised the priority of this task from Medium to High.
ayounsi updated the task description. (Show Details)
ayounsi added subscribers: MoritzMuehlenhoff, cmooney.

Change #1063168 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox: set RQ_DEFAULT_TIMEOUT back to default of 300

https://gerrit.wikimedia.org/r/1063168

Job dispatched to netbox1003 - takes less than 2min
Aug 16 09:24:25 netbox1003 python[1079619]: 09:24:25 default: extras.scripts.run_script(commit=True, data={}, job=<Job: 61e3e2e3-59b1-4982-81a7-5589221d7ed9>, request=<utilities.request.NetBoxFakeRequest ob>
Aug 16 09:25:47 netbox1003 python[1622900]: 09:25:47 default: Job OK (61e3e2e3-59b1-4982-81a7-5589221d7ed9)
Job dispatched to netbox2003 - fails after a 15min timeout
2024-08-16T09:27:15.055394+00:00 netbox2003 python[860666]: 09:27:15 default: extras.scripts.run_script(commit=True, data={}, job=<Job: 359179e0-cac4-451c-aa5a-f2cdcac6283b>, request=<utilities.request.NetBoxFakeRequest object at 0x7ff262fc63d0>) (359179e0-cac4-451c-aa5a-f2cdcac6283b)
[...]
2024-08-16T09:42:15.748983+00:00 netbox2003 python[1292363]: Exception raised during script execution: Task exceeded maximum timeout value (900 seconds)

I then stopped rq-netbox on netbox2003 to force all Jobs to run on 1003 than ran the Capirca script multiple times without any issue.

So I think the issue is related to netbox2003's extra latency to talk to the Redis server in eqiad.

From there I see 3 paths forward:

  • Have Puppet only enable rq-netbox on the "primary" Netbox server (probably the easiest option, but goes in the opposite direction of having an active/active Netbox someday)
  • Figure out why the extra latency is causing a full timeout (probably the most complex option)
  • Have an active-active Redis between eqiad and codfw (not sure how realistic this option is)

Change #1063179 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox: disable rq-netbox on secodary node

https://gerrit.wikimedia.org/r/1063179

Change #1063168 merged by Ayounsi:

[operations/puppet@production] Netbox: set RQ_DEFAULT_TIMEOUT back to default of 300

https://gerrit.wikimedia.org/r/1063168

Change #1063179 merged by Ayounsi:

[operations/puppet@production] Netbox: disable rq-netbox on secondary node

https://gerrit.wikimedia.org/r/1063179

ayounsi claimed this task.

After discussions we decided to go down the first path of the list. I couldn't replicate the issue since deploying the workaround/fix. Please reopen if you see the timeout again or would like to fix the root of the issue.