Page MenuHomePhabricator

Netbox API Occasionally 500s and Netbox2001 dumpcsv fails
Closed, ResolvedPublic

Description

AFter migration, the Netbox API (and thus the autonomic services that depend on it) occationally fails. There's no immediate indication as to why and they seem to recover fine.

I suspect it has to do with there being only 2 uwsgi workers (since the number of workers is == the number of CPUs), and think that possibly we should just increase the number of workers per instance and see if this addresses it.

Event Timeline

Aklapper renamed this task from Netbox API Occtaionally 500s to Netbox API Occasionally 500s.Sep 12 2019, 11:51 PM

@crusnov I have downtimed 'Check systemd state' checks until Sept 24 because it was polluting the alerts

crusnov renamed this task from Netbox API Occasionally 500s to Netbox API Occasionally 500s and Netbox2001 dumpcsv fails.Sep 18 2019, 4:28 AM

After increasing the CPU count to 4 on both fornt-ends the number of 500 errors that occur are much lower.

I spent some time debugging the problem with csv dumps from netbox2001. The basic gist is that when dumping a larger table, its pagination routine fails because the second page it tries to retrieve from the API is returning an http instead of an https url as the "next page" URL (and when netbox2001 tries to access an http url, it times out eventually because :80 is blocked). I suspect a bug in Netbox itself. I traced the execution in pdb, and it showed that this value is coming from the remote end.

I spent some time debugging the problem with csv dumps from netbox2001. The basic gist is that when dumping a larger table, its pagination routine fails because the second page it tries to retrieve from the API is returning an http instead of an https url as the "next page" URL (and when netbox2001 tries to access an http url, it times out eventually because :80 is blocked). I suspect a bug in Netbox itself. I traced the execution in pdb, and it showed that this value is coming from the remote end.

To be clear, I traced the execution of the csv dumper in pdb. Netbox returning the URL is almost certain as when I manually made the request, it returned such a URL to me.

From your description it seems that Netbox doesn't preserve the URL schema on pagination.

Which is the API that returns the wrong data? Could you add an example here please?
Is there any issue already open upstream?

In debugging alerts on Netbox, I noticed that, unrelated to the CSV dumper, the ganeti sync sometimes returs a 500 error. This is caused by this error:

Traceback (most recent call last):
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 483, in dispatch
    self.initial(request, *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 400, in initial
    self.perform_authentication(request)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 326, in perform_authentication
    request.user
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/request.py", line 223, in user
    self._authenticate()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/request.py", line 376, in _authenticate
    user_auth_tuple = authenticator.authenticate(self)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/authentication.py", line 192, in authenticate
    return self.authenticate_credentials(token)
  File "./netbox/api.py", line 40, in authenticate_credentials
    token = model.objects.prefetch_related('user').get(key=key)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/cacheops/query.py", line 356, in get
    return qs._no_monkey.get(qs, *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/cacheops/query.py", line 292, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1098, in execute_sql
    cursor = self.connection.cursor()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 233, in _cursor
    self.ensure_connection()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django_prometheus/db/common.py", line 41, in get_new_connection
    *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/srv/deploymen  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/request.py", line 223, in user
    self._authenticate()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/request.py", line 376, in _authenticate
    user_auth_tuple = authenticator.authenticate(self)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/authentication.py", line 192, in authenticate
    return self.authenticate_credentials(token)
  File "./netbox/api.py", line 40, in authenticate_credentials
    token = model.objects.prefetch_related('user').get(key=key)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/cacheops/query.py", line 356, in get
    return qs._no_monkey.get(qs, *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/cacheops/query.py", line 292, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1098, in execute_sql
    cursor = self.connection.cursor()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 233, in _cursor
    self.ensure_connection()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django_prometheus/db/common.py", line 41, in get_new_connection
    *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  password authentication failed for user "netbox"
FATAL:  password authentication failed for user "netbox"
t/netbox/venv/lib/python3.7/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  password authentication failed for user "netbox"
FATAL:  password authentication failed for user "netbox"

Which pairs with the following message on the Postgres server:

2019-09-24 03:20:47 GMT FATAL:  password authentication failed for user "netbox"
2019-09-24 03:20:47 GMT DETAIL:  Password does not match for user "netbox".
        Connection matched pg_hba.conf line 110: "host  netbox  netbox  2620::861:1:208:80:154:12/128   md5"

Which is odd. I am still debugging it. My immediate guess is that two separate things in puppet are altering the user's password in Postgres, but I haven't confirmed that yet nor do i have any evidence of it.

This is flapping again, and it has been for weeks.

11:44 <+icinga-wm> PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers

I suggest making changes as to when this alert becomes CRITICAL and avoid the flapping alerts, as it is distracting.

jijiki triaged this task as Medium priority.Oct 7 2019, 8:58 AM

I am afraid this is alerting multiple times per day, if it is not something we are able to fix immediately, it makes sense to figure out how to prevent this alert from being distracting. I take it this is related to T233728 and T233624

Thanks for bugging about this, I have silenced the particular alerts for the time being so as to reduce spam. I should have time to debug it more this week and we'll try to get it to shut up.

Rounding up some changes that have occurred:

  • Thanks to Arzhel we have deployed a fix to the https->http issue encountered in the API (it was a missing header that told django that the connection was secure)
  • We deployed a max-requests limit to the uwsgi configuration. This seems to have hugely reduced, but not eliminated, the 503 issue with certain API requests.

Just an update to this. The 503 errors have been fully addressed with retries in the scripts in question (the errors appeared as systemd degradation alerts).

Report alerts we're working on.