Page MenuHomePhabricator

designate-sink lock ups (was: cloudcontrol1005/nova instance creation test is CRITICAL)
Closed, InvalidPublic

Description

For the last few hours, creation of new DNS records has been failing for new VMs. This finally resulted in the alert, above.

Event Timeline

I restarted designate-sink and things seem to be working. There's no obvious death-knell in the logs.

The logs are also saying this now, which might be related or unrelated:

2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher [req-cfc09dd5-4c51-466a-9983-cfd06d6df1e5 - - - - -] Callback raised an exception.: git.exc.GitCommandError: Cmd('git') failed due to: exit code(1)
  cmdline: git pull --rebase -v origin
  stderr: 'fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.'
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher Traceback (most recent call last):
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/oslo_messaging/notify/dispatcher.py", line 78, in _exec_callback
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     return callback(message["ctxt"],
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/designate/sink/service.py", line 124, in info
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     handler.process_notification(context, event_type, payload)
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/wmf_sink/wmfsink.py", line 88, in process_notification
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     self._delete(payload,
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/wmf_sink/base.py", line 112, in _delete
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     self._delete_puppet_git_config(legacy_data['tenant_id'], legacy_fqdn)
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/wmf_sink/base.py", line 197, in _delete_puppet_git_config
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     repo.remotes.origin.pull(rebase=True)
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/git/remote.py", line 815, in pull
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     res = self._get_fetch_info_from_stderr(proc, progress)
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/git/remote.py", line 674, in _get_fetch_info_from_stderr
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     proc.wait(stderr=stderr_text)
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/git/cmd.py", line 408, in wait
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher     raise GitCommandError(self.args, status, errstr)
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher git.exc.GitCommandError: Cmd('git') failed due to: exit code(1)
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   cmdline: git pull --rebase -v origin
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher   stderr: 'fatal: Could not read from remote repository.
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher 
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher Please make sure you have the correct access rights
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher and the repository exists.'
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher

Updated to add: Sink seems to recover and continue after this so it's probably not the cause of the outage.

When I restarted the service, syslog looks like this:

Aug 30 00:52:06 cloudservices1005 systemd[1]: Stopping OpenStack Designate DNSaaS Sink...
Aug 30 00:52:07 cloudservices1005 pdns_server[13752]: [webserver] 9f7d2626-a2f7-4697-9d33-c1578fb46739 10.64.37.6:36058 "GET /metrics HTTP/1.1" 200 13190
Aug 30 00:52:19 cloudservices1005 pdns_recursor[70855]: [webserver] 359f6bb8-ee96-47eb-9974-f133ba52abc5 10.64.4.6:57034 "GET /metrics HTTP/1.1" 200 21911
Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: State 'stop-sigterm' timed out. Killing.
Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Killing process 16138 (designate-sink) with signal SIGKILL.
Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Killing process 267406 (designate-sink) with signal SIGKILL.
Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Main process exited, code=killed, status=9/KILL
Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Failed with result 'timeout'.
Aug 30 00:52:21 cloudservices1005 systemd[1]: Stopped OpenStack Designate DNSaaS Sink.
Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Consumed 1h 33min 15.197s CPU time.
Aug 30 00:52:21 cloudservices1005 systemd[1]: Started OpenStack Designate DNSaaS Sink.

... so it's locking up :(

Andrew renamed this task from cloudcontrol1005/nova instance creation test is CRITICAL to designate-sink lock ups (was: cloudcontrol1005/nova instance creation test is CRITICAL).Aug 30 2022, 12:35 PM

The designate-sink processes in codfw1dev are also locked up.

possible clue, while upgrading cloudnets:

/usr/lib/python3/dist-packages/tooz/drivers/pgsql.py:74: SyntaxWarning: 'str' object is not callable; perhaps you missed a comma?

Mentioned in SAL (#wikimedia-cloud) [2022-10-10T01:22:37Z] <andrewbogott> restarting designate-sink on cloudservices100[45], possible example of T316614

I haven't seen this happen for ages. Maybe it was a one-off?