For the last few hours, creation of new DNS records has been failing for new VMs. This finally resulted in the alert, above.
Description
Description
Event Timeline
Comment Actions
I restarted designate-sink and things seem to be working. There's no obvious death-knell in the logs.
Comment Actions
The logs are also saying this now, which might be related or unrelated:
2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher [req-cfc09dd5-4c51-466a-9983-cfd06d6df1e5 - - - - -] Callback raised an exception.: git.exc.GitCommandError: Cmd('git') failed due to: exit code(1) cmdline: git pull --rebase -v origin stderr: 'fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.' 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher Traceback (most recent call last): 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/oslo_messaging/notify/dispatcher.py", line 78, in _exec_callback 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher return callback(message["ctxt"], 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/designate/sink/service.py", line 124, in info 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher handler.process_notification(context, event_type, payload) 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/wmf_sink/wmfsink.py", line 88, in process_notification 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher self._delete(payload, 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/wmf_sink/base.py", line 112, in _delete 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher self._delete_puppet_git_config(legacy_data['tenant_id'], legacy_fqdn) 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/wmf_sink/base.py", line 197, in _delete_puppet_git_config 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher repo.remotes.origin.pull(rebase=True) 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/git/remote.py", line 815, in pull 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher res = self._get_fetch_info_from_stderr(proc, progress) 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/git/remote.py", line 674, in _get_fetch_info_from_stderr 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher proc.wait(stderr=stderr_text) 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher File "/usr/lib/python3/dist-packages/git/cmd.py", line 408, in wait 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher raise GitCommandError(self.args, status, errstr) 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher git.exc.GitCommandError: Cmd('git') failed due to: exit code(1) 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher cmdline: git pull --rebase -v origin 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher stderr: 'fatal: Could not read from remote repository. 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher Please make sure you have the correct access rights 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher and the repository exists.' 2022-08-30 00:59:16.466 1457502 ERROR oslo_messaging.notify.dispatcher
Updated to add: Sink seems to recover and continue after this so it's probably not the cause of the outage.
Comment Actions
When I restarted the service, syslog looks like this:
Aug 30 00:52:06 cloudservices1005 systemd[1]: Stopping OpenStack Designate DNSaaS Sink... Aug 30 00:52:07 cloudservices1005 pdns_server[13752]: [webserver] 9f7d2626-a2f7-4697-9d33-c1578fb46739 10.64.37.6:36058 "GET /metrics HTTP/1.1" 200 13190 Aug 30 00:52:19 cloudservices1005 pdns_recursor[70855]: [webserver] 359f6bb8-ee96-47eb-9974-f133ba52abc5 10.64.4.6:57034 "GET /metrics HTTP/1.1" 200 21911 Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: State 'stop-sigterm' timed out. Killing. Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Killing process 16138 (designate-sink) with signal SIGKILL. Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Killing process 267406 (designate-sink) with signal SIGKILL. Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Main process exited, code=killed, status=9/KILL Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Failed with result 'timeout'. Aug 30 00:52:21 cloudservices1005 systemd[1]: Stopped OpenStack Designate DNSaaS Sink. Aug 30 00:52:21 cloudservices1005 systemd[1]: designate-sink.service: Consumed 1h 33min 15.197s CPU time. Aug 30 00:52:21 cloudservices1005 systemd[1]: Started OpenStack Designate DNSaaS Sink.
... so it's locking up :(
Comment Actions
possible clue, while upgrading cloudnets:
/usr/lib/python3/dist-packages/tooz/drivers/pgsql.py:74: SyntaxWarning: 'str' object is not callable; perhaps you missed a comma?
Comment Actions
Mentioned in SAL (#wikimedia-cloud) [2022-10-10T01:22:37Z] <andrewbogott> restarting designate-sink on cloudservices100[45], possible example of T316614