Page MenuHomePhabricator

openstack: dns_floating_ip_updater mechanism improvements to better handle transient errors
Closed, ResolvedPublic

Description

The dns_floating_ip_updater mechanism is currently a systemd timer, a job that is run every 10 minutes.

Over the last year (or more) we usually get a page if for whatever reason the script fails to complete (maintenance in the API or whatever).
The timer/script approach is not fault tolerant in any way, and perhaps we could improve the code to make it a daemon with proper error handling / ignoring, specially if we are talking about transient errors.

The page is generated because a systemd unit designate_floating_ip_ptr_records_updater.service enters failed state briefly (because the related timer fails).

Event Timeline

aborrero created this task.Nov 20 2019, 5:43 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2019, 5:43 PM
aborrero renamed this task from openstack: dns_floating_ip_updater to openstack: dns_floating_ip_updater mechanism improvements to better handle transient errors.Nov 20 2019, 5:43 PM
aborrero triaged this task as Low priority.Nov 22 2019, 10:14 AM
aborrero raised the priority of this task from Low to Medium.Dec 11 2019, 5:47 PM

This is annoying. Raising priority.

Andrew claimed this task.Mon, Jan 13, 12:24 PM

I haven't written any code yet but I'm thinking about this. Changing it to a daemon wouldn't be hard but it's not entirely obvious to me how we'd monitor it in that case.

What I really want is to leave it as a timer and tell the monitoring "only tell us if this stays in a failed state for 60 minutes." That's pretty easy to do for the job itself, but then we'd still get alerted due to the general 'systemd is unhappy' check.

Another option is not monitor the return state of the script at all, and instead monitor the effect of the script via a canary VM. Maybe a parallel job that deletes the entry and then a check to see if the entry gets recreated?

It feels like I'm making this more complicated than it needs to be.

I haven't written any code yet but I'm thinking about this. Changing it to a daemon wouldn't be hard but it's not entirely obvious to me how we'd monitor it in that case.

You could simply send an email if you detect a transient issues. How do we know it is transient? Count them. If count == 3, then exit() the daemon.

Systemd will then restart the service (if Restart=always) but if the daemon is malfunctioning for good, this will enter a loop of fail -> restart -> fail that systemd will detect and leave the service in failed state for icinga to page us.

The transient error that upset us should be covered with the email situation I think.

Change 565043 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: modest refactor

https://gerrit.wikimedia.org/r/565043

Change 565044 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: retry if we encounter an exception

https://gerrit.wikimedia.org/r/565044

Change 565284 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: Partial refactor

https://gerrit.wikimedia.org/r/565284

Change 565285 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: further refactor

https://gerrit.wikimedia.org/r/565285

Change 565287 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: catch all exceptions

https://gerrit.wikimedia.org/r/565287

Change 565043 abandoned by Andrew Bogott:
wmcs-dns-floating-ip-updater.py: modest refactor

Reason:
I reworked this into more, smaller patches

https://gerrit.wikimedia.org/r/565043

Change 565286 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: add a main() function

https://gerrit.wikimedia.org/r/565286

Change 565284 merged by Andrew Bogott:
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: Partial refactor

https://gerrit.wikimedia.org/r/565284

Change 565285 merged by Andrew Bogott:
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: further refactor

https://gerrit.wikimedia.org/r/565285

Change 565286 merged by Andrew Bogott:
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: add a main() function

https://gerrit.wikimedia.org/r/565286

Change 565287 merged by Andrew Bogott:
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: catch all exceptions

https://gerrit.wikimedia.org/r/565287

Change 565044 merged by Andrew Bogott:
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: retry if we encounter an exception

https://gerrit.wikimedia.org/r/565044

Andrew closed this task as Resolved.Fri, Jan 17, 4:24 AM

With merged patches, this is still activated by a systemd timer, but has a retry loop. I think that gets us what we want: we'll get the same alert as before, but only if the script fails three times in a row over several minutes.

Change 565458 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: move to python3

https://gerrit.wikimedia.org/r/565458

Change 565458 merged by Andrew Bogott:
[operations/puppet@production] wmcs-dns-floating-ip-updater.py: move to python3

https://gerrit.wikimedia.org/r/565458