Page MenuHomePhabricator

Maintain-dbusers should handle failures due to replicas being in maintenance
Closed, ResolvedPublic

Description

When one of the labsdbs is in maint mode, maintain-dbusers errors out. We should probably have the script handle this gracefully.

See also T188508#4011548 and T188508#4013420

Event Timeline

Quiddity renamed this task from Maintain-dbusers should handle failures due to replicas being in maintanence to Maintain-dbusers should handle failures due to replicas being in maintenance.Mar 1 2018, 11:35 PM

I have improved the ability to handle bad connection, but I am of the opinion that unless the script can be made to read the puppet configuration of another server, it cannot actually do this. The only true indicator is depooling, which is a setting on the dbproxy servers (which is not where this script runs).

edit - actually, I have an idea. I'll start committing things while I consider it.

Change 436328 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: refactor some python and systemd stuff for maintain-dbusers

https://gerrit.wikimedia.org/r/436328

Change 436353 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: maintain-dbusers to skip offline labsdb servers

https://gerrit.wikimedia.org/r/436353

Change 436328 merged by Bstorm:
[operations/puppet@production] wiki replicas: refactor some python and systemd stuff for maintain-dbusers

https://gerrit.wikimedia.org/r/436328

Change 436353 merged by Bstorm:
[operations/puppet@production] wiki replicas: maintain-dbusers to skip offline labsdb servers

https://gerrit.wikimedia.org/r/436353

At this point, I think I've done everything that can practically be done for managing maintenance. Since maintenance is indicated by changes on a database proxy, there is really no way to inform maintain-dbusers of the problem without setting up a service that watches for changes in the haproxy configuration files and commits changes to puppet or sends some kind of feedback to the service. What we should do is simply pay attention to maintenance and remove the server from the config when a labsdb server is going to be locked up or offline. Offline is something the script can now manage at least, and it will fail eventually instead of doing weird loops (totally dead is easier to watch for than loopy).