This happens in the create_account method, where the tool account test-abogott5 fails to get created, and any account that happens to be listed after it does not get created either.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T272395 Cloud: reduce NAT exceptions from cloud to production | |||
Resolved | Andrew | T291405 [NFS] Reduce or eliminate bare-metal NFS servers | |||
Resolved | Andrew | T301280 Move project-specific NFS mounts onto project-local NFS servers | |||
Resolved | dcaro | T303663 Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VM | |||
Resolved | BUG REPORT | dcaro | T332762 New tool not allowed to connect to toolsdb | ||
Resolved | BUG REPORT | dcaro | T332798 [maintain-dbusers] When creating accounts, the script bails out processing other accounts if one of them fails in an unexpected way | ||
In Progress | dcaro | T332955 [maintain-dbusers] Generate prometheus metrics |
Event Timeline
Fail fast has basically always been the behavior of maintain-dbusers. The historic fix would be one of delete the broken account (test-abogott5) or correct whatever is causing the account's failure.
The reason for fail fast is basically so that the errors become a blocker that must be fixed instead of being a growing number of single failures that are not monitored or fixed. If we had some better monitoring to notice when one thing is failing every exec loop without blocking the loop it would be completely reasonable to fix things so that a single failure does not halt the entire queue.
The problem is that it does not fail the script, or the service, it just skips the rest of accounts for the toolsdb database. Maybe failing the run might be better there (or as you say, more granular monitoring might be best)