During failover of alerting host services from alert1001 to alert2001 vopsbot disconnected from IRC, however it didn't reconnet automatically from alert2001.
The vopsbot systemd unit remained in a running state and logged several of these errors:
Apr 03 14:13:48 alert2001 vopsbot[21526]: t=2023-04-03T14:13:48+0000 lvl=eror msg="could not find the topic for this channel stored. Is the bot in the channel?" id=a296c831392ce1c2 host=irc.libera.chat:6697 nick=sirenbot error="sql: database is closed"
The failover process went like this:
- Disable puppet on both alert hosts
- Merge failover patch
- Run puppet on alert2001 (make active host)
- Repoint DNS
- Run puppet on alert1001 (make standby host)
During step 3 the vopsbot service was deployed to alert2001 and started (while vopsbot was running on alert1001) so I think what happened was a collision caused by two vopsbot instances running at the same time.
I'll be sure to manually restart this service during future failovers, however let's also look into how best to update this to automatically recover from situations like this and/or monitor/alert for it.