Page MenuHomePhabricator

vopsbot needed manual restart after alerting hosts failover
Open, Needs TriagePublic


During failover of alerting host services from alert1001 to alert2001 vopsbot disconnected from IRC, however it didn't reconnet automatically from alert2001.

The vopsbot systemd unit remained in a running state and logged several of these errors:

Apr 03 14:13:48 alert2001 vopsbot[21526]: t=2023-04-03T14:13:48+0000 lvl=eror msg="could not find the topic for this channel stored. Is the bot in the channel?" id=a296c831392ce1c2 nick=sirenbot error="sql: database is closed"

The failover process went like this:

  1. Disable puppet on both alert hosts
  2. Merge failover patch
  3. Run puppet on alert2001 (make active host)
  4. Repoint DNS
  5. Run puppet on alert1001 (make standby host)

During step 3 the vopsbot service was deployed to alert2001 and started (while vopsbot was running on alert1001) so I think what happened was a collision caused by two vopsbot instances running at the same time.

I'll be sure to manually restart this service during future failovers, however let's also look into how best to update this to automatically recover from situations like this and/or monitor/alert for it.

Event Timeline

herron renamed this task from vopsbot needs manual restart after alerting hosts failover to vopsbot needed manual restart after alerting hosts failover.Apr 3 2023, 4:27 PM
herron created this task.

In addition, the version of vopsbot on alert2001 was behind alert1001. This outdated version was causing to vopsbot to change the topic when no topic changes were necessary.

noticed on IRC the sirenbot keeps updating the topic a lot, like sets it every couple minutes, but without changing the value of it