Page MenuHomePhabricator

Bot does not detect when ssh connection to Gerrit is interrupted
Closed, ResolvedPublicBUG REPORT

Description

[08:41]  <    taavi> !log tools.wikibugs restart irc and gerrit jobs
...
[11:04]  <   wm-bot> !log taavi@tools-sgebastion-11 tools.wikibugs toolforge jobs restart gerrit
...
[15:55]  <    bd808> taavi: any notes on what I should look into regarding those irc and gerrit restarts for wikibugs? I haven't looked at the logs yet, but hope to soon.
[15:56]  <    taavi> bd808: the gerrit ssh connection seems to be the most unreliable part currently.
[15:57]  <    bd808> interesting. I've been wondering about trying Paramiko instead of shelling out. In theory that could give the code more visibility into protocol errors and network interruptions

The bot's own logs just show a lack of activity leading up to the restarts:

2024-03-04T08:09:40Z wikibugs2.gerrit INFO: stream-events: ...
2024-03-04T08:41:22Z wikibugs2.cli DEBUG: Invoking gerrit
2024-03-04T08:41:22Z wikibugs2.gerrit DEBUG: Writing ssh key to /tmp/tmpy7ntj06l
2024-03-04T08:41:22Z wikibugs2.gerrit INFO: Opened SSH connection for suchabot
2024-03-04T09:42:53Z wikibugs2.gerrit INFO: stream-events: ...
2024-03-04T11:04:19Z wikibugs2.cli DEBUG: Invoking gerrit
2024-03-04T11:04:19Z wikibugs2.gerrit DEBUG: Writing ssh key to /tmp/tmputijaqf0
2024-03-04T11:04:19Z wikibugs2.gerrit INFO: Opened SSH connection for suchabot

Details

TitleReferenceAuthorSource BranchDest Branch
gerrit: Use asyncsshtoolforge-repos/wikibugs2!14bd808work/bd808/T359096-sshmain
Customize query in GitLab

Event Timeline

When T335592: [jobs-api,jobs-cli] Support job health checks is ready, we could have a check for activity in general. Inside the container the bot could touch a file or write a timestamp when it processes an event. The health check could trigger a restart if no events had been seen in X minutes. This could cause a small loss of data in an otherwise healthy bot if gerrit was just quiet and something happened during the short restart window. In practice that may be more acceptable than things just not working at all until an operator intervenes.

Mentioned in SAL (#wikimedia-cloud) [2024-03-06T18:57:01Z] <wmbot~bd808@tools-sgebastion-11> Restarting gerrit job; last event logged at 2024-03-06T17:56:59Z (T359096)

bd808 changed the task status from Open to In Progress.Mar 8 2024, 2:37 PM
bd808 claimed this task.
bd808 moved this task from Ready to Go to Doing on the Wikibugs board.

I'm working on an asyncssh implementation of the polling loop.

Mentioned in SAL (#wikimedia-cloud) [2024-03-10T15:57:14Z] <wmbot~bd808@tools-sgebastion-11> Updated to 7855ab7 and restarted gerrit & irc jobs (T359096)