Page MenuHomePhabricator

gitlab-account-approval bot stalled on 2024-01-09
Closed, ResolvedPublicBUG REPORT

Description

tools.gitlab-account-approval@tools-sgebastion-10:~$ ./watch_logs.sh 
[pod/approve-28412649-ddks9/job] Searching for pending accounts...
[pod/approve-28412649-ddks9/job] 2024-01-09T00:09:06Z glaab.utils INFO: Checking monx94
^C
tools.gitlab-account-approval@tools-sgebastion-10:~$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
approve-28412649-ddks9   1/1     Running   0          20d

https://wikitech.wikimedia.org/w/index.php?title=Tool:Gitlab-account-approval/Log&oldid=2138114 has not seen an update since the 7th.

This is basically the same stuck state recorded in T306391#9436882

Details

Other Assignee
Legoktm
TitleReferenceAuthorSource BranchDest Branch
Add network timeouts and other config to try and prevent deadlockstoolforge-repos/gitlab-account-approval!12bd808work/bd808/deadlocksmain
Run `approve` under `timeout 3m ..`toolforge-repos/gitlab-account-approval!11legoktmlegoktm-main-patch-94759main
Customize query in GitLab

Event Timeline

The "glaab.utils INFO: Checking monx94" log line comes from glaab.utils.is_trusted. The next things that happen are:

  • an LDAP lookup based on cn
  • a Gerrit lookup of all trusted members
  • a Phabricator lookup on Developer name
  • (possibly) a Phabricator lookup on SUL name

All of these steps are network actions that could in theory somehow hang. We need to check the various clients being used to see if we are properly setting network connection and read timeouts. It would be much more ideal for the bot to crash with a networking failure and then run again in a few minutes than to hang indefinitely and not be noticed as stuck.

T306391: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime would be nice as a fallback too.

bd808 triaged this task as Medium priority.
bd808 changed the subtype of this task from "Task" to "Bug Report".

Wanted to process T356586 but it's stuck again :(

tools.gitlab-account-approval@tools-sgebastion-10:~$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
approve-28448319-kgrlk   1/1     Running   0          2d6h

last log entry was:

[pod/approve-28448319-kgrlk/job] 2024-02-02T18:39:17Z glaab.utils INFO: Checking gryffin

I'm leaving it stuck in case that helps with debugging.

Mentioned in SAL (#wikimedia-cloud) [2024-02-05T17:14:11Z] <wmbot~bd808@tools-sgebastion-11> Killed stuck approve-28448319-kgrlk job (T356097)

last log entry was:

[pod/approve-28448319-kgrlk/job] 2024-02-02T18:39:17Z glaab.utils INFO: Checking gryffin

This time it was not stuck on the first user, but one relatively far into the list of possibles to check. This is still the same fundamental place as described in T356097#9504524.

Mentioned in SAL (#wikimedia-cloud) [2024-02-05T17:44:54Z] <wmbot~bd808@tools-sgebastion-11> Rebuilt image to pick up timeout hack for T356097

Looking into places where timeouts may be missing:

The "glaab.utils INFO: Checking monx94" log line comes from glaab.utils.is_trusted. The next things that happen are:

  • an LDAP lookup based on cn

The setup in glaab.ldap.Client.__init__ seems correct:

The only thing I'm not completely sure of here is if lazy=True for the connection somehow prevents the ServerPool from tracking availability. In this app I think we could drop that lazy specifier without any serious impacts.

  • a Gerrit lookup of all trusted members
  • a Phabricator lookup on Developer name
  • (possibly) a Phabricator lookup on SUL name

All of these use the Python requests library, and none of them currently set a timeout for either connect or read (https://docs.python-requests.org/en/latest/user/advanced/#timeouts). The same applies to the app's interactions with gitlab.wikimedia.org. This seems like the most likely cause of the deadlocks.

In local testing, the LDAP connection is not raising errors when no server is available. Instead things just seem to hang indefinitely which is obviously not ideal and also the general behavior we have seen in the live deployment.

In local testing, the LDAP connection is not raising errors when no server is available. Instead things just seem to hang indefinitely which is obviously not ideal and also the general behavior we have seen in the live deployment.

Turning on ldap3.utils.log.set_library_log_detail_level(ldap3.utils.log.NETWORK) reveals what is going wrong:

2024-02-05T15:08:59Z ldap3 DEBUG: NETWORK:entering loop number <True> for finding active server in pool <servers:
ldap://127.0.0.1:3389 - cleartext
Pool strategy: ROUND_ROBIN
 - Last used server: ldap://127.0.0.1:3389 - cleartext>
2024-02-05T15:08:59Z ldap3 DEBUG: NETWORK:checking server <ldap://127.0.0.1:3389 - cleartext> for availability
2024-02-05T15:08:59Z ldap3 DEBUG: BASIC:address for <ldap://127.0.0.1:3389 - cleartext> resolved as <[<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 3389)]>
2024-02-05T15:08:59Z ldap3 DEBUG: BASIC:obtained candidate address for <ldap://127.0.0.1:3389 - cleartext>: <[<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 3389)]> with mode IP_V4_ONLY
2024-02-05T15:08:59Z ldap3 DEBUG: ERROR:server <ldap://127.0.0.1:3389 - cleartext> not available at <[<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 3389), False, datetime.datetime(2024, 2, 5, 15, 8, 59, 61483)]>
2024-02-05T15:08:59Z ldap3 DEBUG: NETWORK:waiting for 10 seconds before retrying pool servers cycle

The use of active=True in the ldap3.ServerPool instance lets the library loop forever trying to get a working connection from the pool.

The active and exhaust parameter accept either a boolean or a number: if you set active=True while defining the ServerPool the strategy will check for server availability, you can also set this attribute to the maximum number of cycles to try before giving up with an LDAPServerPoolExhaustedError exception.

For this app, once through the pool seems appropriate.

Mentioned in SAL (#wikimedia-cloud) [2024-02-05T22:53:29Z] <wmbot~bd808@tools-sgebastion-11> Rebuilt image to pick up deadlock fixes for T356097

bd808 updated Other Assignee, added: Legoktm.
bd808 moved this task from To Do to Done on the User-bd808 board.

I think that between the timeout 3m ... protection; fixing the unintuitive LDAPserver pool infinite search loop (T356097#9515497); and adding connect and read timeouts when talking to GitLab, Gerrit, and Phabricator we should no longer see the bot become indefinitely stuck.