gitlab-account-approval bot stalled on 2024-01-09
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Legoktm
	Jan 29 2024, 6:59 PM

Description

tools.gitlab-account-approval@tools-sgebastion-10:~$ ./watch_logs.sh 
[pod/approve-28412649-ddks9/job] Searching for pending accounts...
[pod/approve-28412649-ddks9/job] 2024-01-09T00:09:06Z glaab.utils INFO: Checking monx94
^C

tools.gitlab-account-approval@tools-sgebastion-10:~$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
approve-28412649-ddks9   1/1     Running   0          20d

https://wikitech.wikimedia.org/w/index.php?title=Tool:Gitlab-account-approval/Log&oldid=2138114 has not seen an update since the 7th.

This is basically the same stuck state recorded in T306391#9436882

Details

Other Assignee: Legoktm

	Title	Reference	Author	Source Branch	Dest Branch
	Add network timeouts and other config to try and prevent deadlocks	toolforge-repos/gitlab-account-approval!12	bd808	work/bd808/deadlocks	main
	Run `approve` under `timeout 3m ..`	toolforge-repos/gitlab-account-approval!11	legoktm	legoktm-main-patch-94759	main

Customize query in GitLab

Related Objects

Mentioned Here: T356586: Requesting GitLab account activation for Ahecht
T306391: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime

Event Timeline

Legoktm created this task.Jan 29 2024, 6:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 29 2024, 6:59 PM

Legoktm added a project: Tool-gitlab-account-approval.Jan 29 2024, 7:06 PM

The "glaab.utils INFO: Checking monx94" log line comes from glaab.utils.is_trusted. The next things that happen are:

an LDAP lookup based on cn
a Gerrit lookup of all trusted members
a Phabricator lookup on Developer name
(possibly) a Phabricator lookup on SUL name

All of these steps are network actions that could in theory somehow hang. We need to check the various clients being used to see if we are properly setting network connection and read timeouts. It would be much more ideal for the bot to crash with a networking failure and then run again in a few minutes than to hang indefinitely and not be noticed as stuck.

T306391: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime would be nice as a fallback too.

bd808 claimed this task.Feb 1 2024, 12:43 AM

bd808 triaged this task as Medium priority.

bd808 changed the subtype of this task from "Task" to "Bug Report".

Restricted Application added a project: User-bd808. · View Herald TranscriptFeb 1 2024, 12:43 AM

bd808 updated the task description. (Show Details)Feb 1 2024, 6:29 PM

Wanted to process T356586 but it's stuck again :(

tools.gitlab-account-approval@tools-sgebastion-10:~$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
approve-28448319-kgrlk   1/1     Running   0          2d6h

last log entry was:

[pod/approve-28448319-kgrlk/job] 2024-02-02T18:39:17Z glaab.utils INFO: Checking gryffin

I'm leaving it stuck in case that helps with debugging.

legoktm opened https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/11

Run approve under timeout 3m ..

Mentioned in SAL (#wikimedia-cloud) [2024-02-05T17:14:11Z] <wmbot~bd808@tools-sgebastion-11> Killed stuck approve-28448319-kgrlk job (T356097)

In T356097#9512092, @Legoktm wrote:
last log entry was:
[pod/approve-28448319-kgrlk/job] 2024-02-02T18:39:17Z glaab.utils INFO: Checking gryffin

This time it was not stuck on the first user, but one relatively far into the list of possibles to check. This is still the same fundamental place as described in T356097#9504524.

bd808 merged https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/11

Run approve under timeout 3m ..

Mentioned in SAL (#wikimedia-cloud) [2024-02-05T17:44:54Z] <wmbot~bd808@tools-sgebastion-11> Rebuilt image to pick up timeout hack for T356097

Looking into places where timeouts may be missing:

In T356097#9504524, @bd808 wrote:

The "glaab.utils INFO: Checking monx94" log line comes from glaab.utils.is_trusted. The next things that happen are:

an LDAP lookup based on cn

The setup in glaab.ldap.Client.__init__ seems correct:

Each server entry is given a connect_timeout=1 config setting. https://ldap3.readthedocs.io/en/latest/server.html#server-object
The pool of servers is set to check for server availability (active=True) and to remove servers found to be inactive from the pool (exhaust=True). https://ldap3.readthedocs.io/en/latest/server.html#server-pool
The connection is set to error out if it takes more than 60 seconds to receive results (receive_timeout=60). https://ldap3.readthedocs.io/en/latest/connection.html#connection

The only thing I'm not completely sure of here is if lazy=True for the connection somehow prevents the ServerPool from tracking availability. In this app I think we could drop that lazy specifier without any serious impacts.

a Gerrit lookup of all trusted members

a Phabricator lookup on Developer name

(possibly) a Phabricator lookup on SUL name

All of these use the Python requests library, and none of them currently set a timeout for either connect or read (https://docs.python-requests.org/en/latest/user/advanced/#timeouts). The same applies to the app's interactions with gitlab.wikimedia.org. This seems like the most likely cause of the deadlocks.

In local testing, the LDAP connection is not raising errors when no server is available. Instead things just seem to hang indefinitely which is obviously not ideal and also the general behavior we have seen in the live deployment.

In T356097#9515296, @bd808 wrote:

In local testing, the LDAP connection is not raising errors when no server is available. Instead things just seem to hang indefinitely which is obviously not ideal and also the general behavior we have seen in the live deployment.

Turning on ldap3.utils.log.set_library_log_detail_level(ldap3.utils.log.NETWORK) reveals what is going wrong:

2024-02-05T15:08:59Z ldap3 DEBUG: NETWORK:entering loop number <True> for finding active server in pool <servers:
ldap://127.0.0.1:3389 - cleartext
Pool strategy: ROUND_ROBIN
 - Last used server: ldap://127.0.0.1:3389 - cleartext>
2024-02-05T15:08:59Z ldap3 DEBUG: NETWORK:checking server <ldap://127.0.0.1:3389 - cleartext> for availability
2024-02-05T15:08:59Z ldap3 DEBUG: BASIC:address for <ldap://127.0.0.1:3389 - cleartext> resolved as <[<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 3389)]>
2024-02-05T15:08:59Z ldap3 DEBUG: BASIC:obtained candidate address for <ldap://127.0.0.1:3389 - cleartext>: <[<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 3389)]> with mode IP_V4_ONLY
2024-02-05T15:08:59Z ldap3 DEBUG: ERROR:server <ldap://127.0.0.1:3389 - cleartext> not available at <[<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 3389), False, datetime.datetime(2024, 2, 5, 15, 8, 59, 61483)]>
2024-02-05T15:08:59Z ldap3 DEBUG: NETWORK:waiting for 10 seconds before retrying pool servers cycle

The use of active=True in the ldap3.ServerPool instance lets the library loop forever trying to get a working connection from the pool.

From https://ldap3.readthedocs.io/en/latest/server.html#server-pool:

The active and exhaust parameter accept either a boolean or a number: if you set active=True while defining the ServerPool the strategy will check for server availability, you can also set this attribute to the maximum number of cycles to try before giving up with an LDAPServerPoolExhaustedError exception.

For this app, once through the pool seems appropriate.

bd808 opened https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/12

Add network timeouts and other config to try and prevent deadlocks

bd808 merged https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/12

Add network timeouts and other config to try and prevent deadlocks

Mentioned in SAL (#wikimedia-cloud) [2024-02-05T22:53:29Z] <wmbot~bd808@tools-sgebastion-11> Rebuilt image to pick up deadlock fixes for T356097

I think that between the timeout 3m ... protection; fixing the unintuitive LDAPserver pool infinite search loop (T356097#9515497); and adding connect and read timeouts when talking to GitLab, Gerrit, and Phabricator we should no longer see the bot become indefinitely stuck.

gitlab-account-approval bot stalled on 2024-01-09Closed, ResolvedPublicBUG REPORTActions

Description

Details

Related Objects

Event Timeline

gitlab-account-approval bot stalled on 2024-01-09
Closed, ResolvedPublicBUG REPORT
Actions