Page MenuHomePhabricator

Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21
Closed, ResolvedPublic

Description

After successfully building and pushing the container image to the registry, the post-build email code seems to run and result in this:
From https://releases-jenkins.wikimedia.org/job/build-mw-container-image/3213/console

00:30:52 FATAL: org.springframework.security.authentication.DisabledException: The user "TheresNoTime" is administratively disabled.
00:30:52 org.springframework.security.authentication.DisabledException: The user "TheresNoTime" is administratively disabled.
00:30:52 	at hudson.security.UserAttributesHelper.checkIfUserEnabled(UserAttributesHelper.java:104)
00:30:52 	at hudson.security.LDAPSecurityRealm$LDAPUserDetailsService.loadUserByUsername(LDAPSecurityRealm.java:1315)
00:30:52 	at hudson.security.LDAPSecurityRealm$DelegateLDAPUserDetailsService.loadUserByUsername(LDAPSecurityRealm.java:1228)
00:30:52 	at hudson.security.LDAPSecurityRealm.loadUserByUsername2(LDAPSecurityRealm.java:763)
00:30:52 	at jenkins.security.UserDetailsCache$Retriever.call(UserDetailsCache.java:170)
00:30:52 	at jenkins.security.UserDetailsCache$Retriever.call(UserDetailsCache.java:159)
00:30:52 	at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868)
00:30:52 	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533)
00:30:52 	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282)
00:30:52 	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159)
00:30:52 	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
00:30:52 Caused: com.google.common.util.concurrent.UncheckedExecutionException
00:30:52 	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2055)
00:30:52 	at com.google.common.cache.LocalCache.get(LocalCache.java:3966)
00:30:52 	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863)
00:30:52 	at jenkins.security.UserDetailsCache.loadUserByUsername(UserDetailsCache.java:127)
00:30:52 	at hudson.model.User$UserIDCanonicalIdResolver.resolveCanonicalId(User.java:1254)
00:30:52 	at hudson.model.User$CanonicalIdResolver.resolve(User.java:1195)
00:30:52 	at hudson.model.User.get(User.java:524)
00:30:52 	at hudson.plugins.git.GitChangeSet.findOrCreateUser(GitChangeSet.java:450)
00:30:52 	at hudson.plugins.git.GitChangeSet.getAuthor(GitChangeSet.java:546)
00:30:52 	at jenkins.scm.RunWithSCM.calculateCulprits(RunWithSCM.java:141)
00:30:52 	at hudson.model.AbstractBuild.calculateCulprits(AbstractBuild.java:351)
00:30:52 	at jenkins.scm.RunWithSCM.getCulprits(RunWithSCM.java:96)
00:30:52 	at hudson.model.AbstractBuild.getCulprits(AbstractBuild.java:340)
00:30:52 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:713)
00:30:52 	at hudson.model.Run.execute(Run.java:1922)
00:30:52 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
00:30:52 	at hudson.model.ResourceController.execute(ResourceController.java:101)
00:30:52 	at hudson.model.Executor.run(Executor.java:442)
00:30:52 Email was triggered for: Failure - Any
00:30:52 Sending email for trigger: Failure - Any
00:30:52 Sending email to: releng@lists.wikimedia.org
00:30:52 Finished: FAILURE

Upstream issue: https://issues.jenkins.io/browse/JENKINS-67981

Reproducibility

One can attempt to retrieve the disabled user via the Jenkins script console at https://releases-jenkins.wikimedia.org/script:

println(User.get("theresnotime", false, Collections.emptyMap()));

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dancy renamed this task from Container image builds started failing on https://releases-jenkins.wikimedia.org/ to Container image builds started failing on https://releases-jenkins.wikimedia.org/ 2022-08-21.Aug 22 2022, 3:29 PM

(adding @bd808 and @hashar — y'all might be interested in this & were involved in T302109... if needs be, please feel free to just re-enable that ldap account)

dancy renamed this task from Container image builds started failing on https://releases-jenkins.wikimedia.org/ 2022-08-21 to Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21.Aug 23 2022, 2:50 PM
dancy triaged this task as High priority.

The build got triggered by https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/823724/1//COMMIT_MSG which has author name set to TheresNoTime. From the stack-trace, Jenkins looks for an internal user having the same name which causes a query to LDAP. I have hard time understanding how it is considered administratively prohibited.

There was https://issues.jenkins.io/browse/JENKINS-67491 Build failure due to disabled LDAP user. Fix released with Git plugin 4.10.2. We run 4.11.4 on releases Jenkins.

https://issues.jenkins.io/browse/JENKINS-67981 Build failure due to ldap disabled account. Which looks like a duplicate.

The git plugin got updated on July 28th.

The build got triggered by https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/823724/1//COMMIT_MSG which has author name set to TheresNoTime. From the stack-trace, Jenkins looks for an internal user having the same name which causes a query to LDAP. I have hard time understanding how it is considered administratively prohibited.

I might be misunderstanding here, but is this not because that Wikitech account is blocked? Just going by this comment:

We don't generally delete accounts from LDAP, but we can disable the account by blocking it on wikitech which in turn triggers blocking of any linked accounts on Phabricator and Gerrit as well as marking the LDAP record as locked.

@TheresNoTime : sorry to rephrase, Jenkins find the account disabled, we apparently have disabled it in LDAP and I take all that as granted. What I meant is that I didn't want to go in the rabbit hole of understanding how Jenkins/Spring etc figure out how we disabled the account in LDAP or how do we disable an account. Sorry for the misleading comment.

The bulk of it is that JENKINS-67491 got a fix at https://github.com/jenkinsci/git-plugin/pull/1202/files which addresses looking up a user based on the first part of their email. Aka previously sending a commit with author theresnotime@example.org would have caused the same issue. That fixes got released and we have it.

However, they have missed a similar code a few lines above which does lookup the user based on the author name of the commit. It is not wrapped with a try/catch statement and that is the issue we are encountering.

00:30:52 at hudson.plugins.git.GitChangeSet.findOrCreateUser(GitChangeSet.java:450)

Here is the source code with some comments by me:

450             user = User.get(csAuthor, false, Collections.emptyMap());   // hashar: this lookup from the author name. Nothing caught, cause the bug.
451 
452             if (user == null) {
453                 if (csAuthorEmail == null || csAuthorEmail.isEmpty()) {
454                     return User.getUnknown();
455                 }
456                 // Ensure that malformed email addresses (in this case, just '@')
457                 // don't mess us up.
458                 String[] emailParts = csAuthorEmail.split("@"); // hashar: get the first part of the email
459                 if (emailParts.length > 0) {
460                     try {
461                         user = User.get(emailParts[0], true, Collections.emptyMap());  // hashar: lookup a user "theresnotime" extracted from <theresnotime@example.org>
462                     } catch (org.springframework.security.core.AuthenticationException authException) {  // hashar: caught it is disabled since 4.10.2 / JENKINS-67491
463                         // JENKINS-67491 - do not fail due to an authentication exception
464                         return User.getUnknown();
465                     }
466                 } else {
467                     return User.getUnknown();
468                 }
469             }

TLDR: I have commented on https://issues.jenkins.io/browse/JENKINS-67981?focusedCommentId=429520 . They need a try/catch statement.

hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)

Ah, thank you for the explanation 😌

@TheresNoTime Apologies on behalf of Jenkins for targeting you. :-)

Rereading the Upstream issue JENKINS-67981, the very first comment by the (one of) the maintainer(s) is:

The git plugin change in 4.10.2 to catch the exception thrown by User.get() for a disabled user was not applied to that line in the plugin. I've submitted git plugin pull request https://github.com/jenkinsci/git-plugin/pull/1233 to guard other calls in the same way.

Please test the build from https://github.com/jenkinsci/git-plugin/pull/1233 by downloading the plugin artifact from https://ci.jenkins.io/job/Plugins/job/git-plugin/job/PR-1233/

Which was my conclusion as well. User.get() missing guards. The PR https://github.com/jenkinsci/git-plugin/pull/1233 is still open. To build the plugin with the cherry pick one can:

git clone https://github.com/jenkinsci/git-plugin
cd git-plugin
git checkout git-4.11.5
git fetch origin refs/pull/1233/head
# Blame Github
git cherry-pick 6bfd837e94bf47cf8d23616d0068610326dcebb8
git cherry-pick f33342fe16c644409a7e0e4be2fa6dea270f2712  # Most probably NOT needed

Change the version in pom.xml to something such as 4.11.5-WMF~T315897

Then build:

mvn -Denforcer.skip clean package

The resulting plugin is ./target/git.hpi.

A Jenkins with the plugin loaded can be started using mvn hpi:run. Then one has to create a job, figure out how to create a disabled user that would cause the condition etc. It is probably easier to deploy the resulting artifact to the releases Jenkins.

If the outcome is positive we can report back to upstream. Hopefully that will get it merged and eventually released, then we can update the plugin to catch up with upstream release.

FATAL: org.springframework.security.authentication.DisabledException: The user "TheresNoTime" is administratively disabled.
org.springframework.security.authentication.DisabledException: The user "TheresNoTime" is administratively disabled.
	at hudson.security.UserAttributesHelper.checkIfUserEnabled(UserAttributesHelper.java:104)
	at hudson.security.LDAPSecurityRealm$LDAPUserDetailsService.loadUserByUsername(LDAPSecurityRealm.java:1315)
	at hudson.security.LDAPSecurityRealm$DelegateLDAPUserDetailsService.loadUserByUsername(LDAPSecurityRealm.java:1228)
	at hudson.security.LDAPSecurityRealm.loadUserByUsername2(LDAPSecurityRealm.java:763)
	at jenkins.security.UserDetailsCache$Retriever.call(UserDetailsCache.java:170)
	at jenkins.security.UserDetailsCache$Retriever.call(UserDetailsCache.java:159)
	at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
Caused: com.google.common.util.concurrent.UncheckedExecutionException
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2055)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3966)
	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863)
	at jenkins.security.UserDetailsCache.loadUserByUsername(UserDetailsCache.java:127)
	at hudson.model.User$UserIDCanonicalIdResolver.resolveCanonicalId(User.java:1254)
	at hudson.model.User$CanonicalIdResolver.resolve(User.java:1195)
	at hudson.model.User.get(User.java:524)
	at hudson.plugins.git.GitChangeSet.findOrCreateUser(GitChangeSet.java:461)
	at hudson.plugins.git.GitChangeSet.getAuthor(GitChangeSet.java:561)
...

So somehow the PR code does not catch org.springframework.security.authentication.DisabledException which leads to the com.google.common.util.concurrent.UncheckedExecutionException. That is talked about on the pull request as an unknown issue with the author wondering why catching AuthenticationException does not catch the DisabledException.

Maybe that is jenkins.security.UserDetailsCache which is not handling the DisabledException and thus raise the UncheckedExecutionException. That would be a bug in Jenkins core itself maybe?

IIRC one iteration of the PR did caught the UncheckedExecutionException and that might "fix" it.

I have commented on PR 1233:


A bit more investigation. The plugin throws at:

hudson/plugins/git/GitChangeSet.java
            try {
                user = User.get(csAuthor, false, Collections.emptyMap());
            } catch (AuthenticationException authException) {
                // JENKINS-67491 - do not fail due to an authentication exception
                return User.getUnknown();
            }

Which only catches AuthenticationException.

The second trace at jenkins.security.UserDetailsCache.loadUserByUsername(UserDetailsCache.java:127) comes from Jenkins core:

core/src/main/java/jenkins/security/UserDetailsCache.java
/**
 * Locates the user based on the username, by first looking in the cache and then delegate to
 * {@link hudson.security.SecurityRealm#loadUserByUsername2(String)}.
 *
 * @param idOrFullName the username
 * @return the details
 *
 * @throws UsernameNotFoundException (normally a {@link hudson.security.UserMayOrMayNotExistException2})
 *              if the user could not be found or the user has no GrantedAuthority
 * @throws ExecutionException if anything else went wrong in the cache lookup/retrieval
 */
@NonNull
public UserDetails loadUserByUsername(String idOrFullName) throws UsernameNotFoundException, ExecutionException {
    Boolean exists = existenceCache.getIfPresent(idOrFullName);
    if (exists != null && !exists) {
        throw new UsernameNotFoundException(String.format("\"%s\" does not exist", idOrFullName));
    } else {
        try {
            return detailsCache.get(idOrFullName, new Retriever(idOrFullName));
        } catch (ExecutionException | UncheckedExecutionException e) {
            if (e.getCause() instanceof UsernameNotFoundException) {
                throw (UsernameNotFoundException) e.getCause();
            } else {
                throw e;
            }
        }
    }
}

Which retrow apparently either an ExecutionException or an UncheckedExecutionException rather than the DisabledUserException. Or to say it otherwise, the Jenkins user cache wraps in ExecutionException the exceptions about the user not being found or disabled. If the wrapped exception happens to be UsernameNotFoundException it throws it, else it throws the wrapper.

The PR catch AuthenticationException, it should instead catch ExecutionException as mentioned in the javadoc for loadUserByUsername:

* @throws UsernameNotFoundException (normally a {@link hudson.security.UserMayOrMayNotExistException2})
*              if the user could not be found or the user has no GrantedAuthority
* @throws ExecutionException if anything else went wrong in the cache lookup/retrieval

I have amended the task description with a repro case which can be run via https://releases-jenkins.wikimedia.org/script

The issue is UserDetailsCache.loadUserByUsername() retrieves from the cache which would throw the AuthenticationException wrapped in an ExecutionException. The Jenkins core method thus inspect the cause and only thrown an AuthenticationException when it is an instance of UsernameNotFoundException. The LDAP AuthenticationException DisabledException is thus not caught and the wrapper is returned.

I thus went to add a wrapper which inspects the cause and throw the cause whenever it is an AuthenticationException. That should surely be made to Jenkins core instead.

The fix I have tried live:

import org.springframework.security.core.AuthenticationException;
import com.google.common.util.concurrent.UncheckedExecutionException;

try {
  try {
    println("Attempting to retrieve disabled user 'theresnotime'");
    println(User.get("theresnotime", false, Collections.emptyMap()));
    } catch (AuthenticationException e) {
      println("Throwing the AuthenticationException directly");
      throw e;
   } catch (UncheckedExecutionException e) {
      println("Got an UncheckedExecutionException");
      if (e.getCause() instanceof AuthenticationException) {
          println("Found cause to be an AuthenticationException, throwing");
          throw (AuthenticationException) e.getCause();
      } else {
          println("Unhandled exception");
          throw e;
      }
  }
} catch (AuthenticationException e) {
	println(User.getUnknown());
}
console output
Attempting to retrieve disabled user 'theresnotime'
Got an UncheckedExecutionException
Found cause to be an AuthenticationException, throwing
unknown

I will send a new pull request.

Upstream pull request https://github.com/jenkinsci/git-plugin/pull/1322

I have compiled it against the tip of the master branch (13a61a06a4293e4307c18e0687f3be16fe516b3f) which describes as git-4.11.5-95-g13a61a06. I have set the version to 4.11.5-96-g13a61a06-PR1322 in the pom and the plugin will thus show as 4.11.5-96-g13a61a06-PR1322-SNAPSHOT.

The plugin can be retrieved from https://people.wikimedia.org/~hashar/T315897/git-4.11.5-96-g13a61a06-PR1322.hpi

The pull request is pending confirmation by others the issue is indeed fixed ( https://github.com/jenkinsci/git-plugin/pull/1322#issuecomment-1234543028 ). I have replied confirming it solved it for us.

Upstream has merged the change, that will be in the next release of the Git plugin. The current ones are 4.9.4 and 4.11.5 ( https://github.com/jenkinsci/git-plugin/releases ).

thcipriani changed the task status from Open to Stalled.Sep 14 2022, 4:45 PM
thcipriani subscribed.

Awaiting upstream realease

Mentioned in SAL (#wikimedia-releng) [2022-09-20T08:08:47Z] <hashar> Upgrading CI and releases Jenkins plugins notably to update the git client T315897

I have upgraded the git plugin on both our Jenkins.