Page MenuHomePhabricator

Upgrade matrix-auth for Jenkins 2.440
Closed, ResolvedPublic

Description

When attempting to upgrade the release Jenkins from 2.426.3 to 2.440.2, the configuration as code failed to initialize and http://releases-jenkins.wikimedia.org failed.

In the error log:

Mar 27 08:53:49 releases1003 jenkins[1678621]: [03/27/24 08:53:49] SSH Launch of releases1003.eqiad.wmnet on localhost completed in 4,597 ms
Mar 27 08:53:50 releases1003 jenkins[1678621]: SEVERE: [jenkins.InitReactorRunner$1 onTaskFailed] Failed ConfigurationAsCode.init
Mar 27 08:53:50 releases1003 jenkins[1678621]: SEVERE: [hudson.util.BootFailure publish] Failed to initialize Jenkins
Mar 27 08:53:54 releases1003 jenkins[1678621]: WARNING: [org.eclipse.jetty.server.handler.ContextHandler$Context log] Error while serving http://releases-jenkins.wikimedia.org/

And browing the page I got:

HTTP ERROR 503 java.lang.IllegalStateException: Jenkins.instance is missing. Read the documentation of Jenkins.getInstanceOrNull to see what you are doing wrong.

URI: /
STATUS: 503
MESSAGE: java.lang.IllegalStateException: Jenkins.instance is missing. Read the documentation of Jenkins.getInstanceOrNull to see what you are doing wrong.
SERVLET: Stapler
CAUSED BY: java.lang.IllegalStateException: Jenkins.instance is missing. Read the documentation of Jenkins.getInstanceOrNull to see what you are doing wrong.
Caused by:

java.lang.IllegalStateException: Jenkins.instance is missing. Read the documentation of Jenkins.getInstanceOrNull to see what you are doing wrong.
	at jenkins.model.Jenkins.get(Jenkins.java:819)
	at org.jenkinsci.plugins.matrixauth.AuthorizationContainer.hasPermission(AuthorizationContainer.java:305)
	at hudson.security.GlobalMatrixAuthorizationStrategy$AclImpl.hasPermission(GlobalMatrixAuthorizationStrategy.java:120)
	at hudson.security.SidACL._hasPermission(SidACL.java:73)
	at hudson.security.SidACL.hasPermission2(SidACL.java:54)
	at hudson.security.ACL.checkPermission(ACL.java:76)
	at hudson.security.AccessControlled.checkPermission(AccessControlled.java:52)
	at jenkins.model.Jenkins.getTarget(Jenkins.java:5200)
	at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:725)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:900)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:698)
	at hudson.init.impl.InstallUncaughtExceptionHandler.handleException(InstallUncaughtExceptionHandler.java:59)
	at hudson.init.impl.InstallUncaughtExceptionHandler.lambda$init$0(InstallUncaughtExceptionHandler.java:33)
	at org.kohsuke.stapler.compression.CompressionFilter.reportException(CompressionFilter.java:72)
	at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:56)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
	at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:86)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
	at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
	at jenkins.security.SuspiciousRequestFilter.doFilter(SuspiciousRequestFilter.java:38)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:569)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1580)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1553)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
	at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:46)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
	at org.eclipse.jetty.server.Server.handle(Server.java:563)
	at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
	at java.base/java.lang.Thread.run(Thread.java:829)

We have pined matrix-auth at 3.1.5. I guess it needs some update, note that 3.2.0 is a breaking change.

Release notes: https://plugins.jenkins.io/matrix-auth/releases/

Details

TitleReferenceAuthorSource BranchDest Branch
Log hudson.pluginManagerrepos/releng/jenkins-deploy!58hasharpluginManager-loggingmaster
jenkins-rel: updated matrix-auth plugin to 3.2.2 and adapt configrepos/releng/jenkins-deploy!57jnucheT361084master
Customize query in GitLab

Event Timeline

https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/55 has added stacktrace to the logging. So we now get:

javaposse.jobdsl.dsl.DslScriptException: (script, line 233) No signature of method: permissions() is applicable for argument types: (java.util.ArrayList) values: [[USER:Job/Read:anonymous, USER:Job/Read:jenkinsrelapi, ...]]
Possible solutions: entries(), inheritanceStrategy()

That leads me to our conf/releasing/casc/jobs/docpub.groovy which has:

job('docpub') {
  properties {
    authorizationMatrix {
        inheritanceStrategy {
            nonInheriting()
        }
        permissions(['USER:Job/Read:anonymous', "USER:Job/Read:$API_USER", "USER:Job/Build:$API_USER"])
    }
  }

My guess is that we are using permissions directly but should instead use entries. The exact syntax is to be determinated :/ I don't know from where the change comes :-\

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:12:24Z] <jnuche@deploy1002> Started deploy [releng/jenkins-deploy@1a343bf] (releasing): testing fix for T361084

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:12:44Z] <jnuche@deploy1002> Finished deploy [releng/jenkins-deploy@1a343bf] (releasing): testing fix for T361084 (duration: 00m 20s)

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:33:10Z] <jnuche@deploy1002> Started deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:33:29Z] <jnuche@deploy1002> Finished deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets (duration: 00m 19s)

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:33:59Z] <jnuche@deploy1002> Started deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:35:03Z] <jnuche@deploy1002> Finished deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets (duration: 01m 03s)

Some debugging session from releases2003 which has the issue. At first, in /var/lib/jenkins/plugins all plugins are from 13:17 except for the bundled ones:

-rw-rw-r--  1 jenkins jenkins    71201 Mar 27 13:20 jdk-tool.jpi
-rw-rw-r--  1 jenkins jenkins    41098 Mar 27 13:20 command-launcher.jpi
-rw-rw-r--  1 jenkins jenkins   632493 Mar 27 13:20 javax-mail-api.jpi
-rw-rw-r--  1 jenkins jenkins   134717 Mar 27 13:20 mailer.jpi
-rw-rw-r--  1 jenkins jenkins   177310 Mar 27 13:20 matrix-auth.jpi

From journalctl we can find which commands are ran since most are using sudo (which is logged)

@timestamp": "2024-03-27T13:17:05Z", "message": "Executing check 'update_jenkins'"}
13:17:05 COMMAND=/usr/bin/systemctl daemon-reload
13:17:05 COMMAND=/usr/bin/apt-get install -y jenkins

2024-03-27T13:17:18Z", "message": "This is a secondary host. Stopping and disabling service 'jenkins'"
COMMAND=/usr/sbin/service jenkins stop
COMMAND=/bin/systemctl disable jenkins

After that @jnuche manually started Jenkins:

13:20:19 COMMAND=/usr/bin/systemctl start jenkins
// It bails out due the wrong version

A fringe theory is that jenkins-plugin-manager has --clean-download-directory which erases the content of the plugin directory. There could then be a race condition which is that the Jenkins daemon starts with the bundled plugin :-((( I'd need to be able to reproduce locally.

jnuche claimed this task.

The fix from https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/57 has fixed the issue. Both releases Jenkins instances have now been updated to 2.440.2 and look healthy.

We still don't know what caused the matrix-auth plugin to upgrade to a >3.2 version and we can't reproduce the issue in other environments. I'm tentatively closing this.