Page MenuHomePhabricator

In-progress Jenkins logs sometimes unavailable (HTTP ERROR 404 Not Found)
Closed, ResolvedPublic

Description

Recently, I’ve noticed that the console of in-progress builds, as linked in Zuul, is sometimes not available:

Screen Shot 2023-07-07 at 16.18.50.png (720×1 px, 42 KB)

This typically resolves itself once the build finishes, but it’s still annoying to no longer be able to see in-progress build output.¹ @Reedy also noticed that during that time, and even afterwards, the affected build is not listed in the Jenkins project’s build history – for example, at the moment, mediawiki-quibble-vendor-mysql-php74-docker still doesn’t list #27265.

¹ (The discussion in T331621 feels tangentially relevant here.)

Event Timeline

I was able to recreate this pretty fast.

I looked at https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-selenium-docker/52953/console (it was a 404 while I was investigating—works now as I'm about to hit "submit" on this comment)

While it was a 404, the console log exists on the disk:

thcipriani@contint2001:/srv/jenkins/builds/quibble-vendor-mysql-php74-selenium-docker/52953$ ls -lhA                                 
total 204K
-rw-rw-r-- 1 jenkins jenkins    6 Jul  7 15:39 changelog.xml
-rw-rw-r-- 1 jenkins jenkins 189K Jul  7 15:48 log
drwxrwsr-x 2 jenkins jenkins 4.0K Jul  7 15:39 timestamper

I see 404s in the apache log for that request, so it's hitting the right host.

And I see 404s in the Jenkins log: X.X.X.X - - [07/Jul/2023:15:55:45 +0000] "GET /ci/job/quibble-vendor-mysql-php74-selenium-docker/52953/console HTTP/1.1" 404 0 "-" "[my-user-agent]"

So something is strange inside Jenkins, but unsure if we've updated anything.

I note now that I can see the job it's added two files to the build directory:

thcipriani@contint2001:~# ls -lh /srv/jenkins/builds/quibble-vendor-mysql-php74-selenium-docker/52953
total 268K
drwxrwsr-x 3 jenkins jenkins 4.0K Jul  7 15:55 archive
-rw-rw-r-- 1 jenkins jenkins 7.8K Jul  7 15:55 build.xml
-rw-rw-r-- 1 jenkins jenkins    6 Jul  7 15:39 changelog.xml
-rw-rw-r-- 1 jenkins jenkins 241K Jul  7 15:55 log
drwxrwsr-x 2 jenkins jenkins 4.0K Jul  7 15:55 timestamper
hashar subscribed.

From sudo journalctl -u jenkins|grep 52953 I got:

Jul 07 15:39:59 contint2001 jenkins[32446]: WARNING: [jenkins.model.lazy.LazyBuildMixIn newBuild] JENKINS-23152: /srv/jenkins/builds/quibble-vendor-mysql-php74-selenium-docker/52952 already existed; will not overwrite with quibble-vendor-mysql-php74-selenium-docker #52952 but will create a fresh build #52953
Jul 07 15:40:04 contint2001 jenkins[32446]: WARNING: [jenkins.model.lazy.LazyBuildMixIn newBuild] JENKINS-23152: /srv/jenkins/builds/quibble-vendor-mysql-php74-selenium-docker/52953 already existed; will not overwrite with quibble-vendor-mysql-php74-selenium-docker #52953 but will create a fresh build #52954

There was one occurence on Jul 04 15:44:00 and there are a lot more since Jul 05 15:41. The issue referred to by the log message is https://issues.jenkins.io/browse/JENKINS-23152 but we don't have the GerritTrigger plugin but there is some explanation:

Sometimes, the nextBuildNumber is set to a lower number and duplicate builds are created; two date-directories are created in builds/ but only the latest build has a build_number symlink.

Jenkins has been up and running since Wed 2023-06-14 15:46:55 UTC; 3 weeks 2 days ago. Then I had it reload its configuration from disk when I manually created XML configuration files to add new agents as part of a fleet rebuild:

SAL for 2023-07-04
15:42 	<hashar> 	integration: Reloading Jenkins CI configuration from disk after manually adding a dozen of agents

Thus I guess the configuration reload caused Jenkins to loose track of next build number or some race condition.

I recommend restarting it.

Mentioned in SAL (#wikimedia-operations) [2023-07-07T16:20:17Z] <hashar> Restarting CI Jenkins due to a confusion in the next build number leading to intermittent 404 when browsing console links | T341348

hashar claimed this task.

Fixed by restarting Jenkins.

Ideally guess that could deserves a report to the upstream bug and ideally a way to prevent us from reloading the configuration from disk since that seems to be broken :-]