For over 4 years now, the Jenkins jobs that run via a jenkins-slave agent on the Beta Cluster bastion (aka deployment-tin) have been failing several times a week/month.
As far as I'm aware, the root cause remains unknown to this day. What we do know is that when this happens:
- A build for the Jenkins job is triggered by Zuul and queued in Jenkins.
- Jenkins keeps it queued with the message "Waiting for available executor: deployment-tin". This despite the fact that the deployment-tin Jenkins slave is reported as online with 6 idle executor slots, eagerly awaiting to be used.
- Other Jenkins jobs continue to work fine (which suggests the issue is not in Gerrit, Zuul, or Gearman).
- There are no errors logged anywhere as far as we know.
- Manually running a script from Jenkins via the "Script Console" works fine (via the Jenkins admin panel for managing the deployment-tin node). Which suggests that Jenkins' ability to connect over SSH and run a command is also not impaired.
- The problem is not intermittent and does not resolve itself. Whenever it has gone unaddressed, beta went without updates for 5 hours, 24 hours, or even several days.
- Usually, restarting the connection from Jenkins Admin to Gearman Deamon fixes it. We do not know why.
- Usually, the builds already in the queue need to be manually cancelled. We do not know why Jenkins only starts the new builds and not the old builds, and we do not know why the new builds only start after cancelling the old ones.
I mistakenly thought this problem was solved several years ago. Maybe it was. But it's back.
@demon suggests that we simply convert these to a crontab entry, provisioned via Puppet. This seems quite feasible given that the majority of the logic is already not in the Jenkins jobs, but actually in a shell script that is already provisioned by Puppet.
Jobs related to this:
- beta-mediawiki-config-update-eqiad
- beta-code-update-eqiad
- beta-update-databases-eqiad
- beta-scap-eqiad