Page MenuHomePhabricator

Use cron instead of Jenkins for beta deployments
Closed, DeclinedPublic

Description

For over 4 years now, the Jenkins jobs that run via a jenkins-slave agent on the Beta Cluster bastion (aka deployment-tin) have been failing several times a week/month.

As far as I'm aware, the root cause remains unknown to this day. What we do know is that when this happens:

  • A build for the Jenkins job is triggered by Zuul and queued in Jenkins.
  • Jenkins keeps it queued with the message "Waiting for available executor: deployment-tin". This despite the fact that the deployment-tin Jenkins slave is reported as online with 6 idle executor slots, eagerly awaiting to be used.
  • Other Jenkins jobs continue to work fine (which suggests the issue is not in Gerrit, Zuul, or Gearman).
  • There are no errors logged anywhere as far as we know.
  • Manually running a script from Jenkins via the "Script Console" works fine (via the Jenkins admin panel for managing the deployment-tin node). Which suggests that Jenkins' ability to connect over SSH and run a command is also not impaired.
  • The problem is not intermittent and does not resolve itself. Whenever it has gone unaddressed, beta went without updates for 5 hours, 24 hours, or even several days.
  • Usually, restarting the connection from Jenkins Admin to Gearman Deamon fixes it. We do not know why.
  • Usually, the builds already in the queue need to be manually cancelled. We do not know why Jenkins only starts the new builds and not the old builds, and we do not know why the new builds only start after cancelling the old ones.

I mistakenly thought this problem was solved several years ago. Maybe it was. But it's back.

@demon suggests that we simply convert these to a crontab entry, provisioned via Puppet. This seems quite feasible given that the majority of the logic is already not in the Jenkins jobs, but actually in a shell script that is already provisioned by Puppet.

Jobs related to this:

  • beta-mediawiki-config-update-eqiad
  • beta-code-update-eqiad
  • beta-update-databases-eqiad
  • beta-scap-eqiad

Event Timeline

I wonder how this relates to T73305, which is about migrating away from a cron to jenkins for the puppet repo. Although these repositories are independent, it seems to me that the upsides/downsides on that other task still apply here. As I understand it the gist of these two tasks is that

  • crons work well, unless they break, in which case they suck at alerting and making it easy to find an offending commit
  • jenkins nice at alerting and helps to find the offending commit, unless it breaks for some reason we can't figure out (as described in this task description)

If I understood that correctly we should rather balance these against each other instead of migrating mediawiki from jenkins to cron and puppet from cron to jenkins. I assume the trade-off we'll have to make and thus the decision which method is better suited will be the same for both repos.

I think we should decline T73305 then?

+1 :)

hashar subscribed.

Declining the cron job idea in favor of a dedicated infra T256168