Page MenuHomePhabricator

Non concurrent mwcore-codehealth-master-non-voting cause Gearman executors to be locked
Closed, ResolvedPublic

Description

Split from T72597#5518854

A related issues are the mwext-codehealth jobs, they are made to not run concurrently. Sometime one would see several of them pending in the Jenkins build queue and some of the Jenkins agents are idling although they should be running jobs.

The builds are queued by Jenkins, but the Gearman plugin already assigned a node for those builds. The node assignment can be seen in /var/lib/jenkins/queue.xml:

<hudson.model.Queue_-State>
  <items>
    <hudson.model.Queue_-BlockedItem>
      <actions>
        <hudson.plugins.gearman.NodeAssignmentAction plugin="gearman-plugin@0.2.0.3.e27817f">
          <labelAtom>integration-agent-docker-1009</labelAtom>
        </hudson.plugins.gearman.NodeAssignmentAction>
...
    <hudson.model.Queue_-BlockedItem>
      <actions>
        <hudson.plugins.gearman.NodeAssignmentAction plugin="gearman-plugin@0.2.0.3.e27817f">
          <labelAtom>integration-agent-docker-1005</labelAtom>
        </hudson.plugins.gearman.NodeAssignmentAction>

I don't have the details, but Gearman is thus unable to use any of the executors on those two nodes until the build queued by Jenkins starts executing.

It might be related to the lock we occasionally have for deployment-prep.

Event Timeline

The non concurrent jobs such as mwcore-codehealth-master-non-voting cause all German workers for a given node to be reserved even though Jenkins is holding the build. It is a bug somewhere between Gearman and the Jenkins queue processor.

We might be able to work around by pointing the job to a different label, but most probably we would want to use dedicated nodes.

Change 539944 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Move mwext-codehealth jobs to a dedicated label

https://gerrit.wikimedia.org/r/539944

INFO:jenkins_jobs.builder:Reconfiguring jenkins job mwcore-codehealth-master-non-voting
INFO:jenkins_jobs.builder:Reconfiguring jenkins job mwcore-codehealth-patch
INFO:jenkins_jobs.builder:Reconfiguring jenkins job mwext-codehealth-master-non-voting
INFO:jenkins_jobs.builder:Reconfiguring jenkins job mwext-codehealth-patch

I have made one of the agent to solely have codehealth label: https://integration.wikimedia.org/ci/computer/integration-agent-docker-1014/ Will look at having an instance with a dedicated hostname.

Change 539944 merged by jenkins-bot:
[integration/config@master] Move mwext-codehealth jobs to a dedicated label

https://gerrit.wikimedia.org/r/539944

Tried temporarily but either the result was unconclusive or I haven't managed to track down what was happening.

hashar renamed this task from Move mwcore-codehealth-master-non-voting to a dedicated Jenkins label / agent to Non concurrent mwcore-codehealth-master-non-voting cause Gearman executors to be locked.Oct 4 2019, 7:43 AM
hashar triaged this task as Medium priority.

For the Jenkins jobs triggered by Zuul, I guess we can instead handle the non concurrency at Zuul level using a mutex, and make the job in Jenkins concurrent again. This way Zuul will no more trigger X gearman functions which each end up causing a build to be enqueued in Jenkins which in turns locks executors.

Change 540805 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Handle codehealth non concurrency at Zuul level

https://gerrit.wikimedia.org/r/540805

Change 540805 merged by jenkins-bot:
[integration/config@master] Handle codehealth non concurrency at Zuul level

https://gerrit.wikimedia.org/r/540805

hashar claimed this task.

Should be fixed

Change 546987 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Handle fresnel non concurrency in Zuul

https://gerrit.wikimedia.org/r/546987

Change 546987 merged by jenkins-bot:
[integration/config@master] Handle fresnel non concurrency in Zuul

https://gerrit.wikimedia.org/r/546987

Mentioned in SAL (#wikimedia-releng) [2022-01-26T19:55:11Z] <hashar> deleting integration-agent-docker-1014 which only has the codehealth label. A short live experiment no more used since October 2nd 2019 - https://gerrit.wikimedia.org/r/c/integration/config/+/540362 - T234259