Page MenuHomePhabricator

Switch CI config job to Nodepool instances
Closed, ResolvedPublic

Description

Now that Nodepool is reasonably working, it is time to push the red button. We should migrate a single job to use Nodepool instances, the first target is integration/config.git

@hashar is electing the job integration-jjb-config-diff. It is currently tied to Jenkins nodes matching the label selector contintLabsSlave && UbuntuPrecise.

Should be tied to label ci-jessie-wikimedia and have NODE_OFFLINE_WHEN_COMPLETE=1 injected by Zuul by applying the parameter function offline_when_complete().

We also need the job to emit build status over ZeroMQ for Nodepool to notice an instance is being used and have it delete it as soon as the job is complete.

Event Timeline

hashar claimed this task.
hashar raised the priority of this task from to Medium.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, zeljkofilipin, dduvall.
hashar set Security to None.

Change 238752 had a related patch set uploaded (by Hashar):
Switch integration-jjb-config-diff to Nodepool

https://gerrit.wikimedia.org/r/238752

Change 238752 merged by jenkins-bot:
Switch integration-jjb-config-diff to Nodepool

https://gerrit.wikimedia.org/r/238752

Change 238762 had a related patch set uploaded (by Hashar):
Test integration-jjb-config-diff on Nodepool

https://gerrit.wikimedia.org/r/238762

Change 238762 abandoned by Hashar:
Test integration-jjb-config-diff on Nodepool

Reason:
Surprisingly, it works!

https://gerrit.wikimedia.org/r/238762

Interesting parts from /var/log/nodepool/debug.log

ZeroMQ notification sent by Jenkins to Nodepool:

Start of job

2015-09-16 14:58:56,911 DEBUG nodepool.NodeUpdateListener: Received: onStarted:
{
    "build": {
        "full_url": "https://integration.wikimedia.org/ci/job/integration-jjb-config-diff/3717/",
        "host_name": "gallium.wikimedia.org",
        "node_name": "ci-jessie-wikimedia-46",   # The node that is running the job
        "number": 3717,
        "parameters": {                          # Parameters sent to the job
            "BASE_LOG_PATH": "62/238762/1",
            "LOG_PATH": "62/238762/1/test/integration-jjb-config-diff/5874ca1",
            "OFFLINE_NODE_WHEN_COMPLETE": "1",   # Injected by Zuul scheduler
            "ZUUL_BRANCH": "master",
            "ZUUL_CHANGE": "238762",
            "ZUUL_CHANGES": "integration/config:master:refs/changes/62/238762/1",
            "ZUUL_CHANGE_IDS": "238762,1",
            "ZUUL_COMMIT": "c49d0bbd8f39aa0f1b55da4424ba027e9e82db54",
            "ZUUL_PATCHSET": "1",
            "ZUUL_PIPELINE": "test",
            "ZUUL_PROJECT": "integration/config",
            "ZUUL_REF": "refs/zuul/master/Zec0d4169bcb64ea5955137cf36169862",
            "ZUUL_URL": "git://zuul.eqiad.wmnet",
            "ZUUL_UUID": "5874ca1b7fdc400794eead964230c5e0"
        },
        "phase": "STARTED",                      # Job is started
        "url": "job/integration-jjb-config-diff/3717/"
    },
    "name": "integration-jjb-config-diff",
    "url": "job/integration-jjb-config-diff/"
}

Nodepool internally marks instance ci-jessie-wikimedia-46 (id 46) as being used:

14:58:56,915 INFO nodepool.NodeUpdateListener: Setting node id: 46 to USED

Every 10 seconds Nodepool examine the pool of instances to determine whether instances need to be spawned to refill the pool.

14:59:01,865 DEBUG nodepool.NodePool: Beginning node launch calculation
14:59:02,806 DEBUG nodepool.NodePool:   Deficit: ci-jessie-wikimedia: 1 (start: 0 min-ready: 1 ready: 0 capacity: 4)

ci-jessie-wikimedia label has 0 ready instances and needs (min-ready) at least 1.

Hence Nodepool decides to allocate 1 instance for the ci-jessie-wikimedia label on wmflabs-eqiad cloud provider:

14:59:02,810 DEBUG nodepool.NodePool:   Allocation requests:
14:59:02,810 DEBUG nodepool.NodePool:     <AllocationRequest for 1.0 of ci-jessie-wikimedia>
14:59:02,810 DEBUG nodepool.NodePool:       <AllocationSubRequest for 1.0 (out of 1.0) of ci-jessie-wikimedia from wmflabs-eqiad>

And it is to be added on the Jenkins master on gallium.wikimedia.org:

14:59:02,810 DEBUG nodepool.NodePool:   Grants:
14:59:02,811 DEBUG nodepool.NodePool:     <AllocationGrant of 1 of ci-jessie-wikimedia from wmflabs-eqiad>
14:59:02,811 DEBUG nodepool.NodePool:       <AllocationGrantTarget for 1 of ci-jessie-wikimedia to gallium.wikimedia.org>
14:59:02,811 DEBUG nodepool.NodePool: Finished node launch calculation 
14:59:02,811 INFO nodepool.NodePool: Need to launch 1 ci-jessie-wikimedia nodes for gallium.wikimedia.org on wmflabs-eqiad

Instance with id 47 creation log shows it took 31 seconds for labs to boot the instance and 30 seconds for it to be ssh reachable:

14:59:02,818 DEBUG nodepool.NodeLauncher: Launching node id: 47
14:59:02,823 INFO nodepool.NodeLauncher: Creating server with hostname ci-jessie-wikimedia-47.contintcloud.eqiad.wmflabs \
                                         in wmflabs-eqiad from image ci-jessie-wikimedia for node id: 47
14:59:33,801 DEBUG nodepool.ProviderManager: Status of server in wmflabs-eqiad 541abb35-062e-4637-b46e-2fd08f4961b7: ACTIVE
14:59:33,801 DEBUG nodepool.NodeLauncher: Node id: 47 testing ssh at ip: 10.68.19.205
15:00:04,271 INFO nodepool.NodeLauncher: Node id: 47 is ready

The instance is then added to Jenkins and the pool is replenished for the label (ready: 1):

15:00:04,272 DEBUG nodepool.NodeLauncher: Adding node id: 47 to jenkins
15:00:05,653 INFO nodepool.NodeLauncher: Node id: 47 added to jenkins
...
Deficit: ci-jessie-wikimedia: 0 (start: 0 min-ready: 1 ready: 1 capacity: 3)

Job completed

Jenkins send over ZeroMQ the events onCompleted (ignored by Nodepool) and onFinalized. Nodepool deletes the instance from its internal state then from Jenkins and finally the instance:

15:00:25,157 DEBUG nodepool.NodeUpdateListener: Received: onCompleted  ...
15:01:25,190 DEBUG nodepool.NodePool: Deleting node id: 46 which has been in used state for 0.0414417525132 hours
15:01:27,165 INFO nodepool.NodePool: Deleted jenkins node id: 46
15:01:27,168 DEBUG nodepool.NodePool: Deleting server dbf066a2-05af-4aa2-82b9-bd9a845e7257 for node id: 46

Note it happened 1 minute after the onCompleted event and the deletion node mention the duration of the used state. So that seems to be the cleanup recurring task dealing with it. That is because there is a hardcoded 1 minute delay in nodepool:

DELETE_DELAY = 1 * MINS      # Delay before deleting a node that has completed
                             # its job.

Has been introduced by https://review.openstack.org/#/c/42531/ Delay 1 min before deleting node. This should allow the background Jenkins console scp to complete..

The task is essentially complete. Have to fill a few tasks following the log investigation above. Namely:

  • look at speeding up the time to get instance ready
  • investigate the DELETE_DELAY, might be because Jenkins publishers are run after onFinalized (which doesn't make any sense).

The task is essentially complete. Have to fill a few tasks following the log investigation above. Namely:

  • look at speeding up the time to get instance ready
  • investigate the DELETE_DELAY, might be because Jenkins publishers are run after onFinalized (which doesn't make any sense).

I have filled:
T113342: Speed up the time to get a Nodepool instances to achieve READY state
T113359: Nodepool delay instance deletions by one minute

Nothing left for this task since one job is consuming instances.

greg renamed this task from Switch a job to Nodepool instances to Switch CI config job to Nodepool instances.Sep 22 2015, 4:49 PM