Page MenuHomePhabricator

Bring back jobs to Nodepool
Closed, ResolvedPublic

Description

Due to the CI incident on August 10th, we had most of the Nodepool jobs moved back to permanent slave in an emergency.

There was apparently with KVM slowness at somepoint and most importantly the quota being incorrect. That caused Nodepool to spam OpenStack every second.

The quota has been fixed and should now be kept in sync properly T143016

The rate has been set from 1 seconds to 10 seconds to stop hammering OpenStack. Should be lowered again since that really slow down the whole Nodepool processing. The reason is each Nodepool interactions with OpenStack is a task added in a Queue, one and only one is processed every rate seconds. Examples of such tasks:

  • ListServersTask (cached for 5 seconds)
  • ListFloatingIPsTask (cached for 5 seconds)
  • DeleteServerTask
  • CreateServerTask
  • ...

With a rate of 10 it would only be able to spawn 6 instances per minute, with the other tasks enqueued, that would be realistically less than that. Has been done with 7bcff1d06a00ac0311ec0eb1b625b0fb08bfb315 / T113359

Revert patches:

StatusGerrit changeSummary
Donehttps://gerrit.wikimedia.org/r/313061Bring back npm-node-4 to Nodepool
Donehttps://gerrit.wikimedia.org/r/#/c/306723/Revert "Move rake jobs off of nodepool"
Donehttps://gerrit.wikimedia.org/r/#/c/306724/Revert "rake: Fix bundle install path"
Donehttps://gerrit.wikimedia.org/r/#/c/306725/Revert "Move tox-jessie & co. off of nodepool"
Donehttps://gerrit.wikimedia.org/r/#/c/306726/Revert "Move mediawiki-core-phpcs off of nodepool"
Donehttps://gerrit.wikimedia.org/r/#/c/306727/Revert "Temporarily move composer-hhvm/php5 jobs off of nodepool"

Event Timeline

:) (note this is not a spam comment just that I like this)

Looking at debug messages over four days:

$ grep 'wmflabs.*running task' /var/log/nodepool/debug.log*|cut -d\  -f9|cut -d\. -f3|sort|uniq -c|sort -rn
   6266 ListServersTask
   3578 DeleteServerTask
   3578 CreateServerTask
   2267 ListFloatingIPsTask
     13 ListFlavorsTask
     13 ListExtensionsTask
      8 AddKeypairTask
      5 ListKeypairsTask
      5 GetServerTask
      5 DeleteKeypairTask
      2 FindImageTask

I looked at the patches this Friday morning and since I am out this evening with the week-end going, there was no chance for me to babysit the revert.

After discussion with @chasemp we will want to revert one by one making sure Chase is around to tweak the quota/rate and monitor the labs infra as needed. His Monday is quite busy, so we will do a first change on Tuesday and then I guess one per day.

Aim is to be done by end of next week?

@chasemp suggested to list out how many builds per timeframe a set of jobs does.

Looking at Zuul metrics, we get for each pipeline the amount of jobs triggered. I crafted a board that let us select multiple jobs and graph the aggregate:

https://grafana-admin.wikimedia.org/dashboard/db/zuul-job

hashar triaged this task as High priority.
hashar moved this task from Untriaged to Next on the Continuous-Integration-Infrastructure board.
hashar moved this task from Next to In-progress on the Continuous-Integration-Infrastructure board.

Mentioned in SAL [2016-08-30T14:14:02Z] <hashar> Moved mediawiki-core-phpcs job back to Nodepool T143938

Change 307526 had a related patch set uploaded (by Rush):
nodepool: bump up ready states, max, and rate

https://gerrit.wikimedia.org/r/307526

Change 307526 merged by Rush:
nodepool: bump up ready states, max, and rate

https://gerrit.wikimedia.org/r/307526

Change 306725 had a related patch set uploaded (by Hashar):
Revert "Move tox-jessie & co. off of nodepool"

https://gerrit.wikimedia.org/r/306725

Change 308183 had a related patch set uploaded (by Hashar):
[labs/striker] port job to Nodepool Jessie instance

https://gerrit.wikimedia.org/r/308183

Change 308183 abandoned by Hashar:
[labs/striker] port job to Nodepool Jessie instance

Reason:
squashed in https://gerrit.wikimedia.org/r/#/c/306725/

https://gerrit.wikimedia.org/r/308183

Change 306725 merged by jenkins-bot:
Revert "Move tox-jessie & co. off of nodepool"

https://gerrit.wikimedia.org/r/306725

To bring the tox job back to permanent slaves:

  • revert https://gerrit.wikimedia.org/r/#/c/306725/
  • Rephrase the commit message one line summary since "Revert Revert Revert" tends to be confusing.
  • Code-Review +2
  • Wait for merge
  • fab deploy_zuul from root of integration/config

I haven't deleted the Jenkins job, so the revert is essentially rolling back zuul/layout.yaml.

The tox jobs are all happy.


The rake jobs have been made to only trigger when ruby related files are modified (T144325). It is done to only a few builds per days so I will process tomorrow:

https://gerrit.wikimedia.org/r/#/c/306723/Revert "Move rake jobs off of nodepool"
https://gerrit.wikimedia.org/r/#/c/306724/Revert "rake: Fix bundle install path"

Then prepare https://gerrit.wikimedia.org/r/#/c/306722/ "Revert "Move npm-node-4 off of nodepool""" which has more jobs to add to Nodepool.

I have moved the rake and oojs-ui-rake jobs to Nodepool. Validated them by hitting recheck on a couple dummy changes:

Change 306722 had a related patch set uploaded (by Hashar):
Revert "Move npm-node-4 off of nodepool"

https://gerrit.wikimedia.org/r/306722

Change 306722 abandoned by Hashar:
Revert "Move npm-node-4 off of nodepool"

Reason:
That patch is a non sense. The commit it reverts only switched 'npm-node-4' job, leaving all the others to Nodepool :D

https://gerrit.wikimedia.org/r/306722

Change 313061 had a related patch set uploaded (by Hashar):
Bring back npm-node-4 to Nodepool

https://gerrit.wikimedia.org/r/313061

Change 313061 merged by jenkins-bot:
Bring back npm-node-4 to Nodepool

https://gerrit.wikimedia.org/r/313061

I have moved the node-npm-4 job back. Is roughly a dozen of builds per hours, hardly a dent.

From a discussion with Chase. What is left todo is to migrate back the PHP based jobs:

https://gerrit.wikimedia.org/r/#/c/306727/
Revert "Temporarily move composer-hhvm/php5 jobs off of nodepool"

That patch would probably move to many build at once so it is better to split it up in manageable chunk. This week is not ideal to accomplish the move, so better done starting next monday.

Change 306727 had a related patch set uploaded (by Hashar):
Revert "Temporarily move composer-hhvm/php5 jobs off of nodepool"

https://gerrit.wikimedia.org/r/306727

I have rebased the patch https://gerrit.wikimedia.org/r/#/c/306727/ Revert "Temporarily move composer-hhvm/php5 jobs off of nodepool"

That replace three jobs that runs on permanent slaves: composer-hhvm, composer-php55 and composer-php70.

Expected build load shifted to Nodepool instances is roughly 350 builds per day which would be added to the 800/1000 builds per day we are already doing on Nodepool. That bring us back to the situation we had from June to early August 2016.


Details

https://grafana.wikimedia.org/dashboard/db/zuul-job?var-pipeline=All&var-job=%7Bcomposer-hhvm,composer-php55,composer-php70%7D&var-status=All&panelId=5&fullscreen . View over a month:

composer-jobs-30days.png (345×711 px, 36 KB)

There is currently 800 - 1000 builds per day on Nodepool instances: https://grafana-admin.wikimedia.org/dashboard/db/nodepool?panelId=23&fullscreen

builds-on-nodepool-30days.png (351×815 px, 32 KB)

Change 314278 had a related patch set uploaded (by Hashar):
Recreate jobs for composer-hhvm/php on Nodepool

https://gerrit.wikimedia.org/r/314278

Change 314278 merged by jenkins-bot:
Recreate jobs for composer-hhvm/php on Nodepool

https://gerrit.wikimedia.org/r/314278

Change 306727 merged by jenkins-bot:
Move composer-hhvm/php5 jobs back to Nodepool

https://gerrit.wikimedia.org/r/306727

Mentioned in SAL (#wikimedia-releng) [2016-10-13T20:12:26Z] <hashar> Switching composer-hhvm / composer-php55 to Nodepool https://gerrit.wikimedia.org/r/#/c/306727/ T143938

hashar updated the task description. (Show Details)

Finally the rollback is complete. I will baby sit / monitor it more and tomorrow morning delete some permanent slaves to free up resources on wmflabs infra.