Page MenuHomePhabricator

Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins)
Closed, ResolvedPublic

Description

No rake-jessie jobs have run since 2016-01-01T19:09:40. The Jenkins web ui shows no ci-jessie-* slaves attached. Many instances are displayed on https://wikitech.wikimedia.org/wiki/Nova_Resource:Contintcloud.

This same problem happened on 2015-12-30 and was resolved after @Andrew restarted both nova-compute and nodepool.

Details

Related Gerrit Patches:
operations/puppet : productionnodepool: set Nova API timeout to 60 seconds

Related Objects

Event Timeline

bd808 created this task.Jan 1 2016, 10:41 PM
bd808 raised the priority of this task from to Needs Triage.
bd808 updated the task description. (Show Details)
bd808 added subscribers: bd808, Andrew.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJan 1 2016, 10:41 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Restored ~2016-01-011T23:58

@ori and @Legoktm both attempted to restart the nodepool service on labnodepool1001.eqiad.wmnet. Further investigation by @ori found that the process was stuck waiting on a lock and not responsive to SIGKILL. He was unable to get a useful stacktrace using gdb however.

Nodepool relies on the wmflabs OpenStack API. Whenever OpenStack is stuck, no more instances get booted/deleted and the remaining ones attached to Jenkins are deleted as jobs consume them. An example is/was T120586

For Jan 1st issue I looked at the debug log on labnodepool1001.eqiad.wmnet:/var/log/nodepool/debug.log.2016-01-01_14

Every 10 seconds, Nodepool look at the current pool situation and decide whether new instances should be booted. We have a base pool of 10 instances and the debug log when idling would be:

15:41:33 Deficit: ci-jessie-wikimedia: 0 (start: 0 min-ready: 10 ready: 10 capacity: 10)

Where:

FieldDescription
ci-jessie-wikimedia: 0Jenkins label and the number of instances that should be added
startinstances that need to be started
min-readyMinimum amount of instances that should be available (always 10)
readyInstances being build or ready
capacity????

We had 3 jobs triggered on three instances, hence ready count drop and Nodepool decides to spawn three instances at 15:41:44:

Deficit: ci-jessie-wikimedia: 3 (start: 0 min-ready: 10 ready: 7 capacity: 10)
Need to launch 3 ci-jessie-wikimedia nodes for gallium.wikimedia.org on wmflabs-eqiad
NodeLauncher: Launching node id: 17478
NodeLauncher: Creating server with hostname ci-jessie-wikimedia-17478 in wmflabs-eqiad from image ci-jessie-wikimedia for node id: 17478
ProviderManager: Manager wmflabs-eqiad running task <nodepool.provider_manager.CreateServerTask object at 0x7fa4184f2250> (queue: 0)
ProviderManager: Manager wmflabs-eqiad ran task <nodepool.provider_manager.CreateServerTask object at 0x7fa4184f2250> in 0.466858148575s
NodeLauncher: Waiting for server 10b67068-b034-488b-ba26-471d51813e30 for node id: 17478

At this point is deadlocked. Creation of other instances are queued and capacity is down to 7:

ProviderManager: Manager wmflabs-eqiad running task <nodepool.provider_manager.CreateServerTask object at 0x7fa4185e8a10> (queue: 2)
NodePool:   Deficit: ci-jessie-wikimedia: 0 (start: 0 min-ready: 10 ready: 10 capacity: 7)

The Waiting for server message is logged before asking OpenStack API for a list of servers. Nodepool sets a lock, query the API over HTTP, cache the result, unlock. It seems the timeout is None i.e. infinite.

Anyway, as slaves are being consumed we end up in a situation where there is nothing left and the jobs are held in queue by Zuul.

Change 262343 had a related patch set uploaded (by Hashar):
nodepool: set Nova API timeout to 60 seconds

https://gerrit.wikimedia.org/r/262343

https://gerrit.wikimedia.org/r/#/c/262343/ sets a 60 seconds timeout when querying the Nova API. That should raise an exception and cancel the task thus unlocking Nodepool.

hashar renamed this task from rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins to Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins).Jan 4 2016, 1:45 PM
hashar triaged this task as High priority.
hashar set Security to None.

Change 262343 merged by Andrew Bogott:
nodepool: set Nova API timeout to 60 seconds

https://gerrit.wikimedia.org/r/262343

hashar closed this task as Resolved.Jan 4 2016, 7:56 PM

By sending the signal SIGUSR2 to nodepoold, it will emit a stacktrace for each of its threads at DEBUG level in channel nodepool.stack_dump which is logged in /var/log/nodepool/debug.log. Edited example:

2016-01-04 19:51:16,056 DEBUG nodepool.stack_dump:
Thread: wmflabs-eqiad (140297227978496)
Thread: DiskImageBuilder queue (140297245812480)
Thread: MainThread (140297871116032)
  File "/usr/lib/python2.7/dist-packages/nodepool/cmd/nodepoold.py", line 52, in stack_dump_handler
    log_str += "".join(traceback.format_stack(stack_frame))
Thread: gallium.wikimedia.org (140297641219840)
Thread: NodePool (140297732646656)
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 1814, in run
    time.sleep(self.watermark_sleep)
Thread: APScheduler (140297254205184)
  File "/usr/lib/python2.7/threading.py", line 359, in wait
    _sleep(delay)
Thread: Gearman client connect (140297598207744)
Thread: Gearman client poll (140297606600448)
  File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 788, in _pollLoop
    ret = poll.poll()
Thread: NodeUpdateListener (140297614993152)
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 172, in run
    m = self.socket.recv().decode('utf-8')

I have added it to the wiki documentation
https://wikitech.wikimedia.org/w/index.php?title=Nodepool&diff=247277&oldid=197636

Nodepool configuration now states a timeout for OpenStack API calls. So this very specific bug is fixed.