Page MenuHomePhabricator

figure out if nodepool is overwhelming rabbitmq and/or nova
Closed, ResolvedPublic

Description

Today we're seeing a lot of random timeouts of instance creation and deletion. We always see a few of these, but today is really excessive.

Without much evidence, I associate bada00439cf533e9080b2a96f7a449720ef6f712 (which changed the ndoepool spawn rate from 6 seconds to 5) with an increase in these timeouts.

Event Timeline

Change 364901 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nodepool: Slow down VM spawns by a lot.

https://gerrit.wikimedia.org/r/364901

For a dramatic change, I'm going to merge a patch that doubles the spawn time from 5 seconds to 10 seconds. If that gets us some peace and quiet then we'll fine-tune... probably we can land back on 6 or 7. If there's no improvement at all then we can just revert :)

Change 364901 merged by Andrew Bogott:
[operations/puppet@production] Nodepool: Slow down VM spawns by a lot.

https://gerrit.wikimedia.org/r/364901

So far things seem quieter and more stable with the nodepool rate change. If we get another good day I'll move the timeout back to 6 and we'll see what happens.

Since https://gerrit.wikimedia.org/r/364901 there haven't been any obvious timeouts in contintcloud, but we have leaked five 'fullstack' test instances -- those five instances are up and running and just fine but (presumably) came up too slowly for the testing framework to mark them as success.

I'm going to drop the rate back down to '6', clear out the fullstack list, and see if the failure rate changes.

Change 365608 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nodepool: move 'rate' to 6 seconds.

https://gerrit.wikimedia.org/r/365608

Change 365608 merged by Andrew Bogott:
[operations/puppet@production] nodepool: move 'rate' to 6 seconds

https://gerrit.wikimedia.org/r/365608

For the record, changes to nodepool.yaml are taken almost immediately by Nodepool, there is no need to restart it.


For RabbitMQ we apparently have a diamond collector installed by the puppet class openstack::queue_server apparently reporting from labcontrol1001

Source code is modules/openstack/files/rabbitmq/rabbitmq.py

There is apparently only one metrics being updated though: servers.labcontrol1001.rabbitmq.statistics_db_event_queue

10 days view:

labcontrol1001_rabbitmq.png (600×800 px, 41 KB)

There is a bunch of leftover metrics, but they are apparently empty. Maybe they can be collected again and we can build a Grafana board for all of them ? (I am volunteering).

There is too many metrics under servers.labcontrol1001.rabbitmq thus Graphite was only showing a few. In Grafana I found the rest and I have created a basic dashboard for RabbitMq:

https://grafana.wikimedia.org/dashboard/db/labs-queue-server-rabbitmq

There are per queues detailed metrics servers.labcontrol1001.rabbitmq.queues but they are meaningless to me.

The first graph (labcontrol queues uses queues_totals which looks like an aggregate of all the queues. That is probably good enough.

And on the labcontrol1001 server board the SWAP apparently went full starting July 9th 12:00.

Note, in the graph below: SwapCached (red) overlap SwapFree (green)

labcontrol1001_swap_full.png (501×829 px, 34 KB)

Change 365998 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: add labvirt1014 to the scheduling pool

https://gerrit.wikimedia.org/r/365998

Change 365998 merged by Andrew Bogott:
[operations/puppet@production] nova: add labvirt1014 to the scheduling pool

https://gerrit.wikimedia.org/r/365998

Change 368102 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nodepool: raise rate to 8 seconds

https://gerrit.wikimedia.org/r/368102

Change 368102 merged by Andrew Bogott:
[operations/puppet@production] Nodepool: raise rate to 8 seconds

https://gerrit.wikimedia.org/r/368102

@Andrew wrote:

This was at 6, which has historically been mostly safe, but I'm seeing terrible congestion right now so slowing things down again :(

Any pointer as to what is congesting? The rate is too high for Nodepool to send the deletion requestions, that causes the pool to take a while to refill.

I did a small bit of poking today. It seems we are using swap on labcontrol1001 and it's either a common cause or a root symptom for at least the rabbitmq portion I'm guessing.

This means we should avoid swap if we can:

cat /proc/sys/vm/swappiness
0

From what I read it seems like if we do end up paging to swap some processes won't move back even if contention has resolved. So potentially we are periodically overwhelmed on labcontrol1001 (salt, nova-scheduler, nova-conductor, keystone, rabbmitmq, puppetmasters, etc) and rabbitmq really doesn't like being pushed off onto swap.

A script that details what is using swap (cribbed from so):

#! /bin/bash
#
# Page on Page on swap.sh: Shows the swap usage of each process
# Author: Robert Love
swap_total=0
for i in /proc/[0-9]*; do
  pid=$(echo $i | sed -e 's/\/proc\///g')
  swap_pid=$(cat /proc/$pid/smaps |
    awk 'BEGIN{total=0}/^Swap:/{total+=$2}END{print total}')
  if [ "$swap_pid" -gt 0 ]; then
    name=$(cat /proc/$pid/status | grep ^Name: |
      awk '{print $2}')
    echo "${name} (${pid}) ${swap_pid} kB"
    let swap_total+=$swap_pid
  fi
done
echo
echo "Total: ${swap_total} kB"
root@labcontrol1001:/home/rush# cat /proc/swaps
Filename				Type		Size	Used	Priority
/dev/md1                                partition	976316	225852	-1
root@labcontrol1001:/home/rush# bash swap_stat.sh
inet_gethost (10199) 68 kB
inet_gethost (10200) 56 kB
screen (16619) 20 kB
epmd (1829) 4 kB
sshd (4980) 4 kB
sshd (5040) 48 kB
bash (5041) 40 kB
rabbitmq-server (9838) 100 kB
beam.smp (9849) 225356 kB

Total: 225696 kB

OK so I'm pretty sure beam.smp is also rabbitmq, maybe a couchdb backing store? and obv rabbitmq-server is as well.

So I'm going to push this off of swap now;

#!/bin/bash

# Make sure that all text is parsed in the same language
export LC_MESSAGES=en_US.UTF-8
export LC_COLLATE=en_US.UTF-8
export LANG=en_US.utf8
export LANGUAGE=en_US:en
export LC_CTYPE=en_US.UTF-8

# Calculate how much memory and swap is free
free_data="$(free)"
mem_data="$(echo "$free_data" | grep 'Mem:')"
free_mem="$(echo "$mem_data" | awk '{print $4}')"
buffers="$(echo "$mem_data" | awk '{print $6}')"
cache="$(echo "$mem_data" | awk '{print $7}')"
total_free=$((free_mem + buffers + cache))
used_swap="$(echo "$free_data" | grep 'Swap:' | awk '{print $3}')"

echo -e "Free memory:\t$total_free kB ($((total_free / 1024)) MB)\nUsed swap:\t$used_swap kB ($((used_swap / 1024)) MB)"

# Do the work
if [[ $used_swap -eq 0 ]]; then
    echo "Congratulations! No swap is in use."
elif [[ $used_swap -lt $total_free ]]; then
    echo "Freeing swap..."
    swapoff -a
    swapon -a
else
    echo "Not enough free memory. Exiting."
    exit 1
fi
root@labcontrol1001:/home/rush# bash freeswap.sh
Free memory:	34048484 kB (33250 MB)
Used swap:	225696 kB (220 MB)
Freeing swap...

ran that and now

root@labcontrol1001:/home/rush# bash swap_stat.sh
Total: 0 kB

curious to see what effect this has

/home/rush# sudo bash swap_stat.sh
inet_gethost (10199) 68 kB
inet_gethost (10200) 56 kB
screen (16619) 4 kB
cat: /proc/17560/smaps: No such file or directory
rabbitmq-server (9838) 88 kB
beam.smp (9849) 316992 kB

Total: 317208 kB

/home/rush# free -m
             total       used       free     shared    buffers     cached
Mem:         48289      46015       2274          1        165      31049
-/+ buffers/cache:      14799      33489
Swap:          953        309        643

/home/rush# grep SwapTotal /proc/meminfo
SwapTotal:        976316 kB

/home/rush# vmstat 3 100
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13  0 316444 2204624 170248 31805512    0    0     3    49    0    1 21  2 76  1  0
 3  0 316444 2248564 170252 31805532    0    0     0    64 21456 13939 45  5 50  0  0
 4  0 316444 2185360 170272 31805576    0    0     1   232 12950 8929 31  3 66  0  0
 3  0 316444 2207740 170280 31806108    0    0     0   349 15198 9957 24  5 71  0  0
 5  0 316444 2216624 170296 31806608    0    0    44   309 13182 10186 17  4 79  0  0
 3  0 316444 2140112 170308 31806664    0    0     0   517 17122 11116 35  5 59  1  0
 4  0 316444 2216300 170308 31806892    0    0     0     0 14375 10282 22  7 70  0  0
 2  0 316444 2240880 170452 31807032    0    0    35   511 13089 6395 18  5 77  1  0
 1  1 316440 2241376 170460 31807044   11    0    11    72 4552 7986 20  2 78  0  0
 2  0 316352 2246136 170460 31807292  128    0   128   256 3063 7005 11  1 88  0  0
 2  0 316352 2247720 170480 31807300    0    0     0   162 3321 7192 19  2 79  0  0
 4  0 316344 2234432 170496 31807316   11    0    11   299 6252 13198 43  4 52  1  0
 3  0 316344 2248552 170512 31807340    0    0     1   349 4293 9452 21  2 77  0  0
^C

/home/rush# free -m
             total       used       free     shared    buffers     cached
Mem:         48289      46235       2054          1        166      31071
-/+ buffers/cache:      14997      33291
Swap:          953        308        644


/home/rush# bash freeswap.sh
Free memory:    34251980 kB (33449 MB)
Used swap:    315840 kB (308 MB)
Freeing swap...

/home/rush# free -m
             total       used       free     shared    buffers     cached
Mem:         48289      46474       1814          1        167      31123
-/+ buffers/cache:      15183      33105
Swap:          953          0        953

A few thoughts on this phenom. I'm not sure if rabbit components swapping is really a total antipattern here but it's worth talking about for a few reasons. 1) clearly rabbit falls over sick at various points 2) rabbit has java-esque memory behaviors 3) it's the most common thread I've seen that, at least to this point, may correlate. We haven't been watching it very well though.

labcontrol1001 atm

ps -ef | grep rabbit
rabbitmq  1829     1  0 Jun21 ?        00:00:43 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon
rabbitmq  9838     1  0 Jul26 ?        00:00:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq  9849  9838 17 Jul26 ?        1-05:24:35 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -W w -A 64 -P 1048576 -K true -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@labcontrol1001 -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/rabbit@labcontrol1001.log"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/rabbit@labcontrol1001-sasl.log"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@labcontrol1001-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@labcontrol1001" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
rabbitmq 10199  9849  0 Jul26 ?        00:00:10 inet_gethost 4
rabbitmq 10200 10199  0 Jul26 ?        00:00:09 inet_gethost 4
pidstat -r -p 1829
Linux 3.13.0-121-generic (labcontrol1001) 	08/02/2017 	_x86_64_	(12 CPU)

06:00:36 PM   UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
06:00:36 PM   114      1829      0.11      0.00   11472   2488   0.01  epmd
root@labcontrol1001:/home/rush# pidstat -r -p 9838
Linux 3.13.0-121-generic (labcontrol1001) 	08/02/2017 	_x86_64_	(12 CPU)

06:00:41 PM   UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
06:00:41 PM   114      9838      0.00      0.00    6500    656   0.00  rabbitmq-server
root@labcontrol1001:/home/rush# pidstat -r -p 9849
Linux 3.13.0-121-generic (labcontrol1001) 	08/02/2017 	_x86_64_	(12 CPU)

06:00:48 PM   UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
06:00:48 PM   114      9849    304.39      0.00 7615640 2544484   5.15  beam.smp

swap_stat.sh

/home/rush# bash swap_stat.sh
beam.smp (9849) 50572 kB

So I looked this am and saw rabbit back to using swap and thought that was curious so I evacuated it again and looked within a short timeframe and it was back again. In looking at overal system health I don't see holistic usage that makes that understandable

https://grafana.wikimedia.org/dashboard/db/host-overview?refresh=5m&orgId=1&from=now-1h&to=now&var-server=labcontrol1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=labs

So I dug around in the docs and have two thoughts.

In looking at https://www.rabbitmq.com/memory.html I see

Erlang's garbage collector can, in the worst case, cause double the amount of memory to be used (by default, 80% of RAM). It is strongly recommended that OS swap or page files are enabled.

{{rabbit, [{vm_memory_high_watermark, 0.4}]}].

and

Configuring the Paging Threshold

By default this starts to happen when the broker is 50% of the way to the high watermark (i.e. with a default high watermark of 0.4, this is when 20% of memory is used).

rabbit will preemptively page to swap when it believes it is using 20% of the overall system memory based on default thresholds it seems. Since I'm seeing it hit swap on labcontrol1001 under load and not say labtestcontrol2001 I think this lines up with demonstrated behavior.

service rabbitmq-server status | grep -A 1 'memory,'

{memory,
     [{total,2787639656},

2787639656 bytes in G is 2.7 ... ok I think this must be physical memory only and not swapped

pidstat -r -p 9849
06:08:43 PM UID PID minflt/s majflt/s VSZ RSS %MEM Command
06:08:43 PM 114 9849 304.57 0.00 7603660 2546456 5.15 beam.smp


7603660 kb in G is 7.6
20% of avail RAM is 9.67G

Postulating that based on virtual memory usage we are hitting some within rabbit thresholds that cause preemptive safety measures to force paging to swap at the 50% of 40% of available RAM threshold.

I'm wondering if we could bump these up to (obv dangerous potential):

{rabbit, [{vm_memory_high_watermark, 0.5}]}].
[{rabbit, [{vm_memory_high_watermark_paging_ratio, 0.75},

A thing I did not realize for rabbit is if it hits that memory high watermark it will /block/ publishers entirely based on this arbitrary line. Though it's not so arbitrary I guess as the erlang garbage collection is noted in a few places as (at worst) consuming twice the in-use memory.


Another minor thought (this gets more handwavy as internal behavior is more opaque) but the two designate internal finite resources noted are file descriptors -- which we have raised to 8192 previously -- and async threads. Roughly there should be enough async threads for the number of simultaneous I/O operations your storage layer can plausibly execute...default value is +A 64 and we are currently at the default. I don't know how to quantify this other than to say: if we bump up the memory water marks and have already bumped up filehandles, raising the async io threads possible for peak concurrency could be sane. We keep adding virts, etc.

reference
https://www.rabbitmq.com/configure.html#configuration-file
https://www.rabbitmq.com/production-checklist.html#resource-limits-ram
https://www.rabbitmq.com/persistence-conf.html
https://www.rabbitmq.com/memory-use.html

Also, not a terrible idea as we start forcing rabbit back into physical RAM for us to consider giving this its own server.

Eventually I was looking at the rabbitMQ reported statistics on https://grafana.wikimedia.org/dashboard/db/labs-queue-server-rabbitmq?orgId=1 Seems it at least has ample memory and file descriptors.

I am not quite sure what all those graphs represent exactly, they come from diamond via I think ./modules/openstack/files/rabbitmq/rabbitmq.py. Apparently the queues are not backed up / exploded, then maybe that show up in another metric.

Maybe RabbitMQ has some logging system that could help? From there one can potentially raise the rate or loop creation/deletion of instances and try to reproduce. Though doing that against production would cause havoc :-\

I've just removed apache and puppetmaster from the labcontrols. Seems like this should give rabbitmq a bit more room to maneuver.

We have been having rabbitmq and/or timeout issues with operations this afternoon. Prior to the first rabbitmq restart here is what I saw.

bash swap_stat.sh
rabbitmq-server (15743) 104 kB
beam.smp (15752) 975860 kB
inet_gethost (16071) 72 kB
inet_gethost (16072) 56 kB
screen (16619) 76 kB
epmd (1829) 36 kB

Total: 976204 kB

I have noticed:

# Security groups for big projects (e.g. tools) are too hard to
#  organize and cause a timeout between conductor and compute.
# This is /maybe/ fixed in mitaka, so we can investigate reverting
#  this to 60 (the proper default) in future versions.
rpc_response_timeout=180

That has been done to accommodate for some huge security groups (T142165). Could it be that the the pool ends up being exhausted when too many long requests are done? That potentially can be tuned via

rpc_conn_pool_size = 30  # default

Also can we lower Nodepool rate again? The current high rate of 8 seconds causes the spawn/deletion of instances to be quite delayed. Just to delete an instance, it takes 4 queries (ListServersTask, ListFlavorsTask, ListExtensionsTask, DeleteServerTask) or 24 seconds (T172229).

Change 375822 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rabbitmq: add a giant default config

https://gerrit.wikimedia.org/r/375822

Change 375823 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rabbitmq: increase vm_memory_high_watermark_paging_ratio

https://gerrit.wikimedia.org/r/375823

It happened again this morning after I entered a faulty command that attempted to delete all Nodepool instances at all. OpenStack had timeout waiting for messages which I found via labnet1001.eqiad.wmnet /var/log/nova/nova-network.log

Eg:
2017-09-05 09:19:15.010 6855 ERROR oslo_messaging.rpc.dispatcher [req-963f455c-efc2-4a01-a715-21c357b6b280 nodepoolmanager contintcloud - - -] Exception during message handling: Timed out waiting for a reply to message ID 23b7f18fac1b429186ca4f7c85535a49

Instances where erroring out for some reason and could not spawn but I was able to delete them either via Horizon or via the openstack CLI.

I then:

  • stopped nodepool
  • manually deleted all instances one by one via the openstack cli
  • watched the nova-network.log for messages such as:
    • oslo_messaging._drivers.amqpdriver [-]
    • No calling threads waiting for msg_id : 726fa2df4fd942c183212fb01e311e1a

Eventually they stopped appeared. I then managed to spawn a dummy instance via Horizon and then started Nodepool which spawned new instances properly.


Taking the last id of a message No calling threads waiting for msg_id which had the identifier 726fa2df4fd942c183212fb01e311e1a and came from request req-47a57a1b-b40d-461a-b5dc-4ecb5a561613

10:07:29.006 nova.osapi_compute.wsgi.server "POST /v2/contintcloud/servers HTTP/1.1"

After four minutes the RPC message is not received and the request times out:

10:11:28.953 WARNING nova.network.manager Error cleaning up fixed ip allocation. Manual cleanup may be required.
10:11:29.160 ERROR oslo_messaging.rpc.dispatcher Timed out waiting for a reply to message ID 9865f0a3a5414ed89f060d585125b77a
10:14:21.437 ERROR oslo_messaging.rpc.dispatcher Timed out waiting for a reply to message ID 4a736726f8e64dc58051fd78f6a7c186
10:17:30.353 ERROR oslo_messaging.rpc.dispatcher Timed out waiting for a reply to message ID 726fa2df4fd942c183212fb01e311e1a

Though the message were apparently properly in RabbitMQ and eventually got received later on:

10:19:20.512 6855 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 9865f0a3a5414ed89f060d585125b77a
10:19:20.515 6855 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 9865f0a3a5414ed89f060d585125b77a
10:19:27.468 6855 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 4a736726f8e64dc58051fd78f6a7c186
10:19:27.474 6855 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 4a736726f8e64dc58051fd78f6a7c186
10:19:31.889 6855 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 726fa2df4fd942c183212fb01e311e1a
10:19:31.892 6855 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 726fa2df4fd942c183212fb01e311e1a

Who knows what is overflowing either the oslo.messaging RPC pool or the RabbitMQ one.

Another thing I have noticed on the RabbitMQ Grafana is the queue notifications.error had messages pilling up and they are apparently not being consumed:

rabbitmq-notifications-queues.png (301×656 px, 31 KB)

Potentially the messages being hold there might have a bit more informations. That seems to be a bus for all OpenStack services to emit to, potentially they have a project to consume and display them somewhere. That could potentially lead to a better view of what OpenStack is doing.

I guess Ceilometer knows how to consume those.

Also from the logs there are ValueError: Circular reference detected in nova-network:

2017-09-05 09:19:14.816 6855 WARNING nova.network.manager
[req-963f455c-efc2-4a01-a715-21c357b6b280 nodepoolmanager contintcloud - - -] Error cleaning up fixed ip allocation. Manual cleanup may be required.
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 978, in allocate_fixed_ip
    f()
  File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 197, in wrapper
    ctxt, self, fn.__name__, args, kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 242, in object_action
    objmethod=objmethod, args=args, kwargs=kwargs)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
    retry=self.retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
    timeout=timeout, retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 462, in send
    retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 430, in _send
    msg = rpc_common.serialize_msg(msg)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/common.py", line 286, in serialize_msg
    _MESSAGE_KEY: jsonutils.dumps(raw_msg)}
  File "/usr/lib/python2.7/dist-packages/oslo_serialization/jsonutils.py", line 185, in dumps
    return json.dumps(obj, default=default, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
ValueError: Circular reference detected

2017-09-05 09:19:14.906 6855 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 23b7f18fac1b429186ca4f7c85535a49

2017-09-05 09:19:15.010 6855 ERROR oslo_messaging.rpc.dispatcher [req-963f455c-efc2-4a01-a715-21c357b6b280 nodepoolmanager contintcloud - - -] Exception during message handling: Timed out waiting for a reply to message ID 23b7f18fac1b429186ca4f7c85535a49

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
    executor_callback))
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
    executor_callback)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/usr/lib/python2.7/dist-packages/nova/network/floating_ips.py", line 113, in allocate_for_instance
    **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 504, in allocate_for_instance
    requested_networks=requested_networks)
  File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 195, in _allocate_fixed_ips
    vpn=vpn, address=address)
  File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 982, in allocate_fixed_ip
    'be required.'), exc_info=True)
  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 958, in allocate_fixed_ip
    self._setup_network_on_host(context, network)
  File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 1896, in _setup_network_on_host
    self.driver.update_dhcp(elevated, dev, network)
  File "/usr/lib/python2.7/dist-packages/nova/network/linux_net.py", line 1052, in update_dhcp
    host=host)
  File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 169, in wrapper
    args, kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 229, in object_class_action
    args, kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 237, in object_class_action_versions
    args=args, kwargs=kwargs)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
    retry=self.retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
    timeout=timeout, retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 462, in send
    retry=retry)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 451, in _send
    result = self._waiter.wait(msg_id, timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 348, in wait
    message = self.waiters.get(msg_id, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 253, in get
    'to message ID %s' % msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID 23b7f18fac1b429186ca4f7c85535a49

I ran sudo rabbitmqctl purge_queue for the ceilometer_notifications.* queues. Before purge they had:

$ sudo rabbitmqctl list_queues |grep ceilometer
ceilometer_notifications.error  76
ceilometer_notifications.info   4149

Change 377039 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] rabbitmq: Add drain_queue utility script

https://gerrit.wikimedia.org/r/377039

Change 377039 merged by Rush:
[operations/puppet@production] rabbitmq: Add drain_queue utility script

https://gerrit.wikimedia.org/r/377039

From this week-end logs, nova.network.manager had the same chain of logs:

2017-09-09 21:33:07.090 2408 WARNING nova.network.manager
[req-2106f08d-04ea-4cfc-a344-d0f7abf1072c nodepoolmanager contintcloud - - -]
Error cleaning up fixed ip allocation. Manual cleanup may be required.
ValueError: Circular reference detected
2017-09-09 21:33:07.269 2408 ERROR oslo_messaging.rpc.dispatcher
[req-2106f08d-04ea-4cfc-a344-d0f7abf1072c nodepoolmanager contintcloud - - -]
Exception during message handling: Timed out waiting for a reply to message ID 6e59f41d21354eccbc77bfe0fcc1c5c0

Though the message apparently got yield some 7 seconds later:

2017-09-09 21:33:14.487 2408 INFO oslo_messaging._drivers.amqpdriver [-]
No calling threads waiting for msg_id : 6e59f41d21354eccbc77bfe0fcc1c5c0

The replies thus took longer and longer to come back. At least for nova.network, all the errors seems to have a reference to that Circular reference detected.. I am not sure how it is related, maybe something goes in a loop and overload RabbitMQ with a ton of messages. But that does not reflect in Grafana.

There is a new message from oslo.messaging:

2017-09-09 21:33:13.947 2408 WARNING oslo_messaging._drivers.amqpdriver [-]
Number of call queues is greater than warning threshold: 20.
There could be a leak. Increasing threshold to: 40

Based on the source, that is the per process amqpdriver having too many functions waiting for replies. The base threshold is 10 then it is doubled everytime it is reached (eg 20, 40, 80).

Which probably is due to RabbitMQ not honoring the request as fast as it should be.

Potentially enabling debug log for oslo.messaging might give more details:

openstack: explicitly define default_log_levels [puppet] - https://gerrit.wikimedia.org/r/377321
openstack: debug oslo.messaging nova.network.manager [puppet] - https://gerrit.wikimedia.org/r/377322

Then if it is on the RabbitMQ side, that is not going to help.

Change 377338 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] nodepool: make puppet match current settings

https://gerrit.wikimedia.org/r/377338

Change 377338 merged by Rush:
[operations/puppet@production] nodepool: make puppet match current settings

https://gerrit.wikimedia.org/r/377338

Change 381165 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wmcs: update rabbitmq drain_queue script

https://gerrit.wikimedia.org/r/381165

Change 381165 merged by Andrew Bogott:
[operations/puppet@production] wmcs: update rabbitmq drain_queue script

https://gerrit.wikimedia.org/r/381165

rabbit is now much quieter, so this is /maybe/ better. Closing for now, optimistically.

Did you have a chance to look at the notifications.error messages? They might offer clue as what is happening :]

Thank you for all the investigations and patches!

Did you have a chance to look at the notifications.error messages? They might offer clue as what is happening :]

Most of them are pretty useless unfortunately. They all contain an "exception" member that is supposed to tell you what went wrong, but for most of them this member is empty.

$ wc -l /var/log/rabbitmq/notifications_error.log
890 /var/log/rabbitmq/notifications_error.log
$ cat /var/log/rabbitmq/notifications_error.log | jq .payload.exception | grep -v '{}'
{
  "kwargs": {
    "instance_id": "i-000b3969",
    "code": 404
  },
  "message": "Instance i-000b3969 could not be found."
}
{
  "kwargs": {
    "instance_id": "i-000b3969",
    "code": 404
  },
  "message": "Instance i-000b3969 could not be found."
}
{
  "kwargs": {
    "instance_id": "i-000b3969",
    "code": 404
  },
  "message": "Instance i-000b3969 could not be found."
}

This is a typical event:

$ head -n 1 /var/log/rabbitmq/notifications_error.log | jq .
{
  "_unique_id": "bc0e184d9ea34cf6b52bab325b7d0a96",
  "_context_user_identity": "- - - - -",
  "_context_user_name": null,
  "_context_user": null,
  "_context_timestamp": "2017-10-01T05:41:04.282967",
  "publisher_id": "compute.labvirt1005",
  "_context_project_domain": null,
  "_context_is_admin": true,
  "_context_tenant": null,
  "timestamp": "2017-10-01 05:41:23.708168",
  "_context_auth_token": null,
  "message_id": "8c3754cb-7fae-4ffa-8402-ac3d3bbe4b21",
  "_context_show_deleted": false,
  "_context_read_only": false,
  "_context_user_id": null,
  "_context_request_id": "req-b7d3cbe9-6af0-4cd8-87e6-3155a64adec7",
  "_context_user_domain": null,
  "_context_instance_lock_checked": false,
  "_context_project_name": null,
  "priority": "ERROR",
  "_context_resource_uuid": null,
  "_context_roles": [],
  "_context_read_deleted": "yes",
  "payload": {
    "exception": {},
    "args": {
      "instance": {
        "user_data": null,
        "node": "labvirt1005.eqiad.wmnet",
        "project_id": "admin-monitoring",
        "ephemeral_key_uuid": null,
        "terminated_at": null,
        "updated_at": "2017-10-01T05:41:03.000000",
        "instance_type_id": 5,
        "created_at": "2017-10-01T05:37:09.000000",
        "host": "labvirt1005",
        "vm_mode": null,
        "display_name": "fullstackd-1506836226",
        "image_ref": "4a2aaa92-acf2-4837-ab41-74a21117b021",
        "vcpus": 1,
        "task_state": "deleting",
        "deleted": false,
        "root_gb": 20,
        "cleaned": true,
        "progress": 0,
        "locked": false,
        "deleted_at": null,
        "key_name": null,
        "name": "i-000b2c53",
        "architecture": null,
        "config_drive": "",
        "hostname": "fullstackd-1506836226",
        "os_type": null,
        "shutdown_terminate": false,
        "vm_state": "active",
        "auto_disk_config": false,
        "availability_zone": "nova",
        "launch_index": 0,
        "cell_name": null,
        "reservation_id": "r-hgrk8jz0",
        "kernel_id": "",
        "ramdisk_id": "",
        "default_swap_device": null,
        "memory_mb": 2048,
        "uuid": "b757d84f-bf4a-4486-8c11-56f8782ba4c9",
        "user_id": "osstackcanary",
        "launched_at": "2017-10-01T05:37:38.000000",
        "ephemeral_gb": 0,
        "display_description": "fullstackd-1506836226",
        "key_data": null,
        "access_ip_v4": null,
        "locked_by": null,
        "access_ip_v6": null,
        "power_state": 1,
        "disable_terminate": false,
        "launched_on": "labvirt1005",
        "root_device_name": "/dev/vda",
        "default_ephemeral_device": null,
        "id": 732243
      },
      "self": null,
      "context": {
        "project_id": null,
        "service_catalog": [],
        "resource_uuid": null,
        "quota_class": null,
        "remote_address": null,
        "read_only": false,
        "instance_lock_checked": false,
        "domain": null,
        "user": null,
        "tenant": null,
        "user_name": null,
        "auth_token": null,
        "_read_deleted": "yes",
        "request_id": "req-b7d3cbe9-6af0-4cd8-87e6-3155a64adec7",
        "project_domain": null,
        "user_domain": null,
        "roles": [],
        "show_deleted": false,
        "is_admin": true,
        "project_name": null,
        "user_id": null,
        "user_auth_plugin": null,
        "timestamp": "2017-10-01T05:41:04.282967"
      }
    }
  },
  "_context_remote_address": null,
  "_context_service_catalog": [],
  "_context_quota_class": null,
  "_context_domain": null,
  "event_type": "refresh_instance_security_rules",
  "_context_project_id": null
}

Change 375823 abandoned by Andrew Bogott:
rabbitmq: increase vm_memory_high_watermark_paging_ratio

Reason:
Other changes seems to have alleviated the problem this was meant to fix

https://gerrit.wikimedia.org/r/375823

Change 375822 merged by Andrew Bogott:
[operations/puppet@production] rabbitmq: add a giant default config

https://gerrit.wikimedia.org/r/375822

The issue ( T170492#3581822 ) still happens from time to time:

Error cleaning up fixed ip allocation. Manual cleanup may be required.
File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 978, in allocate_fixed_ip
...
ValueError: Circular reference detected

That happened today on 21:08:59 and the last error was at 21:36. And eventually it self recovered.

Mentioned in SAL (#wikimedia-operations) [2017-11-29T22:23:14Z] <hashar> Nodepool had some troubles spawning new instances from 21:09 to 21:36, and took a while to recover. Issue similar to T170492#3581822

Mentioned in SAL (#wikimedia-operations) [2017-11-29T22:23:14Z] <hashar> Nodepool had some troubles spawning new instances from 21:09 to 21:36, and took a while to recover. Issue similar to T170492#3581822

This was me testing a patch to dnsmasq that went poorly. Sorry for the noise!