Page MenuHomePhabricator

labnet1001/labstore1004 combined alert on 2018-11-14
Closed, ResolvedPublic

Description

In 2018-11-14 we got a page for labnet1001:

PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack

At the same time, we had these alerts for labstore1004:

PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.

I downtimed both hosts in icinga for 1h to avoid additional SMS pages and begin investigating.

This was present in /var/log/syslog in labstore1004:

Nov 14 12:06:14 labstore1004 maintain-dbusers[18782]: pymysql.err.OperationalError: (1040, 'Too many connections')

And this was present in /var/log/syslog in labnet1001:

aborrero@labnet1001:~ $ sudo grep nova-fullstack /var/log/syslog
Nov 14 12:06:17 labnet1001 kernel: [15801036.224638] init: nova-fullstack main process (14830) terminated with status 1
Nov 14 12:06:17 labnet1001 kernel: [15801036.224658] init: nova-fullstack main process ended, respawning
Nov 14 12:06:18 labnet1001 kernel: [15801037.417144] init: nova-fullstack main process (17255) terminated with status 1
Nov 14 12:06:18 labnet1001 kernel: [15801037.417157] init: nova-fullstack main process ended, respawning
Nov 14 12:06:19 labnet1001 kernel: [15801038.471882] init: nova-fullstack main process (17299) terminated with status 1
Nov 14 12:06:19 labnet1001 kernel: [15801038.471896] init: nova-fullstack main process ended, respawning
Nov 14 12:06:21 labnet1001 kernel: [15801040.428913] init: nova-fullstack main process (17319) terminated with status 1
Nov 14 12:06:21 labnet1001 kernel: [15801040.428925] init: nova-fullstack main process ended, respawning
Nov 14 12:06:23 labnet1001 kernel: [15801042.603344] init: nova-fullstack main process (17329) terminated with status 1
Nov 14 12:06:23 labnet1001 kernel: [15801042.603360] init: nova-fullstack main process ended, respawning
Nov 14 12:06:24 labnet1001 kernel: [15801043.710914] init: nova-fullstack main process (17343) terminated with status 1
Nov 14 12:06:24 labnet1001 kernel: [15801043.710934] init: nova-fullstack main process ended, respawning
Nov 14 12:06:26 labnet1001 kernel: [15801044.917311] init: nova-fullstack main process (17353) terminated with status 1
Nov 14 12:06:26 labnet1001 kernel: [15801044.917331] init: nova-fullstack main process ended, respawning
Nov 14 12:06:27 labnet1001 kernel: [15801046.065372] init: nova-fullstack main process (17364) terminated with status 1
Nov 14 12:06:27 labnet1001 kernel: [15801046.065386] init: nova-fullstack main process ended, respawning
Nov 14 12:06:28 labnet1001 kernel: [15801047.164004] init: nova-fullstack main process (17374) terminated with status 1
Nov 14 12:06:28 labnet1001 kernel: [15801047.164022] init: nova-fullstack main process ended, respawning
Nov 14 12:06:29 labnet1001 kernel: [15801048.121265] init: nova-fullstack main process (17427) terminated with status 1
Nov 14 12:06:29 labnet1001 kernel: [15801048.121283] init: nova-fullstack main process ended, respawning
Nov 14 12:06:30 labnet1001 kernel: [15801049.074531] init: nova-fullstack main process (17434) terminated with status 1
Nov 14 12:06:30 labnet1001 kernel: [15801049.074555] init: nova-fullstack respawning too fast, stopped
Nov 14 12:10:36 labnet1001 puppet-agent[18946]: (/Stage[main]/Openstack::Nova::Fullstack::Service/Service[nova-fullstack]/ensure) ensure changed 'stopped' to 'running'
Nov 14 12:10:36 labnet1001 puppet-agent[18946]: (/Stage[main]/Openstack::Nova::Fullstack::Service/Service[nova-fullstack]) Unscheduling refresh on Service[nova-fullstack]

This could be another occurrence of T188589: m5-master overloaded by idle connections to the nova database

Event Timeline

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

This is the current list of conn procs in m5-master as pointed by @Volans

MariaDB [(none)]> select LEFT(HOST, LOCATE(':',HOST) - 1) as h, count(*) as count from information_schema.processlist group by h order by count desc;
+----------------+-------+
| h              | count |
+----------------+-------+
| 208.80.154.23  |   320 |
| 10.64.20.13    |    66 |
| 208.80.154.92  |    36 |
| 10.64.0.122    |    21 |
| 208.80.155.117 |    10 |
| 208.80.154.135 |     8 |
|                |     4 |
| 10.64.20.18    |     4 |
| 10.64.16.151   |     2 |
| 208.80.154.95  |     1 |
| 208.80.154.132 |     1 |
| 10.192.32.8    |     1 |
| 10.64.0.15     |     1 |
+----------------+-------+
13 rows in set (0.00 sec)

Top 3 are: cloudcontrol1003, labnet1001 and labcontrol1001.

Also, we just rebooted labnet1002 (per T207377: Reboot WMCS servers for L1TF)

We have several log messages like the following in /var/log/nova/nova-api.log in labnet1001:

2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions [req-8c13ec1b-44aa-45dd-bbc4-12125f065517 osstackcanary admin-monitoring - - -] Unexpected exception in API method
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/nova/api/openstack/extensions.py", line 478, in wrapped
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     return f(*args, **kwargs)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/nova/api/openstack/compute/images.py", line 128, in index
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     **page_params)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/nova/image/api.py", line 68, in get_all
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     return session.detail(context, **kwargs)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/nova/image/glance.py", line 303, in detail
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     _reraise_translated_exception()
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/nova/image/glance.py", line 691, in _reraise_translated_exception
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     six.reraise(new_exc, None, exc_trace)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/nova/image/glance.py", line 301, in detail
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     images = self._client.call(context, 1, 'list', **params)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/nova/image/glance.py", line 254, in call
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     return list(result)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/glanceclient/v1/images.py", line 254, in list
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     for image in paginate(params, return_request_id):
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/glanceclient/v1/images.py", line 238, in paginate
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     images, resp = self._list(url, "images")
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/glanceclient/v1/images.py", line 63, in _list
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     resp, body = self.client.get(url)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/glanceclient/common/http.py", line 275, in get
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     return self._request('GET', url, **kwargs)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/glanceclient/common/http.py", line 267, in _request
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     resp, body_iter = self._handle_response(resp)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions   File "/usr/lib/python2.7/dist-packages/glanceclient/common/http.py", line 83, in _handle_response
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     raise exc.from_response(resp, resp.content)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions HTTPInternalServerError: 500 Internal Server Error
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions The server has either erred or is incapable of performing the requested operation.
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions     (HTTP 500)
2018-11-14 12:06:19.549 29112 ERROR nova.api.openstack.extensions
2018-11-14 12:06:19.552 29112 INFO nova.api.openstack.wsgi [req-8c13ec1b-44aa-45dd-bbc4-12125f065517 osstackcanary admin-monitoring - - -] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'glanceclient.exc.HTTPInternalServerError'>
2018-11-14 12:06:19.553 29112 INFO nova.osapi_compute.wsgi.server [req-8c13ec1b-44aa-45dd-bbc4-12125f065517 osstackcanary admin-monitoring - - -] 10.64.20.13 "GET /v2.1/images HTTP/1.1" status: 500 len: 516 time: 0.2158608

This can be found in /var/log/nova/nova-api.log in labnet1002:

2018-11-14 12:07:30.685 16639 ERROR oslo_service.service DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' ((1040, u'Too many connections'))")

And also in labnet1001:

aborrero@labnet1001:~ 30s $ sudo grep "Too many" /var/log/nova/nova-api.log
12:11:14.469 29110 ERROR nova.api.openstack.extensions DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' ((1040, u'Too many connections'))")

This could mean that all servers are being victims of cloudcontrol1003 overloading m5-master again.
We may need to rebump T202889: cloudvps: dedicated openstack database.

The number of connections to m5-master from cloudcontrol1003 is very high:

aborrero@cloudcontrol1003:~ $ sudo netstat -putan | grep 10.64.16.79 | wc -l
335

(considering the current limit in m5-master which is 500 IIRC)

@Banyek or @Marostegui could you please confirm these numbers?

cloudcontrol1003# lsof -i -a -n | grep 10.64.16.79 | awk '{ print $1 }' | sort | uniq -c |sort -rn
    120 neutron-s
     87 keystone-
     61 nova-api
     27 glance-re
     20 nova-cond
      5 glance-ap
      2 nova-sche

I don't see any reason for not to increase the limit, 500 conections are not too much, but actually the most of the threads are idle: (are in sleep state)

                                      | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 4                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 4                                      |
| Threads_running                                              | 4                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |
| Threads_running                                              | 3                                      |

I don't see any reason for not to increase the limit, 500 conections are not too much, but actually the most of the threads are idle: (are in sleep state)

Do not increase connections, the more we allow, the more these users will use and at some point it can overload and crash the whole server and this server is shared with wikitech. We need to keep limiting the users in order not to affect wikitech until we have decided what to do with T202889: cloudvps: dedicated openstack database

There is a cap on the user connections (the user that eats connections being OpenStack aka Cloud VPS). It just has burst capabilities and can briefly go over what we have set. I suspect that with the limits we have in place, it cannot go all that much higher. A slightly higher limit would help us get through Neutron migrations (which is almost certainly what is causing the bursts in connections).

Of course, I really think Openstack needs a new DB home, but I feel confident that it will return to the place it is now if it has a little more room to burst.

aborrero claimed this task.

We can move the conversation to the other tasks.