Page MenuHomePhabricator

designate: database having deadlock problems
Open, HighPublic

Description

We detected tihs:

2020-12-23 12:03:06.867 21821 ERROR oslo_messaging.rpc.server oslo_db.exception.DBDeadlock: (pymysql.err.InternalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: 'UPDATE service_statuses SET updated_at=%(updated_at)s, service_name=%(service_name)s, hostname=%(hostname)s, heartbeated_at=%(heartbeated_at)s, status=%(status)s, stats=%(stats)s, capabilities=%(capabilities)s WHERE service_statuses.id = %(id_1)s'] [parameters: {'updated_at': datetime.datetime(2020, 12, 23, 12, 2, 15, 70273), 'service_name': 'producer', 'hostname': 'cloudservices1004', 'heartbeated_at': datetime.datetime(2020, 12, 23, 12, 2, 11), 'status': 'UP', 'stats': '{}', 'capabilities': '{}', 'id_1': 'b6e4cb00f645434fa14124f33d26cff9'}] (Background on this error at: http://sqlalche.me/e/2j85)

The error can be seen in /var/log/syslog at cloudservices1004.

Event Timeline

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.
root@cloudcontrol1004:~# journalctl -u mariadb.service -f
-- Logs begin at Wed 2020-12-23 09:25:59 UTC. --
Dec 23 12:32:49 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:32:49 0 [Warning] WSREP: Failed to report last committed 128344444, -110 (Connection timed out)
Dec 23 12:33:10 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:33:10 0 [Warning] WSREP: Failed to report last committed 128344586, -110 (Connection timed out)
Dec 23 12:33:52 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:33:52 0 [Warning] WSREP: Failed to report last committed 128344819, -110 (Connection timed out)
Dec 23 12:34:32 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:34:32 0 [Warning] WSREP: Failed to report last committed 128345068, -110 (Connection timed out)
Dec 23 12:35:14 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:35:14 0 [Warning] WSREP: Failed to report last committed 128345332, -110 (Connection timed out)
Dec 23 12:35:36 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:35:36 0 [Warning] WSREP: Failed to report last committed 128345454, -110 (Connection timed out)
Dec 23 12:35:57 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:35:57 0 [Warning] WSREP: Failed to report last committed 128345579, -110 (Connection timed out)
Dec 23 12:36:38 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:36:38 0 [Warning] WSREP: Failed to report last committed 128345841, -110 (Connection timed out)
Dec 23 12:36:59 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:36:59 0 [Warning] WSREP: Failed to report last committed 128345998, -110 (Connection timed out)
Dec 23 12:37:40 cloudcontrol1004 mysqld[15564]: 2020-12-23 12:37:40 0 [Warning] WSREP: Failed to report last committed 128346271, -110 (Connection timed out)

However:

mysql:root@localhost [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment"; SHOW STATUS LIKE "wsrep_ready";
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.001 sec)

+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | ON    |
+---------------+-------+

this may be after-effects from T270552 somehow

This is happening today again (not sure if it was fixed before):

 	Exception during message handling: oslo_db.exception.DBDeadlock: (pymysql.err.InternalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: 'UPDATE service_statuses SET updated_at=%(updated_at)s, service_name=%(service_name)s, hostname=%(hostname)s, heartbeated_at=%(heartbeated_at)s, status=%(status)s, stats=%(stats)s, capabilities=%(capabilities)s WHERE service_statuses.id = %(id_1)s'] [parameters: {'updated_at': datetime.datetime(2021, 1, 5, 13, 28, 46, 746178), 'service_name': 'api', 'hostname': 'cloudservices1003', 'heartbeated_at': datetime.datetime(2021, 1, 5, 13, 28, 46), 'status': 'UP', 'stats': '{}', 'capabilities': '{}', 'id_1': 'fbea29e35ded46ac89d02727647ee91f'}] (Background on this error at: http://sqlalche.me/e/2j85)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     cursor, statement, parameters, context
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 536, in do_execute
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     cursor.execute(statement, parameters)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     result = self._query(query)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     conn.query(q)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     result.read()
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     first_packet = self.connection._read_packet()
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 684, in _read_packet
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     packet.check_error()
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/protocol.py", line 220, in check_error
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     err.raise_mysql_exception(self._data)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/err.py", line 109, in raise_mysql_exception
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     raise errorclass(errno, errval)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server pymysql.err.InternalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server 
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server The above exception was the direct cause of the following exception:
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server 
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/designate/rpc.py", line 168, in dispatch
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     return super(RPCDispatcher, self).dispatch(*args, **kwds)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/designate/central/service.py", line 2899, in update_service_status
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     return self.storage.update_service_status(context, db_status)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/designate/storage/impl_sqlalchemy/__init__.py", line 1788, in update_service_status
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     exceptions.ServiceStatusNotFound)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/designate/sqlalchemy/base.py", line 548, in _update
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     resultproxy = self.session.execute(query)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 1263, in execute
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     clause, params or {}
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 980, in execute
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     return meth(self, multiparams, params)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/sql/elements.py", line 273, in _execute_on_connection
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     return connection._execute_clauseelement(self, multiparams, params)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1099, in _execute_clauseelement
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     distilled_params,
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1240, in _execute_context
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     e, statement, parameters, cursor, context
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1456, in _handle_dbapi_exception
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     util.raise_from_cause(newraise, exc_info)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 296, in raise_from_cause
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     reraise(type(exception), exception, tb=exc_tb, cause=cause)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 276, in reraise
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     raise value.with_traceback(tb)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     cursor, statement, parameters, context
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 536, in do_execute
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     cursor.execute(statement, parameters)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     result = self._query(query)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     conn.query(q)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     result.read()
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     first_packet = self.connection._read_packet()
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 684, in _read_packet
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     packet.check_error()
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/protocol.py", line 220, in check_error
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     err.raise_mysql_exception(self._data)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/err.py", line 109, in raise_mysql_exception
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server     raise errorclass(errno, errval)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server oslo_db.exception.DBDeadlock: (pymysql.err.InternalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: 'UPDATE service_statuses SET updated_at=%(updated_at)s, service_name=%(service_name)s, hostname=%(hostname)s, heartbeated_at=%(heartbeated_at)s, status=%(status)s, stats=%(stats)s, capabilities=%(capabilities)s WHERE service_statuses.id = %(id_1)s'] [parameters: {'updated_at': datetime.datetime(2021, 1, 5, 13, 28, 46, 746178), 'service_name': 'api', 'hostname': 'cloudservices1003', 'heartbeated_at': datetime.datetime(2021, 1, 5, 13, 28, 46), 'status': 'UP', 'stats': '{}', 'capabilities': '{}', 'id_1': 'fbea29e35ded46ac89d02727647ee91f'}] (Background on this error at: http://sqlalche.me/e/2j85)
2021-01-05 13:29:37.168 7591 ERROR oslo_messaging.rpc.server

Random fun in the logs (this first one matches https://jira.mariadb.org/browse/MDEV-18562)

Jan 05 17:19:05 cloudcontrol1005 mysqld[16847]: 2021-01-05 17:19:05 8607732 [ERROR] InnoDB: WSREP: referenced FK check fail: Lock wait index `PRIMARY` table `neutron`.`provisioningblocks`

Folks at the mariadb project don't seem terribly worried about it. Naturally users are.

Also this happened today:
Jan 6 00:09:14 cloudcontrol1005 mysqld[16847]: 2021-01-06 0:09:14 0 [Warning] WSREP: Failed to report last committed 137027138, -110 (Connection timed out) so that's fun, and sounds a lot like something overloaded at the time.

This is all on cloudcontrol1005 which is currently the only active db server since we are set up to only access one of the three with the others in standby mode at the haproxy layer.

I did confirm that, because deadlocks are common in openstack coding patterns in the past, designate normally restarts transactions that deadlock. The oslo middleware, on the otherhand, I'm not sure. I didn't see anything strange in any configs relating to this that should cause it to start writing to the wrong DB host. If control plane hosts suffer overloads here and there on the disks, that could encourage deadlocks.

I predict that this will be improved today now that cloudcontrol1003/1004 are less busy.