Page MenuHomePhabricator

Lost connection to MariaDB server during query
Closed, DuplicatePublic

Description

I'm getting database errors in some tools that until some days ago was running without errors. The error returned is "oursql.OperationalError: (2013, 'Lost connection to MySQL server during query', None)". It seems to happen when query delays more than ~4 sec, faster queries run without errors.

Event Timeline

Danilo raised the priority of this task from to Needs Triage.
Danilo updated the task description. (Show Details)
Danilo added a project: Toolforge.
Danilo changed Security from none to None.
Danilo subscribed.
MariaDB [commonswiki_p]> select rc_user AS test from recentchanges  limit 1;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    1006686
Current database: commonswiki_p

+---------+
| test    |
+---------+
| 3367542 |
+---------+
1 row in set (0.12 sec)
Steinsplitter triaged this task as Medium priority.
Steinsplitter added a subscriber: coren.

(dupe comment from T76699)

No smoking gun yet.

As discussed a few times on labs-l in order fight abuse and replag we kill things explicitly when:

  1. A query runs for more that 28800 seconds
  2. One or more queries are about to collectively cause an OOM (rare)
  3. A client holds a connection open and idle for more than 60 seconds
  4. The CATSCAN stuff runs slow writes for more than 300 seconds

I wonder if #3 is the culprit here.

60s was chosen because some users leave transactions open for long periods and/or consume many concurrent connections for no good reason. If this is the problem, we can either bump up the time limit or encourage apps to auto-reconnect, use keep-alive, or just close them.

The ~4sec estimate ... could it be when running a query on a connection that has just slept for ~60sec? Might be an overlap.