Page MenuHomePhabricator

tools.db.svc.eqiad.wmflabs hitting it's limit?
Closed, ResolvedPublic

Description

It all started in mid-January when I started getting XTools has more and more increasingly been getting errors when writing to the user database. This includes 1205 Lock wait timeout exceeded; try restarting transaction and more rarely 2006 MySQL server has gone away. All the queries being ran are very fast (~0.00 sec).

Then I noticed the issue with Event Metrics, which relies heavily on the user database. We started getting connectivity errors, once on February 5th, 12th and 13th.

The issue is usually short-lived, but as of today, since I wrote this task, it has worsened. I've gotten more and more errors, and now a new one: 1040 Too many connections. Currently https://tools.wmflabs.org/copypatrol and https://eventmetrics.wmflabs.org/login is unusable.

XTools receives frequent requests, and Event Metrics has a cron that runs a query every 5 minutes. We get emailed when fatal errors happen, and I've never gotten this connectivity error prior to February 5. This makes me think it's something new. It would seem that it's getting worse by the day, and perhaps about to hit the breaking point?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2019, 4:26 AM
MusikAnimal added a comment.EditedFeb 13 2019, 4:31 AM

Possibly related, starting in mid-January, XTools has more and more increasingly been getting errors when writing to the user database. This includes 1205 Lock wait timeout exceeded; try restarting transaction and more rarely 2006 MySQL server has gone away. All the queries being ran are very fast. Maybe there's a broader issue with tools.db.svc.eqiad.wmflabs, and not just connectivity?

MusikAnimal updated the task description. (Show Details)Feb 13 2019, 4:34 AM
bd808 added a subscriber: bd808.Feb 13 2019, 4:56 AM

The pooling/de-pooling is a red herring that I set loose when talking with @MusikAnimal on irc. I thought we were discussing Wiki Replica issues rather than ToolsDB issues. ToolsDB does not have multiple backends, so this is easily eliminated as a cause for service interruptions.

MusikAnimal triaged this task as High priority.Feb 13 2019, 7:57 AM

I am now also getting 1040 Too many connections. This happens frequently... at the moment https://eventmetrics.wmflabs.org is unusable.

MusikAnimal renamed this task from Connectivity issues with tools.db.svc.eqiad.wmflabs to tools.db.svc.eqiad.wmflabs hitting it's limit?.Feb 13 2019, 8:06 AM
MusikAnimal updated the task description. (Show Details)
MusikAnimal raised the priority of this task from High to Unbreak Now!.Feb 13 2019, 8:10 AM

Hope I'm not out of line upping to UBN. All of the applications I can think of that rely on toolsdb are currently nonfunctional. Perhaps there's a rogue tool hogging up all the connections? Or did we really just hit the breaking point on hardware limitations?

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptFeb 13 2019, 8:10 AM
MusikAnimal lowered the priority of this task from Unbreak Now! to High.Feb 13 2019, 4:32 PM

Toolsdb is accessible again after someone restarted it

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Feb 13 2019, 5:03 PM

From what I can see none of the labsdb1005 have any connections limit, maybe we need to establish a limit of connections per user similar to what we have on the replicas. Better to "break" a tool than the whole server.
We can probably also take a look at those specific tools that might need more than X number of connections (being X the number of connections we decide to set).

bd808 added a comment.Feb 15 2019, 1:43 AM

From what I can see none of the labsdb1005 have any connections limit, maybe we need to establish a limit of connections per user similar to what we have on the replicas. Better to "break" a tool than the whole server.
We can probably also take a look at those specific tools that might need more than X number of connections (being X the number of connections we decide to set).

Good idea :) We created T216170: toolsdb - Per-user connection limits today while working through the second overload outage.

Bstorm added a subscriber: Bstorm.Feb 15 2019, 4:54 PM

That is done. However, the connection problem is a symptom of queries never returning (including very simple ones) and stuck in "opening tables". Since that state has returned we are proceeding with an effort to migrate toolsdb to new servers. Thanks to the limits, however, we'll be able to log into the mysql server for maintenance activities for far longer, which should help a lot in this effort!

Bstorm closed this task as Resolved.Feb 21 2019, 6:37 PM
Bstorm claimed this task.

This ticket makes me wonder how long the toolsdb server was slowly dying. I suspect that this issue is resolved for now, though, following the move to new hardware along with the overall service outage. I'll close for now. Please re-open if that turns out to be incorrect.

MusikAnimal added a comment.EditedFeb 28 2019, 8:49 PM

@Bstorm I just got another "1205 Lock wait timeout exceeded" error :( The query is supposed to be super duper fast (~0.00 sec):

UPDATE s51187__xtools_prod.usage_timeline SET count = count + 1 WHERE tool = 'Pages' AND date = '2019-02-28'

The is the first such error since the switch over to new hardware. Before then I was getting it many times a day.