Page MenuHomePhabricator

Troubleshoot Wikimetrics "magic button"
Closed, ResolvedPublic13 Estimated Story Points

Description

We got a report on-list that the global program metrics (a.k.a. magic button) is failing.
https://metrics.wmflabs.org/reports/program-global-metrics

Event Timeline

mforns triaged this task as High priority.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

@Marostegui

Hi! I'm looking into this problem and it looks very much like the one Dan was fixing a couple days ago in:
https://phabricator.wikimedia.org/T173062
Wikimetrics raises the same error when trying to connect to MW databases:
User 's52261' has exceeded the 'max_user_connections' resource (current value: 10)

I think I found the snippet of the code that is opening those many connections, but after fixing it
and testing its effects, nothing changes. Am I missing something? Should we close by hand the connections
that are already open? Did you do something like this while helping Dan?

Thanks!

@Marostegui

Hi! I'm looking into this problem and it looks very much like the one Dan was fixing a couple days ago in:
https://phabricator.wikimedia.org/T173062
Wikimetrics raises the same error when trying to connect to MW databases:
User 's52261' has exceeded the 'max_user_connections' resource (current value: 10)

I think I found the snippet of the code that is opening those many connections, but after fixing it
and testing its effects, nothing changes. Am I missing something? Should we close by hand the connections
that are already open? Did you do something like this while helping Dan?

Thanks!

What we saw with Dan was that basically connection would never get closed after doing their queries. They would just get there sleeping and obviously using a slot.
From the DBA side we didn't really do anything other than telling him that that was what we were seeing, connections not getting closed.
So yes, ideally you should close the connection as soon as the result is retrieved and open a new one if you need to query again, instead of reusing that one. Unless you can reuse connections but not creating new ones :-)

Change 373078 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/wikimetrics@master] [wIP] Close all sessions to avoid database connection errors

https://gerrit.wikimedia.org/r/373078

We found out what the cause of the issue is:
Wikimetrics is connecting to the same labsdb host for all databases. And it creates a connection for each database it needs to query.
So whenever there is a centralauth-expanded cohort that combined spans through more than 10 wikis, this problem will happen, because the limit of connections per user is 10.

Actually I tested this, and the problem is not only showing up with the magic button report, but also with any centralauth-expanded report that fulfills this condition. On the other hand, magic button reports that do not fulfill the condition are working fine.

Wikimetrics uses sqlalchemy to query the database, and (as far as my research goes) it's not possible to query 2 different databases using the same engine (same session). So, with sqlalchemy a different session is needed for each database, which implies a different connection to the host.

I will continue trying things...

@Marostegui, wikimetrics has 2 users, one for staging and one for prod (both operate in labs on labsdb). The staging user currently does not have a connection limit, which makes it harder to troubleshoot without deploying to the prod instance.

Can we update the connection limit for the labsdb MySQL user 's52262'@'%' to 10? I tried running an ALTER USER ... WITH MAX_USER_CONNECTIONS 10; on labsdb1001, but got You have an error in your SQL syntax.

(FYI the prod MySQL user is 's52261')

@Ottomata I have updated the connections limit on labsdb1001:

mysql:root@localhost [mysql]> show grants for 's52262'@'%'\G
*************************** 1. row ***************************
Grants for s52262@%: GRANT SHOW VIEW ON *.* TO 's52262'@'%' IDENTIFIED BY PASSWORD 'xx' WITH MAX_USER_CONNECTIONS 10

Change 373921 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/wikimetrics@mforns-dev] [WIP] Close database sessions to avoid max connection issue

https://gerrit.wikimedia.org/r/373921

Change 373921 merged by Mforns:
[analytics/wikimetrics@mforns-dev] [WIP] Close database sessions to avoid max connection issue

https://gerrit.wikimedia.org/r/373921

Change 373078 abandoned by Mforns:
[wIP] Close all sessions to avoid database connection errors

Reason:
I have tested this in staging and does not bring any positive change. Also found another one that will fix the issue. Abandon.

https://gerrit.wikimedia.org/r/373078

Change 373967 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/wikimetrics@master] Clear connections between report executions

https://gerrit.wikimedia.org/r/373967

Ok, I think I found a fix!
The last patch is the one that counts, the previous ones, are commits to the mforns-dev branch, to be able to deploy to staging and test.
I won't deploy now, because it's late friday. But next Monday this will be deployed.

Change 373967 merged by Mforns:
[analytics/wikimetrics@master] Clear connections between report executions

https://gerrit.wikimedia.org/r/373967

It's deployed to production.
We should not see the same problem.
Please, reopen the task otherwise!

mforns set the point value for this task to 13.Aug 31 2017, 4:15 PM