Page MenuHomePhabricator

s52490 created 338 simultanous conections to the same server executing the same query
Closed, ResolvedPublic

Description

| 53011504 | s52490          | 10.68.16.151:46092 | s52490__hashtags_p                    | Execute |      40 | Sending data                         
            FROM recentchanges
            WHERE rc_id = ?
            AND htrc_lang  |    0.000 |

Event Timeline

All queries killed and user throttled to 1 connection per user until this is resolved.

Can you expand upon the issue at hand? Do you know when it started or how similar the in-flight queries are/were?

Also what are the headings to these columns in the issue description?

I cannot say when it started, but I supposed thoughout the weekend. Most of the queries followed the following pattern (approximately 1 per second):

SELECT htrc_id
            FROM recentchanges
            WHERE rc_id = ?
            AND htrc_lang = ?
            LIMIT 1

I only noticed after realizing there was slave delay on c1.

Also what are the headings to these columns in the issue description?

Sorry, I assumed everybody would be familiar with that, it is the standard show processlist output:

| Id       | User            | Host               | db                            | Command | Time    | State                                                                       | Info                                                                                                 | Progress |

Ah, if it's just that pattern, it's probably the same issue just magnified across all the languages. Stephen and I will look into it, thanks!

OK, Stephen and I both have work tomorrow and nothing has jumped out at us, bug wise.

I'll just put it out there that there were no code changes and that this might just be due to a high number of hashtag-laden recent changes in the last 24 hours. In fact, with the latest bot hashtag additions, we may have to reassess our scalability expectations.

However, if you're looking for bugged code or runaway processes this probably isn't it. Is it possible someone else is doing a heavy join or maybe even has a leapday bug? (that would be exciting)

Your code allows 338 simultaneous connections to the replica databases. That is a bug.

Haha, I'm sure some would argue 338 is small potatoes, but I'm not one to argue.

So, what's the connection limit, or recommendation? We'll see to it things stay within bounds.

Also, is there an idle timeout or recommended connection keep alive time?

Haha

I do not consider this a laughing matter. Replica labs is a fundamental piece of infrastructure for Wikimedia wikis, and many communities rely on tools that use them working properly. Every day I get complains when they do now work as expected, and have to act in consequence.

There is no predetermined fixed limit. 30 1-second queries can be ok, while 5 long-running connections can be very taxing, if they consume a lot of memory. If your queries start affecting other user's queries (as it was the case, creating lag for other users), your connections will be throttled. If nothing is done to fix this, connections will continue to be limited, or in extreme, uncooperative cases, the account will be revoked permissions to access the replicas if it continues affecting other users. In your case, your over 300 connections were taking between 40 and 300 seconds to execute.

Please understand that you are sharing non-infinite resources with other 5000 users, so 338 concurrent queries for a single user are a lot of them. This is aggravated by the fact that we have one less server than usual. As a pure suggestion, a hard limit of 15 simultaneous connections may be wiser, and should be enforced on code.

Also, is there an idle timeout or recommended connection keep alive time?

Connections that are idle for over 300 seconds are killed to avoid unused resource keeping.

Cool, I'm also up at 2:20am still looking at this, so I think you can assume I take this seriously, despite my attempts at a friendly and community-driven tone. Our tools also have hundreds of downstream users.

Also, I think I have spotted an issue that would keep the connections open in some cases, likely accumulating many of these connections over time. There were 338 simultaneous connections, but they were not opened simultaneously. Many were not being closed, hence many of them appeared to be taking that maximum lifespan of 300 seconds before being killed.

We'll roll out a new version the code and keep this thread posted. Thanks for your patience.

Hey again @jcrespo, we made some code changes and would request that you or another DBA throttle the number of database connections up to 20 or so max so we can confirm the fix.

Thanks,

Mahmoud

chasemp claimed this task.
chasemp subscribed.

seems to have been ok for awhile so I"m going to resolve for now