| 53011504 | s52490 | 10.68.16.151:46092 | s52490__hashtags_p | Execute | 40 | Sending data FROM recentchanges WHERE rc_id = ? AND htrc_lang | 0.000 |
I cannot say when it started, but I supposed thoughout the weekend. Most of the queries followed the following pattern (approximately 1 per second):
SELECT htrc_id FROM recentchanges WHERE rc_id = ? AND htrc_lang = ? LIMIT 1
I only noticed after realizing there was slave delay on c1.
OK, Stephen and I both have work tomorrow and nothing has jumped out at us, bug wise.
I'll just put it out there that there were no code changes and that this might just be due to a high number of hashtag-laden recent changes in the last 24 hours. In fact, with the latest bot hashtag additions, we may have to reassess our scalability expectations.
However, if you're looking for bugged code or runaway processes this probably isn't it. Is it possible someone else is doing a heavy join or maybe even has a leapday bug? (that would be exciting)
I do not consider this a laughing matter. Replica labs is a fundamental piece of infrastructure for Wikimedia wikis, and many communities rely on tools that use them working properly. Every day I get complains when they do now work as expected, and have to act in consequence.
There is no predetermined fixed limit. 30 1-second queries can be ok, while 5 long-running connections can be very taxing, if they consume a lot of memory. If your queries start affecting other user's queries (as it was the case, creating lag for other users), your connections will be throttled. If nothing is done to fix this, connections will continue to be limited, or in extreme, uncooperative cases, the account will be revoked permissions to access the replicas if it continues affecting other users. In your case, your over 300 connections were taking between 40 and 300 seconds to execute.
Please understand that you are sharing non-infinite resources with other 5000 users, so 338 concurrent queries for a single user are a lot of them. This is aggravated by the fact that we have one less server than usual. As a pure suggestion, a hard limit of 15 simultaneous connections may be wiser, and should be enforced on code.
Cool, I'm also up at 2:20am still looking at this, so I think you can assume I take this seriously, despite my attempts at a friendly and community-driven tone. Our tools also have hundreds of downstream users.
Also, I think I have spotted an issue that would keep the connections open in some cases, likely accumulating many of these connections over time. There were 338 simultaneous connections, but they were not opened simultaneously. Many were not being closed, hence many of them appeared to be taking that maximum lifespan of 300 seconds before being killed.
We'll roll out a new version the code and keep this thread posted. Thanks for your patience.