Page MenuHomePhabricator

Tools still shaky; DB replicas to blame?
Closed, ResolvedPublic

Description

T127066 is closed, but the issue persists. Especially, catscan2 seems to go down every few hours. To reiterate, it worked fine for months, and ran into trouble around the time of the last DB server crash, though I do not know if that was a coincidence.

Whatever changed on Labs, PLEASE FIX IT. We can't keep restarting our tools manually all the time.

Checking for the heartbeat, I find an "unknown" server with a high lag, plus some lag on s3. Might one of those be the cause?

MariaDB [wikidatawiki_p]> SELECT * FROM heartbeat_p.heartbeat;
+---------+----------------------------+---------+

shardlast_updatedlag

+---------+----------------------------+---------+

s62016-02-24T11:02:39.5010700
unknown2016-02-11T11:14:01.2191401122517
s72016-02-24T11:02:39.5011100
s32016-02-24T10:50:39.501080719
s42016-02-24T11:02:15.50119023
s12016-02-24T11:02:36.0005902
s52016-02-24T11:02:38.5008100
s22016-02-24T11:02:39.0009700

+---------+----------------------------+---------+
8 rows in set (0.01 sec)

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Magnus triaged this task as High priority.Feb 24 2016, 11:11 AM

"catscan2 seems to go down every few hours" is too vage. Can you provide details about what fails? Do user request fail? If they do, what queries where they doing? I do not have the logs of your application.

Whatever changed on Labs, PLEASE FIX IT.

We have currently one less server on labs do to a hardware failure, I cannot fix that instantly. I am seeing if the server has to be fixed ore replaced with brand new servers, and in both cases, it has to be later approved by management and the providers can take days or weeks to supply replacements. We are providing the service with one server less right now.

I find an "unknown" server with a high lag

That unknown server is not a master, (it is the old s2 master, and it shows on that list because it stopped being a master 1122517 seconds ago). It is unrelated. I can delete it from that table if it annoys you, but please undestan first how pt-heartbeat works.

There was recently a set of tools users that overtook all resources on one of the servers (T127228), that could be related.

Can you provide some error logs of your application to debug it?

Well, I don't have many more details, other than what is already on T127066.

@valhallasw has checked php child processes in the other task (I can't do that, apparently), and didn't see too many running PHP, so it could be related to T104799.

If the issues are due to the broken server, I could understand if all tools run a little slower. But catscan2 seems to fail quite often (as I write this, it's up less than 2h, and already accumulated 70 "active requests", which means it will likely fail again soon). glamtools and autolist seem to be affected as well, but not as heavily. Why would one less server cause a tool to completely stall?
Did the server break around Feb 14?

I notice that the file system seems to be a little sluggish for catscan2 (after 'become catscan2'), more so than for other tools; might just be my imagination, though.

I only listed the heartbeat thing because it's the only anomaly I could find that might cause this.

UPDATE: While I wrote the above, catscan2 has become unresponsive again.

error.log shows an unspecified DB query error. Logging the command now. Restarting webservice, maybe the new log will give me new insights.

So the DB error is our all-time favourite:

There was an error running the query [MySQL server has gone away]

So I'm now adding a new DB connection before EVERY SINGLE QUERY, as connections apparently drop within a few seconds. I thought that was fixed some time ago?

Labsdb1002 server went down at midnight UTC, the night from Sat, Feb 14 to Mon, Feb 15.

If you have some time, chatting on IRC would be more interactive, but it is ok if you cannot.

Queries that take more than a certain limit are killed, I see a lot of heavy queries (from several users) on labsdb1003, that can cause some non-trivial queries to pile-up (be slower than usual), and correspondely, end up being killed, too.

I will now kill long running queries that will not finish anyway, then send some of the traffic from labsdb1003 to labsdb1001, which is slightly less loaded, and limit the number of concurrent connections on certain heavy users temporarily until new hardware arrives.

Change 272965 had a related patch set uploaded (by Jcrespo):
labsdb1003 is a bit overloaded right now, move commonswiki to 1

https://gerrit.wikimedia.org/r/272965

Change 272965 merged by Jcrespo:
labsdb1003 is a bit overloaded right now, move commonswiki to 1

https://gerrit.wikimedia.org/r/272965

So, regarding other questions:

  • idle connections are closed after 5 minutes on replicas. This is to avoid reserving resources (done on connection) if they are not used. While many applications use the pattern "permanent connections/pool of connections", that is not ok for replicas because if every user maintained permanent connections we could not serve those even with 500 servers.

The right way to do it is, not necessarily create a new connection on every query, but abstract a "query()" call in trying the query, if it fails because the connection dropped (which can happen for many reasons, not only this timeout), reconnect.

  • I do not know much about the application/grid/storage hosting, you should ask some of my workmates.
  • You are not expected to reboot your application every time, but you are expected to, from time to time, refresh the domain configuration. For example, if s2 fails, which used to serve commonswiki, the application should check if commonswiki host now points to a different server and connect to the new one (that way, a failover would be fully transparent to the application).

I am still working to see what is making heavy usage of the third server.

catscan2 seems to hold its own for the moment; maybe the proper log/fail and DB reconnect does help. So could the commons switchover, thanks for that.

Just to be clear, it's not that my queries time out; the tool connects to the database, does something else for a few seconds, runs a new query over the connection, but the server has gone away. And yes, I tried ping() earlier, but that fails as well, so I now reconnect immediately before running a query. This causes a lot of unneccessary reconnections, but I seem to have no other choice.

No more unknow column on heartbeat- it was a non-issue due to the s2 production master failover.

Aside from fining sometimes lag created by user processes blocking master updates, lag may be slightly higher than usual on s3 due to a change in s3 format from innodb to tokudb- due to the high number of wikis on s3, that could make some operations slightly slower, but always catching up with production. If it continues being 5 minutes behing, I will partially revert the most conflictive tables (probably the link* tables) to innodb.

c1:

MariaDB LABS localhost heartbeat_p > SELECT * FROM heartbeat;
+-------+----------------------------+------+
| shard | last_updated               | lag  |
+-------+----------------------------+------+
| s6    | 2016-02-24T13:52:40.501020 |    0 |
| s7    | 2016-02-24T13:52:40.501040 |    0 |
| s3    | 2016-02-24T13:52:28.000920 |   11 |
| s4    | 2016-02-24T13:52:40.501080 |    0 |
| s1    | 2016-02-24T13:52:40.500780 |    0 |
| s5    | 2016-02-24T13:52:40.500820 |    0 |
| s2    | 2016-02-24T13:52:40.500790 |    0 |
+-------+----------------------------+------+
7 rows in set (0.00 sec)

c3:

MariaDB LABS labsdb1001 heartbeat_p > SELECT * FROM heartbeat;
+-------+----------------------------+------+
| shard | last_updated               | lag  |
+-------+----------------------------+------+
| s6    | 2016-02-24T13:49:33.500990 |    0 |
| s7    | 2016-02-24T13:49:33.501170 |    0 |
| s3    | 2016-02-24T13:49:33.500970 |    0 |
| s4    | 2016-02-24T13:49:33.501150 |    0 |
| s1    | 2016-02-24T13:49:33.500800 |    0 |
| s5    | 2016-02-24T13:49:33.500870 |    0 |
| s2    | 2016-02-24T13:49:33.500750 |    0 |
+-------+----------------------------+------+
7 rows in set (0.01 sec)

Magnus: I know labsdatabases are not in the best moment right now, I am trying to get resources from where there is very little.

I am working on getting new hardware, but until I have something definitive, approved and shipped, I cannot promise anything, yet. I will put it on my top priority now, as it is causing increasing issues.

I would love to have your input, however on establish some rules for shared db usage, as I commented here: https://phabricator.wikimedia.org/T127266#2059315

There is also the possibility (again, something I cannot promise on my own) of putting some of your high-bandwidth tools on separate instances with semi-exclusive resources on separate labs machines so that an increase of traffic on other tools may not affect yours. This is not preferential treatment, it is something that I am suggesting to all tools that are very popular and load-sensitive (always depending on approval from the labs guys and operations). And I think I am not the first one to suggest that.

First, thanks for trying to keep the engine running despite tight resources! :-)

catscan2 is now running for as long as it was in my initial post here, but the load is much lower. Let's hope that it stays that way.

I did previously convert all my "heavy" tools to a system that tries to avoid long-running queries, by shoving data into flat temporary files as soon as possible, then run intersections etc. in the file system.
Is there a "longest-running queries by tool" overview somewhere? I do enjoy efficient code, and that includes queries, but some hint on where the issues are would be helpful.

As for "productizing" my tools: As I said, I'm perfectly happy with that, if it is
(a) done one-by-one (so, you tell me tool X should be beefed up, run on its own VM etc), and not "let's productize ALL the tools!!1!"
(b) there is a clear upside to doing that. This could be isolating it from other tools so as to not degrade general performance, or increasing available RAM, or access to a separate replica; but we should see where this is necessary. catscan2, for example, should use low RAM in most situations (often <2MB, which seems to be a lower bound for PHP), so more RAM might not do any good.
(c) some help with infrastructure; VM setup, puppetizing, etc. I don't have the bandwidth to get into Labs internals there.

This might be a case for the "Community Tech team"?

This might be a case for the "Community Tech team"?

Maybe. It sounds similar or related to what they told me they wanted to do when I asked for Tools help, supporting tools users, but that they needed some time for some people to change responsibilities. Again, please do not take for granted the words of this humble DBA, and I will ask again to see that is a possibility (or you can yourself).

The idea is that I am not complaining about your tools, on the contrary, I said multiple times that they deserve better support, but on a fully shared environment (even if we eventually get better hardware) I cannot guarantee that another tool goes crazy and starts taking over too many resources. We are slowly creating several environments to avoid that with virtual machines and containers, and more important, to provide full high availability.

Is there a "longest-running queries by tool" overview somewhere? I do enjoy efficient code, and that includes queries, but some hint on where the issues are would be helpful.

There is something, but it is shared with production, and contains all user's queries, so it is not open to the public (requires NDA-signing). I can provide, however, to tools users, on a one-to-one basis, manually compiled summaries. I can do that for you if you provide me a database user and a timespan (less than 1 week)- and of course you are the owner of such tools. :-)

I am working on improving the database monitoring right now (we already got better replication lag monitoring), not only for tools, but for all mysqls. Expect more news soon.

jcrespo mentioned this in Unknown Object (Task).Feb 25 2016, 9:51 AM
Magnus claimed this task.

In the year (!) since I filed this, I have rewritten CatScan2 as PetScan. The "unknown" server seems to have vanished. So, no.