Page MenuHomePhabricator

labsdb1001 crashing regularly in the last 2 days due to OOM
Closed, ResolvedPublic

Description

For the last few days, the plagiabot API — for example, http://tools.wmflabs.org/eranbot/plagiabot/api.py?action=suspected_diffs — has frequently been returning an error message.

It goes out for a while, then has somehow fixed itself, then goes out again. It's been down about half the time.

See also T162932

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Anomie - Can I ask you to stop BrokenRedirectDeleter queries? It is correlating to swapping and eventual server crash https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=labsdb1001&var-network=eth0&from=1492109306997&to=1492195706997. It may not be the real cause and only the tipping point, but high memory spikes correlate with long running queries by this user. If the crashes don't stop, then I could discard that and continue searching for the real cause.

If that doesn't help (main user of the server), other heavy hitters are merl tool, xtool concurrent imports, and some random quarry tools with multiple revision queries (but those doesn't seem to impact so much on the memory used).

jcrespo renamed this task from Plagiabot API frequently returns an error to labsdb1001 crashing regularly in the last 2 days due to OOM.Apr 14 2017, 7:12 PM
jcrespo added a subscriber: Matthewrbowker.

I have imposed a long running query time of 1500 seconds. I know, not good, but it is either a few queries are affected, or the server keeps crashing and all users are affected.

It's been down about half the time

BTW, the downtime is unacceptable, but based on actual stats, the server only takes minutes to restart and you can see on the query stats other people continue querying the db as soon as it comes back: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1001&from=now-2d&to=now Now, if unsafe engines like MyISAM or Aria are used, table unavailability is extended due to its poor recovery times/how prone it is for corruption- the solution to that will still require conversion to InnoDB.

Another suggestion I can makes is that, I do not know exactly the requirements needed for the tool, but based on a guess, services like IRC/streaming changes + a database on toolsdb mysql may be more reliable?

I am not excusing the issues, I am still working on those, I am just giving extra tips that can be used to minimize issues. Also, current replica service is soon to increase its capacity with newer and hopefully more reliable servers.

Ok, I paused it for the moment. I hope I don't have to keep it paused for too long. It might restart on its own if Tool Labs decides to restart the job or something.

My logs show my own bot started having issues at 2017-04-13 06:22:34 UTC. At that timestamp it failed (" Lost connection to MySQL server during query") during a select-and-insert into the temporary table, but after that one failure it switched to erroring while doing a simple select from the table. When I tried to examine the bot's table it complained "Table './s51055__AnomieBOT/BrokenRedirectDeleter' is marked as crashed and should be repaired". I wonder whether that was what was making the bot's queries be problematic. I just dropped it, since the bot can easily recreate it.

Some of it could be the limiter I imposed (maybe it can be done in batches of less time?). Please force the use of InnoDB for recreation- InnoDB tables don't "crash".

Spiky situation stopped at 21:23- I am going to assume that is when you killed your query? https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=14&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=1492191705657&to=1492213305657 Can you change it so that it doesn't build up to use 40GB of memory 0:-)? Changing it to InnoDB may actually help (not sure of all details, my view is limited)... Sorry, but I need to ask you to save memory here (and I do not expect you to know the impact in advance).

I thought InnoDB was the default. I'll have it explicitly specify that.

I'd be happy to change it to use less memory, but I have no idea what about it might be using that much.

Until March 15, it was doing what had been a fairly straightforward query in its main loop:

SELECT rd_from, p1.page_namespace, p1.page_title, rd_interwiki, rd_namespace, rd_title, rd_fragment
FROM redirect
    JOIN page AS p1 ON(rd_from = p1.page_id)
    LEFT JOIN page AS p2 ON(rd_namespace=p2.page_namespace AND rd_title=p2.page_title)
WHERE rd_namespace >= 0
    AND (p2.page_id IS NULL OR rd_interwiki != '' AND rd_interwiki IS NOT NULL)
    /* AND rd_from > X */
ORDER BY rd_from
LIMIT 10

A bit of a potential for being slow since the join would probably be filtering out a lot of rows to get the 10, but it wasn't bad when I wrote it three years ago.

But in debugging a different issue recently I noticed that that query was now filesorting and taking a really long time to return any results. So I changed it to use a temporary table populated with batched queries that each completed in a second or two, which the main loop would then consume. The series of statements with the latest change to the CREATE will now be

CREATE TABLE IF NOT EXISTS $schema.BrokenRedirectDeleter (
    rd_from INT UNSIGNED NOT NULL PRIMARY KEY,
    rd_interwiki VARBINARY(32) NOT NULL DEFAULT '',
    rd_namespace INT NOT NULL,
    rd_title VARBINARY(255) NOT NULL,
    rd_fragment VARBINARY(255) NOT NULL DEFAULT ''
) ENGINE InnoDB DEFAULT CHARSET=binary;

SELECT COUNT(*) FROM $schema.BrokenRedirectDeleter;

/* If that count returns 0 or it has been 6 hours since the last time the table was updated, { */
 TRUNCATE TABLE $schema.BrokenRedirectDeleter;

 SELECT MAX(rd_from) FROM redirect;

 /* For X from 0 to MAX (which is currently almost 54000000) in steps of 1000000 { */

  INSERT INTO $schema.BrokenRedirectDeleter SELECT rd_from, rd_interwiki, rd_namespace, rd_title, rd_fragment
  FROM redirect
      JOIN page AS p1 ON(rd_from=p1.page_id)
      LEFT JOIN page AS p2 ON(rd_namespace=p2.page_namespace AND rd_title=p2.page_title)
  WHERE rd_namespace >= 0
      AND (p2.page_id IS NULL OR rd_interwiki != '' AND rd_interwiki IS NOT NULL)
      AND rd_from >= $X AND rd_from < $X+1000000;

 /* } end for loop */

/* } end if block */

/* Then the main loop, processing the table { */

SELECT rd_from, p1.page_namespace, p1.page_title, rd_interwiki, rd_namespace, rd_title, rd_fragment
FROM $schema.BrokenRedirectDeleter
    JOIN page AS p1 ON(rd_from = p1.page_id)
ORDER BY rd_from
LIMIT 10;

DELETE FROM $schema.BrokenRedirectDeleter WHERE rd_from IN ( /* ... */ );

/* } end main loop */

/* Wait 6 hours, then start over */

And to reiterate, according to my logs my bot was failing at the SELECT COUNT(*) FROM $schema.BrokenRedirectDeleter; line for almost all of the time you're seeing the spikiness. Personally, I suspect that "crashed" state was somehow causing the database to eat that 40GB while trying to read from the table.

So the server hasn't crashed again in almost 3 days which could be either the limit Jaime set and it is killing other slow queries or @Anomie stopping the query that was actually making the server crash.

mysql:root@localhost [(none)]> show global status like 'uptime';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| Uptime        | 241036 |
+---------------+--------+
1 row in set (0.01 sec)

I thought InnoDB was the default. I'll have it explicitly specify that.

It is for the new labsdb servers, not for the old ones.

I'd be happy to change it to use less memory, but I have no idea what about it might be using that much.

Until March 15, it was doing what had been a fairly straightforward query in its main loop:

SELECT rd_from, p1.page_namespace, p1.page_title, rd_interwiki, rd_namespace, rd_title, rd_fragment
FROM redirect
    JOIN page AS p1 ON(rd_from = p1.page_id)
    LEFT JOIN page AS p2 ON(rd_namespace=p2.page_namespace AND rd_title=p2.page_title)
WHERE rd_namespace >= 0
    AND (p2.page_id IS NULL OR rd_interwiki != '' AND rd_interwiki IS NOT NULL)
    /* AND rd_from > X */
ORDER BY rd_from
LIMIT 10

A bit of a potential for being slow since the join would probably be filtering out a lot of rows to get the 10, but it wasn't bad when I wrote it three years ago.

But in debugging a different issue recently I noticed that that query was now filesorting and taking a really long time to return any results. So I changed it to use a temporary table populated with batched queries that each completed in a second or two, which the main loop would then consume. The series of statements with the latest change to the CREATE will now be

That should definitely help - I would suggest we enable it back after the DC switchover and see how the server behaves?

I've already gotten an inquiry from someone who noticed the task wasn't running.

That should definitely help - I would suggest we enable it back after the DC switchover and see how the server behaves?

Note that, other than the CREATE TABLE now specifying ENGINE=InnoDB and CHARSET=binary, there are no changes to the SQL from what the bot has already been doing for the past month (i.e. well before this issue as far as I can tell).

according to my logs my bot was failing at the SELECT COUNT(*) FROM $schema.BrokenRedirectDeleter; line for almost all of the time

I made it fail myself because how affecting it was. And it failing doesn't take from it taking resources- queries do that since the comment you connect.

My point is that there's no reason I can think of besides the DB corruption that would cause that SELECT query to consume 40GB of memory. Do you have a reason in mind?

Do you have a reason in mind?

MyISAM caching is way worse and simpler than InnoDB, that is why we suggested converting it. It could have been taking 40GB all this time, but only was detected when it started crashing. Sincerely, I do not ask questions, I just look at heavy hitters and create a ticket after something goes too bad- after all, labs is the place to experiment and do expensive queries. We have disabled detailed statistics on labs as profiling there takes too many resources. I could give it a deeper look if you really want to know, but normally not using MyISAM is the solution to most problems.

You made two requests of me:

  1. Switch to InnoDB.
  2. Make sure it doesn't take 40GB of memory.

I've already done #1, and said so four days ago. I have no idea why #2 might have ever happened besides my guess about the corruption. It sounds like you're just guessing too. And considering the lack of 40GB spikes prior to April 13 in the graph you linked despite my bot making the same queries since March 15, I doubt your guess that it might have been happening all along is the case.

So is there anything else you'd have me change besides switching to InnoDB? Or can we consider #2 done as well as it's able to be done and let me restart the task now?

So is there anything else you'd have me change besides switching to InnoDB?

Yes, #2 is monitor that graph and if the spikes come back, avoid that query.

Ok. For the initial run after restarting, I see no deviation outside the norm in the graph. The SELECT that had been causing the spikiness happened sometime between 14:19:19 and 14:19:20.

I also ran the SELECT manually a few extra times and it completed in under 0.01 seconds every time as expected. From the timings in my logs, when it was crashing that SELECT was running for around 50 minutes (while building up that 40GB memory usage).

The next scheduled run of the task is at 2017-04-18 20:34:23 UTC.

and it completed in under 0.01 seconds

Yeah, that looks much better :-) Sorry I cannot provide you as much time as I would love to on labs applications, but the support there is minimal to nonexistent, not by choice, but because we cannot help as good as we would like to thousands of users, I just can quickly create a ticket when something goes wrong and ask for things to be changed somehow or users to help each other.

I may have misunderstood your instruction to "make sure it doesn't take 40GB" as meaning "find the cause and fix it before you can restart" rather than "watch the graph to make sure it's not still happening when you restart". Since you didn't seem to agree that the corrupted table was the probable cause, I was feeling blocked by not having any other workable theory as to the cause.

Second run, no deviation outside the norm in the graph. The next scheduled run is at 02:41:42 UTC.

Third run, no deviation outside the norm in the graph. The next scheduled run is at 08:51:21. I'm not going to be awake for that one.

When I checked the 24-hour graph this morning I saw there was an approximately 10GB increase in memory usage on 2017-04-20 from about 13:44 to 14:15 UTC (graph). This does not correspond to a run of AnomieBOT's BrokenRedirectDeleter, the closest runs to that time were 09:39-09:52 and 15:52-16:02.

jcrespo assigned this task to Anomie.

I think actions taken here and on the subticket will be enough to asure the reliability of labsdb1001, at least until we switch to the larger servers.