Page MenuHomePhabricator

Query from stat1003 brought down db1047
Closed, ResolvedPublic

Description

tbayer: your interactive query brought down the db1047 server. It needs optimization:

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
320	661,655	660,060	211,219,208	db1047	research	
SELECT dupesperday.yearmonthday, ROUND(100*dupes/totalevents, 2) AS dupes_percentage FROM (SELECT yearmonthday, SUM(copies-1) AS dupes FROM (SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, clientIp, webHost, event_action, COUNT(*) AS copies FROM log.MediaViewer_10867062 GROUP BY yearmonthday, timestamp, userAgent, clientIp, webHost, event_action HAVING copies>1) AS dupeslist GROUP BY yearmonthday) AS dupesperday JOIN (SELECT LEFT(timestamp, 8) AS yearmonthday, COUNT(*) AS totalevents FROM log.MediaViewer_10867062 WHERE timestamp LIKE '201508%' GROUP BY yearmonthday) AS totaleventsperday ON dupesperday.yearmonthday = totaleventsperday.yearmonthday ORDER BY yearmonthday /* 1985750a9eb384d23630677d64dad87f db1047 658465s */
320	651,050	649,455	207,825,608	db1047	research	
SELECT dupesperday.yearmonthday, ROUND(100*dupes/totalevents, 2) AS dupes_percentage FROM (SELECT yearmonthday, SUM(copies-1) AS dupes FROM (SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, clientIp, webHost, event_action, COUNT(*) AS copies FROM log.MediaViewer_10867062 GROUP BY yearmonthday, timestamp, userAgent, clientIp, webHost, event_action HAVING copies>1) AS dupeslist GROUP BY yearmonthday) AS dupesperday JOIN (SELECT LEFT(timestamp, 8) AS yearmonthday, COUNT(*) AS totalevents FROM log.MediaViewer_10867062 GROUP BY yearmonthday) AS totaleventsperday ON dupesperday.yearmonthday = totaleventsperday.yearmonthday ORDER BY yearmonthday /* 2526042c364d5984f1ef815f7bc09e3b db1047 647860s */

Event Timeline

the 2 queris up here had been runing for 7 days.

Mentioned in SAL [2016-05-25T16:13:00Z] <volans> kill very long query on db1047 (ID 89274525, client disconnected) T136214

I've also stopped this 18-hour insert on dbstore1002- it needs batching:

dbstore1002	311131181	research	stat1003		18h
INSERT INTO staging.tbayer_readnavsessions_20160107 SELECT SUM(readseconds) AS readtime, SUM(navseconds) AS navtime, isTestA, sessionId, PageId, wiki FROM (SELECT @timest:=ordered_data.timest, IF( (@timest > @prev_timest +2) AND (sessionId = @prev_sessionId) AND (pageId = @prev_PageId) AND (wiki = @prev_wiki), @timest - @prev_timest, 0) AS readseconds, IF( (@timest <= @prev_timest +2) AND (sessionId = @prev_sessionId) AND (pageId = @prev_PageId) AND (wiki = @prev_wiki), @timest - @prev_timest, 0) AS navseconds, isTestA, @prev_timest:= @timest, @prev_sessionId:=sessionId, @prev_PageId:=PageId, @prev_wiki:=wiki, sessionId, eventName, PageId, wiki FROM (SELECT event_isTestA AS isTestA, event_sessionId AS sessionId, 86400*MID(timestamp,7,2)+3600*MID(timestamp,9,2)+60*MID(timestamp,11,2)+MID(timestamp,13,2) AS timest, event_eventName AS eventName, event_PageId AS PageId, wiki FROM log.MobileWebSectionUsage_15038458 WHERE timestamp LIKE '20160107%' AND event_isTablet = 0 AND event_namespace = 0 AND event_pageId <> 0 ORDER BY isTestA, event_sessionId, pageId, wiki, timest) AS ordered_data ORDER BY isTestA, sessionId, pageId, wiki, timest) AS events_with_readnavtime GROUP BY isTestA, sessionId, pageId, wiki ORDER BY isTestA, sessionId, pageId, wiki

OK, the first query took way longer than intended or expected (from having run analogous queries on other tables); I actually checked on it yesterday and thought about killing it but decided to give it just another day. No objection there.

I'm a bit puzzled about why the second query was killed though. I have run quite a few queries on this table (MobileWebSectionUsage_15038458) that took 24h or more and completed without issues, apart from the slow speed. And I need the data from this one pretty soon. And what precisely do you mean by batching here? This query was already run in batch mode (https://dev.mysql.com/doc/refman/5.7/en/batch-mode.html ) using screen.

@Tbayer Select can be long (maybe not 7 days, but as much as the server allows). DMLs (INSERT, UPDATE, DELETE) should be shorter because in the event of a rollback, they could block further queries for a long time. If you could divide the query on smaller ones, even if all together take the same amount of time, and commit every few rows, that would avoid issues. MySQL, unlike other tools, is not really well suited for analytics, so we have to workaround its limitations in some cases.

Please note that there are many users using of the same resources, so if we could do the same work but without affecting other users, everybody would be happy :-)

While the query may need some review to avoid so many nested subqueries that make it inefficient, maybe smaller timestamp periods can be used meanwhile?

DMLs (INSERT, UPDATE, DELETE) should be shorter because in the event of a rollback, they could block further queries for a long time.

I see. But to clarify just in case, the table into which that query wrote is an ad-hoc one that is only used by myself for now, so there would be no problem if queries to it were blocked. Do you mean that queries to all tables in the entire staging database would be blocked?

In fact, I already ran pretty much the same query earlier without issue, inserting the same amount of rows into such an ad-hoc table (tbayer_test3, 1.5 million rows, orders of magnitude smaller than many other tables in staging).

While the query may need some review to avoid so many nested subqueries that make it inefficient, [...]

I'd love advice on how to make it more efficient. (For those reading along, we are *not* talking about the query that brought down the server, but about the one that was killed with the rationale that it had run doing inserts for 18 hours. An almost identical version of it had completed on Saturday April 30 within 26 hours.)

I have reproduced it below in an easier to read formatting. To give some background: This query, part of T128536, processes a day's worth of data by sorting events (single user actions, in particular tapping on a section heading or scrolling a heading into view) into sessions, and then calculating two metrics for each session: time spent reading vs. time spent navigating. (Those depend on the ordering of events within a session, i.e. on the results of the inner query.) It may or may not be possible to reduce the three nested queries to two, I haven't given this a lot of thought yet. I had already removed one nesting level, by leaving the calculation of the end results to a further query that will work on the table resulting from this query, instead of adding a fourth nested query around it (as I had done when testing these calculations on shorter timespans like 10 minutes).

[...] maybe smaller timestamp periods can be used meanwhile?

Because we already found a strong variation by time of day in other aspects of this data set, I don't feel comfortable basing conclusions on data that covers less than 24h. This query already represented a compromise performance-wise, because a week's worth of data would be preferable considering there is also well-known weekly seasonality of reader behavior on the mobile web version of Wikipedia.

CREATE TABLE staging.tbayer_readnavsessions_20160107 (readtime int, navtime int, isTestA tinyint(1), sessionId varbinary(191), pageId bigint(20), wiki varbinary(191));
SET @timest =0;
SET @prev_timest=0;
SET @prev_session_Id ='';
SET @prev_PageId = 0;
SET @prev_wiki = '' COLLATE utf8_unicode_ci;
INSERT INTO staging.tbayer_readnavsessions_20160107 
SELECT SUM(readseconds) AS readtime, SUM(navseconds) AS navtime, isTestA, sessionId, PageId, wiki FROM
	(SELECT
		@timest:=ordered_data.timest,
		IF( (@timest > @prev_timest +2) AND (sessionId = @prev_sessionId) AND (pageId = @prev_PageId) AND (wiki = @prev_wiki),
		   @timest - @prev_timest, 0) AS readseconds,
		IF( (@timest <= @prev_timest +2) AND (sessionId = @prev_sessionId) AND (pageId = @prev_PageId) AND (wiki = @prev_wiki),
		   @timest - @prev_timest, 0) AS navseconds,
		isTestA,
		@prev_timest:= @timest, @prev_sessionId:=sessionId, @prev_PageId:=PageId, @prev_wiki:=wiki,
		sessionId, eventName,
		PageId,
		wiki FROM
			(SELECT event_isTestA AS isTestA, event_sessionId AS sessionId,
			86400*MID(timestamp,7,2)+3600*MID(timestamp,9,2)+60*MID(timestamp,11,2)+MID(timestamp,13,2) AS timest,
			event_eventName AS eventName, event_PageId AS PageId, wiki
				FROM log.MobileWebSectionUsage_15038458
				WHERE timestamp LIKE '20160107%' AND event_isTablet = 0 AND event_namespace = 0 AND event_pageId <> 0
				ORDER BY isTestA, event_sessionId, pageId, wiki, timest) AS ordered_data
	ORDER BY isTestA, sessionId, pageId, wiki, timest) AS events_with_readnavtime
GROUP BY isTestA, sessionId, pageId, wiki
ORDER BY isTestA, sessionId, pageId, wiki

[...]

Please note that there are many users using of the same resources, so if we could do the same work but without affecting other users, everybody would be happy :-)

I absolutely want to be considerate to other users. But I am not aware of any tool to find out the database server's current load and how much of it is being caused by my own queries. Is there any documentation about this, and if not, can we please add some? (For Hadoop, there is at least some related material here.)

we are *not* talking about the query that brought down the server

We are not talking about that, but dbstore1002 had a memory exhaustion
problem yesterday: T136333, so everything that could make queries
faster and more efficient helps.

I do not expect you to check the load of the server, that is my job.
In general, you are free to run anything you want, but if an admin
mentions an issue to you, you are expected to try to see if things can
be done to speed up queries and lower resources. I am not blaming you
for anything nor you are the only person doing heavy queries- I just
wanted to point out these 2 particular ones.

You can ask for help for that. I also have written several guides
about MySQL query optimization:
http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57

If you could divide the query on smaller ones, even if all together take the same amount of time, and commit every few rows, that would avoid issues.

One could split that day's worth of data into hours, say, but that would introduce significant truncation errors, as sessions that go from one hour into the next would be split into two and thus distort the results, as the split sessions will of course be artificially shorter. (See my explanation of the query's structure above.) One could mitigate that by having the timespans overlap and adding code that removes duplicate sessions, but that would probably be much more computationally intensive, defeating the purpose of this batching exercise.

If however your batching suggestion was based on other ideas about how to separate the inserted into batches, let me know.

MySQL, unlike other tools, is not really well suited for analytics, so we have to workaround its limitations in some cases.

What other tools would you recommend me to use instead for such analytics tasks? (Or was that a general comment about our analytics infrastructure setup?)

Just came across this ticket and given how old it is....is this still an issue or can be closed?