Page MenuHomePhabricator

Quarry should refuse to save results that are way too large
Open, HighPublic

Description

Ref: T172086 T188538
When people run queries like:

USE plwiki_p;
SELECT rev_page, page_title, rev_timestamp FROM revision JOIN page ON rev_page = page_id ORDER by rev_timestamp;
SELECT *;

(https://quarry.wmflabs.org/query/25152)
Things like

graphite-labs.wikimedia.org.png (250×800 px, 34 KB)
happens.

Event Timeline

https://www.mediawiki.org/wiki/Topic:U1le4hrq6eunlafz says 500k works. NO, that's waaay tooo large, you should be using dumps if you want so much unfiltered data. Gonna limit to 2^16 = 65536 rows per resultset.

Change 415489 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[analytics/quarry/web@master] worker: refuse to save if rowcount is > 65536

https://gerrit.wikimedia.org/r/415489

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-cloud) [2018-03-01T00:37:23Z] <zhuyifei1999_> UPDATE query join query_revision on query.latest_rev_id = query_revision.id join query_run on latest_run_id = query_run.id SET status=1 where (status = 2 or status = 1) and query_run.timestamp <= DATE_ADD(NOW(), INTERVAL -1 HOUR); 251 rows affected (1.81 sec) T139162 T172086 T188564

Change 415489 merged by jenkins-bot:
[analytics/quarry/web@master] worker: refuse to save if rowcount is > 65536

https://gerrit.wikimedia.org/r/415489

I regularly run queries for ORES that return results in the range of 200k. These results are usually just a single numerical column and thus (probably) do not take up much space. These queries generally finish in less than 10 seconds.

E.g. https://quarry.wmflabs.org/query/25502

(Alternative: time limit the save to 60 seconds, kill it with a signal handler for SIGALRM)

Change 419317 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[analytics/quarry/web@master] worker: change to SIGALRM-based limit instead of row

https://gerrit.wikimedia.org/r/419317

Change 419317 merged by jenkins-bot:
[analytics/quarry/web@master] worker: change to SIGALRM-based limit instead of row

https://gerrit.wikimedia.org/r/419317

Mentioned in SAL (#wikimedia-cloud) [2018-03-15T15:56:45Z] <zhuyifei1999_> deploying d653400 to quarry-runner-0{1,2} T188564

60 seconds is definitely not enough for the query. Switched to the unlimited version, then ran: https://quarry.wmflabs.org/query/25564
Browser reports a timestamp of Thu, 15 Mar 2018 16:31:50 GMT in date header for the POST.

$ while curl -s https://quarry.wmflabs.org/run/244898/status | grep -i running; do date; sleep 5; done
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:32:14 CDT 2018
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:32:20 CDT 2018
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:32:25 CDT 2018
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:32:31 CDT 2018
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:32:36 CDT 2018
[...]
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:36:15 CDT 2018
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:36:20 CDT 2018
{"status": "running", "extra": {"connection_id": 12687695}}
Thu Mar 15 11:36:26 CDT 2018

Took 5 mins. I'll double it to 10 mins.

Change 419786 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[analytics/quarry/web@master] worker: raise the save time limit to 10 mins

https://gerrit.wikimedia.org/r/419786

For the record, the above is crazy. It's writing less than a megabyte and it take 5 minutes! :| This works for now, but getting off of NFS (T178520) seems to be a better long-term solution.

Change 419786 merged by jenkins-bot:
[analytics/quarry/web@master] worker: raise the save time limit to 10 mins

https://gerrit.wikimedia.org/r/419786

Mentioned in SAL (#wikimedia-cloud) [2018-03-16T00:29:16Z] <zhuyifei1999_> deploying fc109c2 to both runners T188564

Mentioned in SAL (#wikimedia-cloud) [2018-03-24T03:21:27Z] <zhuyifei1999_> revert back to d9cc1c8 again on quarry-runner0{1,2} T188564 T190608

Change 427036 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[analytics/quarry/web@master] Revert recent changes regarding maximum time / size of query results

https://gerrit.wikimedia.org/r/427036

Change 427036 merged by jenkins-bot:
[analytics/quarry/web@master] Revert recent changes regarding maximum time / size of query results

https://gerrit.wikimedia.org/r/427036

@Framawiki We really should fix this.... https://quarry.wmflabs.org/query/30742 3 million rows... Executed in 9499.75 seconds. Any idea how?

zhuyifei1999 raised the priority of this task from High to Unbreak Now!.Nov 27 2018, 6:46 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-27T18:54:20Z] <zhuyifei1999_> triggered OOM killer on quarry-worker-02 9 times T188564

quarry-worker-02 was practically dead.

Yikes! I wonder if we could use the query optimization output to decide to not even start some queries. Does quarry use celery timeouts to kill queries? I've found those to be a bit unreliable in ORES so we use a signal based timeout instead. We could try something like that.

It currently only kills queries that mariadb knows that has been executing
on the database for longer than 30 mins, how long it takes to store the
query results is completely disregarded, until OOM killer kicks in.

The problem is, it is completely fine for the Quarry infrastructure to run
a query on the database for, say 15 mins, but it is not fine to use 15 mins
to store the results, which would cause a huge NFS IO lag for the duration,
any other queries that are in progress of being saved.

Yes, I guess using the optimization plan is indeed a good idea. If the join
type is ALL, without ‘Using Where’, and it takes longer than 10 secs to
execute, it is probably bad... except, how long does ‘SELECT page_id FROM
page LIMIT 1000000’ take to execute?

zhuyifei1999 lowered the priority of this task from Unbreak Now! to High.Nov 27 2018, 9:53 PM

(Lowered because the offending processes have been killed)

Unless we can find a similar system to nfs but lighter, we should change the workers' architecture (highly related to T178520: Find somewhere else (not NFS) to store Quarry's resultsets). Here are few ideas, probably stupid or already discarded but that I want to raise:

If they were not stored as sqlite files would the problem be partially resolved? I have trouble seeing the interest of sqlite if it's not for registration/access time, since data extraction in the app is done on whole tables without selection.

Is the saving time in sqlite longer than a simple recording in a csv file? It would make a difference if we compressed this csv data into memory before saving it, so less data to save to nfs?

The problem is that the recording at the end of the workers' task is too long/gourmand. And if we send the raw results to a redis server, and let these results be immediately available to the web server and a dedicated host take care of the registration "when he can + as fast as he can"?

One could almost imagine in the same way that via websocket or other similar technology (or even http resultset page available only once) workers would send the results directly to the end client as soon as they have them, and then store them quietly or send them to another server dedicated to registration using redis.

If they were not stored as sqlite files would the problem be partially resolved? I have trouble seeing the interest of sqlite if it's not for registration/access time, since data extraction in the app is done on whole tables without selection.

This could work. We might need to invent the wheel a way of storing the data type, compression, metadata, column names, etc. A benefit would be that it could easily fix T170464.

Is the saving time in sqlite longer than a simple recording in a csv file?

Dunno, haven't done the statistics.

It would make a difference if we compressed this csv data into memory before saving it, so less data to save to nfs?

You mean, save to a non-NFS mount (instance local storage or tmpfs) then copy the file over to NFS mount? That could work. I remember doing a git clone on a non-NFS + copy is faster than direct NFS git clone, on toolforge a while back.

The problem is that the recording at the end of the workers' task is too long/gourmand. And if we send the raw results to a redis server, and let these results be immediately available to the web server and a dedicated host take care of the registration "when he can + as fast as he can"?

You mean store the query results on redis? Redis is an in-memory storage, and given that the results are kept forever, this won't scale.

One could almost imagine in the same way that via websocket or other similar technology (or even http resultset page available only once) workers would send the results directly to the end client as soon as they have them, and then store them quietly or send them to another server dedicated to registration using redis.

Sounds like something. If the result set is large enough (above a certain threshold) the result will be streamed once to a web frontend. If the client is unavailable the result is immediately discarded. However:

  • What is this 'certain threshold'? ORES wants to keep the results forever (T188564#4047735), but some people just do SELECT * FROM page for 'experimentation'.
  • How to allow downloading-data-in-a-different-format if the results are not saved? Send the data back from client to server? Move all the file generation code to JS and run on client? Or don't even show the data, just tell the client that the data is ready and give the download options?
  • This also needs a lot of coding, but I can work on that in the winter holidays.

The problem is that the recording at the end of the workers' task is too long/gourmand. And if we send the raw results to a redis server, and let these results be immediately available to the web server and a dedicated host take care of the registration "when he can + as fast as he can"?

You mean store the query results on redis? Redis is an in-memory storage, and given that the results are kept forever, this won't scale.

What I had in mind:

  1. send the work request from web server to a worker.
  2. when the worker has results, send them to a redis server.
  3. get results to the final client from the redis server using webserver
  4. when the storage server (yet another special server) is free, by collecting results back in drive (nfs) from redis server and delete them from there.

The idea would be to have the results sent to two places asynchronously: first to the end customer, then when it is possible to record it permanently.

Sounds like something. If the result set is large enough (above a certain threshold) the result will be streamed once to a web frontend. If the client is unavailable the result is immediately discarded. However:
What is this 'certain threshold'? ORES wants to keep the results forever (T188564#4047735), but some people just do SELECT * FROM page for 'experimentation'.

The same procedure for all requests, with no differences, so no "certain threshold"?

How to allow downloading-data-in-a-different-format if the results are not saved? Send the data back from client to server? Move all the file generation code to JS and run on client? Or don't even show the data, just tell the client that the data is ready and give the download options?

Good catch. I see these possible implementations:

  • as you said move all the file generation code to JS and run on client (is it possible to handle big exports?)
  • ask the user the requested format before starting the querying task
  • only offer a preview first (json), and ask the user to wait a few more minutes to be able to export it in other formats

What I had in mind:

  1. send the work request from web server to a worker.
  2. when the worker has results, send them to a redis server.
  3. get results to the final client from the redis server using webserver
  4. when the storage server (yet another special server) is free, by collecting results back in drive (nfs) from redis server and delete them from there.

LGTM. We just need to make sure redis does not run out of memory. So, certain threshold needed.

The same procedure for all requests, with no differences, so no "certain threshold"?

We don't want to keep SELECT * FROM page forever do we?

  • as you said move all the file generation code to JS and run on client (is it possible to handle big exports?)

Possible, but for big exports it can crash a browser. You can't make a 'generator' and write the data from the 'generator' into a file, you must construct all the data into a blob and then let the browser download the blob.

  • ask the user the requested format before starting the querying task

With the current model, we don't know how large the results will be until it is executed, and I don't imagine people would like a extra step of specifying the format before executing it if they don't need to download the results, only display.

Though, I guess we could separate execute into two buttons, something like 'run and display' and 'run and download', but idk how to make the download-after-display menu make sense.

  • only offer a preview first (json), and ask the user to wait a few more minutes to be able to export it in other formats

Hmm, so the results are saved anyhow?