Page MenuHomePhabricator

Do some checks of how many Quarry queries will break in a multiinstance environment
Closed, DeclinedPublic

Description

This is mostly data gathering and an effort to try to find our where users are so we can help ease the transition to multi-instance replicas.

Event Timeline

Bstorm renamed this task from Do some checks of how many queries will break in a multiinstance environment to Do some checks of how many Quarry queries will break in a multiinstance environment.Nov 17 2020, 12:11 AM
Bstorm triaged this task as Medium priority.
Bstorm added a subscriber: Jhernandez.

Adding @dcaro in case he has time or interest to help dig in that database. It's in the quarry Cloud VPS project. Local root can access the database (and do anything).

If you mess with it, just be mindful that I'm not aware of any backup and the main database is part of serving https://quarry.wmflabs.org/ (in case of unintentional cartesian joins and all that). Query text is in the query_revision table, IIRC. Source of that app is at https://gerrit.wikimedia.org/g/analytics/quarry/web for reference. It is mostly volunteer supported these days, though I'm trying to make a big patch for it at the moment.

I did a preliminary check on all quarry queries, and found 42 that seem to use cross DB (I have to check a bit more the code to see if I'm missing any).

The code I'm using for it is hosted here: https://toolsadmin.wikimedia.org/tools/id/sqlchecker
(the webservice is right not not running, as toolforge only allows python 3.5 on the bastion and I have yet not given any time to debug how to work around/fix it).

From those, only 8 of them were run this year though, the rest are older.

I'll review the queries too to anonymize if needed and paste the results somewhere.

Will try to add an extra method of detection (wikimedia specific though, by getting the list of dbs and just checking if 2 or more are addressed inside of the query).

I was thinking on using this code also to check 'live' queries on PAWS too, eventually, though that's just a thought on the back of my mind right now.

As an aside, if you use the webservice shell command, you can get it working (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Python#Virtual_Environments_and_Packages). That'll launch you a container where you can generate the venv for 3.7. The 3.5 thing is only a concern if you try to launch on the gridengine (which I don't recommend).

Something doesn't seem right about the results because the bot we are working on in T267992 would be in scope, and I think I ran the quarry query referenced there last week? Maybe I didn't :) I'll be very happy if we are talking such a low number, but I would be expecting something in the hundreds to thousands unless they've all been removed recently. query_run records go back to 2014.

The table that holds query text is huge. I suppose it's still entirely possible that this isn't done as often as I'd have thought from email conversations over the years. It is usually convos with the same people.

MariaDB [quarry]> select count(*) from query_revision;
+----------+
| count(*) |
+----------+
|   525711 |
+----------+

As a test, one huge example that should definitely be flagged is mentioned and linked to here T195515#4234202. Did that get picked up in your check?

I'm crossing query.latest_query_rev with query_revision, that shrinks the number quite considerably:

MariaDB [quarry]> select count(*) from query join query_revision on query.latest_rev_id=query_revision.id;
+----------+
| count(*) |
+----------+
|    41230 |
+----------+
1 row in set (2.921 sec)

But yes, I agree that it seems a very low number.
For the first one it's not catching it (it currently looks for two or more <something>.<something> in a for like statement, and that one has only one, but uses the bare 'images', so I'll fix that, thanks for the example).

For the other one huge example, that one is caught correctly ;)

And thanks for the shell tip! (found it a bit earlier, and was able to git it working \o/):

edit: this ui is a bit confusing... reverting
'''
dcaro@vulcanus$ curl --silent https://raw.githubusercontent.com/greencardamom/Shadows/master/shadows.sql | curl -X POST -d @- https://sqlchecker.toolforge.org/api/v1/is_cross_database -q
{"result":false}
'''

I'll review the queries too to anonymize if needed and paste the results somewhere.

quarry queries are public. I don't think anonymizing is necessary.

I introduced the detection of implicit db usage (as in not explicitly stating the db name), and this is the current list of found cross-db queries:

https://phabricator.wikimedia.org/P13385

Still reviewing the results though to see if I missed anything, now the shadow sql is properly detected :)

$ curl --silent https://raw.githubusercontent.com/greencardamom/Shadows/master/shadows.sql \
| jq -R -s '{"sql": "\(.)"}' \
| curl -X POST -d @- https://sqlchecker.toolforge.org/api/v1/is_cross_database -H "Content-Type: application/json"

{"result":true}

Did we ever get a tally of the queries that would break here? I'm not seeing it in the comments, but I thought you had numbers @dcaro

If we can get the tally numbers on this ticket it would be good, but at this point, I think this is a moot issue. Quarry already uses multi-instance now.