Page MenuHomePhabricator

it would be useful to run the same Quarry query conveniently in several database
Open, HighPublic

Description

It would be useful to run the same Quarry query conveniently in several databases.

For example, I'd love to run the same queries about ContentTranslation metrics in multiple languages (over 20 languages, and growing). Currently the only thing I can do is to write something like "use tawiki_p" and then change it and run it repeatedly.

Event Timeline

Amire80 created this task.Apr 9 2015, 5:22 PM
Amire80 raised the priority of this task from to Needs Triage.
Amire80 updated the task description. (Show Details)
Amire80 added a project: Quarry.
Amire80 added a subscriber: Amire80.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2015, 5:22 PM
jeremyb added a subscriber: jeremyb.Apr 9 2015, 5:30 PM
agray added a subscriber: agray.Apr 9 2015, 5:30 PM
ggellerman set Security to None.
Milimetric removed Milimetric as the assignee of this task.May 19 2015, 3:12 PM
Capt_Swing triaged this task as High priority.Jul 17 2015, 6:57 PM
Ricordisamoa added a subscriber: Ricordisamoa.

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Nor scalable.

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Nor scalable.

That depends on the type of information you need to retrieve and how optimized the query is. The example I gave is very un-optimized.

-jem- added a subscriber: -jem-.Oct 5 2015, 10:25 AM
Halfak added a comment.Oct 9 2015, 6:58 PM

I do this often. I've been using a python script. See https://github.com/halfak/multiquery

It would be nice to do this in quarry.

Milimetric moved this task from Incoming to Event Platform on the Analytics board.Jan 12 2016, 7:41 PM

I'm going to untag Analytics, quarry is a different approach, we're about to allow multi-database data access in a different way.

This feature would be incredibly helpful, IIUC.
I have two tasks that require checking things across all our projects, and I don't know how else to do it, other than running ~900 separate Quarry queries...
(or begging for the time & help of an individual with database access and understanding (I'm barely beyond the copy&paste level)).

Please let me know if I can help any further, in explaining how valuable this would be to have in Quarry.

@Quiddity having some form of official resources dedicated to it might be helpful. I unfortunately don't think I'll have any bandwidth to be able to look at this any time soon :'(

@Quiddity, we're very close to allowing this kind of query in Hadoop, which you've worked with before, right? Grab me and I'll show you what we're working on and when to expect it to have all the data you need. We'll of course make public announcements when it's all ready (at the latest by the end of next quarter).

If we were to try the same thing in Quarry it would take a while longer. But I'll stay open-minded for when we chat.

Hadoop isn't public though, so not the same thing :) You can do similar things via tool labs now too.

You can fairly easily turn a query for a single wiki into a query over all wikis. It's a bit of boring work though. The pattern goes like this:

select 'aawiki' as wiki, <query> union all
select 'amwiki' as wiki, <query> union all
...
select 'zuwiki' as wiki, <query>;

… where <query> is your original query, with all table names qualified with the database name, and the select is repeated for every database you want to query.

For example, given the following query to list all BMP images on a single wiki (they are no longer allowed, but used to be and some still exist; I was once curious to see them):

select img_name, img_timestamp, img_major_mime, img_minor_mime from image where img_minor_mime in ('x-bmp', 'x-ms-bmp');

You can make a query to list all BMP images on all wikis [not sure if the list I used is up-to-date, I did this a couple months ago]:

select 'aawiki' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawiki.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'aawikibooks' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawikibooks.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'aawiktionary' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawiktionary.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'abwiki' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from abwiki.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
-- [omitted for sanity]
select 'zuwikibooks' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from zuwikibooks.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'zuwiktionary' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from zuwiktionary.image where img_minor_mime in ('x-bmp', 'x-ms-bmp');

I hope this helps someone, at least until Quarry or something implements this internally ;)

The union all approach hits some limits as mysql doesn't let you do that beyond I think like 50 or 100 times or something like that.

@yuvipanda, yes, not public. But the sanitarium stuff that makes mediawiki data public for labs will be refactored into our pipeline. And then we will release the resulting data publicly into probably the postgresql instance on labs. The schema is a single flat table which can be easily vertically partitioned and indexed to blaze through stuff that otherwise crushes quarry. I expect no joke 1000x improvement in some queries. That's too good for us to derail at this point. Of course others are free to try and solve this in the meantime. And those with hadoop access can get at it very soon.

@Quiddity, we're very close to allowing this kind of query in Hadoop, which you've worked with before, right? Grab me and I'll show you what we're working on and when to expect it to have all the data you need. We'll of course make public announcements when it's all ready (at the latest by the end of next quarter).
If we were to try the same thing in Quarry it would take a while longer. But I'll stay open-minded for when we chat.

Sorry, I have no experience in Hadoop. My experience level is "copy example command [from Quarry or the mw.o example page] into Quarry, and replace the obvious looking variables with the ones I think I want. If it doesn't work, stare at other examples, try a few variations, and then ask for help."
I need things like this rarely enough that it hasn't yet been worthwhile for me to invest the time for learning all about database wrangling and tangential skills. Quarry is just about perfect for my experience level and my output needs, and is an easily sharable tool with other slightly-technical editors.
MZ kindly wrote a python script to resolve T146310 for me.
The various use-cases described in T129698 are all non-urgent, just "it would regularly be helpful if I could easily check things like this", so I'll wait patiently for Quarry, and possibly bump "learn databases" up my todo list. Thanks though!

The union all approach hits some limits as mysql doesn't let you do that beyond I think like 50 or 100 times or something like that.

If you look at the query I posted above, it let me do it 892 times :), and the results don't seem to be cut off or anything.