Page MenuHomePhabricator

it would be useful to run the same Quarry query conveniently in several database
Open, HighPublic

Assigned To
None
Authored By
Amire80
Apr 9 2015, 5:22 PM
Referenced Files
F4527581: bmps.sql
Sep 26 2016, 9:32 PM
F4527582: bmps_all.tsv
Sep 26 2016, 9:32 PM
F4527583: bmps_enwiki.tsv
Sep 26 2016, 9:32 PM
Tokens
"Love" token, awarded by Quiddity."Like" token, awarded by Ricordisamoa."Like" token, awarded by eranroz.

Description

It would be useful to run the same Quarry query conveniently in several databases.

For example, I'd love to run the same queries about ContentTranslation metrics in multiple languages (over 20 languages, and growing). Currently the only thing I can do is to write something like "use tawiki_p" and then change it and run it repeatedly.

Event Timeline

Amire80 raised the priority of this task from to Needs Triage.
Amire80 updated the task description. (Show Details)
Amire80 added a project: Quarry.
Amire80 subscribed.

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Nor scalable.

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Nor scalable.

That depends on the type of information you need to retrieve and how optimized the query is. The example I gave is very un-optimized.

I do this often. I've been using a python script. See https://github.com/halfak/multiquery

It would be nice to do this in quarry.

I'm going to untag Analytics, quarry is a different approach, we're about to allow multi-database data access in a different way.

This feature would be incredibly helpful, IIUC.
I have two tasks that require checking things across all our projects, and I don't know how else to do it, other than running ~900 separate Quarry queries...
(or begging for the time & help of an individual with database access and understanding (I'm barely beyond the copy&paste level)).

Please let me know if I can help any further, in explaining how valuable this would be to have in Quarry.

@Quiddity having some form of official resources dedicated to it might be helpful. I unfortunately don't think I'll have any bandwidth to be able to look at this any time soon :'(

@Quiddity, we're very close to allowing this kind of query in Hadoop, which you've worked with before, right? Grab me and I'll show you what we're working on and when to expect it to have all the data you need. We'll of course make public announcements when it's all ready (at the latest by the end of next quarter).

If we were to try the same thing in Quarry it would take a while longer. But I'll stay open-minded for when we chat.

Hadoop isn't public though, so not the same thing :) You can do similar things via tool labs now too.

You can fairly easily turn a query for a single wiki into a query over all wikis. It's a bit of boring work though. The pattern goes like this:

select 'aawiki' as wiki, <query> union all
select 'amwiki' as wiki, <query> union all
...
select 'zuwiki' as wiki, <query>;

… where <query> is your original query, with all table names qualified with the database name, and the select is repeated for every database you want to query.

For example, given the following query to list all BMP images on a single wiki (they are no longer allowed, but used to be and some still exist; I was once curious to see them):

select img_name, img_timestamp, img_major_mime, img_minor_mime from image where img_minor_mime in ('x-bmp', 'x-ms-bmp');

You can make a query to list all BMP images on all wikis [not sure if the list I used is up-to-date, I did this a couple months ago]:

select 'aawiki' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawiki.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'aawikibooks' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawikibooks.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'aawiktionary' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawiktionary.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'abwiki' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from abwiki.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
-- [omitted for sanity]
select 'zuwikibooks' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from zuwikibooks.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'zuwiktionary' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from zuwiktionary.image where img_minor_mime in ('x-bmp', 'x-ms-bmp');

I hope this helps someone, at least until Quarry or something implements this internally ;)

The union all approach hits some limits as mysql doesn't let you do that beyond I think like 50 or 100 times or something like that.

@yuvipanda, yes, not public. But the sanitarium stuff that makes mediawiki data public for labs will be refactored into our pipeline. And then we will release the resulting data publicly into probably the postgresql instance on labs. The schema is a single flat table which can be easily vertically partitioned and indexed to blaze through stuff that otherwise crushes quarry. I expect no joke 1000x improvement in some queries. That's too good for us to derail at this point. Of course others are free to try and solve this in the meantime. And those with hadoop access can get at it very soon.

@Quiddity, we're very close to allowing this kind of query in Hadoop, which you've worked with before, right? Grab me and I'll show you what we're working on and when to expect it to have all the data you need. We'll of course make public announcements when it's all ready (at the latest by the end of next quarter).

If we were to try the same thing in Quarry it would take a while longer. But I'll stay open-minded for when we chat.

Sorry, I have no experience in Hadoop. My experience level is "copy example command [from Quarry or the mw.o example page] into Quarry, and replace the obvious looking variables with the ones I think I want. If it doesn't work, stare at other examples, try a few variations, and then ask for help."
I need things like this rarely enough that it hasn't yet been worthwhile for me to invest the time for learning all about database wrangling and tangential skills. Quarry is just about perfect for my experience level and my output needs, and is an easily sharable tool with other slightly-technical editors.
MZ kindly wrote a python script to resolve T146310 for me.
The various use-cases described in T129698 are all non-urgent, just "it would regularly be helpful if I could easily check things like this", so I'll wait patiently for Quarry, and possibly bump "learn databases" up my todo list. Thanks though!

The union all approach hits some limits as mysql doesn't let you do that beyond I think like 50 or 100 times or something like that.

If you look at the query I posted above, it let me do it 892 times :), and the results don't seem to be cut off or anything.

The UNION ALL trick no longer works, since you can only query one database at once now (since https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign / T264254) :(

Would a proposed change be different from forking the desired query, and entering a new db for it to run against? If so, how?

Would a proposed change be different from forking the desired query, and entering a new db for it to run against? If so, how?

It wouldn't require looking at 900 different pages.

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

You are going to end up with a new real-time replicated, redacted RDBMS as a side effect of dumps generation? That sounds awesome, but also not fully believable.

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

You are going to end up with a new real-time replicated, redacted RDBMS as a side effect of dumps generation? That sounds awesome, but also not fully believable.

I know, you're right, it's a big goal of mine though and I'm very stubborn. The technology was not quite ready until recently, and we still have to upgrade Spark and jump some tricky hurdles, but it actually looks possible. This the very rough and hopeful schedule I have in my mind:

March: we are generating a proof of concept XML dump, similar to the current one, via a Kafka -> Spark -> Iceberg -> XML pipeline. This depends on Event Platform's content-enriched Kafka topic (page-content-change). It also depends on me getting better at all this, some big moving pieces and lots to learn but I'm getting there. If the Kafka -> Iceberg jump is too challenging, the nice thing here is we can fall back on hourly batches, and that's still ok for lots of use cases, right @bd808?

August: with the prototype running for a few months we can figure out exactly how much we drift from the replicas. We have reconciliation strategies with the logging table being better and better about including most events. If the drift is large and impossible to reconcile, then this will depend on changes to core to log more actions in the logging table.

End of 2023: we build on all that to generate a big-data-world proper replica that we can serve to the public.

The big time chunks here are filled with collaborating across lots of teams, getting security review, and all that. Once we figure out this first piece (how to get data from MW into Iceberg) then the rest of the actual data transformation is fairly simple. We talked about factoring out what Sanitarium / clouddb views do, and we've been looking at that code a bunch over the years, I don't see too many more unknowns there.

March: we are generating a proof of concept XML dump, similar to the current one, via a Kafka -> Spark -> Iceberg -> XML pipeline. This depends on Event Platform's content-enriched Kafka topic (page-content-change). It also depends on me getting better at all this, some big moving pieces and lots to learn but I'm getting there. If the Kafka -> Iceberg jump is too challenging, the nice thing here is we can fall back on hourly batches, and that's still ok for lots of use cases, right @bd808?

I am sure that there are folks who would be ok with things stopping at a big pile of data inside of the restricted access production Hadoop cluster. That does not actually move anything forward for Quarry and the general public however.

End of 2023: we build on all that to generate a big-data-world proper replica that we can serve to the public.

This would in theory be the point where Quarry might become involved and able to expose the new dataset to the world.

The big time chunks here are filled with collaborating across lots of teams, getting security review, and all that. Once we figure out this first piece (how to get data from MW into Iceberg) then the rest of the actual data transformation is fairly simple. We talked about factoring out what Sanitarium / clouddb views do, and we've been looking at that code a bunch over the years, I don't see too many more unknowns there.

This all sounds great, and I hope that it can be made to work. Being able to restore cross-wiki query capability would help with a number of workflows that had to be modified or abandoned when the Wiki Replicas outgrew our ability to colocate all slices on the same instance. It also sounds like a step toward being able to design and implement a more performant OLAP schema for answering the sort of questions that Quarry and Toolforge tools typically attempt to brute force out of the MediaWiki OLTP schema that we currently expose. Thank you for working on this!

March: we are generating a proof of concept XML dump, similar to the current one, via a Kafka -> Spark -> Iceberg -> XML pipeline. This depends on Event Platform's content-enriched Kafka topic (page-content-change). It also depends on me getting better at all this, some big moving pieces and lots to learn but I'm getting there. If the Kafka -> Iceberg jump is too challenging, the nice thing here is we can fall back on hourly batches, and that's still ok for lots of use cases, right @bd808?

I am sure that there are folks who would be ok with things stopping at a big pile of data inside of the restricted access production Hadoop cluster. That does not actually move anything forward for Quarry and the general public however.

Sorry, misunderstanding. I meant hourly updates instead of as-fast-as-event-bus+kafka. Of course either way we go we would make this public. Actually serving to the public as a viable alternative needs more work that I'm slating for the next step, but at no point is this meant for internal-only consumption. I mean, dumps is public, and the major problem we're trying to solve here is "what slice of our content is public?" in as real-time fashion as possible.

This all sounds great, and I hope that it can be made to work. Being able to restore cross-wiki query capability would help with a number of workflows that had to be modified or abandoned when the Wiki Replicas outgrew our ability to colocate all slices on the same instance. It also sounds like a step toward being able to design and implement a more performant OLAP schema for answering the sort of questions that Quarry and Toolforge tools typically attempt to brute force out of the MediaWiki OLTP schema that we currently expose. Thank you for working on this!

Thanks go out to everyone for being so patient while this was brewing, but yeah, one way or another, your paragraph here is what we're aiming for.