it would be useful to run the same Quarry query conveniently in several database
Open, HighPublic
Actions

Assigned To

None

Authored By

	Amire80
	Apr 9 2015, 5:22 PM

Description

It would be useful to run the same Quarry query conveniently in several databases.

For example, I'd love to run the same queries about ContentTranslation metrics in multiple languages (over 20 languages, and growing). Currently the only thing I can do is to write something like "use tawiki_p" and then change it and run it repeatedly.

Related Objects

Mentioned In: T190164: Investigate how to identify our code contributors to on-wiki code (gadgets, templates, modules) on a WMF site
T113576: Fix broken "Unused file redirects" report
Mentioned Here: T264254: Prepare Quarry for multiinstance wiki replicas
T129698: Example quarry query for searching multiple wikis for locally customized mediawiki:foo pages
T146310: Make list of wikimedia wikis which do not use Disambiguator
T76466: Add database selector

Event Timeline

Amire80 created this task.Apr 9 2015, 5:22 PM

Amire80 raised the priority of this task from to Needs Triage.

Amire80 updated the task description. (Show Details)

Amire80 added a project: Quarry.

Amire80 subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2015, 5:22 PM

Glaisher subscribed.Apr 9 2015, 5:25 PM

jeremyb subscribed.Apr 9 2015, 5:30 PM

agray subscribed.Apr 9 2015, 5:30 PM

• ggellerman added a project: Analytics-Kanban.Apr 22 2015, 8:52 PM

• ggellerman set Security to None.

• kevinator assigned this task to Milimetric.May 4 2015, 5:39 PM

Milimetric removed Milimetric as the assignee of this task.May 19 2015, 3:12 PM

Ladsgroup subscribed.Jun 1 2015, 5:36 PM

• ggellerman edited projects, added Analytics-Backlog; removed Analytics-Kanban.Jun 12 2015, 9:34 PM

valhallasw merged a task: T76466: Add database selector.Jun 12 2015, 9:40 PM

valhallasw added a parent task: T78549: Quarry-TSreports feature parity.

valhallasw subscribed.

yuvipanda moved this task from Backlog to Planning on the Quarry board.Jul 4 2015, 4:23 PM

• ggellerman moved this task from Incoming to Backlog on the Analytics-Backlog board.Jul 10 2015, 11:01 PM

eranroz awarded a token.Jul 11 2015, 9:48 AM

• Capt_Swing triaged this task as High priority.Jul 17 2015, 6:57 PM

Ricordisamoa awarded a token.Sep 30 2015, 2:14 AM

Ricordisamoa subscribed.

Ricordisamoa mentioned this in T113576: Fix broken "Unused file redirects" report.Sep 30 2015, 2:21 AM

Niharika subscribed.Sep 30 2015, 2:30 AM

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

I mean, like http://quarry.wmflabs.org/query/5417

In T95582#1688177, @Bawolff wrote:

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Nor scalable.

In T95582#1688278, @Ricordisamoa wrote:

In T95582#1688177, @Bawolff wrote:

Can't you already write db queries of this form using UNION and foreign table references? (Of course, that's not very user friendly)

Nor scalable.

That depends on the type of information you need to retrieve and how optimized the query is. The example I gave is very un-optimized.

-jem- subscribed.Oct 5 2015, 10:25 AM

jeremyb added subscribers: Halfak, Abit.Oct 9 2015, 6:37 PM

I do this often. I've been using a python script. See https://github.com/halfak/multiquery

It would be nice to do this in quarry.

• ggellerman edited projects, added Analytics; removed Analytics-Backlog.Jan 12 2016, 7:41 PM

Milimetric moved this task from Incoming to Event Platform on the Analytics board.Jan 12 2016, 7:41 PM

Probably going to be done in T76466: Add database selector

I'm going to untag Analytics, quarry is a different approach, we're about to allow multi-database data access in a different way.

Milimetric removed a project: Analytics.Sep 19 2016, 4:01 PM

This feature would be incredibly helpful, IIUC.
I have two tasks that require checking things across all our projects, and I don't know how else to do it, other than running ~900 separate Quarry queries...
(or begging for the time & help of an individual with database access and understanding (I'm barely beyond the copy&paste level)).

T146310: Make list of wikimedia wikis which do not use Disambiguator
T129698: Example quarry query for searching multiple wikis for locally customized mediawiki:foo pages (this one has many example use-cases)

Please let me know if I can help any further, in explaining how valuable this would be to have in Quarry.

@Quiddity having some form of official resources dedicated to it might be helpful. I unfortunately don't think I'll have any bandwidth to be able to look at this any time soon :'(

• MZMcBride subscribed.Sep 23 2016, 1:34 PM

@Quiddity, we're very close to allowing this kind of query in Hadoop, which you've worked with before, right? Grab me and I'll show you what we're working on and when to expect it to have all the data you need. We'll of course make public announcements when it's all ready (at the latest by the end of next quarter).

If we were to try the same thing in Quarry it would take a while longer. But I'll stay open-minded for when we chat.

Hadoop isn't public though, so not the same thing :) You can do similar things via tool labs now too.

You can fairly easily turn a query for a single wiki into a query over all wikis. It's a bit of boring work though. The pattern goes like this:

select 'aawiki' as wiki, <query> union all
select 'amwiki' as wiki, <query> union all
...
select 'zuwiki' as wiki, <query>;

… where <query> is your original query, with all table names qualified with the database name, and the select is repeated for every database you want to query.

For example, given the following query to list all BMP images on a single wiki (they are no longer allowed, but used to be and some still exist; I was once curious to see them):

select img_name, img_timestamp, img_major_mime, img_minor_mime from image where img_minor_mime in ('x-bmp', 'x-ms-bmp');

bmps_enwiki.tsv4 KBDownload

You can make a query to list all BMP images on all wikis [not sure if the list I used is up-to-date, I did this a couple months ago]:

select 'aawiki' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawiki.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'aawikibooks' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawikibooks.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'aawiktionary' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from aawiktionary.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'abwiki' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from abwiki.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
-- [omitted for sanity]
select 'zuwikibooks' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from zuwikibooks.image where img_minor_mime in ('x-bmp', 'x-ms-bmp') union all
select 'zuwiktionary' as wiki, img_name, img_timestamp, img_major_mime, img_minor_mime from zuwiktionary.image where img_minor_mime in ('x-bmp', 'x-ms-bmp');

bmps.sql142 KBDownload

bmps_all.tsv27 KBDownload

I hope this helps someone, at least until Quarry or something implements this internally ;)

The union all approach hits some limits as mysql doesn't let you do that beyond I think like 50 or 100 times or something like that.

@yuvipanda, yes, not public. But the sanitarium stuff that makes mediawiki data public for labs will be refactored into our pipeline. And then we will release the resulting data publicly into probably the postgresql instance on labs. The schema is a single flat table which can be easily vertically partitioned and indexed to blaze through stuff that otherwise crushes quarry. I expect no joke 1000x improvement in some queries. That's too good for us to derail at this point. Of course others are free to try and solve this in the meantime. And those with hadoop access can get at it very soon.

In T95582#2667746, @Milimetric wrote:

@Quiddity, we're very close to allowing this kind of query in Hadoop, which you've worked with before, right? Grab me and I'll show you what we're working on and when to expect it to have all the data you need. We'll of course make public announcements when it's all ready (at the latest by the end of next quarter).

If we were to try the same thing in Quarry it would take a while longer. But I'll stay open-minded for when we chat.

Sorry, I have no experience in Hadoop. My experience level is "copy example command [from Quarry or the mw.o example page] into Quarry, and replace the obvious looking variables with the ones I think I want. If it doesn't work, stare at other examples, try a few variations, and then ask for help."
I need things like this rarely enough that it hasn't yet been worthwhile for me to invest the time for learning all about database wrangling and tangential skills. Quarry is just about perfect for my experience level and my output needs, and is an easily sharable tool with other slightly-technical editors.
MZ kindly wrote a python script to resolve T146310 for me.
The various use-cases described in T129698 are all non-urgent, just "it would regularly be helpful if I could easily check things like this", so I'll wait patiently for Quarry, and possibly bump "learn databases" up my todo list. Thanks though!

In T95582#2673720, @Milimetric wrote:

The union all approach hits some limits as mysql doesn't let you do that beyond I think like 50 or 100 times or something like that.

If you look at the query I posted above, it let me do it 892 times :), and the results don't seem to be cut off or anything.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:52 PM

zhuyifei1999 subscribed.Jul 31 2017, 12:33 AM

rafidaslam subscribed.Nov 24 2018, 3:32 PM

Aklapper mentioned this in T190164: Investigate how to identify our code contributors to on-wiki code (gadgets, templates, modules) on a WMF site.Dec 14 2018, 2:44 PM

The UNION ALL trick no longer works, since you can only query one database at once now (since https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign / T264254) :(

Base subscribed.Oct 22 2022, 5:14 AM

Would a proposed change be different from forking the desired query, and entering a new db for it to run against? If so, how?

In T95582#8336769, @rook wrote:

Would a proposed change be different from forking the desired query, and entering a new db for it to run against? If so, how?

It wouldn't require looking at 900 different pages.

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

In T95582#8500240, @Milimetric wrote:

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

You are going to end up with a new real-time replicated, redacted RDBMS as a side effect of dumps generation? That sounds awesome, but also not fully believable.

In T95582#8500253, @bd808 wrote:

In T95582#8500240, @Milimetric wrote:

At this point, we are making significant progress on near-real-time dumps generation. If that project continues to work out, we could have an alternate view available for querying, one that would include data from all wikis, identified by perhaps a wiki_db column or similar. Just putting it out there as a possible solution to this very old problem (I haven't forgotten about this, progress here is just... slow)

You are going to end up with a new real-time replicated, redacted RDBMS as a side effect of dumps generation? That sounds awesome, but also not fully believable.

I know, you're right, it's a big goal of mine though and I'm very stubborn. The technology was not quite ready until recently, and we still have to upgrade Spark and jump some tricky hurdles, but it actually looks possible. This the very rough and hopeful schedule I have in my mind:

March: we are generating a proof of concept XML dump, similar to the current one, via a Kafka -> Spark -> Iceberg -> XML pipeline. This depends on Event Platform's content-enriched Kafka topic (page-content-change). It also depends on me getting better at all this, some big moving pieces and lots to learn but I'm getting there. If the Kafka -> Iceberg jump is too challenging, the nice thing here is we can fall back on hourly batches, and that's still ok for lots of use cases, right @bd808?

August: with the prototype running for a few months we can figure out exactly how much we drift from the replicas. We have reconciliation strategies with the logging table being better and better about including most events. If the drift is large and impossible to reconcile, then this will depend on changes to core to log more actions in the logging table.

End of 2023: we build on all that to generate a big-data-world proper replica that we can serve to the public.

The big time chunks here are filled with collaborating across lots of teams, getting security review, and all that. Once we figure out this first piece (how to get data from MW into Iceberg) then the rest of the actual data transformation is fairly simple. We talked about factoring out what Sanitarium / clouddb views do, and we've been looking at that code a bunch over the years, I don't see too many more unknowns there.

In T95582#8501818, @Milimetric wrote:

March: we are generating a proof of concept XML dump, similar to the current one, via a Kafka -> Spark -> Iceberg -> XML pipeline. This depends on Event Platform's content-enriched Kafka topic (page-content-change). It also depends on me getting better at all this, some big moving pieces and lots to learn but I'm getting there. If the Kafka -> Iceberg jump is too challenging, the nice thing here is we can fall back on hourly batches, and that's still ok for lots of use cases, right @bd808?

I am sure that there are folks who would be ok with things stopping at a big pile of data inside of the restricted access production Hadoop cluster. That does not actually move anything forward for Quarry and the general public however.

End of 2023: we build on all that to generate a big-data-world proper replica that we can serve to the public.

This would in theory be the point where Quarry might become involved and able to expose the new dataset to the world.

The big time chunks here are filled with collaborating across lots of teams, getting security review, and all that. Once we figure out this first piece (how to get data from MW into Iceberg) then the rest of the actual data transformation is fairly simple. We talked about factoring out what Sanitarium / clouddb views do, and we've been looking at that code a bunch over the years, I don't see too many more unknowns there.

This all sounds great, and I hope that it can be made to work. Being able to restore cross-wiki query capability would help with a number of workflows that had to be modified or abandoned when the Wiki Replicas outgrew our ability to colocate all slices on the same instance. It also sounds like a step toward being able to design and implement a more performant OLAP schema for answering the sort of questions that Quarry and Toolforge tools typically attempt to brute force out of the MediaWiki OLTP schema that we currently expose. Thank you for working on this!

In T95582#8502156, @bd808 wrote:

In T95582#8501818, @Milimetric wrote:

March: we are generating a proof of concept XML dump, similar to the current one, via a Kafka -> Spark -> Iceberg -> XML pipeline. This depends on Event Platform's content-enriched Kafka topic (page-content-change). It also depends on me getting better at all this, some big moving pieces and lots to learn but I'm getting there. If the Kafka -> Iceberg jump is too challenging, the nice thing here is we can fall back on hourly batches, and that's still ok for lots of use cases, right @bd808?

I am sure that there are folks who would be ok with things stopping at a big pile of data inside of the restricted access production Hadoop cluster. That does not actually move anything forward for Quarry and the general public however.

Sorry, misunderstanding. I meant hourly updates instead of as-fast-as-event-bus+kafka. Of course either way we go we would make this public. Actually serving to the public as a viable alternative needs more work that I'm slating for the next step, but at no point is this meant for internal-only consumption. I mean, dumps is public, and the major problem we're trying to solve here is "what slice of our content is public?" in as real-time fashion as possible.

This all sounds great, and I hope that it can be made to work. Being able to restore cross-wiki query capability would help with a number of workflows that had to be modified or abandoned when the Wiki Replicas outgrew our ability to colocate all slices on the same instance. It also sounds like a step toward being able to design and implement a more performant OLAP schema for answering the sort of questions that Quarry and Toolforge tools typically attempt to brute force out of the MediaWiki OLTP schema that we currently expose. Thank you for working on this!

Thanks go out to everyone for being so patient while this was brewing, but yeah, one way or another, your paragraph here is what we're aiming for.

Quiddity awarded a token.Jan 10 2023, 1:25 AM

Aklapper removed a parent task: T78549: Quarry-TSreports feature parity.Sep 15 2023, 12:51 PM

Zygimantus subscribed.Mar 22 2024, 9:09 AM

	F4527581: bmps.sql
	Sep 26 2016, 9:32 PM

	F4527582: bmps_all.tsv
	Sep 26 2016, 9:32 PM

	F4527583: bmps_enwiki.tsv
	Sep 26 2016, 9:32 PM

it would be useful to run the same Quarry query conveniently in several databaseOpen, HighPublicActions

Description

Related Objects

Event Timeline

it would be useful to run the same Quarry query conveniently in several database
Open, HighPublic
Actions