Page MenuHomePhabricator

Replace Quarry with an installation of Superset
Open, Needs TriagePublic

Assigned To
None
Authored By
yuvipanda
Jul 1 2017, 11:27 PM
Referenced Files
F37109494: Screenshot 2023-06-19 at 15.06.34.png
Jun 19 2023, 2:09 PM
F37109493: Screenshot 2023-06-19 at 15.06.26.png
Jun 19 2023, 2:09 PM
F37082314: quarry-queries-per-day.csv
May 29 2023, 11:03 AM
F37031586: image.png
May 25 2023, 12:08 AM
Tokens
"Like" token, awarded by Halfak."Like" token, awarded by web3."Baby Tequila" token, awarded by awight.

Description

Quarry doesn't get a lot of support. Meanwhile something like superset gets a huge amount of attention. As a result Quarry is falling behind, and someday will likely fail in a way that is difficult or impractical to repair. As such this ticket is to seek out an alternative that preserves the core search function of Quarry, while being better supported, and perhaps bringing additional features as well.

Previously considered, were Redash and Metabase. Redash was a good option, though it no longer has an active community. Metabase explicitly won't support oauth. So they are not longer being considered for project offering.

NOTE: The reason that we are moving in this direction is because quarry is not being supported. Commentary on a desire to keep quarry is fine. But it needs to be focused on how to get quarry support. Discussion on features that quarry has is not material, as without support quarry will someday fail and have no features.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Stuartyeates no, neither of these are accessible via either Quarry or Superset. In terms of Quarry there is some discussion here T151158

I've attached a csv with the number of queries run per day between Quarry and Superset. Though Superset is only counting queries that were run from the sql_lab section, it is not counting queries that were run as part of a chart or other places. As such the data is garbage. I'm not very sure if better usage data is collected in superset. If there is I would be happy to retrieve it.

Well, as a regular and very pleased user of Quarry, I tried Superset. Not helpful, it doesn't have most of databases. So, Quarry doesn't work any more, and I can't use Superset, so there is no way for me to run an urgent query (72767).

Well, as a regular and very pleased user of Quarry, I tried Superset. Not helpful, it doesn't have most of databases. So, Quarry doesn't work any more, and I can't use Superset, so there is no way for me to run an urgent query (72767).

If by "quarry doesn't work" you refer to the replication lag issue, it affects all tools, including both quarry and superset and others

Maybe, don't know. I can't run queries any more.

Well, as a regular and very pleased user of Quarry, I tried Superset. Not helpful, it doesn't have most of databases.

Could you list the databases that superset does not have?

I tried to find hewiki_p. Didn't find, but while looking, I'm not even sure I met any <lang>wiki_p, besides enwiki and frwiki. Maybe I'm wrong. And maybe they are all in s7, which doesn't open.

@IKhitron It would appear that hewiki is on s7 https://noc.wikimedia.org/db.php and that the replication lag on s7 is particularly bad https://replag.toolforge.org/

I see. Well, I succeeded to open s7 now and run a query. Please don't turn Quarry off until at least there will be a way to download the results in all the formats. Now there is only CSV format, and only for ASCII texts. So, non-English wikis can't use it. And I even didn't check what happens with API calls for queries.

@IKhitron It would appear that hewiki is on s7 https://noc.wikimedia.org/db.php and that the replication lag on s7 is particularly bad https://replag.toolforge.org/

Currently I don't see any wikis on any db when I do show tables. A link to that NOC page would be useful on pages where a database can be selected.

Currently I don't see any wikis on any db when I do show tables. A link to that NOC page would be useful on pages where a database can be selected.

You may not have selected a schema. MySQL has the confusing tendency to call both databases and schemas "databases". If you haven't selected a schema show tables won't work. Though show databases should.

Currently I don't see any wikis on any db when I do show tables. A link to that NOC page would be useful on pages where a database can be selected.

You may not have selected a schema. MySQL has the confusing tendency to call both databases and schemas "databases". If you haven't selected a schema show tables won't work. Though show databases should.

Indeed show databases works without a schema and show tables works with a schema selected, thanks.

A few years ago on Quarry we could run queries across multiple wikis. It seems that with Superset this is possible once again, at least within the same database. I have an old query for some s3 wikis but on Superset this gives: mysql error: SELECT command denied to user 's52788'@'10.64.37.27' for table 'page'. Alas it doesn't say which table 'page' is the problem.

Is this the right place to complain about features from Quarry that I'm missing from Superset? Because if so, I'm missing the option to export the results of a query to a Wikitable.

Seconded this, since I frequently use the "wikitable export" in Quarry.

Is this the right place to complain about features from Quarry that I'm missing from Superset? Because if so, I'm missing the option to export the results of a query to a Wikitable.

Seconded this, since I frequently use the "wikitable export" in Quarry.

Third. But, but but... we were told that Quarry "lacks support" and Superset has support overflowing with more support... who is providing this support? They don't even support Wikitable export? You mean Superset isn't really a superset of Quarry?

Of course these are minor issues if the replication lag is running multiple days or weeks.

Third. But, but but... we were told that Quarry "lacks support" and Superset has support overflowing with more support... who is providing this support? They don't even support Wikitable export? You mean Superset isn't really a superset of Quarry?

Quarry is an in-house piece of software. All of the maintenance burden falls on the wikimedia community. Every time it breaks or needs tweaking, it competes with other wikimedia projects and priorities for resources.

superset is an apache project, with a very large group of users and existing support base. A large part of the world's SQL community are users of superset, so there are more people committing resources to the project, both technical and non-technical. Things like: this book https://www.oreilly.com/library/view/apache-superset-quick/9781788992244/918599d4-14b8-44a1-ad48-410ad73a4ad3.xhtml these youtube videos: https://www.youtube.com/results?search_query=Apache+Superset and lots of tutorials.

I think it’s definitely within the realm of possibility to contribute functionality for different export options upstream (wikitext, but also markdown etc).

Conclusion: Until now if there was a feature that only wikimedia users needed, for example wikitable download, multilingual download, or editing history viewing, we could just ask, and now we wil not have them anymore. I don't believe that all the wikis will convert their pages names to English just for queries sake, so the queries will be useless.

Conclusion: Until now if there was a feature that only wikimedia users needed, for example wikitable download, multilingual download, or editing history viewing, we could just ask

Yes you could ask, though looking at https://phabricator.wikimedia.org/project/view/800/ it is clear that most requests are not answered due to limited support. Superset does have a lot of support, just not wiki specific.

I don't believe that all the wikis will convert their pages names to English just for queries sake, so the queries will be useless.

If you're suggesting that superset only supports ascii, that is false T337790 shows that it supports utf-8. The output to csv simply is given with what the replica db gives, namely binary. No titles or fields would need to be changed to support this, an example using utf-8 in a query can be found here:
https://superset.wmcloud.org/superset/sqllab?queryId=859

Conclusion: Until now if there was a feature that only wikimedia users needed, for example wikitable download, multilingual download, or editing history viewing, we could just ask

Yes you could ask, though looking at https://phabricator.wikimedia.org/project/view/800/ it is clear that most requests are not answered due to limited support. Superset does have a lot of support, just not wiki specific.

Exactly. These ones are already there, by request or spoken need, and now we are going to lose them.

I don't believe that all the wikis will convert their pages names to English just for queries sake, so the queries will be useless.

If you're suggesting that superset only supports ascii, that is false T337790 shows that it supports utf-8. The output to csv simply is given with what the replica db gives, namely binary. No titles or fields would need to be changed to support this, an example using utf-8 in a query can be found here:
https://superset.wmcloud.org/superset/sqllab?queryId=859

And Quarry uses the same replica and knows to answer in UTF8.

Actually you suggest to replace a very good tool with bad(?) support to a very bad tool with good support. Maybe you should ask all the features that already exist in Quarry, wait until they all will be operational on Superset, and only then move there.

I would hope that there is the ability to create an extension in the 'Share' menu that may be used to create a wikitable.

I completely disagree that Superset is a “very bad tool”. IMO, the design is so much better, and there’s more features.

“now we are going to lose them” the requests will still be archived on Phabricator; I don’t see the point of your argument…

Exactly. These ones are already there, by request or spoken need, and now we are going to lose them.

As explained above, no one is proposing to turn of quarry, it will continue too function for the foreseeable future.

And Quarry uses the same replica and knows to answer in UTF8.

Both Quarry and superset speak UTF-8 just fine.

Actually you suggest to replace a very good tool with bad(?) support to a very bad tool with good support. Maybe you should ask all the features that already exist in Quarry, wait until they all will be operational on Superset, and only then move there.

That is a pretty good summery of what this is discussion is: a discussion of how to reproduce the needed functionality of quarry in superset.

I just went and looked for how to implement one of the requested features and immediately ran into @rook asking very similar questions (but asking them better than I). This is unlikely to be the finished version of superset you're going to be lumped with.

In T169452#8897787, EpicPupper wrote:

I completely disagree that Superset is a “very bad tool”. IMO, the design is so much better, and there’s more features.

Well, there are no futures we are used to and we need.

In T169452#8897788, EpicPupper wrote:

“now we are going to lose them” the requests will still be archived on Phabricator; I don’t see the point of your argument…

I'm not talking about requests, but about existing features in Quarry, that already took years to implement.

In T169452#8898159, Stuartyeates wrote:

Exactly. These ones are already there, by request or spoken need, and now we are going to lose them.

As explained above, no one is proposing to turn of quarry, it will continue too function for the foreseeable future.

Great, thanks. Looks like I missed this part. But does this mean that urgent bugs, as one a couple of weeks ago, wilk still be fixed?

And Quarry uses the same replica and knows to answer in UTF8.

Both Quarry and superset speak UTF-8 just fine.

Did you see what happens on results download?

It’s bold to speak for all Quarry users with “we”. You still haven’t mentioned any actual features that Superset does not have and Quarry does.

It’s bold to speak for all Quarry users with “we”.

Never said all of the Users.

You still haven’t mentioned any actual features that Superset does not have and Quarry does.

Actually, I did. Many times. But here you are again and some more.

  1. Download results in TSV.
  2. Download results in JSON.
  3. Download results in JSON lines.
  4. Download results in CSV for non-English wikis.
  5. Download results in Wikitable.
  6. Download results in HTML.
  7. Download results in Excel XSLX.
  8. Mark as draft.
  9. Fork.
  10. History.
  11. Toggle highlighting.

If the last four exist and I just couldn't locate them, it's still too many. I personally used 9 of these. And I'm sure the new tool has a lot of other great features, but we are not speaking about trade off here.

I'm trying out Supetset as per the banner on Quarry. I started by trying out this query from my Quarry profile: https://quarry.wmcloud.org/query/18758.

Running it on Superset, I get:

DB engine Error. Only SELECT statements are allowed against this database.

Superset gets confused by the use of parenthesis.

Fails:

(
  SELECT
	CAST(MIN(page_random) AS CHAR) _rand
	FROM page
    LIMIT 1
)
UNION
(
  SELECT
	CAST(MAX(page_random) AS CHAR) _rand
  FROM page
  LIMIT 1
)

A very similar query, but with the first pair of parenthesis removed, does work, although I have to remove LIMIT 1 in that case because this is only syntatically valid in combination with UNION when placed inside parenthesis. Outside of it, it would produce a syntax error:

Passes:

SELECT
	CAST(MIN(page_random) AS CHAR) _rand
	FROM page

UNION
(
  SELECT
	CAST(MAX(page_random) AS CHAR) _rand
  FROM page
  LIMIT 1
)

I'm trying out Supetset as per the banner on Quarry. I started by trying out this query from my Quarry profile: https://quarry.wmcloud.org/query/18758.

Why would the limit statements be needed in this case? Won't MIN and MAX only return 1 value regardless?

FWIW, i like quarry better than superset. I think streamlined is more important than featureful here. If i wanted featureful i would just use toolforge directly.

Superset is interesting, but I agree with Bawolff that simple is a good goal. Superset requires that I find the correct database server to query against, which is unnecessary and confusing.

When I use Quarry, I'm usually trying to answer a simple question quickly. It doesn't seem like Superset will have features I would actually use, like various output formats (including wikitext, which is a feature Quarry has) or clickable links. The graphing might be useful sometimes, but the overhead doesn't really justify a switch for me.

For anyone who does not want quarry to be replaced, I've edited the ticket to be more explicit. The thing that quarry lacks is support. It has a sizable, and old codebase. It will fail at some unknown point and not be recoverable. Superset does not have this problem, it is simple to deploy, and there are a lot of people using it and contributing to it. If you want quarry to remain, focusing on the differences between quarry and superset is not constructive. What is needed is a description of how quarry will have the support it needs to be reliable, and action on the same.

For anyone who does not want quarry to be replaced, I've edited the ticket to be more explicit. The thing that quarry lacks is support. It has a sizable, and old codebase. It will fail at some unknown point and not be recoverable. Superset does not have this problem, it is simple to deploy, and there are a lot of people using it and contributing to it. If you want quarry to remain, focusing on the differences between quarry and superset is not constructive. What is needed is a description of how quarry will have the support it needs to be reliable, and action on the same.

Quarry is linking to this task in a banner asking people to give feedback on what they think about switching to superset. I think it shouldn't be surprising that people are talking about which one they like better.

As far as quarry having support, maybe so, but at the same time, its pretty unclear what the ask is. Are there important bugs that have been sitting there unfixed? Does it just need a new sysadmin to volunteer to be responsible for sysadmining it? Something else?

Quarry is linking to this task in a banner asking people to give feedback on what they think about switching to superset. I think it shouldn't be surprising that people are talking about which one they like better.

The assumption was that people would read the ticket itself, which describes the rational of the intended work. And respond to that. Perhaps that assumption was overly ambitious.

As far as quarry having support, maybe so, but at the same time, its pretty unclear what the ask is. Are there important bugs that have been sitting there unfixed? Does it just need a new sysadmin to volunteer to be responsible for sysadmining it? Something else?

Quarry needs people working on it. For close to two years now that has been mostly me. Lacking greater community support, it is clear to me that it is a project that will eventually fail. Further it is clear that quarry is a project that is in use. As such letting it simply crash some day seems inappropriate. In particular when there are alternatives that take much less time to maintain.

If superset were to completely fall over, project was removed or some such, https://github.com/toolforge/superset-deploy/ has the instructions to bring it back, from nothing being deployed. It doesn't take long, about 20-30 minutes most of which is waiting for a k8s cluster to deploy, I've done it a few times now, and no one seemed to notice when I removed the entire back end of superset and replaced it with a different one.

I'm unaware of a similar set of instructions for quarry, if it really fell over I'm not convinced I could put it back. That leaves quarry in an uncertain position. This ticket is about mitigating that uncertainty.

Quarry doesn't get a lot of support

Then provide it with a lot of support, duh.

It is unhelpful to once again burden unpaid volunteers who often rely on Quarry for their work while having very limited or none SQL knowledge (having had queries written for them by others) to have to adapt to a new platform which judging from the comments has some serious issues such as not allowing people to share the query results without exporting them. It is much better to have a paid intern spend 5 hours bimonthly doing some maintenance than outsourcing the problem to volunteers who would spend tens of hours a week more on having to deal with a less familiar and less convenient tool. Some will just leave the idea of running queries completely, which will result in accumulation of maintenance tasks on wiki, some of those tasks if being ignored are a direct liability to the WMF, such as potential copyright violations and such.

OK this thread has bugged and annoyed me to sufficiently motivate me to attempt to use Superset. After looking at the user interface Dashboards... Charts... Datasets... for a couple of minutes ..... where do I enter my SQL query?? Oh duh, click on the "SQL" tab... that's one step I don't need to make in Quarry ... I see a more familiar window where I drop my SQL

select left(page_links_updated,8) as date,count(*),avg(page_len),sum(page_len)
from page
where page_random between 0.001 and 0.002 and page_namespace = 0
group by date;

then I click run and...

DB engine Error
Only SELECT statements are allowed against this database.

This may be triggered by:
Issue 1022 - Database does not allow data manipulation.

What the... I did use a SELECT statement, did I not?

Is this where I get support for Superset?

"Please contact an administrator if you need to run DML (data manipulation language) on this database." Is an administrator monitoring this thread?

Off-topic aside: Interesting error code # -- back in the day when I was a professional programmer, the company I worked for used System 1022 -- which doesn't get much support anymore either LOL

Then provide it with a lot of support, duh.

I’m not sure that you fully understand the burden of maintaining a tool.

Quarry needs people working on it. For close to two years now that has been mostly me. Lacking greater community support, it is clear to me that it is a project that will eventually fail. Further it is clear that quarry is a project that is in use. As such letting it simply crash some day seems inappropriate. In particular when there are alternatives that take much less time to maintain.

If superset were to completely fall over, project was removed or some such, https://github.com/toolforge/superset-deploy/ has the instructions to bring it back, from nothing being deployed. It doesn't take long, about 20-30 minutes most of which is waiting for a k8s cluster to deploy, I've done it a few times now, and no one seemed to notice when I removed the entire back end of superset and replaced it with a different one.

I'm unaware of a similar set of instructions for quarry, if it really fell over I'm not convinced I could put it back. That leaves quarry in an uncertain position. This ticket is about mitigating that uncertainty.

I think I was one of the loudest against this move, so I can just repeat my words above: I do not oppose moving to Superset. I suggest you to wait with this move some more time, hoping Quarry will not crush in this limited amount of time. Meanwhile, ask from the great Superset support all the features that users need and Quarry already has, wait them to be functional, and only then move to Superset. It even should not be all the features, I think. Maybe noone needs JSON lines output, I do not know, for example.

Am I correct in my assumption that Quarry is 100% "supported" by volunteers? Who wrote it?

Ah, looking at the Quarry home page, I see written in Python by YuviPanda. Where is YuviPanda? Have they abandoned us?

The source code site lists 36 contributors. Wow, that's way more contributors than there are to the stuff I support.

In contrast the Superset home page does not tell me who wrote it nor have any obvious link to its source code. So I can't tell how many contributors it has who are supporting it.

EDIT Oh, I see yuvipanda authored this Phabricator task, but now they have unsubscribed themselves from their own task. ??

And I see that the task was authored July 2017... nearly SIX years ago, and Quarry hasn't fallen down and crashed into an unrecoverable heap yet!!

If superset were to completely fall over, project was removed or some such, https://github.com/toolforge/superset-deploy/ has the instructions to bring it back, from nothing being deployed. It doesn't take long, about 20-30 minutes most of which is waiting for a k8s cluster to deploy, I've done it a few times now, and no one seemed to notice when I removed the entire back end of superset and replaced it with a different one.

I'm unaware of a similar set of instructions for quarry, if it really fell over I'm not convinced I could put it back. That leaves quarry in an uncertain position. This ticket is about mitigating that uncertainty.

Oh, there's the source. Written 100% in "shell"??

README: Setting up a local dev environment looks like instructions for bringing Quarry back.

Where is YuviPanda? Have they abandoned us?

Yuvi is a former WMF employee. He has not worked for WMF for quite some time.

Regardless of whether someone is a staff or volunteer, just because you worked on something once upon a time does not mean you are bound to it for the rest of your life...

Where is YuviPanda? Have they abandoned us?

Yuvi is a former WMF employee. He has not worked for WMF for quite some time.

Regardless of whether someone is a staff or volunteer, just because you worked on something once upon a time does not mean you are bound to it for the rest of your life...

So Quarry was written by a Wikimedia Foundation employee as a task that the Foundation assigned to him. The Foundation does not feel bound to support a product which it commissioned and so has turned over support to volunteers, assuming that "if you build it, they will come".

Please do correct me if I've misstated the situation.

So Quarry was written by a Wikimedia Foundation employee as a task that the Foundation assigned to him. The Foundation does not feel bound to support a product which it commissioned and so has turned over support to volunteers, assuming that "if you build it, they will come".

Please do correct me if I've misstated the situation.

By this logic anything that is ever created by the foundation must be sustained in perpetuity by the foundation. The reasoning being that the foundation is not a human and therefore can be bound as such? As noted in the quarry license the software is provided without warranty. https://github.com/toolforge/quarry/blob/main/LICENSE I'm paraphrasing but I believe the wikimedia mission is to make all knowledge accessible to all people, this does not, to me, mean that the foundation must keep every project it creates running.

In contrast the Superset home page does not tell me who wrote it nor have any obvious link to its source code. So I can't tell how many contributors it has who are supporting it.

It isn't uncommon for a project to not list who wrote it or where its source code is on its homepage. For instance I don't believe https://www.wikipedia.org/ has such information. For me at least "superset source code" in a web search brings up the source and that it has 945 different contributes. An order of magnitude difference feels significant. Though the absolute number of contributors is less important than that there are regular and recent updates. Superset has a lot more attention in this regard:
https://github.com/apache/superset/graphs/contributors
https://github.com/toolforge/quarry/graphs/contributors

I see yuvipanda authored this Phabricator task

One might want to consider when the original author of a project has suggested that other groups are doing much the same as quarry, and that we should investigate one of those alternatives, perhaps there is something to be said about how reasonable it would be to continue maintaining quarry.

And I see that the task was authored July 2017... nearly SIX years ago, and Quarry hasn't fallen down and crashed into an unrecoverable heap yet!!

! In T169452#8933340, @Base wrote:

Then provide it with a lot of support

These ignore the underlying reason of moving away from quarry. Quarry did not continue to function over the last six years unattended. Effort was put into it. Though lacking a sizable community maintaining it, quarry has fallen behind, and now takes more effort than superset does. Of course, more effort could be put into quarry, but if it takes several times the effort to get quarry to work, than it does to get superset to work, we are being inefficient, and unfortunately, have limited resources.

README: Setting up a local dev environment looks like instructions for bringing Quarry back.

I appreciate the effort that you're taking to illuminate support for quarry. Though these instructions describe how to setup an environment on one's laptop, the way quarry is deployed to production is different, and to my knowledge, not well documented. Indeed this difference, and how to make it more the same is described in T301469

... I drop my SQL

select left(page_links_updated,8) as date,count(*),avg(page_len),sum(page_len)
from page
where page_random between 0.001 and 0.002 and page_namespace = 0
group by date;

then I click run and...

DB engine Error
Only SELECT statements are allowed against this database.

I'm not sure what happened here. When I ran the same in quarry and superset, both work and give the same results on enwiki_p

... I drop my SQL

select left(page_links_updated,8) as date,count(*),avg(page_len),sum(page_len)
from page
where page_random between 0.001 and 0.002 and page_namespace = 0
group by date;

then I click run and...

DB engine Error
Only SELECT statements are allowed against this database.

I'm not sure what happened here. When I ran the same in quarry and superset, both work and give the same results on enwiki_p

Oh, Duh!

`SELECT ...select left(page_links_updated,8) as date,count(*),avg(page_len),sum(page_len)
from page
where page_random between 0.001 and 0.002 and page_namespace = 0
group by date;`

The thing tried to "help" me by putting "SELECT ..." in front of the query I copy-pasted in. After I removed that redundancy, it worked.

I'm making more sense of this, figuring out things that should have been explained to us from the start of this phab.

"Superset" is Apache Superset (see the Wikipedia article). It was created circa 2017, apparently not long before yuvipanda created this task. Wikimedia Toolforge Quarry dates back to at least 2014, so is around three years older than Apache Superset. So yuvipanda couldn't have recommended using Apache Superset in 2014 because it did not yet exist.

I see that both are written in Python, so by "Move Quarry to be an installation of Superset" do you mean you want to move yuvipanda's Python code into the Apache Superset code-set? On the results on my first Apache Superset run I see a button "DOWNLOAD TO CSV"... can we move yuvipanda's Python code for the Quarry "Download data" button (which includes six other options besides CSV) so that yuvipanda's button replaces the "DOWNLOAD TO CSV" button?

I'm aware that our superset install doesn't currently do caching of resultsets, but there is documentation in the superset docs related to caching at https://superset.apache.org/docs/installation/cache/ Is this caching something other than resultsets?

I've been working on queries like https://quarry.wmcloud.org/query/74483 which take so long that caching is v. important.

I'm making more sense of this, figuring out things that should have been explained to us from the start of this phab.

"Superset" is Apache Superset (see the Wikipedia article). It was created circa 2017, apparently not long before yuvipanda created this task. Wikimedia Toolforge Quarry dates back to at least 2014, so is around three years older than Apache Superset. So yuvipanda couldn't have recommended using Apache Superset in 2014 because it did not yet exist.

I see that both are written in Python, so by "Move Quarry to be an installation of Superset" do you mean you want to move yuvipanda's Python code into the Apache Superset code-set? On the results on my first Apache Superset run I see a button "DOWNLOAD TO CSV"... can we move yuvipanda's Python code for the Quarry "Download data" button (which includes six other options besides CSV) so that yuvipanda's button replaces the "DOWNLOAD TO CSV" button?

My understanding is that the current https://superset.wmcloud.org/ will be retained (with some improvements / tweaks / etc.) and the current https://quarry.wmcloud.org/ will be retained (without significant maintenance).

Over time https://superset.wmcloud.org/ will accrue improvements (both locally and from upstream). Over time https://quarry.wmcloud.org/ will not, falling behind in terms of maintenance and standards until it fails completely.

I expect the "Download data" / "wikitable" will almost certainly have to be completely written from one to the other.

rook renamed this task from Move Quarry to be an installation of Superset to Replace Quarry with an installation of Superset.Jun 16 2023, 11:43 AM

I'm aware that our superset install doesn't currently do caching of resultsets, but there is documentation in the superset docs related to caching at https://superset.apache.org/docs/installation/cache/ Is this caching something other than resultsets?

I previously got this deployed in the docker environment. Though so far as I could tell it was only for caching chart query results. Rather than sql lab results. I have not observed caching working for sql lab queries.

My understanding is that the current https://superset.wmcloud.org/ will be retained (with some improvements / tweaks / etc.) and the current https://quarry.wmcloud.org/ will be retained (without significant maintenance).

Over time https://superset.wmcloud.org/ will accrue improvements (both locally and from upstream). Over time https://quarry.wmcloud.org/ will not, falling behind in terms of maintenance and standards until it fails completely.

This is mostly the intention. However there is not the intention to let quarry run until it fails. Rather the intention would be (there is currently no timeline for this), when superset is mostly where we are going to get it, quarry would be set to read only, thus encouraging anyone who hasn't started using superset to copy their queries over and start. The concern is if we let quarry run until it fails, many queries could be lost.

I expect the "Download data" / "wikitable" will almost certainly have to be completely written from one to the other.

This is correct. Indeed anyone so inclined should submit a patch to upstream https://github.com/apache/superset

I see that both are written in Python, so by "Move Quarry to be an installation of Superset" do you mean you want to move yuvipanda's Python code into the Apache Superset code-set? On the results on my first Apache Superset run I see a button "DOWNLOAD TO CSV"... can we move yuvipanda's Python code for the Quarry "Download data" button (which includes six other options besides CSV) so that yuvipanda's button replaces the "DOWNLOAD TO CSV" button?

I'll try to clarify on this. The title was not clear, having read it again, I've updated it to be more to the point. The intention is not to replicate quarry inside of superset. The intention is to get replica searching capabilities via an upstream project to reduce maintenance burden. Thus increasing sustainability and stability of offering an easy way to do replica searches.

I'm aware that our superset install doesn't currently do caching of resultsets, but there is documentation in the superset docs related to caching at https://superset.apache.org/docs/installation/cache/ Is this caching something other than resultsets?

I previously got this deployed in the docker environment. Though so far as I could tell it was only for caching chart query results. Rather than sql lab results. I have not observed caching working for sql lab queries.

My understanding is that the current https://superset.wmcloud.org/ will be retained (with some improvements / tweaks / etc.) and the current https://quarry.wmcloud.org/ will be retained (without significant maintenance).

Over time https://superset.wmcloud.org/ will accrue improvements (both locally and from upstream). Over time https://quarry.wmcloud.org/ will not, falling behind in terms of maintenance and standards until it fails completely.

This is mostly the intention. However there is not the intention to let quarry run until it fails. Rather the intention would be (there is currently no timeline for this), when superset is mostly where we are going to get it, quarry would be set to read only, thus encouraging anyone who hasn't started using superset to copy their queries over and start. The concern is if we let quarry run until it fails, many queries could be lost.

I expect the "Download data" / "wikitable" will almost certainly have to be completely written from one to the other.

This is correct. Indeed anyone so inclined should submit a patch to upstream https://github.com/apache/superset

I've been looking at the code. I've just submitted a related pull request at https://github.com/apache/superset/pull/24440 based on an obvious issue.

I believe I now understand how to add the wiki-code export stuff, but just to clarify:

(a) Existing import/export filters in superset are bi-directional, functioning as both import and exports filters. Our requirements for wikitable (and potentially other formats) are only for export not for import.

(b) We're not worried by relative straight-forward terminological changes, for example 'download' becoming 'export' etc.

(c) Quarry has code for stripping out invalid UTF-8 characters. As far as I'm aware these are not an issue in the modern era and I propose ignoring them

(d) The WMF security team has resources to check wiki encoding stuff, once it's done. I'm aware that (re)coding user submitted data is a serious security issue and '|' is magic in more ways than one.

One piece of the puzzle I'm missing right now is a wikimedia dataset that I can access over the open internet for testing purposes. Doesn't need to be fast, large, or have write access, but does need a decent range of UTF-8 characters.

Quarry has code for stripping out invalid UTF-8 characters. As far as I'm aware these are not an issue in the modern era and I propose ignoring them

There are two things this probably relates to:

  • binary fields like cl_sortkey. Still exists just as much as ever.
  • things being truncated in the middle of a multibyte character. I think most of these are fixed but i would be surprised if it is 100%

The WMF security team has resources to check wiki encoding stuff, once it's done. I'm aware that (re)coding user submitted data is a serious security issue and '|' is magic in more ways than one.

If you mean exporting to wikitext then not really. Worst case scenario is you make broken wikitext which sucks but is not a security issue.

(a) Existing import/export filters in superset are bi-directional, functioning as both import and exports filters. Our requirements for wikitable (and potentially other formats) are only for export not for import.

Yes, I believe these would only be for exports.

(b) We're not worried by relative straight-forward terminological changes, for example 'download' becoming 'export' etc.

Correct, differences in UI naming conventions, locations of things, and similar we shouldn't worry about.

(c) Quarry has code for stripping out invalid UTF-8 characters. As far as I'm aware these are not an issue in the modern era and I propose ignoring them
(d) The WMF security team has resources to check wiki encoding stuff, once it's done. I'm aware that (re)coding user submitted data is a serious security issue and '|' is magic in more ways than one.

I have no knowledge on these two fronts.

One piece of the puzzle I'm missing right now is a wikimedia dataset that I can access over the open internet for testing purposes. Doesn't need to be fast, large, or have write access, but does need a decent range of UTF-8 characters.

Would the "mywiki" database that is loaded into the quarry docker dev environment work for this purpose?

(c) Quarry has code for stripping out invalid UTF-8 characters. As far as I'm aware these are not an issue in the modern era and I propose ignoring them

Quarry does not currently handle binary fields such as Categorylinks.cl_sortkey correctly, see https://quarry.wmcloud.org/query/74585 So what superset does it can't be a blocker. Anyone who knows of any uses of binary or non-UTF8 in Quarry, I'll need some examples for how and why it should work.

One piece of the puzzle I'm missing right now is a wikimedia dataset that I can access over the open internet for testing purposes. Doesn't need to be fast, large, or have write access, but does need a decent range of UTF-8 characters.

Would the "mywiki" database that is loaded into the quarry docker dev environment work for this purpose?

I believe that I have a suitable testcase now.

[…] I started by trying out this query from my Quarry profile: https://quarry.wmcloud.org/query/18758.

Why would the limit statements be needed in this case? Won't MIN and MAX only return 1 value regardless?

For this one query, there's another way to get the answer indeed. For example, I could use the the sql CLI from tools-login shell instead.

I believe the issue however shows a larger problem which is that Superset gets confused by parenthesis which are required for many kinds of SELECT-UNION queries.

In the simplest case of a select query that needs no LIMIT, indeed the parenthesis can be ommitted from the first segment (something I did not realize was also valid albeit less-intuitive SQL). However, I don't think we want to have to document and explain that as part of a "Superset FAQ", it's a bug in the way that Superset determines the query verb since it does support parenthesis and UNION. The validator in Superset prevents it from running in this case.

I noticed something rather creepy in Superset. It seems to track by default every link and query you open, even if you merely clicked a link, it saves this (forever?) in a way that is attributed to the logged-in user and publicly visible on your profile. E.g.:

Screenshot 2023-06-19 at 15.06.26.png (954×1 px, 146 KB)
Screenshot 2023-06-19 at 15.06.34.png (752×1 px, 87 KB)

This feels like a violation of privacy and very much an unconsented intrusion into personal space, to publicly record when and what I looked at, even if merely clicking a link to a query I didn't write or run myself.

I first observed this in the private instance at https://superset.wikimedia.org/, but it affects the Cloud one as well where it is arguably worse given that it's public.

Is there a way to turn this tracking off by default on our installs?

Looked at Recent activity on my account and saw nothing. That is despite me having around 30 saved queries. I have alpha and OAuth rights.

In the simplest case of a select query that needs no LIMIT, indeed the parenthesis can be ommitted from the first segment (something I did not realize was also valid albeit less-intuitive SQL). However, I don't think we want to have to document and explain that as part of a "Superset FAQ", it's a bug in the way that Superset determines the query verb since it does support parenthesis and UNION. The validator in Superset prevents it from running in this case.

Very good, please open a ticket with superset https://github.com/apache/superset/issues

I noticed something rather creepy in Superset. It seems to track by default every link and query you open, even if you merely clicked a link, it saves this (forever?) in a way that is attributed to the logged-in user and publicly visible on your profile. E.g.:

This feels like a violation of privacy and very much an unconsented intrusion into personal space, to publicly record when and what I looked at, even if merely clicking a link to a query I didn't write or run myself.

I first observed this in the private instance at https://superset.wikimedia.org/, but it affects the Cloud one as well where it is arguably worse given that it's public.

Is there a way to turn this tracking off by default on our installs?

I think this is restricted to interactions with dashboards, rather than charts and queries. We might be able to modify this. Could you open a phabricator ticket with the superset.wmcloud.org tag to investigate and discuss?

I've opened a superset discussion at https://github.com/apache/superset/discussions/24455 related to getting superset to export wikitables.

I'm an irregular user of Quarry but find it very useful when I do need it. I found some major accessibility problems with Superset with both JAWS (my primary Windows screen reader) and NVDA, when I tried it with a query I'd just done on Quarry (I have no accessibility issues at all with my limited use of the latter program):

  • I can't figure out how to get from the homepage to the SQL Labs section with either screen reader. When I press enter on the SQL button/menu item/whatever it is, nothing seems to happen. I only managed to get there from the link at Wikitech. The settings button at least seems to work though.
  • When the query results come up, the table seems to be some auto-scrolly thing that barely works in NVDA and almost doesn't work at all in JAWS.

If someone could perhaps report these issues upstream in a more useful way than what I've just done, I'd appreciate it. Or at least help me to formulate more useful issue reports ...

I'm an irregular user of Quarry but find it very useful when I do need it. I found some major accessibility problems with Superset with both JAWS (my primary Windows screen reader) and NVDA, when I tried it with a query I'd just done on Quarry (I have no accessibility issues at all with my limited use of the latter program):
If someone could perhaps report these issues upstream in a more useful way than what I've just done, I'd appreciate it. Or at least help me to formulate more useful issue reports ...

These are important issues to report. Unfortunately I have no experience with JAWS or NVDA, you would be the much better person to report to superset, being able to describe your experience directly, and offer insight to any questions that may arise. Please do open an issue at https://github.com/apache/superset/issues

I'm an irregular user of Quarry but find it very useful when I do need it. I found some major accessibility problems with Superset with both JAWS (my primary Windows screen reader) and NVDA, when I tried it with a query I'd just done on Quarry (I have no accessibility issues at all with my limited use of the latter program):
If someone could perhaps report these issues upstream in a more useful way than what I've just done, I'd appreciate it. Or at least help me to formulate more useful issue reports ...

These are important issues to report. Unfortunately I have no experience with JAWS or NVDA, you would be the much better person to report to superset, being able to describe your experience directly, and offer insight to any questions that may arise. Please do open an issue at https://github.com/apache/superset/issues

Fair enough, done ... after a bit more testing the settings menu item is a bit weird with screen readers too. Anyway here they are: https://github.com/apache/superset/issues/24687 and https://github.com/apache/superset/issues/24688

Is there any tutorial on how to use the Superset? I tried to run a simple query on plwiki but initially failed to find a database. It would be helpful to have a welcome page that explains how to find databases or add the same links Quarry has in the menu.

By the way, naming things s1-s9 is just bad, sorry. I understand that it might be some convention known by the devops team, but displaying it in the GUI without any explanation is not user-friendly. Even when the user is a developer. So please either explain this or just drop the current database filter if possible, and either show all schemas in the 'schemas' field or, even better, rename it to 'databases'. That would be more user-friendly.

I also don't see any saved quires that would help me (and others) learn by example. You should probably import queries from quarry to start things off.

I must say for now I mostly agree with IKhitron. For now Quarry seems like a tool with much better integration into Wikimedia servers the Superset has. Export formats are important. Though having the ability to generate charts in Superset is nice.

Two more things missing (worse then Quarry):

  • There seem to be no way to save the results (not the query). It takes some time to load charts upon most page reloads. There seem to be some cache but it is more volatile. Quarry saves results and display them immediately.
  • I noticed I can export chart as JSON. This is nice. Though again this seems slow. Like it is re-calculating results again.
  • There seem to be no way to share results with a large group. Quarry only requires devs to log-in. Superset requires a login even when I use a permalink for a chart.

All this seems like something that would kill WMF servers if Superset would be used on a large scale. Especially if it would be more open.

See subtasks, documentation is being worked on.

I've opened a superset discussion at https://github.com/apache/superset/discussions/24455 related to getting superset to export wikitables.

I've opened an issue based on the discussion.

So is it correct that we're looking for a new maintainer, but only in the capacity of migrating all usage of Quarry to Superset? That is, no new features are planned for or expected of Quarry and we expect to turn it down once Superset has feature and use case parity?

So is it correct that we're looking for a new maintainer, but only in the capacity of migrating all usage of Quarry to Superset?

See https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/FXALIAMBBDHVBCOOWIAQ25THBVXQTUCI/ . (I'd also say that there is no mystical power above maintainers that could dictate that "no new features are planned".)