Page MenuHomePhabricator

Consider moving Quarry to be an installation of Redash
Open, Needs TriagePublic

Description

https://github.com/getredash/redash seems pretty nice, and has a much larger community of people maintaining it. The license is acceptable, and it looks like we can make it work for the Quarry model (need to log in with Wikimedia acct to make queries, but reading queries is public (via hacking their remote user authenticator support).

This probably will be good for the long term health of Quarry.

Event Timeline

You'd have to do some amount of proxy magic to get it to work with mediawiki auth. We should have an authenticator running as a separate app. The proxy should check for a cookie, and try to validate it (with a HMAC, shouldn't be too hard). If it fails, or there's no cookie, we'll redirect to our authenticator, which will do the MW OAuth flow and set a cookie. On second attempt, we'll see the valid cookie in the proxy, and set a trusted header for Redash to consume.

For making this work with 'anonymous' users (ie everyone can read), only specific URLs should require login. This should be probably managed at the proxy level. Everything else, when the cookie validation fails, will just use an 'anonymous' user that's pre-created with a very specific set of rights.

My main concern with this kind of move would be preserving the basic functionality of Quarry in redash. E.g. permalinks to results, recent queries, user queries, public only, published queries, etc.

If we can get all of that for free, then let's do it. :)

@Halfak is it a deal-breaker if we couldn't migrate the history of Quarry to Redash? I'm wondering if you care as much about the history as the features themselves. If it's the history, I'd say that would take too much work and will most likely never happen.

Don't know whether it should be considered a dealbreaker, but FWIW: Apart from permalinks ensuring reproducibility of individual anayses, the history is also an accidental but very valuable source of knowledge about how to use the data available in Quarry. E.g. I have benefited several times from doing a Google site search for a particular table or field name, to find out how smarter people than I have accessed them before. Perhaps we can preserve the history as a static HTML copy under the same URLs?

history is also an accidental but very valuable source of knowledge

To clarify, the decision to include a query history was not accidental at all. That was very purposeful design choice.

If we can't port the history, maybe we can keep a static Quarry instance online that allows use and searching of the old queries. We certainly should *not* brake old URLs.

Yup, we can keep a static version running forever.

re: history, I also record query history + output history, never wrote UI to expose it :(

Ahh yes, but the URLs to old result sets still work.

As a note, @zhuyifei1999 has graciously offered to look at some of the outstanding Quarry issues. I've given them merge rights + labs admin.

@zhuyifei1999 thank you for taking on Quarry stuff!

For those interested in connecting redash to the quarry dev env the following seemed to get there.

The setup.sh script from:
https://github.com/getredash/setup
seemed to work fine on ubuntu 20.04 getting everything setup and visible on localhost:5000

Quarry bits:
redash is using 5000 so change the 5000 to something else (5002) in:

Dockerfile
docker-compose.yml
quarry/web/app.py

also...
we can access the test db for quarry by adding the following to the docker-compose.yml in quarry, and, presumably, starting it after redash:

networks:
  default:
    external:
      name: redash_default

get the root password for the test quarry from the output on startup:
mywiki_1 | 2022-11-19 16:07:05+00:00 [Note] [Entrypoint]: GENERATED ROOT PASSWORD: <password>

get the ip for the mywiki container with:

sudo docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' quarry_mywiki_1

Superset (https://github.com/apache/superset) is likely worth consideration. At this point it has greater activity than Redash. In a counter point, at a glance Superset is more focused on "Right now" so we might need to discuss what quarry is really trying to solve to decide if that is ok or not.

In Quarry a history of both the query and its result is maintained. Is there notable value in maintaining the result of the query? Or is the history really about the query itself.

I understand the value of preserving the queries, Superset does this. I don't understand the importance of the results of the query needing to be preserved. Perhaps this is not valued?

In my point of view, there's a couple high value reasons to store result sets for queries.

Caching: If we don't store result sets in quarry, then every time someone wants to see the results of the query, it needs to re-run. Many queries are too expensive for that.

Stability: I can run a query and expect that the result URL is stable. Thus any data pipeline I keep that result in is also stable. If I'm using the results of someone else's query, I know those results won't change even if they re-run the query. That's handy for a lot of tasks. E.g. having stable and predictable machine learning pipelines.

Thank you for suggesting these as reasons. I would agree, this view creates greater value to having the result as well. I'll keep that in mind as I continue searching and tinkering with possible software.

I agree this is something worth exploring. I'll add Metabase to the options, I had good experiences in the past and the features overlap with those of Redash and Superset.

Superset seems to have caching https://superset.apache.org/docs/installation/cache/

Though I'm having trouble finding documentation on it in official looking places.
https://subscription.packtpub.com/book/all-products/9781788992244/2/ch02lvl1sec23/creating-the-metadata-database

Still some tinkering with the local docker install has yet to result in me figuring out how to get the results of a query saved past a reload.

I'll see if Metabase does this more by default.