Page MenuHomePhabricator

Research allowing read-only access to the superset api from requestctl's web UI
Open, Needs TriagePublic

Description

We want to have read-only access to the superset api to fetch data about dashboards, ideally without user interaction.

Given superset's authn/authz is configured using python code, it might be possible to allow a special-case read-only access to requestctl to a narrow api using some shared secret as authentication.

If we can make what we allow narrow enough, it shouldn't have big security implications.

Event Timeline

I think that we can do this, but maybe we should try to avoid changing the production Superset instance, if we can avoid it.

Instead, we would deploy a new private Superset instance using the same database backend, but with a modified authentication configuration, specifically to support internal API use cases.

We have already been thinking about how to provide access to the Superset API in T309622 where we would like to read metadata about dashboards into DataHub, on a schedule.
However, we haven't updated that ticket since before we migrated Superset to Kubernetes. It should be much easier to implement this custom instance, now that we have finished that migration project.

It seems to me that these two use cases are pretty similar, in that they only need read-only access to the API and only need access from within eqiad.
So I think that we can work on this in a low-risk kind of way, fairly quickly.

We could choose what sort of authentication mechanism to use. When I did the initial scrape of Superset metadata into DataHub (T306903#7959985) I used AUTH_TYPE=None but this is obviously not great for any production use case.

I can see two ways forward, a couple of which we have recently explored with Airflow recently in T375716: Ensure the Airflow API can be reached out to from within Kubernetes and is authenticated

  1. We could use AUTH_TYPE=DB and create a specific user in the database to allocate the necessary rights. Create a specific role for this user, if required.
  2. We could use AUTH_TYPE=KERBEROS just allocate the role to it. I don't think that we would need the user record in the database for this. https://www.restack.io/docs/superset-knowledge-apache-superset-kerberos-integration

I've seen some examples here about how people have managed this secure API access. https://stackoverflow.com/a/76374651 so I think we would be OK.

I would frankly go with option 1 so we are more flexible - I want to authenticate on this second superset instance from a non-kerberized host.

So basically we would need:

  • A second deployment of superset, with similar setup but with AUTH_TYPE=DB sharing the same database for data sources, slices and dashboards
  • A new user in the gamma class or sql_lab class - although neither is properly read-only - how do we create it? Manually or is there a way to do so programmatically?
  • A role to allow this user to access the dashboards created from the webrequest_live data source