Page MenuHomePhabricator

Investigate easier methods for WMF staff to access Superset
Open, MediumPublic

Description

The process for WMF staff getting access to Superset is quite complex, despite significant and valuable work by Data Engineering and SRE to make it easier. If this was a service only necessary for some engineers, this would probably be acceptable; however, Superset is a dashboarding tool which the large majority of WMF staff will eventually need to access.

Event Timeline

@MoritzMuehlenhoff We are concerned that being able to access superset just with a yubi key might be too big of a barrier for many of the users we need to support. Could we at all explore the possibility of making superset accessible via a VPN based authentication? (cc @elukey )

@Nuria we decided not to do pursue the VPN road in other tasks, what kind of barrier a yubikey should represent? It will definitely more problematic to explain to people how to use a VPN than a Yubikey in theory. Are there any specific concerns?

For reference: https://phabricator.wikimedia.org/T242998 (discussion about VPN with the security team)

Nuria renamed this task from Investigate accessing superset via internal VPN to Investigate accessing superset via internal VPN or google oauth.Jul 28 2020, 4:39 PM

The Analytics team had a quick chat about the current issues in authentication for users of Analytics UIs, the high level summary is the following (IIUC):

  • we have a variety of users with different technical backgrounds, that may or may not be familiar with all the procedures to get a Wikimedia Developer Account, Phabricator account, etc..
  • to be able to see a dashboard created by Kate's team, the new potential user needs to have:
    • A Wikimedia developer account (that may already be confusing for a non-technical user)
    • A Phabricator account to get access to LDAP groups wmf/nda if not already added by OIT when joining Wikimedia.
    • A Yubikey for 2FA when/if we'll enable it.
    • A Posix user in puppet (even if without ssh access) to be able to use the Superset integration with Presto, that through various auth-proxies/delegation reaches the Hadoop files that in turn need a valid user to be read. This last bit is very new and it is not required to use Superset, but if the usage of Presto will expand in the future it might be.

Two solutions are highlighted:

  • VPN (as discussed in T242998) - any user should have a way to authenticate to production via a client (I know about client SSL certs and OpenVPN, but there are other solutions) that should be strong enough to be able to identify the user and allow it to join the production network (together with access to all the UIs like Turnilo and Superset with PII data). There is also the concern about allowing anybody that hasn't followed the process to access to production (so signing the document about how to behave, etc..) to join their laptop/computer to a slice of the production network (we don't strictly control what runs on our laptops). There could be only one authentication method, but in my opinion it may not be easier than what we have now. This solution would also require a lot of work for multiple teams (Analytics, Security, SRE).
  • Google Auth - UIs like Superset can work with multiple SSO auth systems, so in theory we could have both CAS and GAuth. Security wise, we'd mix access to tools like emails with access to production, so SRE/Security will need to sign off. It would be very handy for users since they wouldn't need extra accounts for Wikitech/etc.., but at the same time it would mean granting access to production with an auth system that is not thought for that (like CAS-SSO/idp.wikimedia.org is). Moreover, in Superset we need some LDAP info of the user to be able to map a Wikimedia Dev account to a POSIX user in Hadoop (and hence able to use Presto etc..).

Both solutions seems to require a long time to discuss and possibly implement, meanwhile we really need to find a good solution in the short term to protect our data in a better way. What we could work on is:

  • More automation in collaboration with OIT for the creation of accounts and LDAP groups etc.. If a new user joining could already get the accounts needed it would probably be less of a problem upon first access of tools like Superset.
  • A Yubikey shipped to every new user together with laptop etc.. (contractors could get one as well). The cost of a Yubikey dropped incredibly (it is now ~20 dollars), so I don't really think that it should be a financial problem to have it.

The Yubikey would also be very nice to be used with other accounts like the Google mail one, so it would not only be something related to Analytics-related UIs.

More discussion is needed of course :)

fdans triaged this task as Medium priority.Aug 3 2020, 4:35 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
nshahquinn-wmf renamed this task from Investigate accessing superset via internal VPN or google oauth to Investigate easier methods for WMF staff to access Superset.Jan 24 2022, 6:36 PM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf removed subscribers: chasemp, Tbayer.

The difficulty of accessing Superset is still an issue (despite significant and valuable work by Data Engineering and SRE to make it easier). Investing in easier access methods for WMF staff would have significant benefits.

I think to really get this fixed, someone from high up in Tech and Product need to convince SRE that this is something that needs resources to be spent on. SRE manages how people access the production networks, and they aren't in the habit of granting non technical people access, so I don't think they see this as a pressing problem.

I think to really get this fixed, someone from high up in Tech and Product need to convince SRE that this is something that needs resources to be spent on. SRE manages how people access the production networks, and they aren't in the habit of granting non technical people access, so I don't think they see this as a pressing problem.

Strong +1, any other avenue would create more problems than it solves in my opinion.

the large majority of WMF staff

Is it really true that almost everyone needs access to private data to do their job? I used to think we keep that access minimal to protect data and principle of least privilege. And looking at access requests it seemed like some are created by default just "in case" someone might need it. Switching to a model where everyone gets it would be a new paradigm. But if we have changed so much towards "data driven" and self-service that now everybody needs it, instead of a few people in analytics, then times have just changed I guess.

a new user joining could already get the accounts needed

This! This would strike me as totally standard procedure in most other organisations.

Regardless how many people get certain groups or how easy the process is.. the actual problem seems to me that we _never start it before the person starts working here_.

We should get to a world where we know weeks ahead when a new person starts and they should have all their access just ready on day 1. They should not have to ask for it themselves after they already started working here. Just my 2 cents.

And if they don't have to worry about it and other (tech) people, create the access.. then does it matter how easy we make it? Couldn't the SREs within Data-Engineering create those accounts for the users after ITS or HR tells them the start dates and everyone knows what roles we are hiring for? And then nobody besides SRE would care how exactly it is done.