https://github.com/jupyterlab/jupyterlab-git allows for git integration to the interface. Investigate if this can be added to PAWS.
Description
Related Objects
Event Timeline
I'm assuming that this would be something we'd consider for PAWS and not for the JupyterHub instance that has access to private data? Maybe my assumption is off, but for that instance to me it would be making it too easy for someone to push something that then has PII accidentally.
It occurs to me, this will encourage people to put ssh private keys on PAWS, won't it? This is something that we generally discourage, as all of the home directories are public on public-paws.wmcloud.org
@AndrewTavis_WMDE could you provide your thoughts on that?
Hi @rook! Thanks for your thoughts on this :) Not having a full understanding of the infrastructure I guess I have a few more questions/thoughts on my end:
- What you're referring to here is that someone would inherently be putting ssh keys that they use for GitHub/Git into their home directory on PAWS that would then be visible in a file on public-paws.wmcloud.org?
- Would we be required to save this locally on PAWS, or could we instead run terminal scripts to make the connection each time if we wanted to use git?
The big thing that we're discussing in WMDE analytics at this point is a three fold problem of effectively working with notebooks. We want them to be:
- ... version controlled and public where appropriate (GitHub public and private repos)
- ... easily ran on a server in a way that our credentials are not compromised
- PAWS for public facing data
- The private instance of JupyterHub
- Potentially notebooks themselves or Google Colab with any needed credentials accessed via code (.gitignored where needed)
- ... transferable between the run environment and the version controlled file
- Hence this question of linking GitHub and PAWS via the extension
- I'm unsure how we would effectively version control code that gets ran on the private JupyterHub at this time, with suggestions being welcome for this!
- I mention Colab only for public data as obviously creating a sharable link with private data queries would not be acceptable
If you have an idea of how WMF is using and sharing notebooks between analysts/scientists or could tag someone who might have an idea then maybe we can close this task via a simple knowledge transfer :) I've used JupyterHub with this extension in another position and found it very useful, but then I understand that there are dramatically different things to consider within an open server.
This conversation is bringing to light that we are encouraging ill advised behavior by not having git be a central element of PAWS. Rather it is assumed that everything will always be available on PAWS forever, and that git isn't really needed. I wonder if we should re-think that...
I apologize, I'm dense and didn't understand much of what you wrote. My understanding is that you are looking for a way to run notebooks in one venue, perhaps for testing (?), then transfer the notebook to another, private, venue to run it. I further apologize as I know nothing beyond "it exists" when it comes to the private JupyterHub.
The way that people share when it comes to PAWS is, so far as I know, largely through the public interface. For instance here are all the things in my PAWS home dir:
https://public-paws.wmcloud.org/User:VRook_(WMF)/
Since it is quite public we discourage any storing of secrets. As even things that are not shared are only one typo away from being shared. At any rate if I wanted to share anything on PAWS, I would send a link to the desired object.
As for pulling things into the private instance of JupyterHub from PAWS, could you not pull them in from the public link? I'll straight up agree, that git is the better solution, but have to accept that it isn't feasible in the PAWS of today.
You're by no means dense - let's get that our of the way :) :)
We're trying to easily allow our team to work on the same notebook together via version control and then if possible run it in the environments set up by Wikimedia (makes our venv infrastructure more simple). We can of course share the notebooks with one another through the public interface, but then if a change needs to be made then as far as I can tell it requires copying and pasting into our personal versions of the file 🤔 Further questions:
could you not pull them in from the public link?
- Could you explain a bit more on this? Is it possible to populate a notebook from the public link? To me the links seem to be more for sharing results and not something that you can directly import into your own instance and then share a common copy of. Maybe there's a work around, but to me the "forks" we're talking about are hard forks not meant for collaboration. If there's a way to update a PAWS notebook via the public link this could work :)
- Would it be possible to link PAWS to local files and read in SSH credentials via paths rather than saving them in PAWS itself? So by this I mean local access and then say that the SSH keys are located at .ssh/id_rsa and .ssh/id_rsa.pub?
I'll preface by agreeing, that git is, by far, the better option for what you're looking to do than anything I'm about to suggest.
Yes I tink that's it. PAWS allows people to pull down whatever someone has made from the public interface. Though it is hard forking and terrible for collaboration. PAWS does allow for some collaboration through the "Share" which can generate a link and token that a user can give to anyone who will then be able to access their whole server. At which point they could edit wikis or the like as whoever gave them the link, so there is a security detail to be aware of. Though it does, in concept, allow for real time collaboration. (Still less effective than git)
- Would it be possible to link PAWS to local files and read in SSH credentials via paths rather than saving them in PAWS itself? So by this I mean local access and then say that the SSH keys are located at .ssh/id_rsa and .ssh/id_rsa.pub?
I don't quite follow what you mean by local access. Is local, your laptop? Or is local the PAWS server container? If you're referring to the ssh key being on your laptop, I'm not sure how PAWS would be able to get at it in our current setup (Everything is through the browser). If you mean local to PAWS, well then we run into the same problem. Namely that the key is a typo away from being public.
Now I say a typo away, because it should be a typo away. In PAWS anything with a . in front of it shouldn't be shared. Nginx should forbid it. But either a change in nginx, or an accidental cp or the like and suddenly a secret thing is very not secret. There is a security risk there as well. Though how can we provide meaningful git collaboration without keys...
So you specifically want jupyter notebooks, yes? I ask because if the project were for anything else I would have pointed you to toolforge or cloud VPS for this.
I don't quite follow what you mean by local access. Is local, your laptop?
Was just something I was thinking out loud about. I get that it's fully in the browser and was just wondering about the process of potentially loading stuff in a temporary manner.
Now I say a typo away, because it should be a typo away. In PAWS anything with a . in front of it shouldn't be shared. Nginx should forbid it. But either a change in nginx, or an accidental cp or the like and suddenly a secret thing is very not secret.
This was a great rundown! Thank you 🙏 Really was helpful :)
So you specifically want jupyter notebooks, yes? I ask because if the project were for anything else I would have pointed you to toolforge or cloud VPS for this.
Notebooks are just what I'm most familiar with, but then if there are better solutions within toolforge or cloud VPS then I'm all ears. There appear to be some holes on the WMDE end on what tools we should be using moving forward. I'm a really big fan of Wmfdata-python already, so anything that allows me to leverage and similar tech as well as environments that we setup so that data can be protected and software versions are consistent would be more than welcome 😊
ooo! This opens up all kinds of wonderful possibilities!
Could you describe in some detail what kind of work you're looking to do? My current understanding is mostly that things are to be made in a collaborative manner, but I don't know what those things are, or where they would run.
Appreciate your willingness to discuss all this in such detail! :) :)
A specific task that originally brought me to the notebooks was the idea of doing quarterly reporting, which might be something that you could give me specific suggestions on for how WMF handles this. At the end of the pipeline is a Google Slide that needs information and visuals copy-pasted into it. The person who is entering the information may be in our data team, but then we also want to take this initial moment to try to work towards a bit more data autonomy if possible as we're working on so much baseline infrastructure/practices. This brought me to notebooks as something that could via Python:
- Connect to the databases via a venv
- Do any needed data manipulations (these could also be done via preprocessing to a table in the data lake)
- Generate and display visuals that would be saved in a folder for that specific report
- Notebooks would further be useful in this regard as markdown could be used to direct non-analytics stakeholders to where they can find the .png files
- The general thought was that the final step would be a cron job that would run the notebook and potentially even send an email that notifies stakeholders that the data for the quarterly reports is ready
We'd also like this to be version controlled, with GitHub being the initial thought. We'd of course need to go in after the fact and do some explanatory analysis or if there are aberrations in the results then we'd need to do a dive, with the results of these further being added via git to document the process.
Let me know what your thoughts are on this and whether there's a tool in the forge that might fit! Learning a bit about how WMF does reporting would also be quite useful :)
Hello hello o/ I'm the manager of Product Analytics and I've been doing data science at WMF for almost 8 years. There is a lot to respond to here and a lot of thoughts & additional info to share from my experience, so I will try my best to keep it all organized :)
Let me know what your thoughts are on this and whether there's a tool in the forge that might fit!
Just want to mention here that PAWS + redacted/sanitized replicas are a completely separate ecosystem from internal Jupyter service, the data lake, and unredacted replicas. If your reporting relies on data in the data lake (including preprocessing via ETL jobs), Toolforge and PAWS will not work for you.
Re: Git and PAWS
I have also run into the problem of wanting to work on a remote server and easily push commits from that clone to the repo on GitHub, rather than having to download the notebook/scripts/queries to my local machine and do the version control there. I was discouraged by the SRE team from having SSH keys even on the analytics clients and even if I locked down permissions so the keys are only accessible to me (and SREs with root access, obviously). To that end we have an alternative using HTTPS-cloned (rather than SSH-cloned) repos and personal access tokens: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Jupyter/Tips#GitHub
Instead of adding this JupyterLab integration and inadvertently encouraging bad security practices, could we simply not replicate those instructions in the PAWS user guide and point users to those instructions if they wish to use git?
Re: Data workflows and reporting
Okay, now on to the broader questions/topics raised in this thread around working with internal/private data (not the redacted public replicas available to PAWS users) and reporting metrics & insights.
I just want to quickly note that if you have access to Wikimedia GitLab, that is the preferred & recommended space for version control – not GitHub (and not Gerrit due to its inability to render Jupyter notebooks + its eventual deprecation once the migration to GitLab is completed). My team has been using the wikimedia-research GH organization since my first day at WMF, but now that Wikimedia's GitLab instance is stable and used by our colleagues in Research for their projects & our colleagues in Data Engineering for their pipelines we have started to use it as well and upload our analyses there. We are also currently in process of gaining the ability to have private repositories (T305082) for uploading analyses/reports that have not been cleared for public release.
When we upload our analyses/reports to GitLab/GitHub (public AND private), we ensure that none of the outputs contain personal information. If the report has per-country data which include countries in the country protection list, those are excluded from any visible outputs as well. A major motivation for the private repos on GitLab is to enable version control and easier internal sharing of analyses/reports with restricted geographic breakdowns (as detailed in T305082#8739105). If you are doing your analysis on PAWS and using the redacted replicas + data available through AQS then you do not need to deal with the privacy/sensitivity issues. These restrictions only apply if you're working with private data in the data lake using the internal Jupyter service and internal tools like wmfdata-python.
Regarding PAWS-like way to share notebooks on the internal Jupyter service, see T156934 (not a priority).
Regarding data collection, analysis, and reporting practices & processes: data pipelines have been difficult for us. A lot of our current pipelines (including ones where the end result is a chart in a Google Slides deck) are automated through cron & systemd timer (example: Puppet config which schedules these ETLs) and (occasionally) manual steps at the end. If the end result is a report that is meant to be updated with some cadence and made publicly available, the cron job (run under individual user account) copies it to /srv/published (example, see https://wikitech.wikimedia.org/wiki/Analytics/Web_publication & https://www.mediawiki.org/wiki/Product_Analytics/Reporting_Guidelines#Publishing_reports for more information). If the end result is a dataset not in the data lake or a visualization, it is copied manually (e.g. CSV to Google Sheets, PNG to Google Slides).
We are currently working with Data Engineering on unifying, centralizing, and modernizing our team's data pipeline capabilities (T316049). Once that is complete, I'd be happy to share what we've learned and our system so that WMDE's analysts may also benefit from the work we're doing now. Eventually it would be great if we had a data warehouse on the Data Science & Engineering (DSE) cluster (T327267) and a UI for easily productionizing data pipelines (e.g. with dbt, Elyra), so that we're not requiring data analysts to write Airflow DAGs. (But in my view something like that is at least 2 years away.)
@AndrewTavis_WMDE: I'd be happy to share more knowledge and help you figure out a solution for your needs :) I see you've been added to #working-with-data & #data-engineering channels in Slack and they're the perfect place to ask these questions & have these discussions. By the way, I am currently working with WMF's Legal and Security teams on a framework & process for reviewing our work before we make it publicly available in places like GitHub/GitLab, Phabricator, and on-wiki. Unfortunately, WMF's Legal and Security would not be able to perform those reviews of WMDE's analysts' work. I understand WMDE has its own Legal team who might possibly carry out similar type of reviews? Perhaps we can share these guidelines with you and WMDE's Legal team when we are done so that you can refer to them and develop a similar framework & process? (To be clear I'm mentioning it here as invitation for you to follow up with me separately, not on this ticket.)
@mpopov, thanks so much for the detailed explanations here and for your willingness to discuss these topics further! Will firstly tag @Manuel for visibility. Let me also summarize for him and to close this discussion:
- I think a decision to not add Git integration to PAWS is clear so this ticket can be marked as Resolved or Declined
- I don't have GitLab access, and will check with @Manuel on what we might want to use GitLab for
- My EM can't give me access to the Wikimedia GitLab, so if anyone on the WMF side could that would be great :)
- We do have our own legal department for these sorts of reviews of publicly available work
- I've checked with our first contact for data protection on if your all's guidelines are something that she'd like shared with her. I'm sure on our end we'd like to take a look. Thank you!
- Edit: she and WMDE Analytics would like updates, but I can relay the info to her.
- Thank you further for the information on the country protection list! I've checked with software engineering/product here on if we're following similar standards, but makes sense to me.
- I hadn't seen Jupyter/Tips in the docs yet and will look into getting Git setup on the internal Jupyter instance via HTTPS cloning rather than SSH
- I'll also check this for PAWS as well :)