Page MenuHomePhabricator

[Analytics] First test report to explore potential solutions for notebooks
Closed, ResolvedPublic

Description

Resources:

Acceptance criteria:

  • Investigate where to get and how to manipulate the data.
  • Explore different tools to display this data in a single location.
    • T336282 investigated Git access to notebooks via an extension
    • Instructions on setting up internal JupyterHub and PAWS for HTTPS Git: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Jupyter/Tips
    • Internal JupyterHub
      • Data access: private data via internal environment
      • Git version control: via HTTPS works for public and private repos
    • PAWS
      • Data access: public data via PAWS environment, but no private data
      • Git version control: via HTTPS works for public and private repos
    • Notebooks on GitHub/GitLab
      • Data access: more cumbersome as it's not within a WMF environment
      • Git version control: SSH possible
  • Create a proof of concept (initial test report using any of the data from milestone 2).
  • Comment/iterate on the selection criteria and possible solutions for notebooks.
    • Note that for a cron job this would need to either:
      • Recreate the env within GitHub Actions (hard with a potential of a breach)
      • We'd need to do a cron run within the internal JupyterHub (no domain knowledge in team, but WMF does it)
      • The above might point to setting up a dashboard with these values being the best, but then we wouldn't have prior reports saved as records

Event Timeline

Note that T336282 is related to this as we were investigating adding Git integrations for notebooks :)

@Manuel, quick update that PAWS also allows for the HTTPS Git version control. Expected that it would, but I did successfully test it just now. I've updated the task with the findings so far :)

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
  • Cron jobs might be possible in either case
    • Is something that product analytics at WMF does
    • One issue is that it'd be within our individual instances and thus not fully productized
      • It'd be best to thus run this job in a centralized place and then fork to individual JupyterHub instances
    • GitLab issue about JupyterLab operators
  • We could also use airflow
  • Generally the suggestion from WMF was to look into integrating Superset a bit more
    • Graphite is something that people would like to sunset
  • Note further that Liift Wing is replacing ORES (discussion)

Notes from meeting with @Manuel:

  • Maybe migrating the working reporting idea from notebooks to rather having Python scripts that produce the outputs in directories for each report
    • Images, csv, etc

Another thing to consider @Manuel is that apparently Quarry is being migrated to a wmcloud Superset instance, so this could be a place where we put reporting metrics that we want to be open to the public :)

Moved this to Needs review as I think we have a fairly good idea of what we'll be using notebooks for, @Manuel. For reporting we can check with WMF about tools they use (Superset, public Superset or Grafana), so we wouldn't be using notebooks for this purpose. I'd say that setting them up to do this would be prohibitive as the documentation that could similarly be provided with readme files for Python scripts that would be easier to maintain with Airflow.

From there we can use PAWS for publicly available data or the internal JupyterHub instance for internal data :)

Manuel renamed this task from Explore potential solutions for analytics notebooks to [Analytics] First test report to explore potential solutions for notebooks .Jul 6 2023, 12:53 PM