Page MenuHomePhabricator

Public dashboard process
Open, Needs TriagePublic

Description

Problem statement

As a Wikimedia Deutschland Data Analytics team member, I would like to be able to leverage a standardized process to conveniently create publicly available dashboards from data that currently resides in HDFS so that insights can be presented to non-WMDE/WMF staff.

Context

WMDE would like to make our Wikidata REST API metrics available for the public, but the process to do this isn't something that has been standardized. These metrics are generated by an Airflow DAG that leverages jobs defined on GitLab.

Ideas brought up in the original Slack discussion were:

  • Leveraging Wikistats (high effort)
    • This would require creating a service and API via AQS 2
  • Pushing data to Prometheus such that it can be used in Grafana (strongly discouraged)
    • Prometheus only supports counters and timings
    • For counters, it assumes additivity – that is, a weekly count is day 1 count + ... + day 7 count
    • It's impossible to control the time of your data point (it's the current time at which you push the metric)
  • Adding the published datasets directories as a target of the DAG jobs where TSVs would be saved and then ingested via an open Turnilo instance (best solution to date)

General ideas

  • It would be great if the public dashboards were an instance of WMF long-term supported data visualization software
  • Ideally the public dashboards could be directly integrated into current data pipeline/Airflow based workflows
  • Including data stakeholder/admin oversight of what is added to this system would be ideal to protect against the inclusion of PII, regions on the Country and Territory Protection List, etc
    • Maybe a specific admin only database within HDFS could be the source where the public dashboards have access?
      • Admins would be the only ones who could create tables within this database
      • This would prevent the public dashboards from presenting information that has not been actively checked for vulnerabilities
    • Oversight of the Protection List and updating the public dashboards would be necessary
      • Maybe jobs that generate the data could be source controlled within a single repo with strict merge rights?
      • This would ensure that is_protected = True of canonical_data.countries would be always filtered out

Event Timeline

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

Thank you @AndrewTavis_WMDE for submitting this feature request. This will help us think through your use case when we begin strategy work for public data visualization enhancements.