Page MenuHomePhabricator

Data Platform - Public dashboard support
Open, Needs TriagePublic

Description

Problem statement

As a Wikimedia Deutschland Data Analytics team member, I would like to be able to leverage a standardized process to conveniently create publicly available dashboards from data that currently resides in HDFS so that insights can be presented to non-WMDE/WMF staff.

Context

WMDE would like to make our Wikidata REST API metrics available for the public, but the process to do this isn't something that has been standardized. These metrics are generated by an Airflow DAG that leverages jobs defined on GitLab.

Ideas brought up in the original Slack discussion were:

  • Leveraging Wikistats (high effort)
    • This would require creating a service and API via AQS 2
  • Pushing data to Prometheus such that it can be used in Grafana (strongly discouraged)
    • Prometheus only supports counters and timings
    • For counters, it assumes additivity – that is, a weekly count is day 1 count + ... + day 7 count
    • It's impossible to control the time of your data point (it's the current time at which you push the metric)
  • Adding the published datasets directories as a target of the DAG jobs where TSVs would be saved and then ingested via an open Turnilo instance (best solution to date)

General ideas

  • It would be great if the public dashboards were an instance of WMF long-term supported data visualization software
  • Ideally the public dashboards could be directly integrated into current data pipeline/Airflow based workflows
  • Including data stakeholder/admin oversight of what is added to this system would be ideal to protect against the inclusion of PII, regions on the Country and Territory Protection List, etc
    • Maybe a specific admin only database within HDFS could be the source where the public dashboards have access?
      • Admins would be the only ones who could create tables within this database
      • This would prevent the public dashboards from presenting information that has not been actively checked for vulnerabilities
    • Oversight of the Protection List and updating the public dashboards would be necessary
      • Maybe jobs that generate the data could be source controlled within a single repo with strict merge rights?
      • This would ensure that is_protected = True of canonical_data.countries would be always filtered out

Event Timeline

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

Thank you @AndrewTavis_WMDE for submitting this feature request. This will help us think through your use case when we begin strategy work for public data visualization enhancements.

Bringing a point from @Ottomata from the Data Engineering Collaboration Hangtime here:

In response to my question of what the software for this public dashboarding solution could be, his suggestion was maybe to do Grafana with a different data source than Prometheus that's designed for time series and isn't suitable for the aggregation needs here. Superset would be too powerful and not particularly accessible for the public, and Turnilo is similarly time series focused.

Maybe grafana with cassandra or druid querying the same backends that AQS uses?

BUT! This is just a random idea and it could be a bad one! :)

Note from the Data Usage at WMF Interview that I just participated in: @Milimetric suggested that Dashiki could be a stopgap solution where HTML dashboards for metrics could be generated and stored in the published datasets folders. An example for this is all-sites-by-os. This could be something that we could use in the meantime, with the HTML potentially being created and exported within the Airflow process. We could then also host these HTML files.

Ottomata renamed this task from Public dashboard process to Data Platform - Public dashboard support.Dec 2 2024, 5:56 PM
Ottomata added a subscriber: MusikAnimal.

Bringing a point from @Ottomata from the Data Engineering Collaboration Hangtime here:

In response to my question of what the software for this public dashboarding solution could be, his suggestion was maybe to do Grafana with a different data source than Prometheus that's designed for time series and isn't suitable for the aggregation needs here. Superset would be too powerful and not particularly accessible for the public, and Turnilo is similarly time series focused.

I have an idea that I would like to put forward, regarding this public dashboarding capability.

Outline
  • We could support having a publish task in any DAG. This is where oversight and data governance would need to be controlled. Only data that meets the non-PII and other filtering criteria would be permitted to be published.
  • Each of these tasks would take source data from HDFS (or other sources) and write it, using Spark and Iceberg, to tables that are served by the S3 interface of our Ceph cluster.
  • We would use a new metastore for this service, so it would not be the same as the current Hive metastore service and would not need to be protected by Kerberos.
  • We would also use the Alluxio file system as a memory-first cache for the S3 data files.
  • We set up a Trino plugin in Grafana, with a catalog that uses this Alluxio enabled Iceberg data store, backed by only approved public data files.

I believe that this would meet the requirements of enabling a publicly available system for creating and sharing dashboards, that would be backed by aproved data sources.

Crucially, this system wouldn't need to use Kerberos authentication and it wouldnt need to connect to the HDFS file system where we keep our PII data.
It woudn't need to depend on POSIX file system permissions for different levels of data access, if all accessible data has been approved for public access.

Components Required
  1. The Trino Grafana Data Source Plugin to grafana.wikimedia.org to provide the dashboard environment for authors and viewers.
  2. A new Trino deployment to the dse-k8s cluster, as per the docs.
  3. A new metastore service:
  4. A Trino catalog using the Iceberg connector and the Alluxio file system.
  5. An Alluxio deployment on the dse-k8s cluster as per these guidelines.
    • There are plenty of alternative options here, including 1) using the existing presto nodes and their local storage for the Alluxio file system and 2) migrating fully from Presto to Trino and re-using the presto workers as dse-k8s-workers.
  6. Some publish tasks in one or more DAGs that use Spark to write to this file system

We could also investigate Flink with Alluxio, if we would like to deploy streaming jobs to publish data.

Future Options

This approach might also be useful for other datasets and publishing requirements, such as T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users.


I'd be keen to hear any feedback on this idea. Note that we previoulsy investigated the Alluxio file system extensively in T266641: [Data Platform] Test Alluxio as cache layer for Presto.
However, we were unable to proceed because we couldn't connect it to a Kerberos enabled hive metastore, since this is an enterprise feature, not available in the community edition.

The design outlined here bypasses this requirement, so I believe that it would be fully workable with the community edition.

Amazing Ben! Details to discuss, but one requirement we might need to add is accessible only to logged Cloud Services accounts and/or accessible only from Cloud Services networks.

I'd also love to explore the publishing mechanism. Spark -> Iceberg is probably the way to go, but there are details and also possible advantages we could gain by holistically taking a look at how we transfer data between systems, and seeing if we can unify some things. Perhaps not, but we should explore. :)

Thank you!!!

In some ways, this proposal sounds similar to T377362: EPIC: Trino/MinIO/Hive-Standalone-Metaserver/Dagster/Metabase/Superset Implementation. Too bad we can't all just use the same technologies cough cough @Jgreen ;)

Details to discuss, but one requirement we might need to add is accessible only to logged Cloud Services accounts and/or accessible only from Cloud Services networks.

Right, yes. I've thought about this. I think that what I'm talking about is compatible with this requirement, but we will want to be clear on which bits we want restricted.
The ticket description above says:

...I would like to be able to leverage a standardized process to conveniently create publicly available dashboards from data that currently resides in HDFS so that insights can be presented to non-WMDE/WMF staff.

I have taken that to mean that we would like to have dashboards available without authentication, but that the queries that can be run against the data source are pre-configured by dashboard authors.
This would seem to fit in with Grafana's current model of unauthenticated read-only access, with write access and https://grafana-rw.wikimedia.org/explore limited to members of wmf or nda groups.

The Trino plugin for Grafana would proxy requests to the Trino coordinator(s), which would be in the production realm. The Alluxio file system endpoints and the S3 endpoints would also be in the production realm, so there would be no unauthenticated access to these. Everything would have to go through the Grafana plugin and dashboards, using the pre-configured queries.

If we wanted to make the Trino coordinator service and the other services available to certain projects in Horizon, I think that would be possible, too. That kind of production <-> cloud access to the Trino catalog was what I was implying by linking to T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users.

First, I love that this use case is getting attention, this is great!

I want to caution early on, however, on the idea of using Iceberg tables as the backend here. We have had this discussion elsewhere (tried to find ticket but could not, it was for the experimentation platform and their dashboarding use case...), but to recap: Iceberg is meant for analytics, that is, for queries that would typically take >= 1 min to get results. It is not meant for dashboarding, where we typically want results instantly. I know the proposal here includes accelerating Iceberg with Alluxio caching + an always-on engine like Trino/Presto, but I would argue that we are trying to 'fix' Iceberg to fit the use case.

Druid or Cassandra or MariaDB would be a better fit. I do understand that elsewhere we want to get rid of Druid, and that it is stated on this ticket that Cassandra tables seem high effort. But both Druid and Cassandra force your tables to be designed for performance. That is a good thing. If we were to go with MariaDB, we loose that constraint, but we would be able to leverage existing "User databases" from Toolforge and we get a proper sub second database with indexes. Using Toolforge will also come with its own challenges though, I must admit.

But what I want to point out is that the use case here is a platform for dashboarding, and hopefully the dashboards are very interactive and fast. It us much easier to shoot yourself in the foot with an Iceberg/Trino backend for such a use case than it would be with Druid/Cassandra/MariaDB, which are tools designed for very low query latency, giving us a better end user experience.