Page MenuHomePhabricator

Materialize analytics queries to improve superset dashboard latency
Closed, ResolvedPublic

Description

The Wikidata Platform Superset dashboard is currently putting a lot of strain on Presto and consistently times out.

To reduce pressure and improve reads performance we need to create an airflow data pipeline and materialize all queries in batch (deriving from the raw log table), store them in physical tables (optimizing storage format), and update our charting (it becomes a SELECT * FROM <TABLE>) .

This was discussed internally here

As part of this task, the engineers in WDP need to be added to the analytics-wikidata-users group, so that they can create and work with the materialized tables in HDFS.

Needs clarification:

  • Should we use iceberg for exposing this data?
  • Should we consider dbt for managing queries?

AC

  • A new namespace for wikidata platform is created in hive / metastore
  • An airflow dags is provided that materializes queries in parquet format.
  • Datasets are pruned according to Wikimedia's data rention policies.

access request: Add lerickson and trueg to analytics-wikidata-users

  • - User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document.
  • - User has a valid NDA on file with WMF legal. (All WMF Staff/Contractor hiring are covered by NDA. Other users can be validated via the NDA tracking sheet)
  • - User has provided the following: wikitech username, email address, and full reasoning for access (including what commands and/or tasks they expect to perform)
  • - User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not shared with any other service (this includes not sharing with WMCS access, no shared keys.)
  • - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf staff)
  • - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml

Event Timeline

gmodena renamed this task from Materialize analytics query to improve superset dashboard perf to Materialize analytics queries to improve superset dashboard latency.Mar 2 2026, 12:13 PM

Change #1249380 had a related patch set uploaded (by Lerickson; author: Lerickson):

[operations/puppet@production] Add lerickson and trueg to analytics-wikidata-users

https://gerrit.wikimedia.org/r/1249380

andrea.denisse changed the task status from Open to In Progress.Mar 10 2026, 8:25 PM
andrea.denisse moved this task from Untriaged to Patch in Review on the SRE-Access-Requests board.
Dzahn renamed this task from Materialize analytics queries to improve superset dashboard latency to Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users.Mar 12 2026, 12:06 AM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

Approved! (Also approved the Patch, let me know if I am missing something, thanks @Dzahn

Change #1249380 merged by Btullis:

[operations/puppet@production] Add lerickson and trueg to analytics-wikidata-users

https://gerrit.wikimedia.org/r/1249380

@lerickson and @BTullis - It looks like this access request has been fulfilled. If so, can it be resolved?

Technically this should have been a separate ticket for an access request.

I just wanted to help out and be pragmatic by adding the missing template and tags in the existing ticket.

I don't know about the status of the other check boxes that were already here before that.

@gmodena

Hi! This was all a mixup on my part; I didn't realize that I needed to submit an access ticket to add myself to my team's analytics user. This ticket is still open and tracks a bigger project, which I needed the analytics access for. Sorry about the mixup. We still want this task open to track my additional work.

The easiest thing is if we just remove the tags for access requests again - now that the access part is done.

Done! This lets you keep this open as long as you want but people on clinic duty won't keep asking every week what the status is about the access request.

Hope this works for everyone. Let me know if you think it is important to keep the tags for history of requests or something.

That sounds like the perfect solution. Thank you!

lerickson renamed this task from Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users to Materialize analytics queries to improve superset dashboard latency.Fri, Mar 20, 6:15 PM

Project update:

  • We have a "wikidata" database
  • The user (and owner of the relevant HDFS directories) "analytics-wikidata" is set up and the WDP engineers are all members of it
  • Much Airflow dev tooling was improved
  • I examined the queries in our dashboard and decided what data to materialize. It will be: query date, user-agent, backend host, query time (and bucketed time), http status, and the graph name. We can of course add more columns if we need
  • I asked the data engineering experts about deleting data after 90 days, got some advice (might still be room for discussion)
  • I have an MR out for a monthly dag to populate this data, but I'm updating it to be a daily DAG to make retention/deletion easier to manage

Update:

  • The table exists: wikidata.wdqs_external_queries_by_user_agent_daily aggregates the queries by latency time class for each user agent per day, for user-agents with >1000 queries a day. We also store total query counts and the number of successful queries.
  • The DAG to populate this table has been deployed.
  • Data cleanup after 90 days at the row level happens via the data-populating DAG, too.
  • Another DAG to do Iceberg snapshot cleanup runs separately; this MR is under code review.

After the maintenance DAG has been deployed, I will enable both DAGs and update the Superset queries.

Change #1262212 had a related patch set uploaded (by Lerickson; author: Lerickson):

[operations/puppet@production] Add analytics-wikidata to the yarn capacity scheduler.

https://gerrit.wikimedia.org/r/1262212

Change #1262212 merged by Ryan Kemper:

[operations/puppet@production] Add analytics-wikidata to yarn capacity scheduler

https://gerrit.wikimedia.org/r/1262212

Merged patch and ran sudo -u yarn yarn rmadmin -refreshQueues on the active master an-master1003

The DAG to materialize the top user-agents (>1000 qpd) is running. It cleans up data >90 days as well, and I also set up a monthly maintenance DAG to delete old iceberg snapshots. This is in accordance with the standard practices around this type of data, according to a consultation I had with the data engineering advisors.

The superset dashboard with some views of this data is live: https://superset.wikimedia.org/superset/dashboard/wikidata-wdqs-top-user-agents

We can iterate on what exactly is in this dashboard. The important thing is that we now have data materialized in a way that makes querying it through superset fast and reliable.

This task met all AC and the dashboard looks great. Signing off on this task and resolving.