The Global Editor Metrics tables and Data Gateway endpoints need to be provisioned.
Description
Details
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Global Editor Metrics schema & endpoints | repos/sre/data-gateway!10 | eevans | global_editor_metrics | main |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | HNordeenWMF | T382442 [Epic] iOS Wikipedia Year in Review 2025 (V3) | |||
| Resolved | HNordeenWMF | T388060 [Sub-Epic] [WE 3.2.9] On-device statistics Improvements for Year in Review 2025 | |||
| Open | None | T341649 Provide an easy way for MediaWiki to fetch aggregate data from the data lake | |||
| Open | None | T388455 [Spike] Full-year editing stats for Year in Review | |||
| Resolved | Ottomata | T403660 WE3.3.7 Year in Review and Activity Tab Services - Global Editor Metrics | |||
| Resolved | Ottomata | T401260 Global Editor Metrics - Data Persistence Design Review | |||
| Resolved | Eevans | T410962 Provision Global Editor Metrics tables & endpoints |
Event Timeline
eevans opened https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/10
Draft: Global Editor Metrics schema & endpoints
@mforns, @amastilovic I opened https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/10, which combines the work started by @Ottomata (data-gateway/mr-8), with the schema and endpoint for pageviews_top_pages_per_editor.
A couple of things to note: The section on pageviews_top_pages_per_editor in Global_Editor_Metrics_2025 indicates that wiki_id is meant to be an integer. I assume that is wrong, and that we're using strings there (ala enwiki, eswiki, frwiki, etc), let me know if that's not the case.
Also, I followed the example created by @Ottomata of transforming the database attribute of dt to timestamp in the Gateway. Mapping one name means mapping them all (see: line 525). Since we decide what attribute to store, and the DG audience is internal, this seems...unnecessary. Not a huge deal, but perhaps not something we want to make a habit of unless there is a reason. Do you know if there is a reason?
@Eevans thank you for that MR! You are correct, wiki_id should be TEXT - we've already implemented it in the Hive counterpart for that table: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1206879
As far as the transformation of dt to timestamp goes, I'm not aware of why that was done - perhaps @mforns would know.
The more I look at this top k endpoint, the more I think I may have misunderstood what was intended. I reckon the endpoint should accept a start and end timestamp, and return all of the aggregations included (like the other). I'll update it to reflect that.
The more I look at this top k endpoint, the more I think I may have misunderstood what was intended. I reckon the endpoint should accept a start and end timestamp, and return all of the aggregations included (like the other). I'll update it to reflect that.
Backend-wise we will for the foreseeable future only aggregate monthly. Product-wise it might even not make sense to do more flexibly.
However for endpoint modeling consistently using start and end parameters makes sense, we should however document that not all parameters are acknowledged.
Yeah, it's probably easiest to think of the DG as a database interface; an HTTP-bridge to the DB. The data model here would allow you to store different aggregations (ala granularity), and then query for as many of a given aggregation as you want, so the Gateway should follow suit. It works the same even if we only ever store monthly aggregations, and only ever need a month at a time (same as it would if we were querying the db directly).
...we should however document that not all parameters are acknowledged.
Which parameters do you mean? wiki_id & page_id?
eevans merged https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/10
Global Editor Metrics schema & endpoints
Change #1211733 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] cassandra: GRANTs for new analytics keyspace
Change #1211733 merged by Eevans:
[operations/puppet@production] cassandra: GRANTs for new analytics keyspace
Change #1211751 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/deployment-charts@master] data_gateway: upgrade to v1.0.14
Change #1211751 merged by jenkins-bot:
[operations/deployment-charts@master] data_gateway: upgrade to v1.0.14
Ok, schema has been created, grants made, and DG v1.0.14 has been deployed to staging. Let me know if you encounter any problems.
Once we're certain that everything checks out OK, we can proceed with deploying to production.
Change #1213571 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):
[operations/puppet@production] Add GRANT MODIFYs to aqsloader for two new pageviews tables
Change #1213571 abandoned by Aleksandar Mastilovic:
[operations/puppet@production] Add GRANT MODIFYs to aqsloader for two new pageviews tables
Reason:
Not needed.
Change #1217267 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/deployment-charts@master] data-gateway: move v1.0.14 to production
Change #1217267 merged by jenkins-bot:
[operations/deployment-charts@master] data-gateway: move v1.0.14 to production
wiki_id is meant to be an integer. I assume that is wrong
Def wrong! Fixed in design review doc.
transforming the database attribute of dt to timestamp
IIRC (and I might not recall correctly), I think this was tech debt / a limitation of AQS. We wanted to not have to change names but there was some reason we couldn't alter existing aqsassist repo without lots and lots other changes. But I can't totally recall without more effort. See also T342018: compile list of known issues for triage post AQS 2.0 launch.
I reckon the endpoint should accept a start and end timestamp, and return all of the aggregations included (like the other). I'll update it to reflect that.
Hm, yes perhaps! I can't totally remember this one either, but I do remember that this particular metrics is a bit different that the others in that it is not additive. You can't add the monthly values together to get a meaningful sum like you can with the others.