Page MenuHomePhabricator

Provision Global Editor Metrics tables & endpoints
Closed, ResolvedPublic

Description

The Global Editor Metrics tables and Data Gateway endpoints need to be provisioned.

Event Timeline

Eevans triaged this task as Medium priority.Nov 24 2025, 10:36 PM

@mforns, @amastilovic I opened https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/10, which combines the work started by @Ottomata (data-gateway/mr-8), with the schema and endpoint for pageviews_top_pages_per_editor.

A couple of things to note: The section on pageviews_top_pages_per_editor in Global_Editor_Metrics_2025 indicates that wiki_id is meant to be an integer. I assume that is wrong, and that we're using strings there (ala enwiki, eswiki, frwiki, etc), let me know if that's not the case.

Also, I followed the example created by @Ottomata of transforming the database attribute of dt to timestamp in the Gateway. Mapping one name means mapping them all (see: line 525). Since we decide what attribute to store, and the DG audience is internal, this seems...unnecessary. Not a huge deal, but perhaps not something we want to make a habit of unless there is a reason. Do you know if there is a reason?

@Eevans thank you for that MR! You are correct, wiki_id should be TEXT - we've already implemented it in the Hive counterpart for that table: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1206879

As far as the transformation of dt to timestamp goes, I'm not aware of why that was done - perhaps @mforns would know.

The more I look at this top k endpoint, the more I think I may have misunderstood what was intended. I reckon the endpoint should accept a start and end timestamp, and return all of the aggregations included (like the other). I'll update it to reflect that.

The more I look at this top k endpoint, the more I think I may have misunderstood what was intended. I reckon the endpoint should accept a start and end timestamp, and return all of the aggregations included (like the other). I'll update it to reflect that.

Backend-wise we will for the foreseeable future only aggregate monthly. Product-wise it might even not make sense to do more flexibly.
However for endpoint modeling consistently using start and end parameters makes sense, we should however document that not all parameters are acknowledged.

The more I look at this top k endpoint, the more I think I may have misunderstood what was intended. I reckon the endpoint should accept a start and end timestamp, and return all of the aggregations included (like the other). I'll update it to reflect that.

Backend-wise we will for the foreseeable future only aggregate monthly. Product-wise it might even not make sense to do more flexibly.
However for endpoint modeling consistently using start and end parameters makes sense...

Yeah, it's probably easiest to think of the DG as a database interface; an HTTP-bridge to the DB. The data model here would allow you to store different aggregations (ala granularity), and then query for as many of a given aggregation as you want, so the Gateway should follow suit. It works the same even if we only ever store monthly aggregations, and only ever need a month at a time (same as it would if we were querying the db directly).

...we should however document that not all parameters are acknowledged.

Which parameters do you mean? wiki_id & page_id?

Change #1211733 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra: GRANTs for new analytics keyspace

https://gerrit.wikimedia.org/r/1211733

Change #1211733 merged by Eevans:

[operations/puppet@production] cassandra: GRANTs for new analytics keyspace

https://gerrit.wikimedia.org/r/1211733

Change #1211751 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] data_gateway: upgrade to v1.0.14

https://gerrit.wikimedia.org/r/1211751

Change #1211751 merged by jenkins-bot:

[operations/deployment-charts@master] data_gateway: upgrade to v1.0.14

https://gerrit.wikimedia.org/r/1211751

Ok, schema has been created, grants made, and DG v1.0.14 has been deployed to staging. Let me know if you encounter any problems.

Once we're certain that everything checks out OK, we can proceed with deploying to production.

Change #1213571 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[operations/puppet@production] Add GRANT MODIFYs to aqsloader for two new pageviews tables

https://gerrit.wikimedia.org/r/1213571

Change #1213571 abandoned by Aleksandar Mastilovic:

[operations/puppet@production] Add GRANT MODIFYs to aqsloader for two new pageviews tables

Reason:

Not needed.

https://gerrit.wikimedia.org/r/1213571

Change #1217267 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] data-gateway: move v1.0.14 to production

https://gerrit.wikimedia.org/r/1217267

Change #1217267 merged by jenkins-bot:

[operations/deployment-charts@master] data-gateway: move v1.0.14 to production

https://gerrit.wikimedia.org/r/1217267

wiki_id is meant to be an integer. I assume that is wrong

Def wrong! Fixed in design review doc.

transforming the database attribute of dt to timestamp

IIRC (and I might not recall correctly), I think this was tech debt / a limitation of AQS. We wanted to not have to change names but there was some reason we couldn't alter existing aqsassist repo without lots and lots other changes. But I can't totally recall without more effort. See also T342018: compile list of known issues for triage post AQS 2.0 launch.

I reckon the endpoint should accept a start and end timestamp, and return all of the aggregations included (like the other). I'll update it to reflect that.

Hm, yes perhaps! I can't totally remember this one either, but I do remember that this particular metrics is a bit different that the others in that it is not additive. You can't add the monthly values together to get a meaningful sum like you can with the others.

@Eevans this is all done, right?

It is indeed!