Page MenuHomePhabricator

Global Editor Metrics - Data Persistence Design Review
Closed, ResolvedPublic

Description

Data-Engineering have a project to provide Global Editor Metrics for Year in Review, Growth Impact Module, and App Activity Tab. This ticket will serve as a tracking task for the design review of the project's storage architecture & design with Data-Persistence.

Status Updates

Decision Brief

On keyspace naming

TL;DR We have decided on the name analytics for Analytics API keyspaces.

Other AQS/cassandra hosted data either uses vestigial 'restbase' keyspaces, or is for Commons Impact Metrics. Going forward, in order to standardize the keyspace location for Analytics API usages, we will put AQS tables in analytics keyspaces.

Details

Other Assignee
mforns

Event Timeline

Eevans triaged this task as Medium priority.Aug 5 2025, 10:01 PM
Eevans updated the task description. (Show Details)
Ottomata updated Other Assignee, added: mforns; removed: Ottomata.
Ottomata renamed this task from Data Persistence Design Review: Year in Review (YiR) to Data Persistence Design Review: Global Editor Metrics (YiR, Impact Module, App Activity Tab).Sep 10 2025, 7:16 PM

Alright! first draft of design review document is here:

https://wikitech.wikimedia.org/wiki/User:Ottomata/Global_Editor_Metrics_2025_Design_Draft

@Eevans please check it out. There are some TBD and TODO and Questions in the doc we need to work through. I'm sure you'll have suggestions for us too.

@mforns @Seddon @Dbrant also to review too please!

Ottomata renamed this task from Data Persistence Design Review: Global Editor Metrics (YiR, Impact Module, App Activity Tab) to Global Editor Metrics - Data Persistence Design Review: .Sep 18 2025, 8:43 PM
Ottomata renamed this task from Global Editor Metrics - Data Persistence Design Review: to Global Editor Metrics - Data Persistence Design Review.

Yesterday, @mforns and I took a look closer at the App Activity Tab product requirements, and realized we hadn't really investigated one of them.

List of most viewed edited articles from the last 30 days, 60 days, last year

Our most recent 2 table data model does not store pageview counts per page, so we would not be able to compute this from it. We could add another table that somehow supports this use case, but @mforns and I went back to the drawing board to try once again to store daily pageview counts per user per (user edited) page. Our original data model had a single table with this structure, but we were worried that it would be way too big to store in Cassandra. We stopped estimating because we thought no one was asking for per page related counts.

So, how big is it to store per user per page (edit and pageview count) daily data?

This table would look something like:

Table name: editor_metrics_per_user_per_page_daily (or something like that).

FieldTypeDescription
user_central_idintMediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field.
wiki_idtextWiki id, sometimes also called just 'wiki' or 'database name'.
page_idbigintMediaWiki page ID.
edit_countbigintNumber of edits made by this user to this page on this day.
pageview_countbigintNumber of pageviews to this page by on this day.
dttimestampThe day for which we aggregate the data. It should point to the beginning of the day (YYYY-MM-DDT00:00:00.000Z).

It is difficult to estimate an actual size of this table, but here is a back of the napkin worst-case-ish estimate.
This table will have a record for every user for every day for every page where that user either edited that page, or that page received pageviews on that day. E.g. one of edit_count or pageview_count must be positive. We think that pageviews to edited pages will blow this table up much more than edits, so we estimate based on pages receiving pageviews.

Our variables:

  • R: record size.
  • U: count of distinct users ever edited
  • P: percent of pages that receive pageviews each day
  • E: count of pages ever edited by users. We could use average, median, p99 etc. to adjust our estimate.
  • R = 4+10+8+8+8+8=46 bytes.
  • U: 28000000
  • P: 6.6%
  • E: 101 (average)

For P, we are using the % of pages that got pageviews where agent_type = 'user'. This was 6.6% on a single day in September 2025.

For E, We've gone with average number of pages ever edited because we want to allow the outliers to drag this number higher for a not-quite-worst case but probably worse than actual case estimate. We could also use median (p50) or higher. For reference, percentiles are included in queries below too.

Calculations:

New records per day
U * E * P
28000000 * 101 * 0.066 == 186648000 new records per day

~200,000,000 new records per day

Data size added
new records per day * R
200000000 * 46 bytes = ~9GB

9GB added per day == ~3.2 TB added per year.

That is large, but not insanely large.


(Marcel's) Queries:

Distinct count of users that have ever edited (in all namespaces)

28,000,000.

SELECT COUNT(DISTINCT event_user_text)
FROM wmf.mediawiki_history
WHERE
  snapshot = "2025-08" AND
  event_entity = "revision" AND
  event_type = "create" AND
  size(event_user_is_bot_by) = 0 AND
  event_user_is_permanent
;
Pageviews that have pageviews in 1 day

131,642,166

yWITH pages_with_pageviews_in_a_day (
  SELECT DISTINCT
    project,
    page_id
  FROM wmf.pageview_hourly
  WHERE
    year = 2025 AND
    month = 9 AND
    day = 20
)
SELECT COUNT(*)
FROM pages_with_pageviews_in_a_day
;

count of pages total: ~600,000,000

So % of pages that received pageviews on 2025-09-20: 22%

And filtering for agent_type = 'user':
39,703,168


WITH pages_with_pageviews_by_users_in_a_day (
  SELECT DISTINCT
    project,
    page_id
  FROM wmf.pageview_hourly
  WHERE
    year = 2025 AND
    month = 9 AND
    day = 20 AND
    agent_type = "user"
)
SELECT COUNT(*)
FROM pages_with_pageviews_by_users_in_a_day
;

So % of pages that received pageviews by users on 2025-09-20: 6.6%

Average # of pages edited per user (in all namespaces)

101

WITH edited_pages_per_user AS (
  SELECT DISTINCT
    event_user_id,
    wiki_db,
    page_id
  FROM wmf.mediawiki_history
  WHERE
    snapshot = "2025-08" AND
    event_entity = "revision" AND
    event_type = "create" AND
    size(event_user_is_bot_by) = 0 AND
    event_user_is_permanent
),


user_edited_page_count AS (
  SELECT
    event_user_id,
    count(*) AS page_count
  FROM edited_pages_per_user
  GROUP BY
    event_user_id
)


SELECT avg(page_count)
FROM user_edited_page_count
;

Percentiles:

SELECT approx_percentile(page_count, array(0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999)) AS percentiles
FROM user_edited_page_count;

percentiles
[1,2,5,17,42,442,11061]
  • the median number of pages ever edited per user was 2 pages
  • the 75h percentile is 17
  • the 99th percentile was 442 pages
  • the 99.9th percentile was 11061 pages
Max query result size returned

For single table data model:

If we don’t have CQL aggregation support in Data Gateway, then we need to think about the max number of records that might be returned in one query, and how large that result is.

If we don't have query aggregation capabilities, then we'll be returning all records over the queried time range and aggregating them in memory in the client (AQS). Year in Review would query for all days in a year. For editors that have edited many pages where those pages get regular pageviews, this could be quite large. Let's stick with the assumption that 22% of pages get pageviews. Let's check this for E of average, 99 and 99.9 percentiles.

  • Average pages ever edited is 101.
  • 99 percentile pages ever edited by a user is 442
  • 99.9 percentile pages ever edited by user is 11061.
E * P * 365 days * 46 bytes
Average:
101 * 0.066 * 365 * 46 = ~0.11 MB

p99:
442 * 0.066 * 365 * 46 = ~0.47 MB

p99.9:
11061 * 0.066 * 365 * 46 = ~12 MB

p99.9 count of records in result to aggregate over:
11061 * 0.066 * 365 = ~266,500

So, for users who have edited a lot of different pages, we could potentially have to return a ~250,000 row and 12MB result from Cassandra to the client. It does seem that Cassandra supports result pagination, but we'd have to add this to Data Gateway and then use it in AQS. Big Q for @Eevans on that one. :/

Q for Data Persistence: Is 12MB too big for one result?

If we added simple SUM capabilities to Data Gateway, we could mostly avoid this by allowing Cassandra to do the aggregation for us.

However, Cassandra does not support "top k" aggregations. So we cannot use CQL to compute the Activity Tab "top k viewed pages" metric.


We then have 3 options:

A. Set a cutoff for pages for which we will count pageviews. E.g. if you have edited more than 500 pages, we do not support counting pageviews to your pages.

B. Don't do use the single per user per page table data model. As the previous comment shows, this data model might be just too big anyway.

C. Only store records for the top N (30?) pages with pageview_count (and where edit_count=0). We'd store a dummy record to represent pageview count for "all other pages" (below the top N). When returning the top K viewed pages, where K is less than N, we would mostly be accurate. We'd have to get all records over the time period, but the number of records stored per user per day is capped at about N+1 (31). Top k pages over large time ranges might be slightly inaccurate, but it in most cases it is unlikey that a top 10 page will slip below the top 30. The Activity Tab top k metric isn't asking for actual pageviews per page, just the list of pages in the top k, so it would be hard for anyone to verify our counting anyway ;)

... ^ we need to think about this more. Does this logic make sense for world events like the pope dying? Most of the time the pope is probably not in the top 100, but for a few weeks it is # 1? Hm.

Anyway, let's apply our formula to Option C. and see how big our table (and worst case result size) is...

Option C:

Actually, storing e.g. top 30 pages with pageviews is not going to reduce our estimated storage size. E * P is our guess at the number of records per user per day. E=101 (average) and P=6.6% is about 7 new records per user per day. We were exploring Option C to solve the bad/worst case (e.g. E=500) query result sizes.

Option C basically just puts a limit on the number of records per page per user per day at N. So, the worst case query result size would be:

p99.9:
(30+1) * 0.066 * 365 * 46 = ~3.3 MB

p99.9 count of records in result to aggregate over:
(30+1) * 0.066 * 365 = ~750

So, Option C. does help solve the query result size too large / Top K / aggregation CQL problem. But not necessarily total storage size.

Total storage size is really quite hard to estimate. We are going to generate this data for a few days of data in the Data Lake to have a better estimate.

Work for this estimate will be tracked in T405039: Global Editor Metrics - Data Pipeline. We'll pause our estimation efforts until then.

Ah, @mforns, I reran your Pageviews that have pageviews in 1 day query but filtered out bots (where agent_type = 'user'). I'm asking product team if this is what they want, but assuming it is, it will change the numbers.

WITH pages_with_pageviews_in_a_day (
                   >   SELECT DISTINCT
                   >     project,
                   >     page_id
                   >   FROM wmf.pageview_hourly
                   >   WHERE
                   >     year = 2025 AND
                   >     month = 9 AND
                   >     day = 20 AND
                   >     agent_type = "user"
                   > )
                   > SELECT COUNT(*)
                   > FROM pages_with_pageviews_in_a_day
                   > ;

39703168

I'll update my comment above with reestimation with this number.

I wrote:

If we added simple SUM capabilities to Data Gateway, we could mostly avoid this by allowing Cassandra to do the aggregation for us.

However, Cassandra does not support "top k" aggregations. So we cannot use CQL to compute the Activity Tab "top k viewed pages" metric.

On second thought, if we add SUM capability to Data Gateway, can we avoid the top k query result size problem?

IIUC, If we have a key of ((user_central_id), dt, wiki_id, page_id), I think we can sum pageview_count grouped by wiki_id, page_id. E.g.

SELECT
  wiki_id,
  page_id,
  sum(pageview_count)
FROM
  editor_metrics_per_user_per_page_daily
WHERE
  user_central_id = {uid} AND
  dt >= {T1} and dt < {T2}

We can't use Cassandra to e.g. ORDER BY cnt DESC ... LIMIT k, but we can do that in the client on the results returned.

That reduces our worst case count of records returned to simply the number of distinct pages edited per user. p99.9 was 11061, so around 10k results worst case.

This might be good enough motivation to add SUM and COUNT aggregation support to Data Gateway? Let's discuss and file a ticket if so.

Alright. I met with @JAllemandou earlier today, and just now with @mforns and @amastilovic. We've been looking again at options and user requirements, and we think we want to change directions again.

In the discussion above, we were trying to use a single cassandra table to serve all the product use cases. The top k edited pages viewed metric was what motivated us to store per user per page pageview counts. This explodes the data size quite a lot (by how much we still aren't sure).

  1. For edit metrics...there is already the monthly generated mediawiki_history_reduced dataset in Druid public. This dataset is used to compute AQS per project editor metrics already. But, it already has per user data in it. We just experimented to see if we could use this existing Druid dataset to serve the global editing related metrics. We can. We found that even for the worst case of editors who have edited many pages, querying more than a year of data in Druid is less than 3 seconds. For more regular users, the queries return in much under a second.
  1. Along the way, we also don't feel like should use the top k edited pages viewed use case as a justification for storing huge amounts of (duplicate) pageview data. (If a client really wanted to request and compute this metric, they could potentially do so by issuing several requests, first to get the 'list of pages edited' from the endpoint we will create for this task, and then to pageviews API to get the pageviews to the pages they are interested in.)
  • So, for the 2 pageviews metrics, we would like to:
    • total edited pages pageviews : store this as a daily (not per page) aggregation.
    • top k pageviews to edited pages: precompute this monthly and store as a per user list.
      • NOTE: This metric won't be additive anymore, but we think this will be okay for the apps team. TBD, we will ask them.

We went back and forth as to whether this new direction is the best (and most effecient) path forward, and we think so. We can re-use and improve an existing dataset, and not store and maintain huge amounts of (duplicate) data ( and queries) in cassandra for pageviews.

I'll double check this direction with Eric and Joseph, and then update the Design Doc draft accordingly.

We've got our first actual daily pageviews per editor per page data lake table record! @amastilovic backfilled and ran the pageviews daily query for 2025-10-25. On that day, we stored 26637692 records.

This result changes the estimates we made in at T401260#11206915.

We don't need edit counts anymore, since we serve those from Druid. So let's see what it would be like to store per user per page daily pageview counts in cassandra (allowing us to use one table to serve richer use cases).

Table name: pageviews_per_editor_per_page_daily (or something like that).

FieldTypeDescription
user_central_idintMediaWiki user central ID. At Wikimedia, this is the CentralAuth globaluser gu_id field.
wiki_idtextWiki id, sometimes also called just 'wiki' or 'database name'.
page_idbigintMediaWiki page ID.
pageview_countbigintNumber of pageviews to this page by on this day.
dttimestampThe day for which we aggregate the data. It should point to the beginning of the day (YYYY-MM-DDT00:00:00.000Z).

Now that we have (one typical) day computed in the data lake, our calculations are simpler. Let's round this up to 30 million.

R: record size: 4+10+8+8+8 = 38 bytes
C: daily row count: 30000000

Data size added
R * C
30000000 * 38 bytes = ~1GB

1GB added per day == 365GB added per year.

Max query result size returned

If we don't have query aggregation capabilities, then we'll be returning all records over the queried time range and aggregating them in memory in the client (AQS). Year in Review would query for all days in a year. For editors that have edited many pages where those pages get regular pageviews, this could be quite large.

Aleks calculated some more stats for this day. On 2025-10-25:

Daily pageview per page per user result sizing:

percentiledaily record countestimated yearly range result countestimated yearly range result size*
p753~1000~40 KB
p9014~5000~200 KB
p9545~16000~600 KB
p99470~180000~6MB
max818062too largetoo large
avg37~14000~500 KB

*In reality actual Data Gateway result will be much larger: it is returned as JSON

Those prolific editors really are dragging the average way up!

This is smaller than the previous napkin estimate. But the yearly range result counts for p95 are still quite large.

What do to now? see next comment.


At T401260#11230961, we decided to not store per editor per page pageviews metrics in cassandra just to support the top K pageviews use case. This wasn't our favorite decision, because it means we have to maintain 2 different cassandra tables and data pipelines, and the top k pageviews metric is no longer an additive timeseries metric. Product teams can't do 'top k in last 30 days', they can only do e.g. 'top k in October'.

I think the new calculations above mean that data size in Cassandra is not a concern. 500GB / year growth should be manageable, especially since we can delete data after a few years via TTLs.

However, I think the query result size returned for large time ranges is still too large. At T401260#11227713, I posited that adding aggregation support to Data Gateway would mitigate this issue.

Alternatively, we could again consider Option A as described in T401260#11206978: don't store (or serve?) cases where a user has too many edited pages (with pageviews).

If all of the above is true, and either Data Gateway can do simple aggregation for us or we can use a cutoff, then perhaps we should reconsider the decision to not store per page pageview counts.

Let's discuss.

This would certainly extend the timeline of of this project, as we'd have to do a little more design review with Eric, maybe do implementation work in Data Gateway, modify the already implemented pageviews/per_editor AQS endpoint etc. The final product would be better, but it will take longer.

Perhaps Product Owners can decide for us ;)