Page MenuHomePhabricator

WE3.3.7 Year in Review and Activity Tab Services - Global Editor Metrics
Open, Needs TriagePublic

Description

WE3.3.7 Year in Review and Activity Tab Services

If we leverage the data platform’s processing capabilities to aggregate tailored editor metrics and impact data and serve the aggregated data through suitable services with defined SLOs, we can enhance future iterations of Year in Review WE3.3.1 and Activity Tab WE3.3.2.

Asana WE3.3.7 hypothesis


Year in Review (YiR), Mobile Apps Activity Tab, and the Growth Impact Module are 3 projects that require global editor statistics.

The metrics required for each of these projects are very similar. The differences are mostly about time spans and granularities.

This will be the parent task for:

Working document: FY25-26 Year in Review and Impact Module - Notes & Product Requirements

Metric requirements summary

Canonical requirements are in FY25-26 Year in Review and Impact Module - Notes & Product Requirements

  • Total global edit count per user
  • Total number of days edited per user
  • Longest daily edit streak per user
  • List of edited articles per user
  • Total number of pageviews on all articles edited by a user
  • Top K views to articles edited by a user per month

Year in Review needs these metrics rolled up for an entire calendar year. Impact Module and App Activity Tab would like daily roll ups, and ideally at a daily computation frequency.

Privacy review

Most metrics are clearly public. However, List of edited articles per user, if not historically updated, has the potential to expose deleted (and privacy sensitive?) page ids / page titles.

LCS3 review for this has been completed. Summary:

exposing only the MediaWiki internal page_id via the public API and not the page_title satisfactorily mitigates privacy risks.

Details

Other Assignee
mforns

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedHNordeenWMF
OpenNone
OpenNone
OpenOttomata
OpenOttomata
OpenEevans
ResolvedOttomata
ResolvedJMonton-WMF
ResolvedFeatureOttomata
Openamastilovic
Openamastilovic
Resolvedamastilovic
Resolvedxcollazo
OpenJAllemandou
Openxcollazo
OpenNone
ResolvedSnwachukwu
Openamastilovic
Openmforns
OpenOttomata

Event Timeline

@Seddon, @HNordeenWMF, @Tsevener (and Dmitry? Which is Dmitry's phab handle?):

I needed a main ticket to track design and implementation work. The parent Spike ticket was a bit busy so I created this one.

I really prefer Phabricator for async collab (see https://wikitech.wikimedia.org/wiki/Data_Platform_Engineering/Share_our_work), so I'm going focus our discussion and documentation here. I hope that is okay!

@Seddon @HNordeenWMF, a product requirement question for you about the total pageviews per user on edited articles metric:

First, here are some notes from when we've asked about this before:

Per user Views on (wikipedia?) articles edited in the last 12 months

Total number of views since the first edit in a calendar year. (This is preferable from a technical perspective)

Total number of views since any edit whether before or during current calendar year.

This metric is a bit more complicated than the other ones because there are 2 different time ranges to consider:

  1. time range for which to count pageviews
  2. time range for which to count a user's edit.

We'd like to make all metrics additive. I.e. they should be meaningfully summable over a time range. We'd also like to make the metrics immutable so we don't have to manage updates to the past.

If we could relax 2., then the "total pageviews on articles edited by a user" could be simpler and more useful for other use cases than YiR.

Could we instead:

  • count total pageviews in a time range (e.g. a year) per user on articles since a user's first edit to that article?

This would allow the same metric to be used in the Activity Tab use case:

Daily views on articles in the last. (however ideally we could filter the data by set time periods, like last 30 days, 60 days, last year)

If we have to align an editor's date to a specific timeframe (e.g. a year), this will not be possible.

We also think it would make the metric more consistent and meaningful:

What if a user created or edited an article on Dec 31 2024 that then went on to receive millions of pageviews? That user would have made no edits in 2025, but still their impact in 2025 would be quite large.

Thanks! I think @mforns might have more coherent reasoning around this request, in case you have any more questions

(Edit: removed an option after discussion with Marcel)

We think we can generate these metrics daily from mediawiki.page_change.v1 event data, rather than relying on monthly sqooped snapshots and mediawiki_history. Here's how.

event_sanitized.mediawiki_page_change_v1 goes back to 2024-12-31.

This dataset has edit data (no content) per user forever. It is ingested into the Data Lake hourly.

We can use this dataset to entirely generate daily metrics for

  • Total global edit count per user
  • Total number of days edited per user
  • Longest daily edit streak per user
  • List of edited articles per user

We'll need a global user id. Luckily, CentralAuth gives us one, and associates it with each wiki's local user id.

The global user id should be added to mediawiki.page_change.v1 events. T403664: EventBus - Add central user id to MediaWiki events. With this, we can calculate ongoing editor metrics per global user id.

We'll need historical backfill since we don't have this data now. T389666: NEW/CHANGE FEATURE REQUEST: make available the centralauth.globaluser table in Data Lake will help us get the full mapping so we can backfill this where needed (joining with mediawiki_history).

Since we will have a full 2025 year's worth of these events, we can fully backfill a years worth of daily metrics. On going, an Airflow job will compute these daily.

Total number of pageviews on all articles edited by a user is a little more complicated.

Assuming our ask for changing the product requirement above is okay:

We'll need to maintain a dataset of articles edited by a user, and the first edit date, e.g.

global_user_id,
wiki_id,
page_id,
first_edit_timestamp,

We will backfill this table from mediawiki_history joined with sqooped globaluser and localuser tables from CentralAuth. Ongoing, we insert records into this table from page_change events whenever a page is edited by a user for the first time.

We can then use this list to generate pageviews in a time range to articles edited by a user.

Another product requirement question for @KStoller-WMF @Seddon @Dbrant @HNordeenWMF:

What are the accuracy requirements for this? How much do we care if

  • counts are very slightly inaccurate because of potential for missed/inconsistent event data?
    • AO: I assume we don't care that much, as these are grand aggregate metrics.
    • AO: We might care slightly for the 'list of edited articles', because if an editor makes one edit to an article in a day and we miss that event, that article would not show up in their list. This would still be very rare, just more visible.
  • counts are very inaccurate (or missing?) in the very rare case of an outage?
    • AO: we may be able to backfill from other sources if this happens, but it would be manual and expensive in engineer time

If we are okay with inaccuracies like this, then we will proceed with the mediawiki.page_change.v1 event datasource plan as described in T403660#11145650.

If accuracy is a requirement, then we could instead use mediawiki_content_history_v1 as the datasource. This datasource is eventually consistent, and in most cases should have all events within a day or so. Metric computation would lag a little bit more than they would from mediawiki.page_change.v1 events, but not by too much. However, to use mediawiki_content_history_v1, we'd have to do more engineering work (1-2 weeks FTE?) to include the global user id in the mediawiki_content_history_v1 pipeline and dataset.


I'm a little worried about project timeline if accuracy is a strong requirement. We would get more accuracy from mediawiki_history denormalized, but this is only available monthly, so we could not support the non-YiR use cases with it.

My preference would be:

  • Use mediawiki.page_change.v1 events with accuracy caveats now,
  • If we really care about accuracy in the future, do the work and update the pipeline to change datasources.

My preference would be:

  • Use mediawiki.page_change.v1 events with accuracy caveats now,
  • If we really care about accuracy in the future, do the work and update the pipeline to change datasources.

You can defer to @Seddon, but from my perspective, we want contributor impact metrics to be reasonably accurate, but certainly some margin of error is acceptable. Growth features (and Year in Review) are intended to be fun and engaging, not "sources of truth". For example, we can not determine whether a page view actually included a specific person’s edit. Editors should understand that these metrics are estimates and not exact.

Longest daily edit streak per user

That said, certain metrics might lead to more frustration if we get them wrong, for example, I would hate to see us report that a user's edit streak was broken due to inaccurate / missing data.

FWIW, we can do these very accurate if we compute monthly rather than daily, but then we can't support the other non YiR use cases with the same data and endpoints.

Re comment T403660#11145572 about editor pageviews metric:

@mforns and I think we found a way to do it. But, we'll have to have some kind of limit to the number of edited pages we can provide this metric for.

I just posted the | first draft of the Data Persistence Design Review for T401260. Our proposed data model will store a record per user per day per page edited. We can use this table to directly compute all of the desired metrics except the pageviews one.

But, we already have the daily per page pageviews per article api. So:

  • Given an edit time range, we can look up the list of distinct pages edited.
  • If the list of distinct edited pages (per day? or total? not sure) is over some limit, return an error.
    • We need to find can find the right limit here so as not to overwhelm the database/service Perhaps a few thousand? Marcel found an account that edited 130K distinct pages in a year period. Hopefully we can find a limit that supports real humans, but excludes unidentified bots.
  • We can then ask the pageviews per article api (or table) to give us the count of pageviews for those pages and sum them.

Dear product owners (@Seddon, @KStoller-WMF, @HNordeenWMF, @JTannerWMF), :)

We'd like to start setting expiration dates for data products and pipelines. This expiration date can be updated at any time with no questions asked. But, if the expiration date passes, and there are no longer official data product owners at WMF, the data pipeline / platform maintainers may justify decommissioning and removing the data product.

So, what's a good expiration date for this? Perhaps something like 2030-07-01?

So, what's a good expiration date for this? Perhaps something like 2030-07-01?

So, that date would serve as the point when we review and assess whether this data pipeline is still needed?
It seems like a (very) reasonable approach to audit and retire non-owned or low-impact data pipelines, so that date is fine from my perspective.

Could we instead:
count total pageviews in a time range (e.g. a year) per user on articles since a user's first edit to that article?

FWIW, In Slack, Kirsten wrote:

The count of total pageviews in a time range (e.g. a year) per user on articles since a user’s first edit to that article?
From the Growth perspective, thinking about the Impact Module, I think that makes sense.
I know Haley is out, so I’m not sure if Seddon or @jaz want to chime in with the Mobile App perspective.

@Ottomata Nearly 5 years feels like a long time I could come and go from a team on those timescales. Could I suggest trimming that down to 2029-07-01? or even 2029-04-01? I suggest that as an alternative so that the question arises during an annual planning cycle.

Ottomata updated the task description. (Show Details)

Submitted an LCS3 privacy review for the list of page titles metric.

In a meeting on 2025-09-16, we discussed T401260: Global Editor Metrics - Data Persistence Design Review and https://wikitech.wikimedia.org/wiki/User:Ottomata/Global_Editor_Metrics_2025_Design_Draft. The meeting was mostly focused on 3 issues:

  1. Are immutable lists of daily edited page ids and/or page titles privacy sensitive. As noted above, we submit an LCS3 review to find out. If they are, we will either have to drop this metric requirement, or expand the engineering scope to support historical updates.
  1. Do we need to store and/or key by page_titles in the storage and API. If we do, things are more complicated (we have to deal with page renames and page name drift over time). If we don't, client product features may add latency when looking up page_titles. General understanding was that page_ids will probably be fine, as long as we do the same thing that is done for article topics. See T392833#11188134.
  1. Can we satisfy the "Total number of pageviews in a time range on articles edited by a user in a time range" for Year in Review and other product features with the same dataset?

    The answer was "no". Cassandra won't work this way. Perhaps if we were using a different datastore (RDBMS?) for this it would be okay. This may be input for "capability gap analysis" for derived data storage for serving in T401394: ☂️ [FY2025-26][Hypothesis] WE6.2.3 Data Storage Design Review.

Marcel and I then discussed how we will support the compromise pageviews metric we discussed above at T403660#11145572. I'll describe that in the next comment.

Pageviews per editor's edited pages metric explanation.

I want to try to be really clear as to why it is difficult to do, but also why as described, it is probably not a good metric. For more context, please see (and read speaker notes) of @mforns' absolutely excellent and beautiful Metrics From Below presentation. (There has to be a recording somewhere...)


In T403660#11169873, we thought we had a way to satisfy this metric by

  • getting list of articles edited in a time range
  • getting pageviews of those articles in a time range from the Pageviews API.

However, in our design review meeting yesterday, we learned that this won't be possible/performent. The pageviews API / cassandra table is not designed for range queries, meaning we can't look up pageviews for multiple tables at once, nor can we sum them in CQL. Additionally, the Pageviews daily API is keyed by page_title which we are now trying to avoid, so we won't be able to look up pageviews by page_title anyway.

So, we need to precompute a different pageviews by editor dataset and then store that for serving.

Precomputation is fine, but in order to satisfy immutability and additivity, we don't can't really precompute for editor time range windows. E.g.

  • daily pageviews to articles edited by a user in a year, e.g. Jan - Dec 2025.
  • daily pageviews to articles edited by a user in a month, e.g. January, February, March, etc.

However, as noted in the comment, we don't think this kind of metric is particularly meaningful. When a page is viewed is not correlated with a window of when an article was edited. This kind of metric is not additive either: You can't sum over a time range to compute pageviews associated with an edit. If a user edits in January and March, and you want to know the pageviews that editor 'contributed to' in the first quarter of the year, any pageviews from February would not be counted.


So what immutable and additive 'editor's edited article pageviews' can we meaninfully provide? I attempted to suggest something above in T403660#11145572, but reading that now I'm not sure if what I wrote is accurate to what we are proposing. I'll try again to be super clear (for myself too, as I keep confusing myself).

We can provide a metric that is:

Daily pageview counts to all articles ever edited by a user.

Every day for every user, we will store a pageview count, something like

user_central_id, 
pageview_count_to_all_pages_edited_by_this_user,
dt (day)

That will store the number of pageviews on that day, to all pages edited by user_central_id on that day or before.

The endpoint queries this data will then look something like:

/pageviews_per_editor_pages/{user_central_id}/{start_day}/{end_day}

This will return the total number of pageviews on pages edited by a user each day. Note that the list of pages that an editor ever edited will only ever increase.

We will backfill this data starting from January 2025, allowing YiR to sum pageviews for the entire year.

Here's another reason why I think the yearly metric is not great:
"Pageviews generated during 2025 to articles edited during 2025 by a given editor"

With this metric definition, we give much more importance to edits made early in the year.
An edit done in January will accumulate pageviews for 12 months,
while an edit made in December will only accumulate pageviews for 1 month.

Let's say editor A likes to write about a January event and does 10 edits in January every year.
And editor B likes to write about Christmas and does 10 edits in December every year.

Imagine both editors contributions generate the same pageviews per day on a regular basis, so same impact no?
Nevertheless, with this metric, the YiR impact for editor A is going to be ~12 times higher than the one for editor B.

And I believe that we are undercounting the overall yearly impact of the editor by 50% (impact of edits made later in the year).
Because of this, I think that this metric is not super informative of the real impact of the user edits.

Ottomata renamed this task from Global Editor Metrics for YiR, Apps Activity Tab, and Growth Impact Module to WE3.3.7 Year in Review and Activity Tab Services - Global Editor Metrics.Sep 18 2025, 6:10 PM
Ottomata updated the task description. (Show Details)
KStoller-WMF raised the priority of this task from High to Needs Triage.Sep 21 2025, 9:57 PM
KStoller-WMF moved this task from Inbox to Tracking on the Growth-Team board.

@JTannerWMF Product question for you!

List of most viewed edited articles from the last 30 days, 60 days, last year

This metric is a bit technically complicated, partly because of the 'additivity' property we prefer metrics to have. In order to compute this in an additive and flexible way (e.g. List of K most viewed edited articles in a time range), we'd have to store a LOT of data: per user per page per day pageview counts.

We are not sure we can justify storing and serving such a large amount of data for this one metric, so we would like to make it a less flexible (and not additive). Instead, we'd like to do:

List of K most viewed edited articles each month

We'd precompute and store the list of K most viewed editor articles each month. You wouldn't be able to ask for 'last 30 days', but you could ask for each month's list, e.g. 2025-08, 2025-09, etc.

Would this be acceptable?

Also, what should K be? 10? 100?

More details in T401260#11230961


Edit: We discussed in a meeting, and this is acceptable.

Next product question.

The 2 pageview metrics both mention 'articles' and not pages. Should we take that to mean that pageviews to non content pages (e.g. talk pages, draft namespaces, user pages, etc.) should not be counted?

I'd like to suggest that we simply count all pageviews to any mediawiki pages.

  1. it is simpler to code, store and reason about.
  2. For sum # of pageviews, non content pages are unlikely to contribute to the metric number that much. The actual pages viewed are not included in this metric, so it probably doesn't really matter?
  3. For top k pages viewed: if a user's most viewed page is really in a non content namespace (a draft page? their user page?) perhaps they'd like to see that? The only downside I could see is if a user edits a Main_Page, this page would likely always be in their top k. Perhaps we could just ever counting Main_Page? (I think we have encountered this problem before and filtering out Main_Page may be harder than it sounds).

@JTannerWMF @Dbrant whatcha think?

FWIW, we have the ability to differentiate between namespaces and 'content' vs 'non content' more easily for the edit related metrics, so this is not an issue there: the client will be able to choose.

I'd like to suggest that we simply count all pageviews to any mediawiki pages.

I would agree and endorse this, for similar reasons: Pageviews of 'content' pages will almost always dwarf non-content pages, and if a user only ever edits non-content pages, I'm sure they'd want to see those pageviews anyway.

Hi all, after doing some implementing and discussing with Joseph, we realized (T406069#11261850) that the only per-page edit metric being requested is "List of edited articles per user last year" for YiR. If we don't need per page edit metrics, then the implementation will be much simpler, and the queries should perform much better too.

@Dbrant This means that the API endpoint will change such that it will only give you per user edit counts over time ranges, not per page. I'll update T405041: Global Editor Metrics - HTTP API endpoints accordingly. I hope that is okay!

Mentioned in SAL (#wikimedia-analytics) [2025-10-28T16:51:42Z] <ottomata> deploying AQS edit-analytics service to pick up edits/per_editor endpoint - T403660

@Dbrant! Great news! edits/v3/per_editor is live!

Try it out! Get Ottomata's monthly edit counts in 2025:

curl 'https://wikimedia.org/api/rest_v1/metrics/edits/v3/per_editor/11878393/all_page_types/monthly/20250101/20260101' | jq .

I'm working on docs update now...

Here is product question about the "top k pages viewed" metric to discuss in today's sync meeting: T401260#11341613

Now that we have some real data to play with, we think we can (we still have to triple check with Data Persistence) serve the original "top k edited pages viewed" requirement.

We previously decided (T403660#11233604) that we would simplify the requirement to List of K most viewed edited articles each month. But, with new actual size estimates think we can store pageviews_per_editor_per_page_daily, allowing us to serve List of most viewed edited articles from the last 30 days, 60 days, last year..

To make this clearer, here are the API differences between this choice:

Good enough product option:
GET /metrics/pageviews/v3/top_pages_per_editor_monthly/{user_central_id}/{year}/{month} => Top 10(?) pages viewed in specified month. This is not additive over multiple months, so would not work for e.g. Year in Review (unless YiR wanted to display top page viewed each month in a year?).

Better product option:
If we go with the pageviews_per_editor_per_page_daily idea, the API might look something like:

GET /metrics/pageviews/v3/top_pages_per_editor/{user_central_id}/{granularity}/{start}/{end} => Timeseries of Top 10(?) pages viewed between {start} and {end} each {granualarity} (daily, monthly, yearly). '.

Better product option has some unknowns and requires more changes as described in T401260#11341613, so it will take more time (maybe until end of quarter? Hard to say for sure.)

Good enough product option has already been planned out, but is not implemented yet. We still have work to do, but it should all be straightforward to do in a shortish time frame (2 or 3 more weeks?)

In meeting today, we decided that "Good enough product" was sufficient for now. If this is not the case, Product will try to let us know as soon as possible.

If Year in Review wants top k pages viewed last year, we can precompute a calendary yearly version of this metric.