Page MenuHomePhabricator

Global Editor Metrics - Druid mediawiki_history_reduced changes
Closed, ResolvedPublic

Description

As discussed in T401260#11230961, we'd like to use the existent Druid mediawiki_history_reduced dataset to serve global editor editing related metrics. The metrics to serve are:

  • Edit Metrics
    • Total edit count within a date range.
    • Total number of days edited within a date range.
    • Longest daily consecutive edit streak within a date range.
    • List of edited pages within a date range.

In order to do compute these daily, and for global user metrics, we'll have to make a few changes to mediawiki_history_reduced in Druid.

mediawiki_history_reduced (and it's upstream mediawiki_history) are event based monthly snapshot tables. This means:

  • event based - each row represents a single event happening, e.g. an edit.
  • monthly snapshot - All records are fully regenerated monthly from sqooped MediaWiki MariaDB tables.

To compute global editor metrics, we need to add user_central_id and page_id to mediawiki_history_reduced.

To compute these daily, we'll need to append new events to mediawiki_history_reduced at least daily.

While it would be very nice to have these changes applied to the upstream Hive tables, the minimum requirements are that these are satisfied for the Druid datasource used for serving. We may choose to e.g. add user_central_id to mediawiki_history etc (T365648: Add user_central_id to mediawiki_history and mediawiki_history_reduced Hive tables) and source this info in Druid from there, but doing so is not required.


Incremental updates to upstream Hive tables is surely out of scope, so we will just apply daily updates to the Druid dataset. To do this, we have a few options:

1. Incrementally update the latest snapshot
  • Determine the 'latest snapshot dataset', e.g. mediawiki_history_reduced_2025_08 in Druid.
  • Compute daily mediawiki history 'event' data from event.mediawiki_page_change_v1, starting from latest time in latest snapshot
  • Load daily mediawiki history events into latest snapshot dataset.

Pros:

  • No new Druid datasources to maintain
  • Incremental lambda arch: each month the new snapshot will supersede the previous, giving us eventual consistency.

Cons:

  • The daily loading Airflow DAG needs to align its time with latest time in each new monthly snapshot. This may be some complicated sensor work.
    • We can avoid this con by having the daily loading DAG write to both current and future month. This is still a bit awkward but means we don't have to do any complicated alignments.
  • Not clear how/if we can roll back to old snapshots if we create bugs.
  • mediawiki_history_reduced contains 'digest' event types. These are aggregations in rows in the data itself. If we don't update the digest aggregations, the digests will no longer match aggregations of events as the new event rows are added. This is a little awkward, and Joseph recommends possibly wanting to split the digest rows out into their own dataset any way.
2. Incrementally update a new mediawiki_history_reduced_latest dataset
  • When loading a new mediawiki_history_reduced snapshot into druid, also load it into a new (digest-less) mediawiki_history_reduced_latest dataset.
  • Compute daily mediawiki history 'event' data from event.mediawiki_page_change_v1, starting from latest time in latest snapshot
  • Load daily mediawiki history events into mediawiki_history_reduced_latest dataset.

Pros:

  • Incremental lambda arch: each month the new snapshot will supersede the previous, giving us eventual consistency.
  • No digest awkwardness in mediawiki_history_reduced_latest
  • Airflow sensor less complicated (we can just ongoing load daily into mediawiki_history_reduced_latest
  • No dataset switching needed.

Cons:

  • New Druid datasource to maintain
  • Dataset storage and segment cache duplication - Existent AQS and Global Editor Metrics will not utilize the same segment cache.
  • Difficult to rollback mediawiki_history_reduced_latest.
3. Single lambda style mediawiki_history_reduced

This is the same as Option 2, but without the cons of extra dataset storage and segment cache duplication. Existing AQS usages would be migrated to the same dataset.

We'd have to migrate the digest (pre-aggregations) rows to a new separate monthly snapshot dataset.

We could implement Option 2 and then migrate to Option 3 later.

Pros:

  • Single incremental eventually consistent mediawiki_history_reduced to maintain.
  • Airflow sensor less complicated (we can just ongoing load daily into mediawiki_history_reduced_latest
  • No dataset switching needed.

Cons:

  • Rollbacks will be manual backfills / dataset replacements
  • We need to create a new monthly digest/aggregation dataset in druid.

We could also consider computing and streaming mediawiki_history_reduced events into Druid realtime, but that would require running new streaming enrichment jobs and Druid ingestion jobs, which is probably out of scope for this task.

Done is

  • page_id and user_central_id fields are added to Druid mediawiki_history_reduced.
  • Druid mediawiki_history_reduced is updated daily

Related Objects

Event Timeline

Ottomata renamed this task from Global Editor Metrics - Druid mediawiki_history_reduced updates to Global Editor Metrics - Druid mediawiki_history_reduced changes.

@JAllemandou I tried to capture what we discussed this morning here. Please let me know what you think!

This is great, thank you for the writing Andrew :)
I have two comments:

  • I think that for either solution 1 or 2 we'll wish to split the digest/events datasets, for easier understanding.
  • For solution 2 the downside I see is more complication in case of dataset rollback: we will maintain snapshot datasets per date and latest (latest being the same at load time than the latest date dataset). Then we'll have to either update daily both latest and the latest-date snapshot, or rollback accept that in case of rollback the previous month of data will miss (loaded only in latest).

From the second point above, it feels that the solution 1 described above would be simpler/easier to maintain. But on the other side it makes a long time (years) we've not used the dataset rollback mechanism... Maybe it'll be more useful when we start touching the data :)

Thanks @JAllemandou, I think one of the biggest advantages of Option 2 is that we don't have coordinate the timing of the monthly loading and the daily loading DAG, right? These are all (mostly) immutable event rows, so if one monthly runs late, or daily runs late, it shouldn't really matter, as long as both run?

Option 1 would require that after every new monthly snapshot, we start a new dag, or reset the daily time, to the start of the month, right?

case of dataset rollback

I think Option 2 would allow us rollback with snapshots for APIs that only need monthly and want better stability, but for APIs that want more frequent data, rollback would be more difficult. This would also helps us separate this new capability from existing use cases, as existing snapshot datasets would remain the same (well we could add new fields, but we wouldn't load daily into them).

I vote Option 2. If we need to rollback a snapshot, wouldn't we simply:

  • roll back mediawiki_history_reduced to last_good_snapshot
  • roll back mediawiki_history_reduced_latest to last_good_snapshot
  • replay daily events on top of mediawiki_history_reduced_latest to get it back to current?

Testing that this went well seems *easier* when these are split, because the checks are different. If they were together, I'd have more stuff to load into my brain if a test was failing - like, is this because I messed something up in the query and I'm getting a daily update instead of the base?

2 things to check before we proceed:

  1. Will druid handle these queries under load? Marcel and I did some spot testing and latencies seemed acceptable, but what if there ~100 queries per second?
  1. Option 2 will add one more mediawiki_history_reduced sized dataset into Druid public. Is there enough space?

On it...

Druid query load testing

I wanted to just use ab, but IIRC, Druid has a pretty good query cache. We need to vary the queries that we are issuing while we are load testing.

After much prompting, ChatGPT wrote me a some python scripts that:

  • Generates Druid SQL HTTP API POST body payloads given a SQL query and templated parameters. The generated payloads are the cross product of all the possible parameters.
  • Runs locust and uses slices of the possible payloads for each parallel 'user' simulation. E.g. I run 100 concurrent 'users', and each will iterate through a distinct list of editor user name payloads.

I then grabbed a list of almost about 700 very prolific editors (I filtered ones that were annoying to quote) on enwiki. I combined that also with a list of about 10000 random editors.

The queries I used queried for daily results over an entire year.

I then used the locust script to run concurrent requests (with a small random delay between each request) iterating through these payloads, causing distinct queries to go to druid simultaneously. The runtime of the load tests were 60 seconds.

I did this for both the edit count query, as well as the distinct pages edited query. I'll post results and queries I used in the next comment.

The code for all of this is at

https://gitlab.wikimedia.org/otto/druid_editor_metrics_load_test

Druid query load testing results

queryconcurrency# reqs# failsAvg msMin msMax msMed msreq/s
edit count100445001162526145476.80
edit count10001068604089123227873100175.50
distinct pages edited100422411(0.26%)1802556065574.90
distinct pages edited1000992299(1.00%)453568262614600193.20

What I get from this is:

  • edit count at ~75 req/s is mostly fine, but some prolific editors might have slower queries lasting a few seconds.
  • distinct pages edited at about 75 req/s is also mostly fine, but for some prolific editors this query might fail. (I should probably locate one or two of these editors and see what the error is...I'm guessing a timeout?)
  • neither query was great with concurrency=1000. Each seemed to reach a req/s plateau of < 200 req/s. I was only running this from a stat box, so it is possible the slowdowns were in the client, but somehow I doubt it.
    • The good news is that most requests success rate was about the same. The higher load caused worse overall latency, but not necessarily failed queries.

I think these results are okay! The worst case throughput estimation for YiR was:

100-999k Bursts expected spread over 24 hour period

That's around ~10 req/s on average. If for some reason YiR is really popular and gets 1000 req/s for an extended period of time, users will see a slowdown in query responses.

We'll have to ask product about the distinct edited pages for prolific editors. I'll dig in a tiny bit more there first though.

This is not a scientific test, and I probably did something stupid, so if anyone sees anything please help me!


Edit Count Query:

SELECT
  __time,
  user_text,
  project,
  page_namespace,
  page_type, 
  count(*)
FROM mediawiki_history_reduced_2025_08 
WHERE 
  event_entity = 'revision' AND
  event_type = 'create' AND
  __time >= '{start}' AND
  __time < '{end}' AND
  user_text = '{user_text}'
GROUP BY
  __time,
  user_text,
  project,
  page_type,
  page_namespace
ORDER BY
  __time

Distinct Pages Edited query:

SELECT
  __time,
  user_text,
  project,
  page_namespace,
  page_type, 
  page_title
FROM mediawiki_history_reduced_2025_08 
WHERE 
  event_entity = 'revision' AND
  event_type = 'create' AND
  __time >= '{start}' AND
  __time < '{end}' AND
  user_text = '{user_text}'
GROUP BY
  __time,
  user_text,
  project,
  page_type,
  page_namespace,
  page_title
ORDER BY
  __time

I wanted to see if we could use one query to serve the edit metrics. If we returned per page counts, we could. I manually tested this query on a few prolific editors, and got errors.

Here are the load test results of this query.

queryconcurrency# reqs# failsAvg msMin msMax msMed msreq/s
per page edit count100408577(1.88%)2312052534775.30

There are more errors (expected) and the worst case latencies are worse, also expected, but the best and median cases are about the same. This makes sense, in that most users only edit 1 or 2 pages ever, so daily per page edit counts for them is not that different than just daily edit count sums.

Seeing that the distinct pages edited query also failed (albeit for fewer editors (11 vs 77), we might have another question for product. If they are okay with accepting some feature degredation for prolific editors anyway, they might prefer the simpler (and richer -- edit counts per page) endpoint?

Although, if we have 2 endpoints, the only metric they'd need to accept feature degredation for is distinct pages edited. With 1 endpoint they'd have to accept it for all of them. Hm.


Per page edit count query:

SELECT
  __time,
  user_text,
  project,
  page_type,   
  page_namespace,
  page_title,
  count(*)
FROM mediawiki_history_reduced_2025_08 
WHERE 
  event_entity = 'revision' AND
  event_type = 'create' AND
  __time >= '{start}' AND
  __time < '{end}' AND
  user_text = '{user_text}'
GROUP BY
  __time,
  user_text,
  project,
  page_type,  
  page_namespace,  
  page_title
ORDER BY
  __time

Oh but, what if I reduce the time range we are looking at?

Restricting the time range on the per page edit count query to 3 months changed the results:

queryconcurrency# reqs# failsAvg msMin msMax msMed msreq/s
daily per page edit count 3 months10047110(0.00%)401524672778.20

No failures! Okay!

And what if I change use coarser time granularities? Monthly? Yearly? YiR doesn't need daily results...

queryconcurrency# reqs# failsAvg msMin msMax msMed msreq/s
monthly per page edit count 1 year100411468(1.65%)2182254655175.80
yearly per page edit count 1 year10045530(0.00%)822248884974.80

Interesting! Monthly granularity still has failures, but yearly is totally okay!


Monthly per page edit count query:

SELECT
  DATE_TRUNC('month', __time),
  user_text,
  project,
  page_type,
  page_namespace,
  page_title,
  count(*)
FROM mediawiki_history_reduced_2025_08
WHERE
  event_entity = 'revision' AND
  event_type = 'create' AND
  __time >= '{start}' AND
  __time < '{end}' AND
  user_text = '{user_text}'
GROUP BY
  DATE_TRUNC('month', __time),
  user_text,
  project,
  page_type,
  page_namespace,
  page_title
ORDER BY
  DATE_TRUNC('month', __time)

Yearly per page edit count query:

SELECT
  DATE_TRUNC('year', __time),
  user_text,
  project,
  page_type,
  page_namespace,
  page_title,
  count(*)
FROM mediawiki_history_reduced_2025_08
WHERE
  event_entity = 'revision' AND
  event_type = 'create' AND
  __time >= '{start}' AND
  __time < '{end}' AND
  user_text = '{user_text}'
GROUP BY
  DATE_TRUNC('year', __time),
  user_text,
  project,
  page_type,
  page_namespace,
  page_title
ORDER BY
  DATE_TRUNC('year', __time)

Okay, ^ T406069#11235581 is good news. To me this means:

  • We can use a single per page edit count endpoint for all desired edit metrics.
  • If you try to get daily or monthly edit counts over large time ranges, your query will have higher latency and/or may fail for prolific editors.
  • The product use cases we have don't need fine grained large time range metrics. They need small grained and small time range (e.g. daily in last 30 or 60 days), or large grained and large time range (e.g. yearly for a year).

Current conclusion: Druid should suffice for our use cases, even with the single per page edit count metric endpoint.

(Again though! This was not a fully scientific test, and I very well may have done something something stupid!)

I have been thinking about the 1 versus 2 solutions above, and I have more arguments for going with version 1:

  • Multiplying the number of (used) datasources in druid will reduce caching optimization
  • The loading is very similar IMO
    • The latest datasource to use is stored as a row in a cassandra table, we can use this to decide which datasource to load for solution 1
    • The complexity of loading is at the beginning of the month: when the new monthly data lands you wish to load from the latest date of that monthly dataset. But this issue is the same wheter we use solution 1 or 2!

I continue to think that keeping the solution with a single datasource instead of a duplicated one is better.

Discussed with Druid a bit. I didn't like Option 1 because I don't want to deal with aligning time schedules between two different DAGs. However, I discussed with Joseph and we had some ideas.

The monthly snapshot load actually runs a few days into the next month. If we don't align the schedule of the daily job, the first couple of days of the next month will be written into the previous snapshot, and the new month snapshot will miss data for the first few days.

To avoid this, the daily job could load some number days to the current month as well as preloading to the next month, even before the next month's snapshot dataset is created. Before the monthly snapshot loads, the next month will have only daily loads, not the entire history. When the next month snapshot loads, the first few days of the next month will already exist, and the daily load job can just keep trucking.

TBD how many days of overlap in both snapshots we need. We could potentially just always daily load to current and next snapshot table? Or perhaps just the first few days of each month.

Pros:

  • Rollbacks are easy!
  • no extra duplicated _latest dataset.

I'm into it.

@JAllemandou @Milimetric should we consider Option 3?

3. Lambda-ize only one dataset

Instead of maintaining monthly snapshots in Druid, we could just maintain one mediawiki_history_reduced, and lamba arch both daily and monthly loads into it. We could still keep the 'active dataset' mechanism and version this table, e.g. mediawiki_history_reduced_000 or something. If when we need to generate new ones or make one for a rollback reason, we could just load a new version and change the 'active dataset' switch in cassandra.

We could transition from the snapshot status quo to the single dataset incrementally. E.g. use it for Global Editor Metrics now, and then switch all of the other AQS usages to it after we are confident it works.

Pros:

  • simple and easier to maintain

Cons:

  • No easy rollback. Rollbacks require manual regeneration of the dataset.

But, I assume that the Option 3 rollback con is bad enough to rule this out? Whatchya think?

hm, I'm not sure I see why this is easier than Option 2

But it does make me think of adding daily updates to an Iceberg version of this table, using that to load Druid, and using Iceberg snapshots to manage fine-grained rollback if we need.

hm, I'm not sure I see why this is easier than Option 2

Option 3 is the same as Option 2 except without maintaining monthly snapshot tables at all. It avoids Joseph's main objection about Option 2 which is duplication of dataset for both size and segment cache.

I'd be ok with Option 3, as it makes a long time we didn't have to rollback. But indeed the rollback mechanism becomes more complicated.

I think I prefer Option 3. Rollback would be just as (maybe more?) complicated if we were to do this in Cassandra. Right now we have monthly snapshot rollback-ability for edit based metrics in Druid, but no auto rollback options for Cassandra, right?

Might as well have feature parity ? ;P

The easy path to Option 3 is to do Option 2 now, then migrate usages of monthly snapshot tables to Option 3's single table later.

This also decouples the active implementation of Global Editor Metrics from existing AQS usages, as we won't touch the datasets they use until it is time to migrate them to Option 3.

And, if Option 2 -> 3 turns out to be a bad idea, we can still change our minds and go back to Option 1.

@JAllemandou what do you think?

The only concern I have is data duplication (possibly not optimal performance due to caching) and never making to option 3 when we'll have option 2.

Hm, yeah. I think going full option 3 is not that hard once we have Option 2, right? We'd:

  • Set AQS' active dataset to be e.g. mediawiki_history_reduced_1_0_0
  • Remove monthly loads to snapshot datasets?

That's it?

If we changed our mind and wanted Option 1 after all, indeed that would be more work (implementing the Option 1 double daily loading stuff).

@JAllemandou if we can get Andreas to commit us to go full Option 3 let's say be end of this FY (July 2026), would that be okay? That would let us do work now for Global Editor Metrics, and then fit the full migration to Option 3 in Q3 or Q4?

Ah but @JAllemandou what do we do about the digests? I guess for Option 3, we have to split them to a new dataset?

Ah but @JAllemandou what do we do about the digests? I guess for Option 3, we have to split them to a new dataset?

That would be my preferred solution, but we could leave them in the main dataset, it should still work.

In a https://docs.google.com/document/d/19ErSFYSUCeRl8hcMZDfzxoRYwUxgIjL-QuOksAH_zHE/edit?tab=t.0#heading=h.lxmqn5wud7ny, we decided the following

Druid Incremental Daily Decision

  • We will do Option 2 now for Global Editor Metrics.
  • Assuming the new Druid daily incremental mediawiki_history_reduced 'latest' dataset works as we expect, in FY2025-2026 Q3 we commit to making all existing AQS use cases point at this dataset. This will avoid the query segment cache problem pointed out by Joseph, allowing all of these queries to use the same segment cache.
    • TODO: make a task for this.
  • In the future, we would like to do something about the digests in Druid mediawiki_history_reduced.
    • Compute the daily digests also incrementally
    • and/or split the digests out into their own dataset.
    • TODO: make a task for this.

Data source decisions

Get user_central_id into mediawiki_history and mediawiki_content_history_v1` ASAP!

Decision documented at T405039#11247302

Update!

So, all the load testing I did was with Druid SQL GROUP BY. This generates a GroupBy query, however, the granularity is always all because it is not smart enough to translate the sql time group by into a granularity. TBH I don't totally know what this means, but it doesn't sound great.

The load testing I did was using SQL, so it was executing in this way. So, it will be fine, but, it would be better if I could execute a native Druid GroupBy and set granularity to daily or monthly as needed by the request.

However, to do this in AQS, I will have to implement a DruidGroupBy query wrapper, as it does not exist atm. We only have wrappers for Timeseries and TopN druid queries.

Adding a Druid SQL wrapper was much more flexible and straightforward, so that was going to be my approach.

However, after discussing with Joseph, he thinks the queries and user experience will be much better if we don't use a GroupBy for this at all. I looked again at the product requirements, and realized that the only need for per-page grouping was YiR "List of edited articles per user last year". YiR is no longer on our product timeline, so if I don't worry about that metric, I can serve all the other use cases with a Timeseries query.

Given I don't have to implement new AQS library code to accomplish this, I'm going to proceed with Timeseries. Slack thread here.

I'll comment on parent task about the product decision, but given YiR is no longer needing this this year, I think this will be fine.

@Ottomata so it sounds like we are ready to accept the mediawiki_history_reduced dataset as it is right now, but with user_central_id and page_id columns added? If so, I'll start backfilling September 2025.

user_central_id is now in Druid mediawiki_history_reduced! Thanks @amastilovic !

Here is the proposed solution we came with in today's meeting:

  • We are going to create a new dataset in Druid containing revision-events only, for the purpose of the Editors-metrics.
  • This dataset will be updated daily from a newly created Iceberg table.
  • This Iceberg table will be a replica of the mediawiki_content_history_v1 table, minus the content.
  • this table will be updated daily from the same sources and using the same process (copy/paste) as the mediawiki_content_history_v1 table:
    • pagechange events
    • reconciliation events
  • No reconciliation against MariaDB is needed, as this is already done by the mediawiki_content_history_v1 process handling the same data (minus text!)

After talking with @mforns this morning:

  • The 4 metrics defined in the task will always be requested for a single user_central_id.
  • The 4th metric List of edited pages within a date range will not be used this year (see this page)- no need to embed page information (page_title or page_id) in the new dataset for now.
  • It's ok to NOT have rows that don't have a defined user_central_id (not NULL and strictly positive)in the new druid dataset - the queries are always filtering for a single user_central_id. (only a very small number of rows filtered out, see https://phabricator.wikimedia.org/T410688#11404257)
  • We wish to have an exact-copy of the mediawiki_history_reduced dataset in terms of dataset definition. Endpoints for the first three metrics are already serving data using mediawiki_history_reduced, and we now wish to have this data updated daily using a new datasource, but minimize change.

For my own (out of the loop) understanding, here are the changes to previously made decisions:

Druid Incremental Daily Decision

T406069#11393933

Instead of doing Option 2. new mediawiki_history_reduced dataset as decided in T406069#11247231:

We are going to create a new dataset in Druid containing revision-events only, for the purpose of the Editors-metrics.

This new dataset will have the same field names as the current Druid mediawiki_history_reduced, except that it will be purpose built to serve incremental global editor metrics. It will not have fields like user_type, and it will not have digests. AQS edits per editor endpoint will need to be changed to query the new dataset, but no other changes should be needed.

A new intermediate Iceberg table will be created for the source data. This new Iceberg table will be backfilled from mediawiki_content_history. It will be ongoing updated from mediawiki.page_change_v1 and mediawiki.revision_visibility_change + mediawiki_content_history reconciliation events.

NOTE: This is contrary to the decision made in T405039#11290461. An updated decision will be posted there soon.