Page MenuHomePhabricator

Backfill newly productionized edit types dataset
Open, Needs TriagePublic

Description

Context: As part of T293465: Edit Types Research -- more specifically T351225: Productionized Edit Types -- I am requesting the new productionized edit types dataset to be backfilled to March 2025, using all of the library updates, so that we explore 2025 data as part of Hypothesis WE 1.5.4 work.

Suggested steps: run edit types like Fabian usually does, and as a last step transform it to the rendering_feature_change data format, then write it to the table.

Done is:

  • The dataset will be backfilled going back to March 2025
  • The backfilled data will include the inline notes edit type

Note: I am aware that the backfilled data will only have namespace 0, and will only have Wikipedias. I am OK with these limitations.

Details

Other Assignee
AKhatun_WMF

Event Timeline

It would be nice to backfill the final production table, rather than our current development tables.

Perhaps for now, we should create a separate table to hold the backfill so Caroline can explore. When we have our .v1 table, we can then copy data from it. When we do, we will need to add the .v1 table to the sanitization allow list.

OTOH, perhaps all we need is something for a one off now? If so this, would be a quick one off and easy enough to do.

Update:

  • Created akhatun.edit_type according to the edit-type stream's schema
    • Created as a hive table to maintain consistency with the (eventually) prod table for edit-types
    • *Only difference *: There is a field called delta.revision.rendering.features.error to capture any error we might have encountered during edit-type computation. Streams have error_sink for this.
  • Wrote up some code in a notebook to port over whatever data we have available in research.mediawiki_content_html and compute edit types.
  • Saved a days (30 March, 2025) worth of data into the table.

Todo/Questions:

  • @CMyrick-WMF to give a green light on edit-type data currently in mediawiki_page_edit_type_simple_dev1. This has the edit-types from the new version of edit type library (v3.1.0). If this looks good, we can one-time process data from March, 2025 for exploration purposes.
  • Should I host the code anywhere, get it reviewed? Hosting probably not required since it is a one-time spark job.
  • What should we consider for the spark job when performing a years worth of backfill? Memory? Time? Need some wisdom.
    • Currently yarn-large is not enough for 12 hours of data. Requires more tuning.
org.apache.spark.SparkException: Job aborted due to stage failure: ... Reason: Container killed by YARN for exceeding physical memory limits. 8.8 GB of 8.8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

Should I host the code anywhere, get it reviewed? Hosting probably not required since it is a one-time spark job.

Nah, but it would be good to post it somewhere. Here in phab is fine, or in a link Gitlab snippet, or whatever you prefer.

What should we consider for the spark job when performing a years worth of backfill? Memory? Time? Need some wisdom.

Probably @fkaelin can help, but perhaps just smaller batch periods? Loop and do a day or week or month at a time?

There is a notebooks folder in research-datasets, you could put a notebook there too.

Running this at scale in a notebook is unpleasant. You run can monthly jobs but these need a beefy spark config and and can be finicky with timeouts; you also need to batch the jobs. At times I use a tmux pyspark shell with for loops (e.g. you can sudo to write as system user).

But in this case the code is straight forward and no experimentation involved (for which interactive is useful), you can also use airflow to do this. The edit types dag for html already exists:

  • your code needs to be packaged; in a notebook you already need to package the edit types library for the spark workers, when using airflow you also have to package the spark driver code as it runs on yarn itself
  • create a branch in the research-datasets repo, bump the edit types version and modify the run_html command. Do the operation on the dataframe/schema and write to your table like in your notebook. Push the branch, run the publish_conda_env job (test stage) via the gitlab UI, and find the conda env package registry link (wmfing.gitlab.latest_package_file("repos/research/research-datasets",version="your_branch")).
  • deploy an airflow dev instance for the research dags, you can modify the edit types dag if needed (maybe to create your iceberg table) but the start_date and spark config and schedule are tuned. Then in the airflow UI you change the edit_types variable and point conda_env to the gitlab package registry link for your branch.
  • then you activate the dag and wait. The benefit of this approach is that the jobs use smaller spark configs and run more stable, and airflow runs multiple jobs in parallel (default config is 3 I think, so with this config uses 3*66~200 executors with 1 core when backfilling). When using a notebook you have to submit jobs sequentially, so running smaller jobs takes much longer.

Update:

  • Following @fkaelin's recommendation, running the backfill in a airflow devenv.
  • research dag branch: akhatun/edit-type-dag . research-datasets branch: akhatun/edit-type-backfill
  • @fkaelin / @Ottomata feel free to take a quick look at the code as a sanity check.
  • Job has been running for ~12 hours: 2 months of data has been backfilled into akhatun.edit_type.
  • @CMyrick-WMF still needs access to the table to ensure things look fine. She has been looking at the ingested table from streaming data (event.mediawiki_page_edit_type_simple_dev1) as well and making sure things look as expected.

Note that a lot of meta data fields are null in the backfill that would have otherwise come from event data. All edit-type data is present along with whatever was available in the html table. Will ensure from Caroline that this is sufficient.

Update:

  • Caroline shared the query she is using to get the metrics from edit-types. The Query is here. The data I was backfilling did not contain performers.* fields (among others) as they are not present in research html table. These are required to compute the metrics.
  • Re-wrote the airflow job to now use event_sanitized.page_change_v1 table to gather all related meta-data, join with research.mediawiki_content_html, and then compute edit-types.

Hourly partitions (akhatun.edit_type_v2):

  • Changed the frequency to hourly jobs to sync with what the final prod table would look like.
  • With hourly job + joining on page_change table, caveat is that this made the overall backfill a bit slow.
    • With daily computation a years worth of data would have been backfilled in ~3 days, but no metadata.
    • Now it will require ~12 days, with meta-data. Noted to Caroline, this works for them.
  • Currently saving 1 file per hourly partition. The files are very small (~5MB). We could actually do daily partitions easily (1 file ~120MB, or a few files).
  • Hind sight, daily was more than good enough, since Moderator Metrics are monthly anyways.

Daily partitions (akhatun.edit_type_v3):

  • daily jobs need ~30 minutes. So 1 year's data will take ~3 days to finish. Continuing with this.
  • Each file is ~130MB. 1 file per daily parition.

Backfill is now complete. akhatun.edit_type_v3 contains edit-type data from ns0 and just Wikipedias. Uses mwedittypes v3.1.0 and mwparserfromhtml v2.1.1.

Edit type data is view-able by Caroline in the Draft dashboard: https://superset.wikimedia.org/superset/dashboard/p/l39rmgDyv05/

n-daily-users-who-inserted-messagebox-or-inline-cleanup-notes-2026-04-21T21-58-41.888Z.jpg (929×1 px, 369 KB)

Note:
We merged research.mediawiki_content_html with event_sanitized.page_change_v1 [inner-join]. page_change_v1 contains the meta-data of the event at time of the event. research.mediawiki_content_html on the other hand, was populated with backfill. So some edits on Draft pages for instance (ns 118), when they became Main page (ns 0), were listed in mediawiki_content_html, but the edit were from when it was a Draft. In akhatun.edit_type_v3 the namespace column is from page_change_v1, so we find some non ns0 edits. For the purposes of baseline metric, we should filter by ns0. Those will be the edits performed on a ns0 page when it was a ns0 at the time of the edit. Other namespaces will not be comprehensive.

@fkaelin Let me know if this sounds about right. Also, wondering if the html dataset would be missing some edits done on ns0 pages that later became some other kind of non ns0 page? Not sure how frequently this happens though.

Namespace distribution in akhatun.edit_type_v3
https://en.wikipedia.org/wiki/Wikipedia:Namespace

ns_id	cnt
0	131291365
2	388272
118	264604
102	2448
4	2169
3	1187
1	783
104	142
106	108
12	68
10	47
14	26
100	20
5	18
119	10
1728	7
710	4
15	3
13	2
107	1
103	1
1704	1
7	1
126	1
828	1