Personal Accounts:
- Phab: tanny411
- Meta: Aisha Khatun
Check out my website/blog: http://tanny411.github.io/
Personal Accounts:
Check out my website/blog: http://tanny411.github.io/
Ops, de-drafted. Thanks!
Thanks @Eevans ! Had a couple for discussions. Changed the field names a bit and added another field. This is what we want to move forward with
{
"wiki_id": "enwiki",
"page_id": 12345,
"editor_total_count": 127,
"editor_bot_count": 4,
"editor_logged_out_count": 21,
"editor_permanent_count": 93,
"editor_temporary_count": 9,
"page_is_deleted": false,
"updated_at": "2026-05-26 00:00:00.000Z"
}Thank you!
Details/Caveats of this dataset:
Some analysis on user status (anonymous, temporary, permanent, cross-wiki)
Update:
Full monthly load stats for all wikis, 2026-03 snapshot.
Ah, that makes sense! We don't need to get data into sanitized right now. Just wanted to inform. But looks like we are good. Thanks!
Wanted to note here:
Following up on my comment T425573#11913740 and adding to @nshahquinn-wmf: It would help to add the MWH fields in snapshot rows even if they don't exist in event rows. For contributor counts as well, it is agreeable to have some fields populated monthly and reconcile with current data on our side as seen fit, but at least having the fields in the same table is helpful for that.
@xcollazo Yes, we still have very frequent warnings.
https://airflow.wikimedia.org/dags/refine_webrequest_hourly_text/grid?search=refine_webrequest_hourly_text
Almost all the warning emails were sent here.
We should also get rid of the hive tables for dev and rc0 versions. Can we just drop tables? Do we also need to cleanup the hdfs files?
Want to add
eqiad.mw_page_edit_type_enrich.error
We have another use-case that wants to use MWH (hopefully). Contributors Count
The number of unique editors that have contributed to a given article within a Wikimedia project. Ideally, this data point would then be able to be split based on the type of editor; for example, a community bot, a logged in user, or an anonymous user.
Upgraded to eventutilities-spark 1.4.6 with dependencies. Re-running did not work, since it is already reconciled now, no new event were being emitted to kafka, and the code path was not being run.
After some debugging with @JAllemandou we found that with ivy, all dependencies of eventutilities-spark were being downloaded automatically. But when we set artifacts explicitly with artifact("eventutilities-spark-1.4.1.jar"), the dependencies are not auto resolved. We need to add a -with-dependencies version of the jar. 1.4.1 does not have a jar -with-dependencies. Will try eventutilities-spark-1.4.6-shaded-with-dependencies.jar locally and create and MR if that works.
From current Ops Week:
Same problem today for maintenance dags in main airflow
Exception in thread "main" java.io.FileNotFoundException: /tmp/table_maintenance_iceberg_monthly/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-7071631f-d152-48b4-bb0f-788ee707e4d1-1.0.xml (Permission denied)
@Ottomata Sorry, I can't fully remember what happened when DC switchover happened. Wanted to confirm, when DC switchover does happen, page_change_v1 will have events in codfw. html_content_change will consume from codfw.page_change.v1 but output to eqiad.page_html_content_change.v1, correct?
Then page_html_feature_counts_change will always ingest from eqiad and output to eqiad.
The hypothesis is now complete. Final update can be found here: https://app.asana.com/0/0/1214459375535326
@CMyrick-WMF: You should (if you haven't already) deduplicate on wiki_id, rev_id. As we've already noticed, other events (deletes/moves) can contain the same old rev_id, hence a produce duplicate of the edit-types. Plus, sometimes the same event is duplicated too, due to reprocessing for instance.
What do we need to do to have these datasets in event_sanitized?
So html pipeline
Yes, the html pipeline is just not adding the diff, the current html is present.
@Ottomata, @JMonton-WMF
There are some events in error sink. Not a new error, there are events in old edit-type error sink with the same error messages. I've checked, all of them are because
"$schema": "/error/2.1.0", "dt": "2026-04-23T22:02:40Z", "emitter_id": "mw-page-edit-type-enrich-next", "error_type": "ValueError", "errored_schema_uri": "/development/rendering_content_change/1.0.0", "errored_stream_name": "mediawiki.page_html_content_change.dev5", "message": "ValueError(\"Required field(s) missing or empty: 'delta.revision.rendering.content.content_body' (unified diff). Cannot proceed with enrichment for event (meta_id=67a12db2-44db-4942-9438-6fdae30a0537; rev_id=982629652; domain=en.wikipedia.org).\")",
(took a enwiki example from previous stream for convenience)
the incoming events delta is null. All of them are undelete events. And I have spot checked, these rev_ids don't have a parent rev_id (example revision). So this makes sense that the delta is null, we should be ok to ignore these. Wondering if we should handle these events, or letting them go to error sink is fine?
With the new stream mediawiki.page_html_feature_counts_change.rc0 declared https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1276397, requesting SRE help to
Backfill is now complete. akhatun.edit_type_v3 contains edit-type data from ns0 and just Wikipedias. Uses mwedittypes v3.1.0 and mwparserfromhtml v2.1.1.