In July, we calculated total_edits=50M, by querying Mediawiki History. Total edits from editors_daily = 44.8M.
total_edits_ED = spark.run(""" SELECT sum(edit_count) FROM wmf.editors_daily WHERE month = '2022-07' AND not user_is_anonymous AND action_type =0 -- # (0,1,2) #gives similar results """)
We calculated global_north_edits=24M and global_south_edits=3.7M. Global_unknown_editors in July = 2,999 and global_unkonwn_edits in July = 17M when we query for economic_region == "unknown".
This leaves 12% of edits that had no region label, neither north nor south nor unknown. This may be attributed to edits that have their information deleted + edits that do not have an assigned log_action and therefore do not move to Editors_Daily.
MWW shared this query to better understand those edits that are logged in the revision table but are not making it into the editor_daily table:
spark.run(''' WITH edits AS ( SELECT rev_id FROM wmf_raw.mediawiki_revision WHERE snapshot = "2022-07" AND wiki_db = "nnwiki" AND rev_timestamp >= "20220701000000" AND rev_timestamp < "20220801000000" UNION ALL SELECT ar_rev_id FROM wmf_raw.mediawiki_archive WHERE snapshot = "2022-07" AND wiki_db = "nnwiki" AND ar_timestamp >= "20220701000000" AND ar_timestamp < "20220801000000" ), cu_changes AS ( SELECT cuc_id, cuc_this_oldid FROM wmf_raw.mediawiki_private_cu_changes WHERE month = "2022-07" AND wiki_db = "nnwiki" AND cuc_timestamp >= "20220701000000" AND cuc_timestamp < "20220801000000" AND cuc_type IN (0, 1) ) SELECT edits.*, cuc_id FROM edits LEFT JOIN cu_changes ON rev_id = cuc_this_oldid WHERE cuc_id IS NULL ''')
Further research is required to understand the reason for the gap and why total_edits on Mediawiki_History differs from the total_edits on Editors_Daily (which has been used to calculate the geo breakouts) and to understand which wikis are most impacted. At present from looking at the rc_type field in the cu_changes table on nnwiki, it looks like some of the edits that are not getting logged in the editors_daily table are moves and redirect updates and other non_content_create edits.