Page MenuHomePhabricator

Review why total_edits on Mediawiki_History differs from the total_edits on Editors_Daily
Open, MediumPublic5 Estimated Story Points

Description

In July, we calculated total_edits=50M, by querying Mediawiki History. Total edits from editors_daily = 44.8M.

total_edits_ED = spark.run("""
SELECT sum(edit_count)
FROM wmf.editors_daily
WHERE month = '2022-07'
AND not user_is_anonymous
AND action_type =0 -- # (0,1,2) #gives similar results
""")

We calculated global_north_edits=24M and global_south_edits=3.7M. Global_unknown_editors in July = 2,999 and global_unkonwn_edits in July = 17M when we query for economic_region == "unknown".
This leaves 12% of edits that had no region label, neither north nor south nor unknown. This may be attributed to edits that have their information deleted + edits that do not have an assigned log_action and therefore do not move to Editors_Daily.

MWW shared this query to better understand those edits that are logged in the revision table but are not making it into the editor_daily table:

spark.run('''
WITH edits AS (
    SELECT
        rev_id
    FROM wmf_raw.mediawiki_revision
    WHERE snapshot = "2022-07"
    AND wiki_db = "nnwiki"
    AND rev_timestamp >= "20220701000000"
    AND rev_timestamp < "20220801000000"
    UNION ALL
    SELECT
        ar_rev_id
    FROM wmf_raw.mediawiki_archive
    WHERE snapshot = "2022-07"
    AND wiki_db = "nnwiki"
    AND ar_timestamp >= "20220701000000"
    AND ar_timestamp < "20220801000000"
),
cu_changes AS (
    SELECT
        cuc_id,
        cuc_this_oldid
    FROM wmf_raw.mediawiki_private_cu_changes
    WHERE month = "2022-07"
    AND wiki_db = "nnwiki"
    AND cuc_timestamp >= "20220701000000"
    AND cuc_timestamp < "20220801000000"
    AND cuc_type IN (0, 1)
)
SELECT
    edits.*,
    cuc_id
FROM edits
LEFT JOIN cu_changes
ON rev_id = cuc_this_oldid
WHERE cuc_id IS NULL
''')

Further research is required to understand the reason for the gap and why total_edits on Mediawiki_History differs from the total_edits on Editors_Daily (which has been used to calculate the geo breakouts) and to understand which wikis are most impacted. At present from looking at the rc_type field in the cu_changes table on nnwiki, it looks like some of the edits that are not getting logged in the editors_daily table are moves and redirect updates and other non_content_create edits.

Event Timeline

Understanding this will be important to T307883 so that we don't replicate any issues

Mayakp.wiki added a subscriber: Milimetric.

Added this to data engineering's radar.

I ended up being more curious about this after our chat and so I looked up the entries in cu_changes on nnwiki around the time of one of the revisions that we identified did not have a corresponding match on cuc_this_oldid.

Said edit is a page move that leaves a redirect, which means that it results in two revisions: one for the move of the original page, and one for the creation of the redirect page. The first of these is rev_id = 3395496 and has a match in cu_changes on cuc_this_oldid with cuc_type = 3 (RC_LOG according to the manual for rc_type). The other is rev_id = 3395497 and has no corresponding match in cu_changes.

Note that cuc_type = 3 also encompasses a lot of other things besides these moves such as account creations, account renames, triggered abuse filters, logins, etc… Hence, we can't just add that to a query and get the right number of edits. Instead, we might need to identify non-matching revisions using wmf_raw.mediawiki_revision and wmf_raw.mediawiki_archive and count them that way.

@EChetty were concerned about the gap. Can you and team fix this issue?

This issue impacts regional and editor reporting using the Editors_Daily table.

As far as Key Metrics, this impacts one of the charts in the Superset Editors Dashboard, Active Editors by Market. Note, we are considering retiring global market editors metrics, see T316580.

This is also important to keep in mind for task T307883.

Found this other task from 2018 that's still open and mentions the same, but from a different point of view: T189044

Indeed, I had come to the same conclusion around the time we launched mediawiki_history. I think the root problem here is that we don't have solid metric definitions that get into all these details. As Joseph points out on that task, T189044#5016011, there are many ways we can label these edits and many ways we can include/exclude them from different metrics. But without a clear definition of what we're counting and why, there's no "problem" in the technical sense. Just ambiguity.

EChetty triaged this task as Medium priority.Oct 5 2022, 6:22 PM
EChetty moved this task from Backlog to To be discussed on the Data Pipelines board.
EChetty set the point value for this task to 5.Nov 28 2022, 8:21 PM
EChetty moved this task from To be prioritised to Sprint 05-06 on the Data Pipelines board.
EChetty edited projects, added Data Pipelines (Sprint 05-06); removed Data Pipelines.