Page MenuHomePhabricator

Mediawiki History: moves counted twice in Revision
Open, NormalPublic

Description

While attempting to vet cu_changes data against mediawiki history, I hit a problem. Page moves, triggered either by the renaming of a page, move-over-redirect, user renames, and so on, are logged in the revision table as two edits. One on the old page and one on the new page. These then count towards active editors, and sometimes if the pages are in namespace 0, they count towards "content" active editors.

I'm not sure how the latter is handled in Wikistats 1.0, but the counting of these makes it very hard to reconcile data automatically against what I'm seeing in the cu_changes table. Because all moves are logged there with cuc_type = 3, but so are other things such as abuse filter triggers, etc. So there's no way to selectively include data from cu_changes to match data from revision.

Also, it's kind of a problem that these actions count twice even though they're just one action. Maybe this helps explain some of the data discrepancies that @JAllemandou was having? Either way, a closer look at this would benefit both datasets. For now, I will only include cuc_type in (0, 1) and so the editor numbers for geowiki will be considerably lower than those from mediawiki_history and Wikistats 2.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2018, 7:27 PM

Super interesting finding !!!
Let's discuss what to do with that.

Nuria added a subscriber: Nuria.Mar 6 2018, 8:09 PM

and so the editor numbers for geowiki will be considerably lower than those from mediawiki_history and Wikistats 2.

The number of edits might change but not the number of editors. Right?

Milimetric added a comment.EditedMar 6 2018, 8:35 PM

and so the editor numbers for geowiki will be considerably lower than those from mediawiki_history and Wikistats 2.

The number of edits might change but not the number of editors. Right?

Both would change, for example editors whose only edits are these kinds of moves. But I should've been more clear, geowiki counts editors who do X edits, where X is 1, 5, 100, etc. So the 1 case might not change that much, but the 5 case and 100 case have significant differences from the queries I did, that's why I looked at it more carefully. For future reference, here are some example queries that look at the different datasets:

-- NOTE: inserting into geowiki_daily filters out bots
 select count(*)
   from (select user_id_or_ip
           from geowiki_daily
          where wiki_db = 'rowiki'
            and month='2018-02'
          group by user_id_or_ip
         having sum(edit_count) >= 5
        ) active_editors
;

-- NOTE: counts revisions on deleted pages which is fine because cu_changes is not updated
 select if(event_user_is_anonymous, event_user_text, event_user_id) user_id_or_ip

   from wmf.mediawiki_history

  where event_entity = 'revision'
    and event_type = 'create'
    and event_timestamp between '2018-02-01 00:00:00' and '2018-03-01 00:00:00'
    and wiki_db = 'rowiki'
    and snapshot = '2018-02'
    and not array_contains(event_user_groups, 'bot')
    and not event_user_is_bot_by_name

  group by if(event_user_is_anonymous, event_user_text, event_user_id)
 having count(*) >= 5
  order by user_id_or_ip
  limit 10000
;

-- NOTE: this is where I first saw a discrepancy of "move-related" revisions being logged with cuc_type = 3, more editors show up with this query therefore:
 select user_id_or_ip
   from (select if(rev_user = 0, rev_user_text, rev_user) as user_id_or_ip,
                count(*) edits
           from revision
          where rev_timestamp between '20180201000000' and '20180301000000'
          group by if(rev_user = 0, rev_user_text, rev_user)

          union all

         select if(ar_user = 0, ar_user_text, ar_user) as user_id_or_ip,
                count(*) edits
           from archive
          where ar_timestamp between '20180201000000' and '20180301000000'
          group by if(ar_user = 0, ar_user_text, ar_user)
        ) editors
            left join
        user_groups         on ug_user = user_id_or_ip
                            and ug_group = 'bot'

  where ug_user is null
  group by user_id_or_ip
 having sum(edits) >= 5
;
*/
mforns triaged this task as Normal priority.Apr 19 2018, 4:53 PM
mforns added a subscriber: mforns.Mar 11 2019, 3:59 PM

@JAllemandou @Milimetric Was something done in this regard, during data quality work?

IMO this issue is not data-quality as in a problem in the dataset generation, but rather a problem of data-semantics and how we interpret our data. By this I mean the effort on data quality we made this quarter is not related to this issue.

@JAllemandou @Milimetric
(grosking) Should we then add some documentation in regards to that?

mforns raised the priority of this task from Normal to Needs Triage.Mar 25 2019, 5:32 PM
mforns triaged this task as Normal priority.