Page MenuHomePhabricator

History: mismatched historical and latest values
Open, MediumPublic


Comparing latest state for each user to make sure user_groups, user_blocks, is_bot_by, user_text, and their _historical counterparts have matching values, I found a small number of mismatches, but a lot of which are in important data, like sysop and bureaucrat groups.

Here is what I did to find the problems.

  1. Checked that each user has at most one record with end_timestamp is null, making sure this returned no results:
select wiki_db, user_id
  from wmf.mediawiki_user_history
 where snapshot='2020-06'
   and end_timestamp is null
 group by wiki_db, user_id
 having count(*) > 1
  1. Counting mismatches:
   with latest_state_with_checks as (
 select wiki_db, user_id,
        concat_ws('', sort_array(user_groups)) = concat_ws('', sort_array(user_groups_historical))  as groups_match,
        concat_ws('', sort_array(user_blocks)) = concat_ws('', sort_array(user_blocks_historical))  as blocks_match,
        concat_ws('', sort_array(is_bot_by)) = concat_ws('', sort_array(is_bot_by_historical))       as bot_by_matches,
        user_text = user_text_historical                                                            as user_text_matches

   from wmf.mediawiki_user_history
  where snapshot='2020-06'
    and end_timestamp is null

 select count(*) total,
        sum(if(not groups_match, 1, 0)) groups_mismatched,
        sum(if(not blocks_match, 1, 0)) blocks_mismatched,
        sum(if(not bot_by_matches, 1, 0)) bot_by_mismatches,
        sum(if(not user_text_matches, 1, 0)) user_text_mismatches

   from latest_state_with_checks
  1. Analyze results

This is a tiny tiny difference:

.004%  mismatched bot_by fields
.0006% mismatched user_blocks fields
.02% mismatched user_groups fields

But in the case of sysop groups it makes a difference:

   with latest_state_bureaucrats_or_sysops as (
 select wiki_db, user_id,

        array_contains(user_groups, 'sysop') or array_contains(user_groups_historical, 'sysop') as sysop_somehow,
        array_contains(user_groups, 'sysop') and not array_contains(user_groups_historical, 'sysop') as sysop_only_latest,
        not array_contains(user_groups, 'sysop') and array_contains(user_groups_historical, 'sysop') as sysop_only_historical,

        array_contains(user_groups, 'bureaucrat') or array_contains(user_groups_historical, 'bureaucrat') as bureaucrat_somehow,
        array_contains(user_groups, 'bureaucrat') and not array_contains(user_groups_historical, 'bureaucrat') as bureaucrat_only_latest,
        not array_contains(user_groups, 'bureaucrat') and array_contains(user_groups_historical, 'bureaucrat') as bureaucrat_only_historical

   from wmf.mediawiki_user_history
  where snapshot='2020-06'
    and end_timestamp is null
    and (
           array_contains(user_groups, 'sysop') or array_contains(user_groups_historical, 'sysop')
        or array_contains(user_groups, 'bureaucrat') or array_contains(user_groups_historical, 'bureaucrat')

 select sum(if(sysop_somehow, 1, 0)) sysops,
        sum(if(sysop_only_latest, 1, 0)) sysop_only_latest,
        sum(if(sysop_only_historical, 1, 0)) sysop_only_historical,
        sum(if(bureaucrat_somehow, 1, 0)) bureaucrats,
        sum(if(bureaucrat_only_latest, 1, 0)) bureaucrat_only_latest,
        sum(if(bureaucrat_only_historical, 1, 0)) bureaucrat_only_historical

   from latest_state_bureaucrats_or_sysops

Which shows a significant mismatch, 3997 + 1563 / 11079 is 50% and 100 + 767 / 1797 is 48%. This is a little confusing, it's the percent records that don't align out of all records, so not quite as bad as it sounds. It means that the latest states in the mediawiki_user_history table are not propagating current data properly.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2020, 5:53 PM
fdans triaged this task as Medium priority.Aug 3 2020, 4:36 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.
fdans awarded a token.
fdans added a project: Product-Analytics.
LGoto moved this task from Triage to Tracking on the Product-Analytics board.Aug 4 2020, 5:16 PM