Page MenuHomePhabricator

page_id is null where it shouldn't be in mediawiki history
Open, LowPublic

Description

 select event_entity,
        event_type,
        sum(if(page_id is null, 1, 0)) as null_page_id,
        count(1) as total
   from wmf.mediawiki_history
  where snapshot='2020-06'
    and event_entity in ('page', 'revision')
  group by event_entity,
        event_type
;
event_entityevent_typenull_page_idtotal
pagemerge028018
pagedelete1123124560155501
pagerestore734261571489
pagecreate-page047032453
pagemove29927926674648
revisioncreate272010684818723080
pagecreate11231244538243985

I also found other types of events with duplicates. page_id = 0, and other duplicates:

 select event_entity,
        event_type,
        event_user_id,
        event_user_text_historical,
        event_timestamp,
        page_id,
        revision_id,
        user_id
   from wmf.mediawiki_history
  where snapshot='2020-06'
  group by
        event_entity,
        event_type,
        event_user_id,
        event_user_text_historical,
        event_timestamp,
        page_id,
        revision_id,
        user_id
 having count(1) > 1
;

Here's an example of an alterblocks problem. That last record has the same start and end timestamp, causing two events in the denormalized table to have the same timestamp, and making it necessary to look at other fields for uniqueness:

 select *
   from wmf.mediawiki_history
  where snapshot = '2020-06'
    and wiki_db = 'enwiki'
    and event_entity = 'user'
    and event_type = 'alterblocks'
    and event_user_id = 1272505
    and event_user_text_historical = 'ST47'
    and event_timestamp = '2009-02-04 03:59:18.0'
    and wiki_db = 'enwiki'
    and user_id = 6144686
  limit 1000
;
wiki_dbuser_iduser_text_historicaluser_textuser_groups_historicaluser_groupsuser_blocks_historicaluser_blocksis_bot_by_historicalis_bot_byuser_registration_timestampuser_creation_timestampuser_first_edit_timestampcreated_by_selfcreated_by_systemcreated_by_peeranonymousstart_timestampend_timestampcaused_by_event_typecaused_by_user_idcaused_by_user_textcaused_by_anonymous_usercaused_by_block_expirationinferred_fromsource_log_idsource_log_commentsource_log_paramssnapshot
enwiki6144686Jack MerridoodooJack Merridoodoo[][][][][][]2008-01-08 08:34:38.02008-01-08 08:34:38.02008-01-08 08:52:42.0truefalsefalsefalse2008-01-08 08:34:38.02008-01-08 09:10:09.0create6144686Jack MerridoodoofalseNULLNULL12958868{"unparsed":"6144686"}2020-06
enwiki6144686Jack MerridoodooJack Merridoodoo[][]["nocreate"][][][]2008-01-08 08:34:38.02008-01-08 08:34:38.02008-01-08 08:52:42.0truefalsefalsefalse2008-01-08 09:10:09.02008-07-28 20:26:26.0alterblocksrEMFR125785563531Gogo DodofalseindefiniteNULL12959074[[Wikipedia:Vandalism/gVandalism]]-only account{"unparsed":"indefinite nocreate"}2020-06
enwiki6144686Jack MerridoodooJack Merridoodoo[][][][][][]2008-01-08 08:34:38.02008-01-08 08:34:38.02008-01-08 08:52:42.0truefalsefalsefalse2008-07-28 20:26:26.02008-07-28 20:34:41.0alterblocks1807070East718falseNULLNULL16452718unblocking to reblock{"unparsed":""}2020-06
enwiki6144686Jack MerridoodooJack Merridoodoo[][]["nocreate","noemail"][][][]2008-01-08 08:34:38.02008-01-08 08:34:38.02008-01-08 08:52:42.0truefalsefalsefalse2008-07-28 20:34:41.02008-11-15 03:32:02.0alterblocks1807070East718falseindefiniteNULL16453396grawp{"unparsed":"infinite nocreate,noemail"}2020-06
enwiki6144686Jack MerridoodooJack Merridoodoo[][]["nocreate","noemail"][][][]2008-01-08 08:34:38.02008-01-08 08:34:38.02008-01-08 08:52:42.0truefalsefalsefalse2008-11-15 03:32:02.02009-02-04 03:59:18.0alterblocks3153598Fran RogersfalseindefiniteNULL18256521[[User:JarlaxleArtemis]]{"unparsed":"indefinite nocreate,noemail"}2020-06
enwiki6144686Jack MerridoodooJack Merridoodoo[][]["nocreate","noemail","nousertalk"][][][]2008-01-08 08:34:38.02008-01-08 08:34:38.02008-01-08 08:52:42.0truefalsefalsefalse2009-02-04 03:59:18.0NULLalterblocks1272505ST47falseindefiniteNULL20134780[[User:JarlaxleArtemis]]{"unparsed":"indefinite nocreate,noemail,nousertalk"}2020-06
enwiki6144686Jack MerridoodooJack Merridoodoo[][][][][][]2008-01-08 08:34:38.02008-01-08 08:34:38.02008-01-08 08:52:42.0truefalsefalsefalse2009-02-04 03:59:18.02009-02-04 03:59:18.0alterblocks1272505ST47falseNULLNULL20134779{"unparsed":""}2020-06

Event Timeline

mforns triaged this task as Medium priority.Aug 10 2020, 3:47 PM
mforns moved this task from Incoming to Data Quality on the Analytics board.
mforns subscribed.

Maybe a task for Lex.

I have investigated quickly and we should take page_artificial_id into consideration: most events having page-id null or 0 have a defined page-artificial-id, meaning they have a kind of identifier. I have also found some duplicates without page-artificial ids, but a very small number - will investigate later.

I thought we excluded these from denormalized, and left them just in page history. I can see why we'd want to include them. Hm... it is very confusing. Maybe we should give them a negative page id?

Hi @mforns @JAllemandou @Milimetric ! If this seems like a good first issue for a newbie to understand Wiki Datasets , could I look into this ? Will be grateful for any guidance or pointers :)

@Meghajain171192 do ping me if you want to talk over this, I'm @Milimetric on IRC

odimitrijevic lowered the priority of this task from Medium to Low.Dec 3 2021, 9:21 PM
odimitrijevic edited projects, added Data-Engineering; removed Analytics-Data-Quality.
odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.