Page MenuHomePhabricator

Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history
Closed, ResolvedPublic

Description

Some registered users have null values for one of the user name fields. This seems to fit one of two patterns:

  • A revision event has a null event_user_text, even though event_user_text_historical and event_user_id match and are correct and there is no indication the user has ever been renamed.
  • A page or user event has event_user_text, event_user_text_historical, and event_user_is_anonymous as null, even though event_user_id is correct

Examples:

  • S7w4j9 (user ID 64) on yuewiktionary
  • Rovack (user ID 467551) on enwiki
  • SheriffsIsInTown (user ID 2987925) on dewiki

For a query showing the number of affected rows, see P8210.

Event Timeline

Neil_P._Quinn_WMF moved this task from Triage to Tracking on the Product-Analytics board.
Milimetric moved this task from Incoming to Data Quality on the Analytics board.Mar 18 2019, 3:18 PM
Milimetric triaged this task as High priority.

Change 497604 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Correct mw user-history create event timestamp

https://gerrit.wikimedia.org/r/497604

Thanks Neil for having raised this.
I have found 3 issues:

  • One is that there was an inconsistency between user create events start_timestampand the user_registration_timestamp, leading to users not being correctly linked to other events (this is the case for S7w4j9 and SheriffsIsInTown in the given examples above). See this patch: https://gerrit.wikimedia.org/r/#/c/497604/
  • The second is that by approximating user registration with its first-edit when registration is undefined, we miss the opportunity to link the user to its real create event happening a lot before the actual first edit (example of Rovack above). We should make an explicit distinction between registration (either defined in user-page or through create-event), and user first-edit timestamp (such a distinction is already coming to pages, it makes sense to add it for users as well).
  • Finally the user first-edit date was computed using revisions and not archive, leading to some archive rows not correctly attached to the user if before the first revision ( corrected but not not yet deployed: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/491494)

Comments welcome (ping @Milimetric and @Neil_P._Quinn_WMF :)

@JAllemandou thank you for diagnosing and addressing this so quickly! 👏

I don't understand the whole history reconstruction process deeply enough to give any useful comments—so I'll just say thank you again 😁

fdans assigned this task to mforns.Apr 11 2019, 4:40 PM

Change 497604 merged by jenkins-bot:
[analytics/refinery/source@master] Update mw user-history timestamps

https://gerrit.wikimedia.org/r/497604

Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

// Current datasource - normally the problem is present in here
val odf = spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
val oudf = spark.read.parquet("/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")

// New datasource - normally the problem is soled in here
val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03")
val udf = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")
  • user S7w4j9 (user ID 64) on yuewiktionary
// No more unlinked events for S7w4j9 on yuewiktionary
odf.where("event_user_text_historical = 'S7w4j9' and event_user_text is null and wiki_db = 'yuewiktionary'").count()
res0: Long = 885           
                                                     
df.where("event_user_text_historical = 'S7w4j9' and event_user_text is null and wiki_db = 'yuewiktionary'").count()
res1: Long = 0         


// This is because create-event start_timestamp now reflect the minimum of potential creation-dates while it was only based on registration before
oudf.where("user_text_historical = 'S7w4j9' and wiki_db = 'yuewiktionary' and caused_by_event_type = 'create'").select("user_registration_timestamp", "start_timestamp").show(10, false)
+---------------------------+---------------------+                             
|user_registration_timestamp|start_timestamp      |
+---------------------------+---------------------+
|2017-05-26 16:26:05.0      |2018-11-18 23:20:23.0|
+---------------------------+---------------------+

udf.where("user_text_historical = 'S7w4j9' and wiki_db = 'yuewiktionary' and caused_by_event_type = 'create'").select("user_registration_timestamp", "user_creation_timestamp", "user_first_edit_timestamp", "start_timestamp").show(10, false)
+---------------------------+-----------------------+-------------------------+---------------------+
|user_registration_timestamp|user_creation_timestamp|user_first_edit_timestamp|start_timestamp      |
+---------------------------+-----------------------+-------------------------+---------------------+
|2018-11-18 23:20:23.0      |2018-11-18 23:20:23.0  |2017-05-26 16:26:05.0    |2017-05-26 16:26:05.0|
+---------------------------+-----------------------+-------------------------+---------------------+
  • User SheriffIsInTown (user ID 2987925) in dewiki (same check as before):
odf.where("event_user_text_historical = 'SheriffIsInTown' and event_user_text is null and wiki_db = 'dewiki'").count()
res15: Long = 153    
df.where("event_user_text_historical = 'SheriffIsInTown' and event_user_text is null and wiki_db = 'dewiki'").count()
res16: Long = 0

oudf.where("user_text_historical = 'SheriffIsInTown' and wiki_db = 'dewiki' and caused_by_event_type = 'create'").select("user_registration_timestamp", "start_timestamp").show(10, false)
+---------------------------+---------------------+                             
|user_registration_timestamp|start_timestamp      |
+---------------------------+---------------------+
|2015-09-14 20:37:54.0      |2018-08-12 21:19:58.0|
+---------------------------+---------------------+


udf.where("user_text_historical = 'SheriffIsInTown' and wiki_db = 'dewiki' and caused_by_event_type = 'create'").select("user_registration_timestamp", "user_creation_timestamp", "user_first_edit_timestamp", "start_timestamp").show(10, false)
+---------------------------+-----------------------+-------------------------+---------------------+
|user_registration_timestamp|user_creation_timestamp|user_first_edit_timestamp|start_timestamp      |
+---------------------------+-----------------------+-------------------------+---------------------+
|2018-08-12 21:19:58.0      |2018-08-12 21:19:58.0  |2015-09-14 20:37:54.0    |2015-09-14 20:37:54.0|
+---------------------------+-----------------------+-------------------------+---------------------+
  • User Rovack (user ID 467551) on enwiki
df.where("event_user_text_historical = 'Rovack' and event_user_text is null and wiki_db = 'enwiki'").count()
res21: Long = 0                                                                 

odf.where("event_user_text_historical = 'Rovack' and event_user_text is null and wiki_db = 'enwiki'").count()
res22: Long = 1                                                                 

// start_timestamp in old datasource is not using archived-revision into account
oudf.where("user_text_historical = 'Rovack' and wiki_db = 'enwiki' and caused_by_event_type = 'create'").select("user_registration_timestamp", "start_timestamp").show(10, false)
+---------------------------+---------------------+                             
|user_registration_timestamp|start_timestamp      |
+---------------------------+---------------------+
|2018-02-18 00:25:04.0      |2018-02-18 00:25:04.0|
+---------------------------+---------------------+

udf.where("user_text_historical = 'Rovack' and wiki_db = 'enwiki' and caused_by_event_type = 'create'").select("user_registration_timestamp", "user_creation_timestamp", "user_first_edit_timestamp", "start_timestamp").show(10, false)
+---------------------------+-----------------------+-------------------------+---------------------+
|user_registration_timestamp|user_creation_timestamp|user_first_edit_timestamp|start_timestamp      |
+---------------------------+-----------------------+-------------------------+---------------------+
|null                       |2005-09-30 20:59:06.0  |2018-02-18 00:19:05.0    |2005-09-30 20:59:06.0|
+---------------------------+-----------------------+-------------------------+---------------------+

odf.where("event_user_text_historical = 'Rovack' and event_user_text is null and wiki_db = 'enwiki'").select("event_entity", "event_timestamp", "revision_is_deleted").show(10, false)
+------------+---------------------+-------------------+                        
|event_entity|event_timestamp      |revision_is_deleted|
+------------+---------------------+-------------------+
|revision    |2018-02-18 00:19:05.0|true               |
+------------+---------------------+-------------------+

Now checking global numbers using P8210 as base:

// Old
spark.sql("""
select
event_entity,
year,
name_status,
count(*) as rows
from (
    select
        trunc(event_timestamp, 'YEAR') as year,
        case
            when (event_user_text is null and event_user_text_historical is null) then 'both_names_null'
            when event_user_text is null then 'current_name_null'
            when event_user_text_historical is null then 'historical_name_null'
            else 'neither_name_null'
        end as name_status,
        event_entity
    from wmf.mediawiki_history
    where
        snapshot = '2019-03' and
        (event_user_is_anonymous = false or event_user_is_anonymous is null)
) name_status
group by event_entity, year, name_status
order by event_entity, year, name_status
limit 1000
""").show(1000, false)
+------------+----------+-----------------+---------+                           
|event_entity|year      |name_status      |rows     |
+------------+----------+-----------------+---------+
|page        |null      |both_names_null  |24       |
|page        |1970-01-01|both_names_null  |574      |
|page        |1999-01-01|both_names_null  |3        |
|page        |2001-01-01|both_names_null  |55697    |
|page        |2001-01-01|neither_name_null|14040    |
|page        |2002-01-01|both_names_null  |49009    |
|page        |2002-01-01|neither_name_null|116830   |
|page        |2003-01-01|both_names_null  |112960   |
|page        |2003-01-01|neither_name_null|358257   |
|page        |2004-01-01|both_names_null  |392858   |
|page        |2004-01-01|neither_name_null|1519548  |
|page        |2005-01-01|both_names_null  |878923   |
|page        |2005-01-01|neither_name_null|5325096  |
|page        |2006-01-01|both_names_null  |975271   |
|page        |2006-01-01|neither_name_null|11399247 |
|page        |2007-01-01|both_names_null  |852619   |
|page        |2007-01-01|neither_name_null|16055619 |
|page        |2008-01-01|both_names_null  |762835   |
|page        |2008-01-01|neither_name_null|17290817 |
|page        |2009-01-01|both_names_null  |690553   |
|page        |2009-01-01|neither_name_null|18047836 |
|page        |2010-01-01|both_names_null  |626656   |
|page        |2010-01-01|neither_name_null|20684163 |
|page        |2011-01-01|both_names_null  |774778   |
|page        |2011-01-01|neither_name_null|22515646 |
|page        |2012-01-01|both_names_null  |614306   |
|page        |2012-01-01|neither_name_null|27893472 |
|page        |2013-01-01|both_names_null  |501882   |
|page        |2013-01-01|neither_name_null|44295719 |
|page        |2014-01-01|both_names_null  |477353   |
|page        |2014-01-01|neither_name_null|38134826 |
|page        |2015-01-01|both_names_null  |474235   |
|page        |2015-01-01|neither_name_null|41111714 |
|page        |2016-01-01|both_names_null  |450693   |
|page        |2016-01-01|neither_name_null|37223113 |
|page        |2017-01-01|both_names_null  |501832   |
|page        |2017-01-01|neither_name_null|56906629 |
|page        |2018-01-01|both_names_null  |463062   |
|page        |2018-01-01|neither_name_null|45764454 |
|page        |2019-01-01|both_names_null  |113350   |
|page        |2019-01-01|neither_name_null|9304296  |
|revision    |null      |current_name_null|2        |
|revision    |1970-01-01|current_name_null|578      |
|revision    |1999-01-01|current_name_null|3        |
|revision    |2001-01-01|current_name_null|10479    |
|revision    |2001-01-01|neither_name_null|49539    |
|revision    |2002-01-01|current_name_null|23311    |
|revision    |2002-01-01|neither_name_null|589095   |
|revision    |2003-01-01|current_name_null|71981    |
|revision    |2003-01-01|neither_name_null|2734172  |
|revision    |2004-01-01|current_name_null|391430   |
|revision    |2004-01-01|neither_name_null|15047843 |
|revision    |2005-01-01|current_name_null|1117317  |
|revision    |2005-01-01|neither_name_null|44435035 |
|revision    |2006-01-01|current_name_null|1632745  |
|revision    |2006-01-01|neither_name_null|116525758|
|revision    |2007-01-01|current_name_null|1891484  |
|revision    |2007-01-01|neither_name_null|164136590|
|revision    |2008-01-01|current_name_null|1739332  |
|revision    |2008-01-01|neither_name_null|176088922|
|revision    |2009-01-01|current_name_null|1356329  |
|revision    |2009-01-01|neither_name_null|191416058|
|revision    |2010-01-01|current_name_null|1103308  |
|revision    |2010-01-01|neither_name_null|200654337|
|revision    |2011-01-01|current_name_null|1239194  |
|revision    |2011-01-01|neither_name_null|207242472|
|revision    |2012-01-01|current_name_null|773537   |
|revision    |2012-01-01|neither_name_null|227048388|
|revision    |2013-01-01|current_name_null|489165   |
|revision    |2013-01-01|neither_name_null|316319640|
|revision    |2014-01-01|current_name_null|425565   |
|revision    |2014-01-01|neither_name_null|291181706|
|revision    |2015-01-01|current_name_null|163912   |
|revision    |2015-01-01|neither_name_null|329874121|
|revision    |2016-01-01|current_name_null|79769    |
|revision    |2016-01-01|neither_name_null|377725448|
|revision    |2017-01-01|current_name_null|53391    |
|revision    |2017-01-01|neither_name_null|437584429|
|revision    |2018-01-01|current_name_null|28219    |
|revision    |2018-01-01|neither_name_null|454034579|
|revision    |2019-01-01|current_name_null|811      |
|revision    |2019-01-01|neither_name_null|133574927|
|revision    |2025-01-01|neither_name_null|5        |
|user        |0005-01-13|both_names_null  |1        |
|user        |0006-01-13|both_names_null  |1        |
|user        |1943-01-01|both_names_null  |1        |
|user        |1945-01-01|both_names_null  |4        |
|user        |1947-01-01|both_names_null  |1        |
|user        |1948-01-01|both_names_null  |2        |
|user        |1949-01-01|both_names_null  |1        |
|user        |1951-01-01|both_names_null  |1        |
|user        |1952-01-01|both_names_null  |2        |
|user        |1953-01-01|both_names_null  |1        |
|user        |1957-01-01|both_names_null  |1        |
|user        |1961-01-01|both_names_null  |1        |
|user        |1962-01-01|both_names_null  |2        |
|user        |1963-01-01|both_names_null  |1        |
|user        |1966-01-01|both_names_null  |1        |
|user        |1970-01-01|both_names_null  |2        |
|user        |1971-01-01|both_names_null  |1        |
|user        |1972-01-01|both_names_null  |5        |
|user        |1973-01-01|both_names_null  |6        |
|user        |1974-01-01|both_names_null  |4        |
|user        |1975-01-01|both_names_null  |4        |
|user        |1976-01-01|both_names_null  |1        |
|user        |1977-01-01|both_names_null  |5        |
|user        |1979-01-01|both_names_null  |3        |
|user        |1980-01-01|both_names_null  |5        |
|user        |1981-01-01|both_names_null  |2        |
|user        |1982-01-01|both_names_null  |2        |
|user        |1983-01-01|both_names_null  |5        |
|user        |1984-01-01|both_names_null  |1        |
|user        |1986-01-01|both_names_null  |2        |
|user        |1988-01-01|both_names_null  |1        |
|user        |1990-01-01|both_names_null  |1        |
|user        |1997-01-01|both_names_null  |1        |
|user        |1999-01-01|both_names_null  |1        |
|user        |2001-01-01|both_names_null  |532      |
|user        |2002-01-01|both_names_null  |2856     |
|user        |2003-01-01|both_names_null  |18225    |
|user        |2004-01-01|both_names_null  |103251   |
|user        |2004-01-01|neither_name_null|214      |
|user        |2005-01-01|both_names_null  |330836   |
|user        |2005-01-01|neither_name_null|69326    |
|user        |2006-01-01|both_names_null  |74999    |
|user        |2006-01-01|neither_name_null|4248968  |
|user        |2007-01-01|both_names_null  |75908    |
|user        |2007-01-01|neither_name_null|5441025  |
|user        |2008-01-01|both_names_null  |135921   |
|user        |2008-01-01|neither_name_null|6195355  |
|user        |2009-01-01|both_names_null  |245568   |
|user        |2009-01-01|neither_name_null|8240898  |
|user        |2010-01-01|both_names_null  |223479   |
|user        |2010-01-01|neither_name_null|7473681  |
|user        |2011-01-01|both_names_null  |1059443  |
|user        |2011-01-01|neither_name_null|6722427  |
|user        |2012-01-01|both_names_null  |183690   |
|user        |2012-01-01|neither_name_null|11054129 |
|user        |2013-01-01|both_names_null  |102245   |
|user        |2013-01-01|neither_name_null|16310970 |
|user        |2014-01-01|both_names_null  |55874    |
|user        |2014-01-01|neither_name_null|24808379 |
|user        |2015-01-01|both_names_null  |62716    |
|user        |2015-01-01|neither_name_null|31787006 |
|user        |2016-01-01|both_names_null  |57500    |
|user        |2016-01-01|neither_name_null|20777202 |
|user        |2017-01-01|both_names_null  |36933    |
|user        |2017-01-01|neither_name_null|20534572 |
|user        |2018-01-01|both_names_null  |36588    |
|user        |2018-01-01|neither_name_null|21540787 |
|user        |2019-01-01|both_names_null  |8144     |
|user        |2019-01-01|neither_name_null|5395897  |
+------------+----------+-----------------+---------+


//New
df.createOrReplaceTempView("mwh")

spark.sql("""
select
event_entity,
year,
name_status,
count(*) as rows
from (
    select
        trunc(event_timestamp, 'YEAR') as year,
        case
            when (event_user_text is null and event_user_text_historical is null) then 'both_names_null'
            when event_user_text is null then 'current_name_null'
            when event_user_text_historical is null then 'historical_name_null'
            else 'neither_name_null'
        end as name_status,
        event_entity
    from mwh
    where
        (event_user_is_anonymous = false or event_user_is_anonymous is null)
) name_status
group by event_entity, year, name_status
order by event_entity, year, name_status
limit 1000
""").show(1000, false)

+------------+----------+-----------------+---------+                           
|event_entity|year      |name_status      |rows     |
+------------+----------+-----------------+---------+
|page        |1999-01-01|neither_name_null|3        |
|page        |2001-01-01|current_name_null|12       |
|page        |2001-01-01|neither_name_null|18243    |
|page        |2002-01-01|current_name_null|106      |
|page        |2002-01-01|neither_name_null|128725   |
|page        |2003-01-01|current_name_null|100      |
|page        |2003-01-01|neither_name_null|372628   |
|page        |2004-01-01|current_name_null|1603     |
|page        |2004-01-01|neither_name_null|1599965  |
|page        |2005-01-01|current_name_null|8536     |
|page        |2005-01-01|neither_name_null|6710774  |
|page        |2006-01-01|current_name_null|21878    |
|page        |2006-01-01|neither_name_null|14930303 |
|page        |2007-01-01|current_name_null|62864    |
|page        |2007-01-01|neither_name_null|22487114 |
|page        |2008-01-01|current_name_null|38353    |
|page        |2008-01-01|neither_name_null|22171407 |
|page        |2009-01-01|current_name_null|22761    |
|page        |2009-01-01|neither_name_null|22779898 |
|page        |2010-01-01|current_name_null|22902    |
|page        |2010-01-01|neither_name_null|24860401 |
|page        |2011-01-01|current_name_null|26026    |
|page        |2011-01-01|neither_name_null|27783391 |
|page        |2012-01-01|current_name_null|28022    |
|page        |2012-01-01|neither_name_null|31529324 |
|page        |2013-01-01|current_name_null|18816    |
|page        |2013-01-01|neither_name_null|47973025 |
|page        |2014-01-01|current_name_null|40068    |
|page        |2014-01-01|neither_name_null|41974442 |
|page        |2015-01-01|current_name_null|11666    |
|page        |2015-01-01|neither_name_null|45344168 |
|page        |2016-01-01|current_name_null|7016     |
|page        |2016-01-01|neither_name_null|41318169 |
|page        |2017-01-01|current_name_null|2262     |
|page        |2017-01-01|neither_name_null|60582382 |
|page        |2018-01-01|current_name_null|205      |
|page        |2018-01-01|neither_name_null|49461983 |
|page        |2019-01-01|current_name_null|7        |
|page        |2019-01-01|neither_name_null|10148712 |
|revision    |1999-01-01|neither_name_null|3        |
|revision    |2001-01-01|current_name_null|34       |
|revision    |2001-01-01|neither_name_null|59983    |
|revision    |2002-01-01|current_name_null|230      |
|revision    |2002-01-01|neither_name_null|645246   |
|revision    |2003-01-01|current_name_null|1713     |
|revision    |2003-01-01|neither_name_null|2833023  |
|revision    |2004-01-01|current_name_null|55953    |
|revision    |2004-01-01|neither_name_null|16351562 |
|revision    |2005-01-01|current_name_null|181798   |
|revision    |2005-01-01|neither_name_null|46387498 |
|revision    |2006-01-01|current_name_null|242888   |
|revision    |2006-01-01|neither_name_null|118393715|
|revision    |2007-01-01|current_name_null|305365   |
|revision    |2007-01-01|neither_name_null|165720630|
|revision    |2008-01-01|current_name_null|381797   |
|revision    |2008-01-01|neither_name_null|177444705|
|revision    |2009-01-01|current_name_null|86053    |
|revision    |2009-01-01|neither_name_null|192684039|
|revision    |2010-01-01|current_name_null|105998   |
|revision    |2010-01-01|neither_name_null|201649885|
|revision    |2011-01-01|current_name_null|103873   |
|revision    |2011-01-01|neither_name_null|208375497|
|revision    |2012-01-01|current_name_null|109977   |
|revision    |2012-01-01|neither_name_null|227709139|
|revision    |2013-01-01|current_name_null|71323    |
|revision    |2013-01-01|neither_name_null|316736676|
|revision    |2014-01-01|current_name_null|68510    |
|revision    |2014-01-01|neither_name_null|291556668|
|revision    |2015-01-01|current_name_null|18930    |
|revision    |2015-01-01|neither_name_null|330021631|
|revision    |2016-01-01|current_name_null|19155    |
|revision    |2016-01-01|neither_name_null|377786064|
|revision    |2017-01-01|current_name_null|3420     |
|revision    |2017-01-01|neither_name_null|437634405|
|revision    |2018-01-01|current_name_null|182      |
|revision    |2018-01-01|neither_name_null|454062617|
|revision    |2019-01-01|current_name_null|118      |
|revision    |2019-01-01|neither_name_null|133575620|
|revision    |2025-01-01|neither_name_null|5        |
|user        |1999-01-01|neither_name_null|1        |
|user        |2001-01-01|current_name_null|388      |
|user        |2001-01-01|neither_name_null|538      |
|user        |2002-01-01|current_name_null|55       |
|user        |2002-01-01|neither_name_null|1282     |
|user        |2003-01-01|current_name_null|38       |
|user        |2003-01-01|neither_name_null|3800     |
|user        |2004-01-01|current_name_null|116      |
|user        |2004-01-01|neither_name_null|15097    |
|user        |2005-01-01|current_name_null|534      |
|user        |2005-01-01|neither_name_null|625340   |
|user        |2006-01-01|current_name_null|2216     |
|user        |2006-01-01|neither_name_null|4291766  |
|user        |2007-01-01|current_name_null|1907     |
|user        |2007-01-01|neither_name_null|5473669  |
|user        |2008-01-01|current_name_null|2626     |
|user        |2008-01-01|neither_name_null|6203914  |
|user        |2009-01-01|current_name_null|1081     |
|user        |2009-01-01|neither_name_null|8248577  |
|user        |2010-01-01|current_name_null|1072     |
|user        |2010-01-01|neither_name_null|7479157  |
|user        |2011-01-01|current_name_null|633      |
|user        |2011-01-01|neither_name_null|6727900  |
|user        |2012-01-01|current_name_null|504      |
|user        |2012-01-01|neither_name_null|11058913 |
|user        |2013-01-01|current_name_null|386      |
|user        |2013-01-01|neither_name_null|16304563 |
|user        |2014-01-01|current_name_null|641      |
|user        |2014-01-01|neither_name_null|24807473 |
|user        |2015-01-01|current_name_null|268      |
|user        |2015-01-01|neither_name_null|31833972 |
|user        |2016-01-01|current_name_null|259      |
|user        |2016-01-01|neither_name_null|20768346 |
|user        |2017-01-01|current_name_null|67       |
|user        |2017-01-01|neither_name_null|20464236 |
|user        |2018-01-01|current_name_null|9        |
|user        |2018-01-01|neither_name_null|21523581 |
|user        |2019-01-01|current_name_null|1        |
|user        |2019-01-01|neither_name_null|5392549  |
+------------+----------+-----------------+---------+

Findings:

  • Quite some more page events (deleted-pages now are included in the dataset).
  • old datasource had both_names_null for page and user, new one one has current_name_null.
  • While there there still are some rows with current_name_null for page, revision and user, the proportion of events correctly linked to their user has been improved quite a lot (looking at year 2000 onward, considering the small number of rows in previous years as problematic - mostly on user):
    • page event-entity, average ratio of neither_name_null is 99.922% on new datasource, was 84.744% on old one
    • user event-entity, average ratio of neither_name_null is 99.987% on new datasource, was 94.411% on old one
    • revision event-entity, average ratio of neither_name_null is 99.913% on new datasource, was 93.256% on old one

Hi @Neil_P._Quinn_WMF, sorry for the big comment above - Do you mind having a look and confirming this looks ok for you? Many thanks :)

JAllemandou claimed this task.
JAllemandou moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.
JAllemandou added a subscriber: mforns.

Change 497604 merged by Fdans:
[analytics/refinery/source@master] Update mw user-history timestamps

https://gerrit.wikimedia.org/r/497604

Hi @Neil_P._Quinn_WMF, sorry for the big comment above - Do you mind having a look and confirming this looks ok for you? Many thanks :)

Hey @JAllemandou! This looks like a big improvement—thanks so much for all your work! I look forward to the fixes in the April snapshot :)

Nuria closed this task as Resolved.Tue, May 14, 8:35 PM