Page MenuHomePhabricator

mediawiki_history missing page events
Closed, ResolvedPublic8 Estimated Story Points

Description

For example, according to the on-wiki logs, the page "Jeff Caldwell (soccer)" on enwiki was deleted three times, restored once, and then moved.

But mediawiki_history only records the last two, and the move is actually marked as a creation. It also doesn't include any of the initial creations.

select
    event_type,
    event_timestamp,
    event_user_text,
    page_title,
    page_title_historical,
    page_id
from wmf.mediawiki_history
where
    event_entity = "page" and
    wiki_db = "enwiki" and
    (page_title_historical = "Jeff_Caldwell_(soccer)" or page_title = "Jeff_Caldwell_(soccer)") and
    snapshot = "2018-08"

  event_type        event_timestamp event_user_text  \
0     create  2018-07-19 13:00:57.0  Freefalling660   
1    restore  2018-07-31 17:33:57.0         Hut 8.5   

                              page_title   page_title_historical   page_id  
0  Freefalling660/Jeff_Caldwell_(soccer)  Jeff_Caldwell_(soccer)  57939448  
1  Freefalling660/Jeff_Caldwell_(soccer)  Jeff_Caldwell_(soccer)  57939448

mediawiki_page_history records a bunch more, but there are several duplicates and the schema is a lot more confusing to me (only including the query because the result is too long to print).

select
    page_id,
    page_id_artificial,
    page_title,
    page_title_historical,
    start_timestamp,
    end_timestamp,
    caused_by_event_type,
    caused_by_user_id
from wmf.mediawiki_page_history
where
    wiki_db = "enwiki" and
    (page_title_historical = "Jeff_Caldwell_(soccer)" or page_title = "Jeff_Caldwell_(soccer)") and
    snapshot = "2018-08"
order by start_timestamp asc
limit 1000

As another example, the the page ""Accidente ferroviario de Cerrillos de 1956" on eswiki has had quite a few events, but has no page events at all in mediawiki_history (same with mediawiki_page_history).

select
    event_type,
    event_timestamp,
    event_user_text,
    page_id
from wmf.mediawiki_history
where
    event_entity = "page" and
    wiki_db = "eswiki" and
    (page_title_historical = "Accidente ferroviario de Cerrillos de 1956" or page_title = "Accidente ferroviario de Cerrillos de 1956") and
    snapshot = "2018-08"

Is the data supposed to be this unreliable? Shouldn't mediawiki_history and mediawiki_page_history both be consistent?

On the wiki page, I see a note from almost a year ago saying that "History of pages with complex delete/restore patterns is on purpose not yet corretly worked. Will happen after Wikistats-2 release", but I feel like these issues are bigger than that implies.

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptSep 27 2018, 12:44 AM
Ottomata raised the priority of this task from Medium to High.Oct 4 2018, 5:14 PM
Ottomata moved this task from Incoming to Data Quality on the Analytics board.

ping @Milimetric @joal @fdans so we have this in our radar for data quality

@Neil_P._Quinn_WMF this is something we wanted to work on this quarter, but it was derailed by the actor/comment refactor. I should've filled you in earlier, but here's what's going on. The mediawiki_page_history table is built first. Where we have conflicting or invalid data, we sometimes generate inferred events like the "create" you see. And sometimes we assign page_id_artificial, where we can't find a page_id. There are some bugs here because we assign both a page_id and an artificial one sometimes, so that's what we are working on now and next quarter. Clearly mediawiki has some way of figuring this out because they display the logs, so I'm going to take a closer look and see if we can copy the logic. The reason mediawiki_history disagrees with mediawiki_page_history is because we generate it by joining the page history to revision by page_id, so the join fails where we don't have page_ids. We'll get to the root of it, thanks for helping us.

Interesting findings so far. @fdans and I found https://www.mediawiki.org/wiki/Manual:Log_search_table which means logging and revision are joined somewhere in mediawiki, accounting for all the problems in log_params. This potentially has many implications for reconstruction efforts.

The relevant places to look for logic are:

https://github.com/wikimedia/mediawiki/blob/master/includes/specials/SpecialLog.php
https://github.com/wikimedia/mediawiki/blob/master/includes/logging/LogPager.php

Checking improvements in new datsource.

// Current datasource - normally the problem is present in here
spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh_old")

// New datasource - normally the problem is solved in here
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh")
  • page "Jeff Caldwell (soccer)" on enwiki
// Events in old datasource
spark.sql("""
     | select
     |     event_type,
     |     event_timestamp,
     |     event_user_text,
     |     page_title,
     |     page_title_historical,
     |     page_namespace,
     |     page_namespace_historical,
     |     page_id
     | from mwh_old
     | where
     |     event_entity = "page" and
     |     wiki_db = "enwiki" and
     |     (page_title_historical = "Jeff_Caldwell_(soccer)" or page_title = "Jeff_Caldwell_(soccer)")
     | order by page_id, event_timestamp
     | """).show(100, false)
+----------+---------------------+---------------+-------------------------------------+----------------------+--------------+-------------------------+--------+
|event_type|event_timestamp      |event_user_text|page_title                           |page_title_historical |page_namespace|page_namespace_historical|page_id |
+----------+---------------------+---------------+-------------------------------------+----------------------+--------------+-------------------------+--------+
|create    |2018-07-19 13:00:57.0|Freefalling660 |Freefalling660/Jeff_Caldwell_(soccer)|Jeff_Caldwell_(soccer)|2             |0                        |57939448|
|restore   |2018-07-31 17:33:57.0|Hut 8.5        |Freefalling660/Jeff_Caldwell_(soccer)|Jeff_Caldwell_(soccer)|2             |0                        |57939448|
|create    |2018-07-31 17:34:29.0|UncleTupelo1   |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|0             |0                        |60185989|
|create    |2019-03-10 01:19:40.0|UncleTupelo1   |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |60185990|
+----------+---------------------+---------------+-------------------------------------+----------------------+--------------+-------------------------+--------+

// Events in new datasource
 spark.sql("""
     | select
     |     event_type,
     |     event_timestamp,
     |     event_user_text,
     |     page_title,
     |     page_title_historical,
     |     page_namespace,
     |     page_namespace_historical,
     |     page_id
     | from mwh
     | where
     |     event_entity = "page" and
     |     wiki_db = "enwiki" and
     |     (page_title_historical = "Jeff_Caldwell_(soccer)" or page_title = "Jeff_Caldwell_(soccer)")
     | order by page_id, event_timestamp
     | """).show(100, false)
+----------+---------------------+-----------------+-------------------------------------+----------------------+--------------+-------------------------+--------+
|event_type|event_timestamp      |event_user_text  |page_title                           |page_title_historical |page_namespace|page_namespace_historical|page_id |
+----------+---------------------+-----------------+-------------------------------------+----------------------+--------------+-------------------------+--------+
|create    |null                 |null             |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|0             |0                        |50256784|
|delete    |2016-05-03 11:57:22.0|Sarahj2107       |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|0             |0                        |50256784|
|create    |null                 |null             |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |50257013|
|delete    |2016-05-03 11:57:22.0|Sarahj2107       |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |50257013|
|create    |2016-05-03 11:57:22.0|null             |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|0             |0                        |57427746|
|delete    |2018-05-16 12:13:51.0|Anthony Appleyard|Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|0             |0                        |57427746|
|create    |2016-05-03 11:57:22.0|null             |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |57427914|
|delete    |2018-05-16 12:13:55.0|Anthony Appleyard|Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |57427914|
|move      |2018-07-20 20:08:30.0|Freefalling660   |Freefalling660/Jeff_Caldwell_(soccer)|Jeff_Caldwell_(soccer)|2             |0                        |57939448|
|delete    |2018-07-30 20:41:25.0|Hut 8.5          |Freefalling660/Jeff_Caldwell_(soccer)|Jeff_Caldwell_(soccer)|2             |0                        |57939448|
|restore   |2018-07-31 17:33:57.0|Hut 8.5          |Freefalling660/Jeff_Caldwell_(soccer)|Jeff_Caldwell_(soccer)|2             |0                        |57939448|
|create    |2018-05-16 12:13:55.0|null             |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |57978189|
|delete    |2018-07-30 20:41:30.0|Hut 8.5          |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |57978189|
|create    |2018-07-31 17:34:29.0|UncleTupelo1     |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|0             |0                        |60185989|
|create    |2019-03-10 01:19:40.0|UncleTupelo1     |Jeff_Caldwell_(soccer)               |Jeff_Caldwell_(soccer)|1             |1                        |60185990|
+----------+---------------------+-----------------+-------------------------------------+----------------------+--------------+-------------------------+--------+
  • page "Accidente ferroviario de Cerrillos de 1956" on eswiki
scala> spark.sql("""
     | select
     |     event_type,
     |     event_timestamp,
     |     event_user_text,
     |     page_title,
     |     page_title_historical,
     |     page_namespace,
     |     page_namespace_historical,
     |     page_id
     | from mwh_old
     | where
     |     event_entity = "page" and
     |     wiki_db = "eswiki" and
     |     (page_title_historical = "Accidente_ferroviario_de_Cerrillos_de_1956" or page_title = "Accidente_ferroviario_de_Cerrillos_de_1956")
     | order by page_id, event_timestamp
     | """).show(100, false)
+----------+---------------------+---------------+------------------------------------------+------------------------------------------+--------------+-------------------------+-------+
|event_type|event_timestamp      |event_user_text|page_title                                |page_title_historical                     |page_namespace|page_namespace_historical|page_id|
+----------+---------------------+---------------+------------------------------------------+------------------------------------------+--------------+-------------------------+-------+
|create    |2018-07-25 12:34:48.0|LuisCG11       |Accidente_ferroviario_de_Cerrillos_de_1956|Accidente_ferroviario_de_Cerrillos_de_1956|0             |0                        |8589445|
+----------+---------------------+---------------+------------------------------------------+------------------------------------------+--------------+-------------------------+-------+


scala> 

scala> spark.sql("""
     | select
     |     event_type,
     |     event_timestamp,
     |     event_user_text,
     |     page_title,
     |     page_title_historical,
     |     page_namespace,
     |     page_namespace_historical,
     |     page_id
     | from mwh
     | where
     |     event_entity = "page" and
     |     wiki_db = "eswiki" and
     |     (page_title_historical = "Accidente_ferroviario_de_Cerrillos_de_1956" or page_title = "Accidente_ferroviario_de_Cerrillos_de_1956")
     | order by page_id, event_timestamp
     | """).show(100, false)
+----------+---------------------+----------------+----------------------------------------------------------+------------------------------------------+--------------+-------------------------+-------+
|event_type|event_timestamp      |event_user_text |page_title                                                |page_title_historical                     |page_namespace|page_namespace_historical|page_id|
+----------+---------------------+----------------+----------------------------------------------------------+------------------------------------------+--------------+-------------------------+-------+
|create    |null                 |null            |LuisCG11/Taller/Accidente_ferroviario_de_Cerrillos_de_1956|Accidente_ferroviario_de_Cerrillos_de_1956|2             |0                        |4950497|
|delete    |2014-03-01 20:10:07.0|Lourdes Cardenal|LuisCG11/Taller/Accidente_ferroviario_de_Cerrillos_de_1956|Accidente_ferroviario_de_Cerrillos_de_1956|2             |0                        |4950497|
|restore   |2018-07-25 12:33:59.0|Marcelo         |LuisCG11/Taller/Accidente_ferroviario_de_Cerrillos_de_1956|Accidente_ferroviario_de_Cerrillos_de_1956|2             |0                        |4950497|
|create    |2018-07-25 12:34:48.0|LuisCG11        |Accidente_ferroviario_de_Cerrillos_de_1956                |Accidente_ferroviario_de_Cerrillos_de_1956|0             |0                        |8589445|
+----------+---------------------+----------------+----------------------------------------------------------+------------------------------------------+--------------+-------------------------+-------+

Keeping this task open as another refactor on page-history is on the way.

Results confirmed after page-history algorithm refactor. Marking as done :)

Let's mark it as done when the snapshot that has the fixes is live, i think that should be June, correct?

Part of the confusion here is that Special:Log is matching the page exactly, while looking for page_title or page_title_historical matches will match both namespace 0 and namespace 1 events, making it look like some data is weirdly duplicated. Just dropping this simple thought here as I was very confused for a while.

I do see a glitch for this page's history though, which is that it shows up as moved in the log on 2018-07-31 17:34:29, but we interpret that as a create because of how our algorithm works. So I haven't figured out how consequential this is or how often it happens, but it'll be something to look out for.

Nuria set the point value for this task to 8.