Page MenuHomePhabricator

Issues with page deleted dates on data lake
Closed, ResolvedPublic

Description

From irc on conversation with @Nettrom

Using page creations from July 2015, where page IDs should work as authoritative identifiers, I find no deletion timestamps in the data lake match the logging table. Even found one where the “revision_deleted_timestamp” is the patrol timestamp from the logging table. I’ll work on documenting this and open a ticket on Phab.

Event Timeline

Nuria created this task.Mar 22 2018, 6:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2018, 6:02 PM
Nuria added a comment.Mar 22 2018, 6:02 PM

Ping @Nettrom did you ended up filing a ticket for the issue described? I could not find it. (cc @joal) If not : could you describe issue a bit more?

As mentioned on IRC earlier today, I never filed a ticket because I didn't have the time to sit down and make sure I had data that allowed me to understand exactly what the problem is. Picked it up again today because I now have some time to dig in.

The underlying question is: should two columns in two different tables in the Data Lake be equal to each other? In the mediawiki_page_history table there is the end_timestamp, while in the mediawiki_history table there is the revision_deleted_timestamp. In the case of a page being deleted, I would expect both of them to match the timestamp of the deletion event in the given wiki's logging table. The documentation (mediawiki_history schema and mediawiki_page_history schema) are not particularly clear on this point, but seems to suggest what I'm describing.

I've made some progress on this and plan to keep chipping at it over the next few days. My idea is to gather a couple of sample datasets (from enwiki) and then compare the timestamps in those against the revision and logging tables to see what they match with.

Sounds awesome, do let us know of your findings.

fdans triaged this task as Normal priority.Mar 26 2018, 4:05 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
fdans moved this task from Smart Tools for Better Data to Data Quality on the Analytics board.

And yes, the end_timestamp and revision_deleted_timestamp should match and match the point in time the page was deleted, as recorded in the logging table.

I've spent a bit of time looking at this, and as far as I can find, the revision_deleted_timestamp is consistently incorrect. Using a sample dataset of creations from four different months, I've found that 15% of the time the deletion timestamp is missing. For pages that have it set, the vast majority of entries (almost 90%) do not match against the logging table. Lastly, of those that match against the logging table, it's almost always not a page deletion event.

Read on for more details, and there's a quick writeup in this work log.

I used four separate months from recent years because my experience that those allow page_id to be an authoritative identifier of page deletion events. The four months I sampled were July 2014, February 2015, October 2016, and May 2017. Page creation and deletion timestamps were grabbed from the mediawiki_page_history and mediawiki_history tables in the Data Lake, using the 2018-02 snapshot. The query used to get creations from the mediawiki_history table is a modification of the query I used for data gathering during ACTRIAL, which is an adaptation of this query from T149021. The corresponding queries used for the data gathering are in the work log, and the resulting tables (creations_from_page and creations_from_revision) are both found in the nettrom_articlecreations database on Hive.

Comparing the two sources, I found 128,764 entries exist in both datasets. 19,451 entries (15.1%) have revision_deleted_timestamp set to null while end_timestamp at the same time is not null. There are 8,122 entries with a deletion timestamp in both tables, of which 8,115 (99.9%) have disagreements between the two tables.

I exported the timestamps of those 8,122 entries into two TSVs, and then wrote a Python script to process them and compare timestamps against revisions and log entries. The resulting tables are in the staging database on analytics-store (nettrom_creations_from_page, nettrom_creations_from_page_sources, nettrom_creations_from_revision, and nettrom_creations_from_revision_sources).

Comparing data from the mediawiki_page_history table with revisions and log events indicate that end_timestamp consistently refers to page deletions, the few exceptions are moves and restores, which one would expect. That's a positive result!

Comparing the mediawiki_history table with revision and log events indicate that revision_deleted_timestamp rarely (about 10% of the time) refers to a log event, and if it does, it's most likely not a page deletion.

Hopefully my methodology for investigation this is fairly reasonable. Let me know if something's amiss, and I'll look into how to fix it!

(just in case we've been so absent minded as to not say this via some other channel:) @Nettrom this is awesome work, thank you so much for finding it and surfacing the problem so nicely.

discussed: approach: try to find the correct Id, if we don't, use artificial id, and test until delete dates make sense and join well with revisions of deleted pages.

Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
Milimetric moved this task from Paused to In Progress on the Analytics-Kanban board.

Change 493390 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Refactor mediawiki-page-history computation

https://gerrit.wikimedia.org/r/493390

Change 509772 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Refactor mediawiki-page-history computation

https://gerrit.wikimedia.org/r/509772

Change 509772 abandoned by Joal:
Refactor mediawiki-page-history computation

Reason:
Using https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/ /493390/

https://gerrit.wikimedia.org/r/509772

JAllemandou raised the priority of this task from Normal to High.Jun 27 2019, 7:43 AM

Change 493390 merged by jenkins-bot:
[analytics/refinery/source@master] Refactor mediawiki-page-history computation

https://gerrit.wikimedia.org/r/493390

Improved greatly by the last page-history reconstruction refactor:

spark.sql("""
WITH
rev_deleted AS (
  SELECT
    wiki_db as deleted_rev_wiki,
    page_id as deleted_rev_page_id,
    revision_deleted_by_page_deletion_timestamp as deleted_rev_ts,
    COUNT(1) as nb_deleted_revs
  FROM wmf.mediawiki_history
  WHERE snapshot = '2019-05'
    AND event_entity = 'revision'
    AND page_id is not NULL
    AND revision_is_deleted_by_page_deletion
  GROUP BY
    wiki_db,
    page_id,
    revision_deleted_by_page_deletion_timestamp
),

page_deleted_ts AS (
  SELECT DISTINCT
    wiki_db as deleted_page_wiki_ts,
    page_id as deleted_page_id_ts,
    event_timestamp as page_delete_ts
  FROM wmf.mediawiki_history
  WHERE snapshot = '2019-05'
    AND event_entity = 'page'
    AND event_type = 'delete'
    AND page_id is not NULL
),

page_deleted AS (
  SELECT DISTINCT
    wiki_db as deleted_page_wiki,
    page_id as deleted_page_id
  FROM wmf.mediawiki_history
  WHERE snapshot = '2019-05'
    AND event_entity = 'page'
    AND event_type = 'delete'
    AND page_id is not NULL
)

  SELECT
    deleted_page_wiki_ts IS NOT NULL AS deleted_page_timestamp_match,
    deleted_page_wiki IS NOT NULL deleted_page_match,
    sum(COALESCE(nb_deleted_revs, 0)) as revisions_count
  FROM rev_deleted
    LEFT OUTER JOIN page_deleted_ts
      ON deleted_rev_wiki = deleted_page_wiki_ts
        AND deleted_rev_page_id = deleted_page_id_ts
        AND deleted_rev_ts = page_delete_ts
    LEFT OUTER JOIN page_deleted
      ON deleted_rev_wiki = deleted_page_wiki
        AND deleted_rev_page_id = deleted_page_id
  GROUP BY
    deleted_page_wiki_ts IS NOT NULL,
    deleted_page_wiki IS NOT NULL
  ORDER BY
    deleted_page_timestamp_match,
    deleted_page_match

""").show(100, false)

+----------------------------+------------------+---------------+               8192]]]
|deleted_page_timestamp_match|deleted_page_match|revisions_count|
+----------------------------+------------------+---------------+
|false                       |false             |21645769       |
|false                       |true              |4870           |
|true                        |true              |157978909      |
+----------------------------+------------------+---------------+

The result of this table means that, among deleted revisions having a defined page_id:

  • 157978909 (87.95%) have their deleted_timestamp matching one of their page delete event
  • 21645769 (12.05%) don't have an explicit page (we do our best to rebuild deleted-page from existing info, but there are cases we we still don't manage).
  • 4870 (0.002%) have a weird status where they do have a page with existing delete event, but the timestamps don't match.

I think we can call this one done.

The results for the above query with the new snapshot, 2019-06, are very similar:

+----------------------------+------------------+---------------+
|deleted_page_timestamp_match|deleted_page_match|revisions_count|
+----------------------------+------------------+---------------+
|false                       |false             |21788880       |
|false                       |true              |4951           |
|true                        |true              |158959907      |
+----------------------------+------------------+---------------+
Nuria closed this task as Resolved.Jul 9 2019, 3:39 PM