Page MenuHomePhabricator

[Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation
Closed, ResolvedPublic

Description

On T367570: Spike: Figure feasability to emit (wiki_db, revision_id) pairs, we found that we have some surprising data quality issues:

Missing or bad revisions.png (449×570 px, 32 KB)

Bad revisions.png (449×561 px, 44 KB)

In this spike we should figure out the root causes of some of these issues.

Code that generated these figures at : https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/notebooks/Can_we_emit_wiki_db__revision_id_pairs_for_reconciliation.ipynb?ref_type=heads

Temporary table with all these missing and or bad rows at xcollazo.missing_or_innaccurate_rows

Event Timeline

Ottomata renamed this task from Spike: Figure root cause of missing rows when doing reconciliation to [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation.Jun 26 2024, 1:29 PM
Milimetric triaged this task as Medium priority.

My first hunch, that the revisions were coming from only specific pages, is wrong:

spark-sql (default)> 
                   >  with pages_with_missing as (
                   >  select rev_page,
                   >         revision_count - not_missing_count as missing_count
                   >    from milimetric.enwiki_page_with_missing
                   >   where revision_count <> not_missing_count
                   >  )
                   > 
                   >  select avg(missing_count) as avg_missing,
                   >         approx_percentile(missing_count, array(0.25, 0.5, 0.75, 0.99)) as p25p50p75p99_missing
                   >    from pages_with_missing
                   > ;
avg_missing	p25p50p75p99_missing
1.6982443251671715	[1,1,1,11]

Similarly, the distribution across namespaces does not appear meaningful:

                   >  select page_namespace,
                   >         sum(missing_count) as missing
                   >    from wmf_raw.mediawiki_page
                   >             inner join
                   >         pages_with_missing      on rev_page = page_id
                   >   where wiki_db = 'enwiki'
                   >     and snapshot = '2024-05'
                   >   group by page_namespace
                   > ;
page_namespace	missing
101	9
12	26
1	32321
13	38
6	6000
3	24441
5	1942
15	4403
9	19
829	21
4	21247
8	39
100	244
7	5448
10	6603
710	1
828	154
11	746
14	5738
119	830
2	26056
118	9006
0	307746

@Milimetric I'm especially concerned about the sha1 mismatches. Maybe check into those first? Is the content actually different?

I am still trying to find an elegant way to change the queries and show all this, but I just wanted to share results so far:

  • missing_from_target: probably mostly page restores that we missed
  • missing_from_source: probably mostly page deletes that we missed (we seem to just be missing all page deletes, not just the most recent ones)
  • mismatch_sha1: all I found here were revisions with revision_content_is_visible set to false on our side, so the sha1 was blanked out and therefore not matching (intuition for this is there were no size mismatches)
  • mismatched visibility: this is a small number that I suspect we just missed

we seem to just be missing all page deletes, not just the most recent ones

? really? in page_content_change there are no deletes?

mismatch_sha1

phewf! That is a relief!

missing_from_target: probably mostly page restores that we missed

Page restores? Hm, indeed, we would only get the current revision in page_change during a restore, right?

Very interesting.

we seem to just be missing all page deletes, not just the most recent ones

? really? in page_content_change there are no deletes?

On hindsight, this makes sense. We do consume DELETE events from page_content_change. However, page_content_change only cares about the latest revision, not the historical revisions.

So we need another mechanism to delete full revision histories given a page_id.

page_content_change only cares about the latest revision, not the historical revisions.

Right, but old revisions themselves can't actually be 'deleted'. RevisionDelete == revision-visibility-change. page delete is a page level change.

Ok, I think I got this query to make sense... the results:

presto:milimetric> select reasons, count(*) from mismatched_revisions group by reasons;
                          reasons                           | _col1 
------------------------------------------------------------+-------
 [missing_from_source, page_was_deleted, page_was_restored] |  1007 
 [mismatch_page, page_was_deleted]                          |    20 
 [mismatch_user_visibility]                                 |     1 
 [mismatch_user_visibility, mismatch_comment_visibility]    |    16 
 [mismatch_page, page_was_deleted, page_was_restored]       |     4 
 [missing_from_source]                                      |   312 
 [mismatch_page]                                            |     1 
 [missing_from_target, page_was_restored]                   |     1 
 [missing_from_source, page_was_deleted]                    | 19347 
(9 rows)

pyspark code:

That one mismatch_page that has no other reason listed is apparently part of a merge, so if we're not following up on delete/restore properly then this makes perfect sense because merges are more complicated still. Here are the two pages involved and the logging table records for them:

+---------------------------+----------------+----------+------------------+
| page_title                | page_namespace | page_len | page_is_redirect |
+---------------------------+----------------+----------+------------------+
| CONCACAF_Caribbean_Shield |              0 |       29 |                1 |
| CFU_Club_Shield           |              0 |    11501 |                0 |
+---------------------------+----------------+----------+------------------+

+-----------+---------------------+------------------+----------------+---------------------------+----------+
| log_id    | log_type            | log_action       | log_timestamp  | log_title                 | log_page |
+-----------+---------------------+------------------+----------------+---------------------------+----------+
| 146240577 | move                | move_redir       | 20230430025329 | Caribbean_Club_Shield     | 56610493 |
| 162555752 | merge               | merge            | 20240608153644 | CONCACAF_Caribbean_Shield | 56610493 |
| 162511963 | create              | create           | 20240606011540 | CFU_Club_Shield           | 77093044 |
| 162555758 | patrol              | patrol           | 20240608153713 | CFU_Club_Shield           | 77093044 |
| 162555759 | pagetriage-curation | reviewed-article | 20240608153713 | CFU_Club_Shield           | 77093044 |
+-----------+---------------------+------------------+----------------+---------------------------+----------+

page_content_change only cares about the latest revision, not the historical revisions.

Right, but old revisions themselves can't actually be 'deleted'. RevisionDelete == revision-visibility-change. page delete is a page level change.

Right, but what I mean is that to reflect a page delete on wmf_dumps.wikitext_raw we will need to have separate mechanism.

Two ideas:

  • Consume page_content_change twice, once to do the MERGE INTO as it is today, and the second just to grab any changelog_kind=delete events, and to grab the associated page_id and delete any rows that match. This would be a very expensive query considering that we have to do a full table scan. :(
  • Have a new stream that emits delete events at a revision level.

changelog_kind=delete events, and to grab the associated page_id and delete any rows that match

Ah, yes. That makes sense.

Have a new stream that emits delete events at a revision level.

indeed, this is worth considering.

OK, so it seems most problems do indeed track back to not applying delete and restore events. It feels like we can mark this task complete. We can find a way to apply delete/restore/merge, and then run these queries again and see what we need to reconcile. The period I looked at above was 10 days of enwiki revisions. If anyone disagrees, do move this task back.

For higher visibility, the reconciling query that's in my notebook above is this (all the other details are in the code uploaded):

SELECT '{wiki_db}'                                          AS wiki_db,
       COALESCE(t.page_id, s.page_id)                       AS page_id,
       revision_id,
       COALESCE(t.revision_timestamp, s.revision_timestamp) AS revision_timestamp,
       FILTER(
           ARRAY(
               IF(s.user_is_visible != t.user_is_visible, 'mismatch_user_visibility', NULL),
               IF(s.revision_content_is_visible != t.revision_content_is_visible, 'mismatch_content_visibility', NULL),
               IF(s.revision_comment_is_visible != t.revision_comment_is_visible, 'mismatch_comment_visibility', NULL),
               IF(t.revision_content_is_visible AND s.revision_sha1 != t.revision_sha1, 'mismatch_sha1', NULL),
               IF(s.revision_size != t.revision_size, 'mismatch_size', NULL),
               IF(s.page_id != t.page_id, 'mismatch_page', NULL),
               IF(s.page_id IS NULL, 'missing_from_source', NULL),
               IF(t.page_id IS NULL, 'missing_from_target', NULL),
               IF(d.page_id IS NOT NULL, 'page_was_deleted', NULL),
               IF(r.page_id IS NOT NULL, 'page_was_restored', NULL)
           ),
           reason -> reason IS NOT NULL
       ) AS reasons

FROM source_revisions_filtered s
FULL OUTER JOIN target_revisions t USING (revision_id)
LEFT JOIN pages_deleted d ON (COALESCE(t.page_id, s.page_id) = d.page_id)
LEFT JOIN pages_restored r ON (COALESCE(t.page_id, s.page_id) = r.page_id)

WHERE s.user_is_visible != t.user_is_visible
   OR s.revision_content_is_visible != t.revision_content_is_visible
   OR s.revision_comment_is_visible != t.revision_comment_is_visible
   OR (t.revision_content_is_visible AND s.revision_sha1 != t.revision_sha1)
   OR s.revision_size != t.revision_size
   OR s.page_id != t.page_id
   OR s.page_id IS NULL
   OR t.page_id IS NULL