Page MenuHomePhabricator

Implement a mechanism for figuring out suppressed data on backfills
Closed, ResolvedPublic

Description

In the current backfill code, mediawiki_wikitext_history does not provide an unambiguous source to know whether the incoming revision details are suppressed. Thus we just blindly mark them as not suppressed:

...
WHEN MATCHED AND to_timestamp('{snapshot}') > t.row_last_update THEN
    UPDATE SET
      t.page_id = s_page_id,
...
      t.user_is_visible = TRUE,                                -- set to TRUE for now, need to figure source for this
      t.revision_id = s_revision_id,
      t.revision_parent_id = s_revision_parent_id,
...
      t.revision_comment = s_revision_comment,
      t.revision_comment_is_visible = TRUE,                    -- set to TRUE for now, need to figure source for this
      t.revision_sha1 = s_revision_sha1,                       -- from backfill, revision_sha1 == main slot sha1
 ...
      t.revision_content_is_visible = TRUE,                    -- set to TRUE for now, need to figure source for this

@Milimetric comments that indeed the current schema of mediawiki_wikitext_history does not contain such info, and suggests a possible solution:

for this backfill, specifically from mediawiki_wikitext_history, deleted is written out in the XML, for example:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/export/XmlDumpWriter.php#371

I found some examples in mysql and then looked them up in the '2023-07' snapshot:

mysql:research@dbstore1007.eqiad.wmnet [etwiki]> select * from revision where rev_deleted {> 0, > 1, > 3} and rev_timestamp > '2023-05' limit 1;

I found that deleted user meant user_id = -1, deleted content meant revision_text = '', and deleted comment meant revision_comment = ''. This is useful for the user_id but not for the others which could be like that normally (empty comments). Without joining, there's no way to get this data, and joining in general would be too expensive I would think.

However, collecting only the revisions where rev_deleted is <> 0 and broadcasting that to join might work, there might just not be that many of these things.

Another possibility is to modify mediawiki_wikitext_history so that this data is included. Source code: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/mediawikidumps/MediawikiXMLParser.scala#43. XML Dumps seems to have all we need on it? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/export/XmlDumpWriter.php#371

In this task we should:

  • Figure out if the suggestion can be built into the current backfill
  • If not, figure out another source for this data, perhaps by modifying mediawiki_wikitext_history.
  • Additionally, take care of some cosmetic issues discussed in the same review thread.

Details

TitleReferenceAuthorSource BranchDest Branch
Add visibility on backfills via broadcast join of wmf_raw.mediawiki_revision.repos/data-engineering/dumps/mediawiki-content-dump!12xcollazouse-rev-deletedmain
Customize query in GitLab

Event Timeline

xcollazo renamed this task from Figure out a mechanism for figuring out suppresed data on backfills to Implement a mechanism for figuring out suppressed data on backfills.Aug 29 2023, 4:38 PM
xcollazo updated the task description. (Show Details)

Parking some links that will be useful to this work:

  1. dumps are imported with this script
  2. This is parsed from XML to a Parquet table partitioned by wiki

For the second option Xabriel mentions, we'd have to change the parser to detect the deleted attribute, and add some columns to the output table to save the result. Then we could use that to backfill.

A third option is to use MW History instead of the dumps.

xcollazo changed the task status from Open to In Progress.Sep 7 2023, 8:30 PM

The total amount of revisions on wmf_raw.mediawiki_revision with rev_deleted > 0 is ~5M.

Tried caching it to see how big it'll be:

visbility_df = spark.sql("""
SELECT
    rev_id                AS visibility_revision_id,
    wiki_db               AS visibility_wiki_db,
    rev_deleted & 1 = 0   AS visibility_is_text_visible,
    rev_deleted & 2 = 0   AS visibility_is_user_visible,
    rev_deleted & 4 = 0   AS visibility_is_comment_visible
FROM
    wmf_raw.mediawiki_revision
WHERE
    rev_deleted > 0
    AND snapshot = '2023-06'
"""
).cache()

visbility_df.count()

5042571

Looking at Spark UI Storage tab, we are lucky with an estimated size of:

Storage Level: Disk Memory Deserialized 1x Replicated
Cached Partitions: 3115
Total Partitions: 3115
Memory Size: 27.6 MiB
Disk Size: 0.0 B

So we can definitely broadcast this! Note further that the broadcasted table will be even smaller considering we are backfilling in groups and in years range, so we can push down wiki_db and revision_timestamp ranges as well.