Implement a mechanism for figuring out suppressed data on backfills
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	xcollazo
	Aug 29 2023, 4:26 PM

Description

In the current backfill code, mediawiki_wikitext_history does not provide an unambiguous source to know whether the incoming revision details are suppressed. Thus we just blindly mark them as not suppressed:

...
WHEN MATCHED AND to_timestamp('{snapshot}') > t.row_last_update THEN
    UPDATE SET
      t.page_id = s_page_id,
...
      t.user_is_visible = TRUE,                                -- set to TRUE for now, need to figure source for this
      t.revision_id = s_revision_id,
      t.revision_parent_id = s_revision_parent_id,
...
      t.revision_comment = s_revision_comment,
      t.revision_comment_is_visible = TRUE,                    -- set to TRUE for now, need to figure source for this
      t.revision_sha1 = s_revision_sha1,                       -- from backfill, revision_sha1 == main slot sha1
 ...
      t.revision_content_is_visible = TRUE,                    -- set to TRUE for now, need to figure source for this

@Milimetric comments that indeed the current schema of mediawiki_wikitext_history does not contain such info, and suggests a possible solution:

for this backfill, specifically from mediawiki_wikitext_history, deleted is written out in the XML, for example:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/export/XmlDumpWriter.php#371

I found some examples in mysql and then looked them up in the '2023-07' snapshot:
mysql:research@dbstore1007.eqiad.wmnet [etwiki]> select * from revision where rev_deleted {> 0, > 1, > 3} and rev_timestamp > '2023-05' limit 1;
I found that deleted user meant user_id = -1, deleted content meant revision_text = '', and deleted comment meant revision_comment = ''. This is useful for the user_id but not for the others which could be like that normally (empty comments). Without joining, there's no way to get this data, and joining in general would be too expensive I would think.

However, collecting only the revisions where rev_deleted is <> 0 and broadcasting that to join might work, there might just not be that many of these things.

Another possibility is to modify mediawiki_wikitext_history so that this data is included. Source code: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/mediawikidumps/MediawikiXMLParser.scala#43. XML Dumps seems to have all we need on it? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/export/XmlDumpWriter.php#371

In this task we should:

Figure out if the suggestion can be built into the current backfill
If not, figure out another source for this data, perhaps by modifying mediawiki_wikitext_history.
Additionally, take care of some cosmetic issues discussed in the same review thread.

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Add visibility on backfills via broadcast join of wmf_raw.mediawiki_revision.	repos/data-engineering/dumps/mediawiki-content-dump!12	xcollazo	use-rev-deleted	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Open	VirginiaPoundstone	T345988 [Epic] XML MediaWiki data dumps for right to fork
Resolved	Milimetric	T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
Resolved	xcollazo	T340861 Implement a backfill job for the dumps hourly table
Resolved	xcollazo	T345183 Implement a mechanism for figuring out suppressed data on backfills

Event Timeline

xcollazo created this task.Aug 29 2023, 4:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2023, 4:26 PM

xcollazo renamed this task from Figure out a mechanism for figuring out suppresed data on backfills to Implement a mechanism for figuring out suppressed data on backfills.Aug 29 2023, 4:38 PM

xcollazo updated the task description. (Show Details)

xcollazo updated the task description. (Show Details)Aug 29 2023, 4:47 PM

VirginiaPoundstone moved this task from Incoming to To be discussed/To be estimated on the Data Products board.Aug 30 2023, 2:25 PM

Parking some links that will be useful to this work:

dumps are imported with this script
This is parsed from XML to a Parquet table partitioned by wiki

For the second option Xabriel mentions, we'd have to change the parser to detect the deleted attribute, and add some columns to the output table to save the result. Then we could use that to backfill.

A third option is to use MW History instead of the dumps.

VirginiaPoundstone added a project: Dumps 2.0.Aug 30 2023, 9:08 PM

xcollazo changed the task status from Open to In Progress.Sep 7 2023, 8:30 PM

The total amount of revisions on wmf_raw.mediawiki_revision with rev_deleted > 0 is ~5M.

Tried caching it to see how big it'll be:

visbility_df = spark.sql("""
SELECT
    rev_id                AS visibility_revision_id,
    wiki_db               AS visibility_wiki_db,
    rev_deleted & 1 = 0   AS visibility_is_text_visible,
    rev_deleted & 2 = 0   AS visibility_is_user_visible,
    rev_deleted & 4 = 0   AS visibility_is_comment_visible
FROM
    wmf_raw.mediawiki_revision
WHERE
    rev_deleted > 0
    AND snapshot = '2023-06'
"""
).cache()

visbility_df.count()

5042571

Looking at Spark UI Storage tab, we are lucky with an estimated size of:

Storage Level: Disk Memory Deserialized 1x Replicated
Cached Partitions: 3115
Total Partitions: 3115
Memory Size: 27.6 MiB
Disk Size: 0.0 B

So we can definitely broadcast this! Note further that the broadcasted table will be even smaller considering we are backfilling in groups and in years range, so we can push down wiki_db and revision_timestamp ranges as well.

xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/12

Add visibility on backfills via broadcast join of wmf_raw.mediawiki_revision.

xcollazo moved this task from To be discussed/To be estimated to Sprint 01 on the Data Products board.Sep 13 2023, 2:19 PM

xcollazo edited projects, added Data Products (Sprint 01); removed Data Products.

xcollazo moved this task from Sprint Backlog to In Process on the Data Products (Sprint 01) board.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/12

Add visibility on backfills via broadcast join of wmf_raw.mediawiki_revision.

xcollazo moved this task from In Process to Done on the Data Products (Sprint 01) board.Sep 13 2023, 7:02 PM

xcollazo closed this task as Resolved.Oct 20 2023, 8:16 PM

Implement a mechanism for figuring out suppressed data on backfillsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement a mechanism for figuring out suppressed data on backfills
Closed, ResolvedPublic
Actions

Related Objects
Search...