In the current backfill code, mediawiki_wikitext_history does not provide an unambiguous source to know whether the incoming revision details are suppressed. Thus we just blindly mark them as not suppressed:
... WHEN MATCHED AND to_timestamp('{snapshot}') > t.row_last_update THEN UPDATE SET t.page_id = s_page_id, ... t.user_is_visible = TRUE, -- set to TRUE for now, need to figure source for this t.revision_id = s_revision_id, t.revision_parent_id = s_revision_parent_id, ... t.revision_comment = s_revision_comment, t.revision_comment_is_visible = TRUE, -- set to TRUE for now, need to figure source for this t.revision_sha1 = s_revision_sha1, -- from backfill, revision_sha1 == main slot sha1 ... t.revision_content_is_visible = TRUE, -- set to TRUE for now, need to figure source for this
@Milimetric comments that indeed the current schema of mediawiki_wikitext_history does not contain such info, and suggests a possible solution:
for this backfill, specifically from mediawiki_wikitext_history, deleted is written out in the XML, for example:
I found some examples in mysql and then looked them up in the '2023-07' snapshot:
mysql:research@dbstore1007.eqiad.wmnet [etwiki]> select * from revision where rev_deleted {> 0, > 1, > 3} and rev_timestamp > '2023-05' limit 1;I found that deleted user meant user_id = -1, deleted content meant revision_text = '', and deleted comment meant revision_comment = ''. This is useful for the user_id but not for the others which could be like that normally (empty comments). Without joining, there's no way to get this data, and joining in general would be too expensive I would think.
However, collecting only the revisions where rev_deleted is <> 0 and broadcasting that to join might work, there might just not be that many of these things.
Another possibility is to modify mediawiki_wikitext_history so that this data is included. Source code: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/mediawikidumps/MediawikiXMLParser.scala#43. XML Dumps seems to have all we need on it? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/export/XmlDumpWriter.php#371
In this task we should:
- Figure out if the suggestion can be built into the current backfill
- If not, figure out another source for this data, perhaps by modifying mediawiki_wikitext_history.
- Additionally, take care of some cosmetic issues discussed in the same review thread.