Page MenuHomePhabricator

Implement mechanism to fetch content when visibility is unsuppressed
Closed, ResolvedPublic

Description

In T340880, we are incorporating event.mediawiki_revision_visibility_change into target table`wmf_dumps.content_raw_rc0`, which was created as part of T335860.

We noticed, however, that mediawiki_revision_visibility_change's schema does not include the content. This would be useful when visibility is unsuppressed, that is, changed from FALSE to TRUE.

This situation happens very sporadically, but it happens:

presto>  select count(1) as count
     ->  from analytics_hive.event.mediawiki_revision_visibility_change
     ->  where
     ->    ( prior_state.visibility.comment = FALSE AND visibility.comment = TRUE )
     ->    OR
     ->    ( prior_state.visibility.text = FALSE AND visibility.text = TRUE )
     ->    OR
     ->    ( prior_state.visibility.user = FALSE AND visibility.user = TRUE );
 count 
-------
  1298 
(1 row)

We have discussed a few options on how to deal with this:

  1. On wmf_dumps.content_raw, have boolean columns for visibility. Pro: consuming mediawiki_revision_visibility_change is as simple as toggling such columns. Con: We'd like to eventually expose wmf_dumps.content_raw to the public, and we do not want to expose publicly fields that have been suppresed.
  1. Another variation is to have a VIEW on top of (1), and we would expose that VIEW instead of (1) publicly. Pros: Simple approach, does not copy data unnecessarily. Con: Fields that need to be suppressed may change in the future, and we could be at risk of exposure if we forget to update the VIEW. Also VIEW support on Iceberg is sketchy right now ( See T337562#8895823 ).
  1. Another approach would be to simply have an enrichment job that consumes mediawiki_revision_visibility_change and produces, say, mediawiki_revision_visibility_change_with_content. Pros: We solve the issue upstream instead of downstream. Other folks could leverage this. Cons: Seems a bit overkill given we only need the content for 1298 rows so far...

We discussed these approaches recently and we were leaning on (3).

In this task, we should:

  • Further consider the options above (and any other we missed?). Choose one.
  • Implement the solution.

Event Timeline

xcollazo claimed this task.

We went with option (1) above. It is the simplest solution, and requires no code. This makes wmf_dumps.content_raw a private table, and that is ok.

Whenever we are ready to develop further use cases for wmf_dumps.content_raw we can get back to this question.