Page MenuHomePhabricator

Update wikimedia-history revision data with deleted field (and find it a new name?)
Closed, ResolvedPublic3 Story Points

Description

In the current version of the mediawiki-history, what we consider to be deleted-revisions are the ones being in the archive table.
The rev_deleted flag in the revision table is about hiding portions of revisions to users (user, content, comment). It is imported by sqoop (as a boolean now, to be corrected), but not kept through denormalization. We should keep the rev_deleted value in denormalized history and possibly rename it to be clearer against delted-revisions (archived, in mediawiki world).
Notice: Since we use labs databases, the process of sanitizing the data from the rev_deleted value is already done (meaning a rev_deleted value meaning that revision comment must not be shown implies that the comment column in the DB is set to null).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 19 2017, 4:24 PM
fdans edited projects, added Analytics; removed Analytics-Kanban.Mar 22 2018, 5:03 PM
JAllemandou updated the task description. (Show Details)Mar 22 2018, 5:10 PM
fdans triaged this task as High priority.Mar 26 2018, 4:10 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.
JAllemandou renamed this task from Fix wikimedia-history revision-deleted data to Update wikimedia-history revision data with deleted field (and find it a new name?).Mar 26 2018, 4:12 PM
JAllemandou raised the priority of this task from High to Needs Triage.
JAllemandou updated the task description. (Show Details)
fdans triaged this task as High priority.Mar 26 2018, 4:16 PM
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou removed JAllemandou as the assignee of this task.
JAllemandou edited projects, added Analytics; removed Analytics-Kanban.
Milimetric added a subscriber: Milimetric.EditedOct 2 2018, 3:54 PM

Thoughts: when processing wmf_raw.mediawiki_revision, process rev_deleted like this:

  • make a new field revision_parts_masked, which is an array
  • for each part that was deleted, as indicated by the bitfield rev_deleted, add its string name to the array

reference: https://www.mediawiki.org/wiki/Manual:Revision_table#rev_deleted

https://www.mediawiki.org/wiki/Manual:Log_actions suggests we can gleam historical information about revision delete events. We can store these in revision_parts_deleted as a map from timestamp to the string array of what was deleted at that time.

Change 492304 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add revision_hidden_parts to mediawiki-history

https://gerrit.wikimedia.org/r/492304

JAllemandou added a project: Analytics-Kanban.
JAllemandou set the point value for this task to 3.

Thanks for working on this! I have some thoughts on terminology :)

  • We should use page_is_deleted rather than rev_deleted to mark revisions which have been moved to the archive table—the action applies to whole pages and "deleted" (rather than "archived") is the normal term.
  • Rather than revision_hidden_parts, consider using revision_deleted_parts. "Hidden" makes sense as a way as a way to distinguish this from page deletion, but "revision deletion" is the name of the feature so we might as well stick to that. With the names page_is_deleted and revision_deleted_parts, there's not much chance of confusing the features. (The difference will still be confusing to people unfamiliar with it, but that will probably happen no matter what).
  • It looks like you've included "stewards only" as a possible entry in revision_hidden_parts. Maybe revision_hiding_status (revision_deletion_status if you accept my other suggestion) would a little more accurate, since "stewards only" isn't really a part?
JAllemandou added a comment.EditedFeb 25 2019, 9:17 AM

thanks for your comment @Neil_P._Quinn_WMF :) Some more bikeshedding.

  • We should use page_is_deleted rather than rev_deleted to mark revisions which have been moved to the archive table—the action applies to whole pages and "deleted" (rather than "archived") is the normal term.

Interesting! the page_is_deleted info is present on page data now, but the reason for which we can't rely only on this is because there are revisions we don't manage to link back to pages, and we still want to know they are coming from archive. In relation to naming-consistency as you mentioned after, I'd rather used revision_is_deleted and page_is_deleted, instead of instance revision_is_archived.

  • Rather than revision_hidden_parts, consider using revision_deleted_parts. "Hidden" makes sense as a way as a way to distinguish this from page deletion, but "revision deletion" is the name of the feature so we might as well stick to that. With the names page_is_deleted and revision_deleted_parts, there's not much chance of confusing the features. (The difference will still be confusing to people unfamiliar with it, but that will probably happen no matter what).
  • It looks like you've included "stewards only" as a possible entry in revision_hidden_parts. Maybe revision_hiding_status (revision_deletion_status if you accept my other suggestion) would a little more accurate, since "stewards only" isn't really a part?

We chose to use hidden instead of deleted because we thought deleted was really not correct, even if used in the db. I'm not too much opinionated here, so moving it to revision_deleted_parts works for me if not for the naming consistency you also noticed. As for status instead of parts, I understand the reason, but the meaning is IMO less clear when you have for instance: revision_hidding_status -> text.
Very happy to continue the discussion - Looping @Milimetric in :)

Neil_P._Quinn_WMF added a comment.EditedFeb 27 2019, 2:40 PM

the page_is_deleted info is present on page data now, but the reason for which we can't rely only on this is because there are revisions we don't manage to link back to pages, and we still want to know they are coming from archive.

In this case, the revision being in the archive table tells us that it belonged to a page which was deleted, even if we don't know which page it was. So it would still be accurate to call the field page_is_deleted.

We chose to use hidden instead of deleted because we thought deleted was really not correct, even if used in the db. I'm not too much opinionated here, so moving it to revision_deleted_parts works for me if not for the naming consistency you also noticed.

Well, that's thinking from the perspective of a backend engineer. From the perspective of a site user, product manager, or a data analyst who doesn't use the MediaWiki tables, the revision is literally deleted 😁

More importantly, this feature is consistently called deletion: on MediaWiki.org, at the English Wikipedia, in the MediaWiki code, and in the revision table.

As for status instead of parts, I understand the reason, but the meaning is IMO less clear when you have for instance: revision_hidding_status -> text.

This is less important, from my point of view. But maybe we could break out the suppressed flag into a separate column, since it's not comparable to the other values of the map? Something like revision_deleted_parts_suppressed?

Ok, here's what we think Neil, two options:

  • page_is_deleted for page-level, every revision with this page_id is in the Archive table.
  • revision_is_deleted for revisions in the archive table, acknowledging that some pages can have some revisions in the archive table while the page is still active
  • revision_suppressed_parts a list corresponding to the rev_deleted flag, unless rev_deleted & 8 in which case it's a list of everything
  • revision_suppressed_from_admins === rev_deleted & 8

or (with the same meanings as described above)

  • page_is_deleted
  • revision_is_archived
  • revision_deleted_parts
  • revision_deleted_from_admins

What do you think? We present them in two sets because we're trying to not overload the word "deleted", even if it is overloaded in the interface.

Also, btw, we think page_is_archived can be too confusing both to people knowing the UI and to people really familiar with how mediawiki works.

Thanks for getting back to me! 🙂

Ok, here's what we think Neil, two options:

  • page_is_deleted for page-level, every revision with this page_id is in the Archive table.

Yes, I like this.

  • revision_is_deleted for revisions in the archive table, acknowledging that some pages can have some revisions in the archive table while the page is still active

I still don't understand why this can't be covered by page_is_deleted too? Unless I'm missing something, it's simply a reality that you can have multiple pages with the same title, one existing and the rest deleted, so we shouldn't be shy about letting that appear in the data. Even if we can't unambiguously link a revision to its page ID, we know that the only reason it's in the archive table is because at some point, the page it belonged to was deleted.

  • revision_suppressed_parts a list corresponding to the rev_deleted flag, unless rev_deleted & 8 in which case it's a list of everything
  • revision_suppressed_from_admins === rev_deleted & 8

"Suppression" is used specifically to refer to the hiding of deleted revision parts for admins, so I think this would just add confusion.

  • revision_deleted_parts

I like this.

  • revision_deleted_from_admins

I would use the term "suppressed" or "oversighted" (as in revision_deleted_parts_suppressed) because, again, those are the standard terms for this feature.

We present them in two sets because we're trying to not overload the word "deleted", even if it is overloaded in the interface.

Well, we may think this is more logical, but there are two features which are both consistently called deletion. We should not complicate things by trying to "fix" the standard terminology in this one dataset.

Also, btw, we think page_is_archived can be too confusing both to people knowing the UI and to people really familiar with how mediawiki works.

Yes, agreed. "Archived" is rarely used.

So overall, I really think we should go with:

  • page_is_deleted: true for any row sourced from the archive
  • revision_deleted_parts: a list with a combination of user, text, and comment, as appropriate
  • revision_deleted_parts_suppressed: true if the bitfield indications suppression (I think this is what you mean by rev_deleted & 8)

Let me know what you think!

So overall, I really think we should go with:

  • page_is_deleted: true for any row sourced from the archive

Ok for the page but unfortunately we still need something for revisions that are in the archive table while the page they belong to is live. This happens, for example, when someone partial-restores a deleted page and leaves some revisions behind. The page_is_deleted flag wouldn't be granular enough to convey that. So based on your other comment, I guess we better stick with revision_is_deleted. I still cringe a little at the overload of the "deleted" concept, but the more I look at "deleted_parts" the more I realize they're not likely to get confused. If Joseph's also ok, let's go with this.

  • revision_deleted_parts: a list with a combination of user, text, and comment, as appropriate
  • revision_deleted_parts_suppressed: true if the bitfield indications suppression (I think this is what you mean by rev_deleted & 8)

That sounds good to me. The only thing is, say you have:

revision_deleted_parts: user, text
revision_deleted_parts_suppressed: true

That seems to imply only the user and text are suppressed, and not the comment. But when rev_deleted & 8, isn't the whole revision invisible to anyone but stewards?

So overall, I really think we should go with:

  • page_is_deleted: true for any row sourced from the archive

Ok for the page but unfortunately we still need something for revisions that are in the archive table while the page they belong to is live. This happens, for example, when someone partial-restores a deleted page and leaves some revisions behind. The page_is_deleted flag wouldn't be granular enough to convey that. So based on your other comment, I guess we better stick with revision_is_deleted. I still cringe a little at the overload of the "deleted" concept, but the more I look at "deleted_parts" the more I realize they're not likely to get confused. If Joseph's also ok, let's go with this.

Well, thanks to your explanations here and in our hangtime, I now understand why page_is_deleted can be inaccurate in some situations!

The standard way to distinguish between the two features confusingly called deletion is to talk about revision deletion and page deletion. So to follow that, how about revision_is_page_deleted? That's not an ideal title, but I honestly think it's the clearest possible.

So perhaps we could set revision_is_page_deleted for all rows sourced from the archive table, and potentially add page_is_deleted for pages where all the revisions are in the archive table.

The only thing is, say you have:

revision_deleted_parts: user, text
revision_deleted_parts_suppressed: true

That seems to imply only the user and text are suppressed, and not the comment. But when rev_deleted & 8, isn't the whole revision invisible to anyone but stewards?

Actually, no 😊 When the highest bit is filled, it just means that the parts that have been revision deleted are visible only to oversighters. For example, see this revision (enwiki rev ID 15901241). Its rev_deleted is 9, which indicates DELETED TEXT and DELETED_RESTRICTED (suppressed). As you can see, the user name (or IP address in this case) and the comment are still publicly visible.

So to follow that, how about revision_is_page_deleted? That's not an ideal title, but I honestly think it's the clearest possible.
So perhaps we could set revision_is_page_deleted for all rows sourced from the archive table, and potentially add page_is_deleted for pages where all the revisions are in the archive table.

Hm, ok, then I think we should just live with revision_is_deleted and page_is_deleted. Because revision_is_page_deleted is too confusing when the page is not deleted.

Actually, no 😊 When the highest bit is filled, it just means that the parts that have been revision deleted are visible only to oversighters. For example, see this revision (enwiki rev ID 15901241). Its rev_deleted is 9, which indicates DELETED TEXT and DELETED_RESTRICTED (suppressed). As you can see, the user name (or IP address in this case) and the comment are still publicly visible.

This is very useful, thanks for explaining. The docs do not make this clear at all in my opinion. So then the names you suggested here are great.

To summarize:

  • page_is_deleted when all revisions are in the archive table
  • revision_is_deleted for individual revisions in the archive table
  • revision_deleted_parts
  • revision_deleted_parts_suppressed

Let me know if you see anything wrong and otherwise we'll ship it!

So to follow that, how about revision_is_page_deleted? That's not an ideal title, but I honestly think it's the clearest possible.
So perhaps we could set revision_is_page_deleted for all rows sourced from the archive table, and potentially add page_is_deleted for pages where all the revisions are in the archive table.

Hm, ok, then I think we should just live with revision_is_deleted and page_is_deleted. Because revision_is_page_deleted is too confusing when the page is not deleted.

I disagree! I think that the confusing similarity between revision deletion and page deletion is part of reality, so we should not try to hide it in this dataset or introduce our own unique terminology which confuses things further. revision_is_deleted naturally seems like the equivalent of rev_deleted in the source data set, referring to revision deletion. How about revision_is_page-deleted? That clarifies that we're talking about the feature called "page deletion", not about the page having the status "deleted".

I disagree! I think that the confusing similarity between revision deletion and page deletion is part of reality, so we should not try to hide it in this dataset or introduce our own unique terminology which confuses things further. revision_is_deleted naturally seems like the equivalent of rev_deleted in the source data set, referring to revision deletion. How about revision_is_page-deleted? That clarifies that we're talking about the feature called "page deletion", not about the page having the status "deleted".

The fact that could have a revision having revision_is_page_deleted belonging to a page that is actually not deleted is really something that i find confusing. I would not go for this name.

revision_deleted_by_page_deleted? It's so long and ugly but maybe it's clear enough and doesn't have the potential overlap with revision_deleted_parts?

And after saying it, and saying it again, I'm actually ok to go with revision_is_page_deleted, even I think it is confusing (sorry for the hard-no above...).
It looks like every solution we've encountered is confusing to some extent, so one confusion or another...
My increasing confusion order is:

  1. page_is_deleted, revision_is_deleted, revision_hidden_parts and revision_hidden_parts_suppressed
  2. page_is_deleted, revision_is_archived, revision_deleted_parts and revision_deleted_parts_suppressed
  3. page_is_deleted, revision_deleted_by_page_deletion, revision_deleted_parts and revision_deleted_parts_suppressed
  4. page_is_deleted, revision_is_page_deleted, revision_deleted_parts and revision_deleted_parts_suppressed

Anyone of those would actually work for me. Thanks again @Neil_P._Quinn_WMF and @Milimetric for the bikeshedding :)

revision_deleted_by_page_deleted? It's so long and ugly but maybe it's clear enough and doesn't have the potential overlap with revision_deleted_parts?

I like this (although I would slightly tweak it to revision_is_deleted_by_page_deletion). It is long but definitely clearer than revision_is_page_deleted.

And after saying it, and saying it again, I'm actually ok to go with revision_is_page_deleted, even I think it is confusing (sorry for the hard-no above...).
It looks like every solution we've encountered is confusing to some extent, so one confusion or another...

No problem, thank you for rethinking it! 😁

One final bikeshed (sorry, I promise it's really the last!): should we make it revision_deleted_parts_are_suppressed? I think it's fine either way but so far the schemas has been consistent in using is for boolean fields so it seems sensible to keep that up.

Looks like we have a winner:

  • page_is_deleted
  • revision_is_deleted_by_page_deletion
  • revision_deleted_parts
  • revision_deleted_parts_are_suppressed

@Neil_P._Quinn_WMF and @Milimetric - Confirmation?

nice! I like it. I think this bikeshedding will pay off dividends when we have to explain this dataset.

Looks like we have a winner:

  • page_is_deleted
  • revision_is_deleted_by_page_deletion
  • revision_deleted_parts
  • revision_deleted_parts_are_suppressed @Neil_P._Quinn_WMF and @Milimetric - Confirmation?

Actually, I was thinking, rather than revision_deleted_parts, we should—

Just kidding. I'm happy.

Macro seal-of-approval:

Change 492304 merged by jenkins-bot:
[analytics/refinery/source@master] Add revision_deleted_parts to mediawiki-history

https://gerrit.wikimedia.org/r/492304

Data is available in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:

spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh")

 spark.sql("select revision_deleted_parts, revision_deleted_parts_are_suppressed,  count(1) from mwh where event_entity = 'revision' group by revision_deleted_parts, revision_deleted_parts_are_suppressed").show(100, false)

+----------------------+-------------------------------------+----------+       
|revision_deleted_parts|revision_deleted_parts_are_suppressed|count(1)  |
+----------------------+-------------------------------------+----------+
|[text, comment]       |true                                 |36489     |
|[user]                |true                                 |24592     |
|[text]                |true                                 |135353    |
|[text, user]          |true                                 |6612      |
|[comment, user]       |true                                 |735       |
|[comment, user]       |false                                |2051      |
|[text, comment, user] |false                                |63781     |
|[user]                |false                                |10177     |
|[text, comment, user] |true                                 |100707    |
|[text, user]          |false                                |11236     |
|[comment]             |true                                 |6061      |
|[text]                |false                                |2524450   |
|[]                    |false                                |4122706476|
|[text, comment]       |false                                |217881    |
|[comment]             |false                                |75752     |
+----------------------+-------------------------------------+----------+

Change 492304 merged by Fdans:
[analytics/refinery/source@master] Add revision_deleted_parts to mediawiki-history

https://gerrit.wikimedia.org/r/492304

Nuria closed this task as Resolved.Tue, May 14, 9:08 PM