Page MenuHomePhabricator

Allow aspects of edits or logs to be revdeled while other aspects are suppressed
Open, MediumPublic

Description

I searched and didn't quite find anything like this. The problem is that when you have an abusive edit that is removed entirely from the database by an oversight/office action, it can leave behind an abusive edit summary and a standard admin with revdel capabilities cannot remove this. See the attachment.

Possibly relevant:

Visibility.png (501×1 px, 94 KB)

Event Timeline

Copying from the merged-in duplicate:

Having taken a quick look at this, seems to be due to the $item->canView() checks failing (eg at SpecialRevisionDelete::showForm, line 420). That function is in RevDelRevisionItem, line 81:

return RevisionRecord::userCanBitfield(
	$this->getRevisionRecord()->getVisibility(),
	RevisionRecord::DELETED_RESTRICTED,
	$this->list->getUser()
);

Would probably have to do a couple of bit checks, but seems do-able - the various different fields suppressed result in a different bit value. Perhaps in a duplicate function, since canView can't take a parameter.

https://en.wikipedia.org/w/index.php?date-range-to=2019-04-04&tagfilter=&title=Khaled_Juffali&action=history

Similarly I want to zap the content of those revisions with the oversighted username for copyvio, but can't.

Finally had a chance to circle back to this. Unlike what @Proc had hoped, this will take a substantial change in code. The reason becomes clear when you see mw:Bitfields_for_rev_deleted#Current_values: we use the same bits to store the deletion and the suppression flag for each of the revision's three components (text, summary, user). With our current approach, it is not possible to configure a revision in such a way that its text is suppressed and its summary is deleted (but not suppressed).

To support concurrent use of rev-del and supression, we will need to change the revision deletion constants. One strategy is to use 1, 2 and 4 like we do now, but instead of adding 8 to them to indicate suppression, we could multiply the 1/2/4 bits by 100. The table below shows how the revision visibility bit will be calculated under the existing and proposed method. This proposal has a few advantages:

  • Decouples the rev-del status and the suppression status (main purpose)
  • Code will remain backward compliant; that is, the only values reused will be those which retain the same meaning
  • There is no need for a maintenance script to update all revisions in the database. Every time a change is made the revision, the new value will be calculated using the new representation
  • The math necessary to interpret these is not substantially more complex (or computationally more expensive); our current approach uses binary math; the proposed approach will be exactly identical for revisions that are not suppressed. For those that are suppressed, it will first separate the left-most digit from the rightmost digit (a simple division and modulo operation) and then uses the same old logic on each of those.

It has one key disadvantage:

  • The rev_deleted field in the revision table is an 8-bit TINYINT field, which means it only supports values between 0 and 255. The proposed approach requires storage space for numbers as big as 700. We would need to ALTER the revision table, which would be an expensive Schema-change

We could use a more sophisticated algorithm that restricts the values to be between 0 and 255, but it would not be as clean. For instance, we can use a small multiplier of 16. This will also guarantee backward compliance, but the math to interpret it will be more involved.

I am happy to write the patch for it, but rather wait for some feedback first.

textsummaryusercurrent representationproposed representation
deleted11
deleted22
deleteddeleted33
deleted44
deleteddeleted55
deleteddeleted66
deleteddeleteddeleted77
suppressed9100
suppressed10200
suppressedsuppressed11300
suppressed12400
suppressedsuppressed13500
suppressedsuppressed14600
suppressedsuppressedsuppressed15700
suppresseddeletedN/A102
suppresseddeletedN/A104
suppresseddeleteddeletedN/A106
deletedsuppressedN/A201
suppresseddeletedN/A204
deletedsuppresseddeletedN/A205
suppressedsuppresseddeletedN/A304
deletedsuppressedN/A401
deletedsuppressedN/A402
deleteddeletedsuppressedN/A403
suppresseddeletedsuppressedN/A502
deletedsuppressedsuppressedN/A601

I am adding three people here, whom I would like to ask to opine on my last comment right above. Daniel is added because it seems like he was the one who created RevisionRecord and would know a lot about it. Max because he was among the last people to touch the code where RevisionRecord deletion constants are defined. Sam because he has always been extremely helpful to me and is generally a thoughtful developer.

TL;DR: should we revised our deletion constants to allow for coexistence of rev-del and suppression, and if yes, what strategy do you like better?

Might I suggest instead

textsummaryusercurrent representationproposed representation
deleted1 (00000001)1 (00000001)
deleted2 (00000010)2 (00000010)
deleteddeleted3 (00000011)3 (00000011)
deleted4 (00000100)4 (00000100)
deleteddeleted5 (00000101)5 (00000101)
deleteddeleted6 (00000110)6 (00000110)
deleteddeleteddeleted7 (00000111)7 (00000111)
suppressed9 (00001001)17 (00010001)
suppressed10 (00001010)34 (00100010)
suppressedsuppressed11 (00001011)51 (00110011)
suppressed12 (00001100 )68 (01000100)
suppressedsuppressed13 (00001101)85 (01010101)
suppressedsuppressed14 (00001110)102 (01100110)
suppressedsuppressedsuppressed15 (00001111)119 (01110111)
suppresseddeletedN/A19 (00010011)
suppresseddeletedN/A21 (00010101)
suppresseddeleteddeletedN/A23 (00010111)
deletedsuppressedN/A35 (00100011)
suppresseddeletedN/A38 (00100100)
deletedsuppresseddeletedN/A39 (00100101)
suppressedsuppresseddeletedN/A55 (00110111)
deletedsuppressedN/A69 (01000101)
deletedsuppressedN/A70 (01000110)
deleteddeletedsuppressedN/A71 (01000111)
suppresseddeletedsuppressedN/A87 (01010111)
deletedsuppressedsuppressedN/A103 (01100111)

Bitwise this results in the 8 bits:

currently
<unused><unused><unused><unused>	<suppress rather than delete><user><comment><text>
proposed
<unused><suppress user><suppress comment><suppress text>	<mark legacy format><delete user><delete comment><delete text>

where the <suppress user/comment/text> bits only do anything if the <delete *> bit is activated. If the <delete user> bit is 1, and the <suppress user> bit is 0, user is visible to admins, but if the <delete user> bit is 1, and the <suppress user> bit is 1, it is visible only to oversighters.
This does not require a schema change, and should be completely backwards compatible since if the <mark legacy format> bit is true in the new definition, the old rules are used, and if it is false it doesn't matter because it is not a part of the new definition

I love that! Let me see if I can make a patch out of this one.

Huji triaged this task as Medium priority.
Huji added a project: User-Huji.

This will probably also need a maintenance script, and it might be worth tagging DBA to get feedback on the change before proceeding

My plan was to bring it to DBA attention once we have a patch. The proposal will not result in a schema change.

I agree regarding the maintenance script. Revisions that are deleted/suppressed are rarely touched again, in terms of their rev-del/suppress status. So while it is true that the code could update old entries to their new value if they were ever updated, because updates are rare, it is best to have a maintenance script.

However, the decision as to whether and when to run the script for WMF deserves a separate task. One could argue that because code will remain backward compliant, we might as well never update the revision table. This won't be the first time we have old and new data using different data models in the same DB field. In logging we have changed how some log properties are recorded over time. In AbuseFilter, we have similar examples too.

Huji removed Huji as the assignee of this task.Oct 24 2020, 1:43 AM
Further Analysis

As it turns out a lot of core and non-core code is directly interpreting the value of rev_deleted. This means we need to patch everyone of those.

In use cases that are part of MW or an extension, a bit-wise AND is created using Database::bitAnd() and with help from one of the revision deletion constants from RevisionRecord class, similar to this: $db->bitAnd( 'rev_deleted', RevisionRecord::DELETED_TEXT ) . ' = 0'. For these, we need a logic that checks for two possible values (old values and new values, e.g. 9 and 17). I don't think having repetitive code in so many files is a great idea. A better idea might be to refactor all of these such that they would use a new static method in Revision to create the desired condition. Something like RevisionRecord::getRevDelCond( RevisionRecord::SUPPRESSED_TEXT ) which would return the correct SQL statement (in this case, something like ( rev_deleted = 9 OR rev_deleted = 17)).

For use cases that are not part of MW, such as the puppets used by Cloud-Services, the comparisons are hard-coded (e.g. where: ipc_rev_id = rev_id AND (rev_deleted & 4) = 0. We would need to create patches for them in the same hard-coded way; that is, I cannot think of a way to refactor them into a modular way, like we can with MW code as explained above.

There are also a few cases that do not need to be changed, e.g. Extension:GlobalContribs uses the logic $condition[] = "rev_deleted != '0'"; which will stay the same after this task.

Next steps

The following subtasks need to be created and fulfilled, more or less in the order listed below:

  • Identify all use cases of rev_deleted in MediaWiki, MW Extensions, and non-MW code
  • Define new revision deletion constants called SUPPRESSED_TEXT, SUPPRESSED_COMMENT
  • Rename revision deletion constant SUPPRESSED_TEXT to SUPPRESSED_TEXT_OLD and update all code that uses it too
  • Rename revision deletion constant SUPPRESSED_COMMENT to SUPPRESSED_COMMENT_OLD and update all code that uses it too
  • Rename revision deletion constant SUPPRESSED_USER to SUPPRESSED_USER_OLD and update all code that uses it too
  • Define a new SUPRESSED_ACTOR constant whose value is different from the old SUPPRESSED_USER one (68 instead of 12). The word ACTOR is used to avoid confusion with the old constant, and to align with the current column names in the revision table
  • Create a new DELETED_ACTOR constant with the same value as the DELETED_USER; indicate that DELETED_USER is only kept for historical reasons
  • Create a static method in RevisionRecord class that can return the SQL statement corresponding to each rev-del status check
  • Update all MW/Extension code that uses rev_deleted to use the new method for its SQL queries
  • Update all non-MW code such that they can interpret the new rev_deleted values correctly
  • Modify RevisionRecord::userCanBitfield so it can understand the new rev_deleted values correctly
  • Update all other references to the revision deletion constants (e.g. in API code, HistoryPager, LogPager, etc.) that do not explicitly produce a SQL query, to use the new constants

Most or all of the above need to also be repeated for logs (in includes/logging/LogPage.php a similar set of constants are defined).

By the end of all of these, we still won't have any of the new rev_deleted values in the database. We just have made sure that once we turn on the switch, all code that depends on them will continue to function as before.

Then, and only then, we can finally updated SpecialRevisionDelete to start writing the new values into the database.

At this point, I don't think this is a one-man job anymore. Certainly above my bandwidth and availability. I am unassigning this task from myself.

Huji raised the priority of this task from Medium to Needs Triage.Oct 24 2020, 1:43 AM

Fields using the same bit logic (as far as I can tell):
rev_deleted
ar_deleted
oi_deleted
fa_deleted
rc_deleted
log_deleted

I suggest adding a service for the migration, like for ActorMigration, but that handles setting and reading the values for the specific fields. Tagging CPT as a feature request to review, and my task workboard since I might be able to work on this (assuming this is approved to go forward; double checking first since this is a larger scope that previously expected)

Huji renamed this task from Allow revdel for the edit summaries of oversight-style deleted diffs to Allow aspects of edits or logs to be revdeled while other aspects are supressed.Oct 24 2020, 12:52 PM

Thanks Danny. I will be happy to provide support through code review, at least.
I retitled the task to reflect its new scope

@Ladsgroup would this be something you could work on in your new role?

I can consult on this on how to move forward here but I don't think I can take the lead on the work.

I can say from DBA side, making schema changes on revision table will be quite hard, alter tables on each one of replicas can take more than a day (on s8 and s1 replicas) and require a master switchover of each section. Most important point is to keep backward compatibility as the schema change will be gradual. I tried to do a draft of how data migration can be done a while ago: https://www.mediawiki.org/wiki/User:ASarabadani_(WMF)/Database_for_devs_toolkit/How-to/Data_migration

Anyway, the schema change part aside (as it might not be needed anyway). Generally it looks good but I really prefer doing T20493: RFC: Unify the various deletion systems first to ease the work for this (and much more benefits). That ticket is on my long-term plans of improving databases (T297633: <Tech Initiative> Improving Databases in MediaWiki) but won't get to it this year or even more unless other teams in WMF do it.

One last point is to make sure sanitation rules to the cloud (maintain-views) don't break specially because this means it might leak deleted or worse suppressed information to public.

jrbs renamed this task from Allow aspects of edits or logs to be revdeled while other aspects are supressed to Allow aspects of edits or logs to be revdeled while other aspects are suppressed.May 10 2022, 11:59 PM

T346293: ipblocks schema redesign for multiblocks propose to migrate the suppression state to actor table. If similar also happens in comment field, rev_deleted will only concern the text.