Page MenuHomePhabricator

Add ar_del_time (timestamp of page deletion)
Open, Needs TriagePublic

Description

https://www.mediawiki.org/wiki/Manual:Archive_table says:

There is also not presently a record of when deletion occurred, making it hard to separate multiple delete cycles or distinguish old deletions from recent deletions of pages that haven't been edited in a long time.

This patch will store in a new ar_del_time field the timestamp of when the revision rows were moved to the archive table.

Details

Event Timeline

Change 572134 had a related patch set uploaded (by Superbass; owner: Superbass):
[mediawiki/core@master] Add ar_del_time

https://gerrit.wikimedia.org/r/572134

Hi @Superacidic, thanks for taking the time to report this, to look at the code, and welcome to Wikimedia Phabricator!

Also removing subscriber as that was 17 years ago. Also see https://www.mediawiki.org/wiki/Developers/Maintainers

Thank you, Aklapper. Good to be here.

I'm thinking we can add the field in MW 1.35 and then actually do something with it in 1.36, after people have had time to upgrade. I suppose the way this could work is that in Special:Undelete/Foo, there could be an option to display the deletion log entries sorted by ar_del_time before ar_timestamp, to help people who want to restore revisions from particular deletions. If so, I suppose we need a new index for that, something like CREATE INDEX /*i*/name_title_deltime_timestamp ON /*_*/archive (ar_namespace,ar_title,ar_del_time,ar_timestamp);

I have revised the patch to include the index.

Timestamps are often not the right choose when it is possible to have duplicates of it for different actions from different users (Yes, it is possible to have two edits of one page in the same second, so it could be possible to have two deletion of different revisions in the same second, maybe not very likly to get, but possible).

The filearchive table has three fields:

-- Deletion information, if this file is deleted.
  fa_deleted_user int,
  fa_deleted_timestamp binary(14) default '',
  fa_deleted_reason_id bigint unsigned NOT NULL,

Maybe try to use similar names of the column for a different table. [Not adding columns when unused like the user and reason columns in filearchive]
On the other hand mediawiki does always use the long word "timestamp" in the schema and not short words like "time"

Timestamps are often not the right choose when it is possible to have duplicates of it for different actions from different users (Yes, it is possible to have two edits of one page in the same second, so it could be possible to have two deletion of different revisions in the same second, maybe not very likly to get, but possible).

Ideally, the archive table would store the log_id, since that's the primary key, but is there a way to do that, while still waiting till after the revision rows are moved to the archive table to do the insertion to the logging table? Presumably we don't want to insert a log entry till last, in case a problem arises in moving the revision rows.

Timestamps are often not the right choose when it is possible to have duplicates of it for different actions from different users (Yes, it is possible to have two edits of one page in the same second, so it could be possible to have two deletion of different revisions in the same second, maybe not very likly to get, but possible).

The filearchive table has three fields:

-- Deletion information, if this file is deleted.
  fa_deleted_user int,
  fa_deleted_timestamp binary(14) default '',
  fa_deleted_reason_id bigint unsigned NOT NULL,

Maybe try to use similar names of the column for a different table. [Not adding columns when unused like the user and reason columns in filearchive]
On the other hand mediawiki does always use the long word "timestamp" in the schema and not short words like "time"

Are you actually proposing we add these other two fields, or just noting an unfortunate deficiency of only adding the timestamp field, which perhaps we have to accept for efficiency's sake? Some might say that having these two extra fields is going to be adding too much bloat to the database, since every revision involved in a given deletion event is going to be repeating the same data, and since as you suggest, it's going to be unlikely that there will be a lot of deletions with the same timestamp and same ar_page_id, ar_title, ar_namespace, etc. (the other fields that could be used for disambiguating between deletions).

I'm open to the idea of calling it ar_deleted_timestamp or something like that, if that would be preferable.

I am not suggest to add more fields than needed.

I am not write about deletion with the same timestamp on different pages, I am want to say that technically there could be deletion with same timestamp on the same page (but different revisions).

It is okay to ignore that, but I just want to have it documented.

I see the same problem as you, when using the logid, but I have no other idea

I am not write about deletion with the same timestamp on different pages, I am want to say that technically there could be deletion with same timestamp on the same page (but different revisions).

What's an example of a series of events that could create a situation like that? (This stuff gets confusing sometimes, when people are doing multiple deletions, undeletions, etc. so I like to clarify what kinds of scenarios we might be talking about.) Thanks.

It seems to me, though, that there probably aren't going to be two page deletion events happening on the same page in the same second, because archiving of revisions via page deletion is done wholesale, right? You either archive all the revisions or you don't; page deletion is not like the Revision Deletion feature where you might delete some revisions but not others, and so therefore it's conceivable two users might be doing revision deletions on the same page at the same time.

If my analysis is correct, then it seems like just having that timestamp field might suffice for our purposes.

Let's consider some scenarios.

User A deletes page X.
User B re-creates page X.
User C deletes page X.

What are the odds that all this will happen within the same second, so that the ar_del_time timestamp for both sets of archived revisions will be the same?