Page MenuHomePhabricator

Use correct timestamp when indexing deletes
Closed, ResolvedPublic

Description

After discussion on T164975, it turns out that ar_timestamp column in archive table is not suitable for indexing deletes, as the timestamp is when revision was created, not when revision was deleted.

Looks like instead, we need to use something like:

SELECT log_timestamp,log_namespace,log_title,log_page  FROM `logging`    WHERE (log_timestamp >= '20170101000000') AND (log_timestamp <= '20170828201852') AND log_type = 'delete' AND log_action = 'delete' AND (EXISTS(select * from archive where ar_title = log_title and ar_namespace = log_namespace))  ORDER BY log_timestamp ASC

We've stopped using log table after the fix to T54612, but look like we still need timestamps from there.

Event Timeline

Restricted Application added a project: Discovery. · View Herald TranscriptJul 28 2017, 12:29 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev triaged this task as Normal priority.Jul 28 2017, 12:29 AM
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.
Smalyshev moved this task from Next to Doing on the User-Smalyshev board.Aug 25 2017, 9:11 PM

So, this turns out trickier than I thought. logging has these indexes:

KEY `type_time` (`log_type`,`log_timestamp`),
KEY `user_time` (`log_user`,`log_timestamp`),
KEY `page_time` (`log_namespace`,`log_title`,`log_timestamp`),
KEY `times` (`log_timestamp`),
KEY `log_page_id_time` (`log_page`,`log_timestamp`),
KEY `log_user_text_time` (`log_user_text`(16),`log_timestamp`),
KEY `log_title_time` (`log_title`(16),`log_timestamp`),
KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`)

Since we need to select by type (delete) the natural index would be type_time. But then we need to link it to archive table (presumably to index only those who have archive entries) but archive table is:

KEY `name_title_timestamp` (`ar_namespace`,`ar_title`,`ar_timestamp`),
KEY `usertext_timestamp` (`ar_user_text`,`ar_timestamp`),
KEY `ar_revid` (`ar_rev_id`)

i.e. there is no index to link it to logging table for specific deletion. We could use namespace+title, but that produces bad query according to explain.

In general, it is not clear which entries we want to get, since half of the info seems to be in logging table (timestamp) and half in the archive table. Should we use only the logging table and ignore archive altogether, even with the risk that the page has been restored since and so should not be deleted? Note that if somebody deletes two revisions and then restores one (not necessarily the latest!) then the page would both have logging delete entry (actually, two of them) and archive entry, all under the same page ID. Not entirely clear what should happen in this case and how we should handle these deletes.

Change 374416 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/CirrusSearch@master] Use correct table to find deletes.

https://gerrit.wikimedia.org/r/374416

Smalyshev updated the task description. (Show Details)Aug 28 2017, 8:54 PM
Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.
Smalyshev moved this task from Up Next to Current work on the Discovery-Search board.

Change 374416 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Use correct table to find deletes.

https://gerrit.wikimedia.org/r/374416

Smalyshev closed this task as Resolved.Sep 22 2017, 8:28 PM