Page MenuHomePhabricator

archive table needs index starting with timestamp
Closed, DeclinedPublic

Description

When reindexing deleted pages for CirrusSearch extension, we've discovered that archive table only has one index involving timestamp:

KEY `name_title_timestamp` (`ar_namespace`,`ar_title`,`ar_timestamp`),

This means if we want to reindex deletes from certain timestamp (such as --from option used in in-place reindexing) we can not use index, since there's no index starting with timestamp. It also makes table scan (when reindexing whole set of deleted pages) - which now follows the index - to go in somewhat unnatural order, scanning in title order and not in timestamp order, which makes it hard to track progress.

It would be great if we could add either index on just ar_timestamp or:

KEY `timestamp_name_title` (`ar_timestamp`, `ar_namespace`,`ar_title`),

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Adding platform because they know the risks of overindexing.

The danger here, as far as I can tell, would be planner confusion over whether to use name_title_timestamp or this new index. We see that on revision fairly frequently for page_timestamp versus rev_timestamp.

BTW, there actually is another index on archive (at least in tables.sql) involving ar_timestamp, ar_usertext_timestamp, but it's also not suitable for the use case here.

Adding platform because they know the risks of overindexing.

Ha, I'd have added you because you know more about it than I do.

I think the way we're using them now (either search by namespace/title prefix, or scan by timestamp) if we add timestamp index it should not be too confusing? We could check queries in PageArchive::listPagesByPrefix, PageArchive::listPagesBySearch and forceSearchIndex.php in CirrusSearch, if we had a test DB where both indexes exist.

Good idea, let's create one or several test hosts with the new index and
then see how it goes in both resources and query planner.

Smalyshev triaged this task as Medium priority.Jun 29 2017, 6:30 PM

@jcrespo just to be sure, creating test hosts is something you can do? Want to ensure this doesn't fall through and is being handled.

Hi @jcrespo - is this something you'll be able to do? Thanks!

Correct me if I am wrong, but "deletes from certain timestamp" will not be given by the ar_timestamp index? Maybe you would prefer scanning using the ar_id primary key, or use the logging for that?

Answering the question, yes, I can setup such a test, but the fact is that such an index does not exist is because the timestamp is the one of the original creation, not the deletion date -so revisions don't make sense to order by such a timestamp if they are not first ordered by namespace/title - maybe we can use ar_id or do a join for the query you want, like this?

SELECT archive.*, logging.* FROM logging LEFT JOIN archive ON log_namespace = ar_namespace AND log_title = ar_title WHERE log_type='delete' and log_timestamp BETWEEN '20170725' and '20170726' ORDER BY log_timestamp;

Hm, I was under the impression that it's deletion timestamp. If it's creation timestamp than using it for deletes is wrong. And it also means our delete indexing script is completely wrong, as we want from/to to index by deletion time - we use it to catch up with deletes that happened after the main reindex has started - not by creation time.

Since it looks like we're using the wrong timestamp, and don't need the index as described, I'm closing it. Thanks @jcrespo for getting to the root of the issue!

Sorry it took me more time than necessary to answer- sometimes one needs time and perspective to see "this is strange" (why does this index doesn't already exist?) and providing an alternative method that does the correct thing rather than exactly what it is asked.