Page MenuHomePhabricator

Logging needs an index to optimize searching by log_title
Open, LowPublic

Description

When trying to create a page that has been previously deleted, a message is shown to the user to remind that it was deleted before. That information is retrieved from the logging table, by matching the title.

Currently, there is no index for the log_title field in that table, hence the query is slow. We would like a new index on log_namespace, log_title to allow for such queries to be efficient.


  • Use case: When opening the edit page for a title to create the page, a log extract may be shown of delete log actions.
  • Source code:
    • EditPage::showIntro,
    • LogEventsList::showLogExtract( …, [ 'delete', 'move' ], [ 'lim' => 10, 'conds' => 'log_action != "revision"' ])
    • LogPager::getQueryInfo

Details

Reference
bz66961

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:31 AM
bzimport set Reference to bz66961.
bzimport added a subscriber: Unknown Object (MLST).
Huji created this task.Jun 23 2014, 1:57 AM
hoo added a comment.Jun 23 2014, 2:56 AM

What's actually needed over here is (log_namespace, log_title, log_type), see query at the bottom of https://gerrit.wikimedia.org/r/#/c/139103/3/AbuseFilterVariableHolder.php

Not sure we really want that for only this feature, but Huji mentioned other code paths also running a similar query.

Huji added a comment.Jun 23 2014, 12:55 PM

To be politically correct:

In EditPage.php at the very end of showIntro() function you will find a call to LogEventsList::showLogExtract() which is how the message on top of the page is created). LogEventsList::showLogExtract() itself is defined in includes/logging/LogEventsList.php and uses the getBody() function of LogPager to get a list of 50 recent delete log entries for that page and in the end, it is the doQuery() function of Pager class which actually runs the query. If you follow this path you will notice that the query will be run on log_namespace, log_title and log_type fields.

The example queries I've seen so far always use both log_namespace and log_title which means the page_time index can be used partially:

KEY page_time (log_namespace, log_title, log_timestamp)

Using enwiki pages with 100000+ log entries the queries take ~1s on warm data. Not terrible, though the number of Handler% calls is directly proportional to the number of log entries which isn't great for long term scalability.

So tentative +1 to this bug.

I've been trialing the following index on enwiki slaves for a few months:

KEY log_title_type_time (log_title(16), log_type, log_timestamp)

It sees quite a bit of use in general, and is used in favour of page_time for counting page deletes in the larger namespaces like 10 and 828. We should investigate how it compares to (log_namespace, log_title, log_type); the latter is probably better but might service fewer queries overall as well as encroach on page_time.

Huji added a comment.Dec 8 2014, 4:13 PM

Springle, did you conduct any evaluations on this one? There are very few people who can do that, so gaining momentum on this bug is hard.

Info: showLogExtract used a bloom cache to cache possible empty results since https://gerrit.wikimedia.org/r/#/c/170441/ to avoid a slow query.

@Springle and @Umherirrender I want to resuscitate this task. Can you continue to support this case, or would you like me to get other people involved (and who?)

Huji renamed this task from Logging needs a log_title index to Logging needs an index to optimize searching by log_title.Feb 15 2017, 4:05 PM
Huji added projects: DBA, Wikimedia-Rdbms.
Huji added a project: Performance-Team.
Gilles moved this task from Inbox to Radar on the Performance-Team board.Feb 15 2017, 8:49 PM

I can not say anything about the task, my comment was just a info, that the missing index is not at high priority, because it has a workaround.

jcrespo lowered the priority of this task from Normal to Low.Feb 16 2017, 7:11 PM
jcrespo added a subscriber: jcrespo.

Based on the comments, I am going to lower the priority of this, without trying to abort any ongoing work on this.

@Umherirrender may I asked what workaround you are talking about? The BloomCache workaround is not possible (all BloomCache related code was later removed as it was only experimental); and the use of partial index page_time that @Springle mentioned above doesn't seem to be sufficient.

Huji added a comment.Mar 1 2017, 3:53 PM

@jcrespo can you point me out to the part of the code that skips showing the edit log when the user is anonymous and more then X time has passed since deletion? we might be able to re-use some of that logic in https://phabricator.wikimedia.org/T22892 by skipping the query altogether when a page has no logs.

@Huji Are you sure you are asking the right person? I am not a developer, and I have never done a single contribution to mediawiki code, nor I was involved with the feature you mention. Maybe you are confusing me with another person 0:-)

Huji added a comment.Mar 1 2017, 6:25 PM

@jcrespo I asked you because of your comment on https://gerrit.wikimedia.org/r/#/c/139103/ PS7 where you explained this to me :) I can ask around though.

@Huji Sadly, I knew what happened, but for 3rd party recount, I wasn't part of the people implemented that. The person that answered you with the details knew more, but he may be a bit busy right now...

@Umherirrender may I asked what workaround you are talking about? The BloomCache workaround is not possible (all BloomCache related code was later removed as it was only experimental); and the use of partial index page_time that @Springle mentioned above doesn't seem to be sufficient.

The workaround is linked in my comment from 28. Dec 2014 T68961#946248, my comment from 16. Feb. 2017 refer to this "hint", "info" or what you want call it. It is possible that code was changed since 2014, so it is possible that old comments are not good "hints"/"infos" now, that is always possible for old bugs, which are a lot in this phab.

As written in T68961#3034156, I can not say anything about the task

Another hint: BloomCache was removed with https://gerrit.wikimedia.org/r/#/c/201684/

@Umherirrender thanks for clarifying. BloomCache is not an option so we are back to square one.

Reedy moved this task from Unsorted to Add / Create on the Schema-change board.Apr 26 2017, 2:22 PM
Krinkle updated the task description. (Show Details)
Krinkle removed a subscriber: wikibugs-l-list.

I don't think this is going to work log_namespace, log_title should probably be used instead, even if we do not have a namespace -> that probably can be extrapolated (heurstilcly from the title) or erronously assume namespace 0 (or default namespace) and only show that for the default namespace; or on worse case scenario, IN (all namespaces) could be used instead.

Huji updated the task description. (Show Details)Aug 7 2017, 1:39 PM

... log_namespace, log_title should probably be used instead, even if we do not have a namespace ...

Fair point. I updated the task description accordingly.

Huji moved this task from Triage to Backlog on the DBA board.Aug 7 2017, 1:40 PM

@Huji, but we already have that index :-) :

https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql;0a9ee371b531e3e657eaba378abf89480acc8efc$1411

A different story would be that such index doesn't exist by mistake on some specific servers (which would be a duplicate of T71127 -a WMF deploying bug, not a mediawiki bug), but it should exist according to the mediawiki code seen above. Can you check if that is the case?

Huji added a comment.Aug 7 2017, 1:50 PM

@Jcrespi: but that includes log_timestamp a field that in our current use cases (see related Tasks above) would not be available. In fact, ideally we need an index that instead of log_timestmap, has log_type.

Think of "number of times a page was deleted" use case. We don't put a constraint on time, but we specify the page (namespace + title) and the action (deleted).

jcrespo added a comment.EditedAug 7 2017, 2:04 PM

Left-most column index featur on MySQL/MariaDB allows to use an index on (A, B, C) for filters on {A}, {A, B}, and {A, B, C} as efficiently as the smaller indexes. In some cases, where the selectivity is high it can be used for middle-column scanning, too (less eficiently). You mentioned that you needed an index on (log_namespace, log_title), for that, the index (log_namespace, log_title, log_timestamp) is great.

If now you say you need to filter by action, too, that is a different index (and not the one you originally asked for on the description). Please update your requirements to add (action, log_namespace, log_title) or (log_namespace, log_title, action), although I wonder if it can be done in a different way. My real advice is to remove the request for a specific index, because it can lead to confusion, but indicate instead the problem with further details (specific query run now, and why it is slow, with examples such as urls, logs, queries, etc.) -and people will help finding the right solution. That will avoid discussing issues that may be distracting from the real issue we have here :-). For example, if what you really wanted was T22892, maybe the (smaller) archive table could be used instead? I do not know, it is just an example.

Note I am not saying the index is not reasonable, I am just suggesting to add more context information to clarify to us readers the needs more clearly.

For example, we could search the last 100 logs for a page, and then search for a delete:

SELECT ... FROM (SELECT ... FROM logging WHERE log_namespace = 0 and log_title = $title ORDER BY log_timestamp DESC LIMIT 100) AS last_logs_for_page WHERE action = 'delete';

And indicate on the text it is only on "recent logs", as a workaround. Again, I am not saying we should do that, but it could be a possibility.

Huji added a comment.Aug 9 2017, 1:14 AM

I actually like that idea. I have seen that code elsewhere (have to think to remember where), whereby a query is executed to get "recent" data and only if that fast query does not return results would the slower query be run. Let me investigate and incorporate your comment. You may have made my day here :)

Huji added a comment.Aug 12 2017, 1:03 PM

@aaron this requires your attention. If @jcrespo 's analysis is correct, then we do not need a new index for https://gerrit.wikimedia.org/r/#/c/139103/

Please note my suggestion was just that, a suggestion, we were still discussing it -we may or may not need an index, I just asked to look at such option.

Huji added a comment.Aug 12 2017, 1:17 PM

Please note my suggestion was just that, a suggestion, we were still discussing it -we may or may not need an index, I just asked to look at such option.

Hence my conditional if .. then statement :)

Marostegui moved this task from Backlog to Next on the DBA board.Mar 11 2019, 6:22 AM

Just for the record, we are almost done with unifying the logging table to what is on tables.sql across all the wikis. So any index placed on tables.sql will be in production soon. Any other index not appearing in tables.sql is being removed.
The only pending section to be done is s3.

Krinkle updated the task description. (Show Details)