Page MenuHomePhabricator

Add page_is_redirect/page_namespace/page_title index
Open, Needs TriagePublic

Description

Consider adding an index on (page_is_redirect, page_namespace, page_title) to the page table.

In T160916, the redirect filter on Special:AllPages and in ApiQueryAllPages was disabled in production because it is an unindexed query and, depending on parameters, can scan very many rows.

T173479 asked for the feature to be re-enabled and was closed invalid.

This 2025 community wish by @Klein also asks for such a feature. The wish asks for no limit on Special:ListRedirects, but if you look at the linked discussion, it would seem that re-enabling the ApiQueryAllPages feature would better suit this user's needs. They would like to run a bot which iterates over all redirects.

I am filing this task with the correct technical framing so that we can have a discussion about it. The question is whether the feature is sufficiently valuable to justify the index space. This question was not discussed in the historical tasks.

You can run unindexed queries on toolserver, and there are page table dumps, but it would be most convenient for users if the index were added, so that ApiQueryAllPages could be used as a generator with a redirect filter.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I suggest doing this backwards, the queries should happen on redirect table instead. Maybe even if we add rd_from_namespace to redirect table, we can even fully get rid of page_is_redirect from the page table and mere existence of a row in redirect table should indicate that the page is a redirect. This is a more recommended approach to data modelling when you have a superset and then a rather small subset (redirect pages), instead of trying hold all information there in one giant table via somethid_is_something fields (the smaller the portion of the subset, the more useful this modelling would be, in enwiki it's 23% but in many wikis it's quite smaller).

Currently in commonswiki, page is 27GB while redirect is 357MB (in enwiki it's more balanced, 8.2GB for page and 1.3GB for redirect).

Additionally, It's always better for maintenance to have two medium sized tables than one gigantic one.

I suggest doing this backwards, the queries should happen on redirect table instead.

The main problem with using the redirect table is that you can't sort it by title. Sorting by title is needed for Special:ListRedirects and for the disabled features on Special:AllPages and ApiQueryAllpages. It's probably not needed for @Klein's use case, although for that we would need to write a new query module that sorts by page_id.

A secondary problem with using the redirect table is that it doesn't allow you to exclude redirects. Special:AllPages on enwiki (and other Wikipedias) is quite cluttered with redirects, and it used to be possible to hide them. We could just re-enable that feature. Reviewing T160914, it seems that the actual problem was a query planner bug -- the table scanning issue was just theoretical and has apparently not been noticed on Special:PrefixIndex in the years since.

Currently in commonswiki, page is 27GB while redirect is 357MB (in enwiki it's more balanced, 8.2GB for page and 1.3GB for redirect).

I think the proposed index size would be roughly equal to the page_name_title index, which I measured using mysql.innodb_index_stats. I've also shown the number of page table rows, the percentage of redirects, and the page_name_title index size scaled down by the redirect percentage as a model for a sparse duplicate of the page table.

DBPagesRedirects %Index size per rowproposed index size (GB)redirect titles index size (GB)
enwiki65 M23.4%623.740.88
dewiki9 M16.2%520.440.07
jawiki5 M16.5%590.250.04
commonswiki156 M1.4%11015.980.22
wikidatawiki118 M2.6%465.020.13

It's almost an argument for having different indexes on different databases isn't it? I know, that way lies madness.

Also on the topic of heresies,

MariaDB [mw]> \s
--------------
mysql  Ver 15.1 Distrib 10.11.8-MariaDB, for debian-linux-gnu (x86_64) using  EditLine wrapper
...
MariaDB [mw]> ALTER TABLE page
    -> ADD page_namespace_if_redirect INT AS (IF(page_is_redirect, page_namespace, NULL)),
    -> ADD page_title_if_redirect VARBINARY(255) AS (IF(page_is_redirect, page_title, NULL)),
    -> ADD INDEX page_redir_name_title (page_namespace_if_redirect,page_title_if_redirect);
Query OK, 0 rows affected (0.023 sec)
Records: 0  Duplicates: 0  Warnings: 0
MariaDB [mw]> EXPLAIN SELECT page_id FROM page WHERE page_namespace_if_redirect=0 ORDER BY page_title_if_redirect;
+------+-------------+-------+------+-----------------------+-----------------------+---------+-------+------+--------------------------+
| id   | select_type | table | type | possible_keys         | key                   | key_len | ref   | rows | Extra                    |
+------+-------------+-------+------+-----------------------+-----------------------+---------+-------+------+--------------------------+
|    1 | SIMPLE      | page  | ref  | page_redir_name_title | page_redir_name_title | 5       | const | 36   | Using where; Using index |
+------+-------------+-------+------+-----------------------+-----------------------+---------+-------+------+--------------------------+
1 row in set (0.011 sec)

I mean, it would work, wouldn't it?

I suggest doing this backwards, the queries should happen on redirect table instead.

The main problem with using the redirect table is that you can't sort it by title. Sorting by title is needed for Special:ListRedirects and for the disabled features on Special:AllPages and ApiQueryAllpages. It's probably not needed for @Klein's use case, although for that we would need to write a new query module that sorts by page_id.

I don't see that being a massive problem, Special:Whatlinkshere and many other places order based on page_id.

A secondary problem with using the redirect table is that it doesn't allow you to exclude redirects. Special:AllPages on enwiki (and other Wikipedias) is quite cluttered with redirects, and it used to be possible to hide them. We could just re-enable that feature. Reviewing T160914, it seems that the actual problem was a query planner bug -- the table scanning issue was just theoretical and has apparently not been noticed on Special:PrefixIndex in the years since.

We can still do a left join and add a condition for redirect_something being NULL. It means the filtering will happen on server-side but as long as the ratio is not too large (which it isn't), it should be fine. The only problematic path would be when there are millions of consecutive redirect pages but that shouldn't be the case anyway.

Currently in commonswiki, page is 27GB while redirect is 357MB (in enwiki it's more balanced, 8.2GB for page and 1.3GB for redirect).

I think the proposed index size would be roughly equal to the page_name_title index, which I measured using mysql.innodb_index_stats. I've also shown the number of page table rows, the percentage of redirects, and the page_name_title index size scaled down by the redirect percentage as a model for a sparse duplicate of the page table.

Here is the problem. page_name_title serves a vast range of vital usecases so we can justify the index and the cost it puts on the infra. This usecase doesn't really justify adding tens of gigabytes into core tables, doubly so in s4 and s8 that are already quite under constraint due to their size (e.g. T343131: Commons database is growing way too fast) on a table that is quite hot and needs to be served from memory. If these tables were in, let's say in x1, I would have signed off on it immediately. That's why I think the redirect table is a better option since it arrives at the compromise of cost vs user experience feature. If we fully remove page_is_redirect in favor of existence of the row in redirect table, it'll save us money and makes those sections more stable.