Page MenuHomePhabricator

Providing index of backlinks table to labs replicas
Closed, ResolvedPublic

Description

For a fairly popular and important user tool, (check its use here) I need to query Wikipedia database about backlinks of a page but it seems unlike Wikipedia servers, labs' replicas doesn't have access to that index.

For example run this on ckbwiki_p on labs:

SELECT COUNT(*) FROM pagelinks WHERE pl_title="Main_page";

Then try to do the same on fawiki_p or enwiki_p (you will get stuck and timeout on large databases).. Can you please provide that index to labs users? Thanks

Related Objects

Event Timeline

Ebraminio created this task.Mar 8 2017, 8:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2017, 8:17 PM
Ebraminio renamed this task from Making access to index of table backlinks to Providing index of backlinks table to labs replicas.Mar 8 2017, 8:18 PM
Ebraminio updated the task description. (Show Details)
Ebraminio updated the task description. (Show Details)Mar 8 2017, 8:21 PM

You have always to provide the namespace to make better database access (but I do not know, if the index is there)

Try this:

SELECT COUNT(*) FROM pagelinks WHERE pl_namespace = 0 AND pl_title="Main_page";

Ebraminio closed this task as Resolved.Mar 11 2017, 10:08 PM
Ebraminio claimed this task.

Excellent, it makes a huge difference for my use.

@Umherirrender: Thanks a lot for your help on this, but can I also know the reason behind this? I have a very specific query something like this:

SELECT pl_title, COUNT(*)
FROM pagelinks
WHERE pl_namespace = 0 AND pl_title IN ('A', 'B', 'C') GROUP BY pl_title;

which just after removing the 'C' will not get timeout (the exact query that is getting timeout is on error log of linkstranslator tool on labs) but in an attempt to make the tool more reliable I found adding "pl_from_namespace = 0 AND " to the query also makes a difference. Does this make sense? Do you have any other suggestion for making the situation better? Thanks

A page in mediawiki is always in a namespace, so seaching for a page named 'C' can return the page in the main namespace, but also a user page or a template, because in all these namespaces a page named 'C' can exists.

The pagelinks table also reflect this by having both columns and the index are created for this fact on pl_namespace, pl_title.

Therefore always provide a namespace when searching a title, because without namespace it could mean different pages and it gives a time out due to non existing index

Maybe read more about the database layout of mediawiki - https://www.mediawiki.org/wiki/Manual:Database_layout
or namespaces - https://www.mediawiki.org/wiki/Manual:Namespace

jcrespo added a subscriber: jcrespo.

Thanks for the help provided, Umherirrender.

Thank you for the kind explanation. I was able to reduce the troublemaking query an user issued to the tool to this which still have database timeout issue:

SELECT pl_title, COUNT(*) FROM pagelinks WHERE pl_namespace = 0 AND pl_title IN ('WWW', 'NNN', 'CCC', 'DDD', 'FFF', 'GGG', 'Girls_Gone_Wild_(franchise)', 'EEE', 'Epipedobates_espinosai', 'HHH', 'JJJ', 'BBB', 'TTT', 'AAA', 'Revenge_of_the_Petites') GROUP BY pl_title;

Perhaps the issue with this specific query is indicating some internal mysql bug or is subject of another phab task? Thanks.

(the weird point of the mentioned query on the previous comment is if you remove just one of the pages from the list or replace it with something else, it works just fine)

What you comment is a known limitation of mariadb, it may send bad plans when there is more than 10 items per 'IN ()' expression, as it uses heuristics to calculate the best index. https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_eq_range_index_dive_limit variable solves this, however, that has not been backported to MariaDB: Look how it was myself who reported that: https://jira.mariadb.org/browse/MDEV-10313

For now, reducing the number of items on the IN (for example, doing 2 queries with less than 10 items), may help do a better plan. See:

root@labsdb1001[enwiki_p]> EXPLAIN SELECT pl_title, COUNT(*) FROM pagelinks WHERE pl_namespace = 0 AND pl_title IN ('WWW', 'NNN', 'CCC', 'DDD', 'FFF', 'GGG', 'Girls_Gone_Wild_(franchise)', 'EEE', 'Epipedobates_espinosai', 'HHH', 'JJJ', 'BBB', 'TTT', 'AAA', 'Revenge_of_the_Petites') GROUP BY pl_title\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagelinks
         type: ref
possible_keys: pl_namespace,pl_backlinks_namespace
          key: pl_namespace
      key_len: 4
          ref: const
         rows: 1047
        Extra: Using where; Using index
1 row in set (0.04 sec)

root@labsdb1001[enwiki_p]> EXPLAIN SELECT pl_title, COUNT(*) FROM pagelinks WHERE pl_namespace = 0 AND pl_title IN ('WWW', 'NNN', 'CCC', 'DDD', 'FFF', 'GGG', 'Girls_Gone_Wild_(franchise)', 'EEE', 'Epipedobates_espinosai') GROUP BY pl_title\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagelinks
         type: range
possible_keys: pl_namespace,pl_backlinks_namespace
          key: pl_backlinks_namespace
      key_len: 261
          ref: NULL
         rows: 513
        Extra: Using where; Using index
1 row in set (0.01 sec)

Thanks for the great explanation and sorry for overloading the file.

Just as a side note, unfortunately labs users don't have access to EXPLAIN (T152341) and I tried to do this locally yesterday without fortune of course. I should mention however without your explanation I don't think I could find the root of the issue. Thanks!

unfortunately labs users don't have access to EXPLAIN

@Ebraminio actually, you do: See the workaround on: https://phabricator.wikimedia.org/T50875 There used to be a tool to do that easily at https://tools.wmflabs.org/tools-info/optimizer.py but it seems no longer works. It can be done manually with the steps above:

Open 2 SQL sessions

In session 1:

  • SELECT CONNECTION_ID() AS conid;
  • Note the number returned.
  • Run the query to be explained.

In session 2:

  • Use the number noted above for <conid>
  • SHOW EXPLAIN FOR <conid>;

Excellent! Thanks