Page MenuHomePhabricator

SpecialWhatLinksHere::showIndirectLinks on wikidatawiki needs optimizing
Closed, ResolvedPublic

Description

The following query takes hours to run and will never run as they get killed by the query killer:

Copying to tmp table on disk	SELECT /* SpecialWhatLinksHere::showIndirectLinks  */  page_id,page_namespace,page_title,rd_from,rd_fragment,page_is_redirect  FROM (SELECT  pl_from,rd_from,rd_fragment  FROM `pagelinks` LEFT JOIN `redirect` ON ((rd_from = pl_from) AND rd_title = 'P248' AND (rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = 120) JOIN `page` ON ((pl_from = page_id))   WHERE pl_namespace = 120 AND pl_title = 'P248'  ORDER BY pl_from LIMIT 102  ) `temp_backlink_range` JOIN `page` ON ((pl_from = page_id))    ORDER BY page_id LIMIT 51
root@db1092.eqiad.wmnet[wikidatawiki]> explain SELECT /* SpecialWhatLinksHere::showIndirectLinks  */  page_id,page_namespace,page_title,rd_from,rd_fragment,page_is_redirect  FROM (SELECT  pl_from,rd_from,rd_fragment  FROM `pagelinks` LEFT JOIN `redirect` ON ((rd_from = pl_from) AND rd_title = 'P248' AND (rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = 120) JOIN `page` ON ((pl_from = page_id))   WHERE pl_namespace = 120 AND pl_title = 'P248'  ORDER BY pl_from LIMIT 102  ) `temp_backlink_range` JOIN `page` ON ((pl_from = page_id))    ORDER BY page_id LIMIT 51;
+------+-------------+------------+--------+----------------------+----------+---------+---------------------------------------+----------+----------------------------------------------+
| id   | select_type | table      | type   | possible_keys        | key      | key_len | ref                                   | rows     | Extra                                        |
+------+-------------+------------+--------+----------------------+----------+---------+---------------------------------------+----------+----------------------------------------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL                 | NULL     | NULL    | NULL                                  |      102 | Using temporary; Using filesort              |
|    1 | PRIMARY     | page       | eq_ref | PRIMARY              | PRIMARY  | 4       | temp_backlink_range.pl_from           |        1 |                                              |
|    2 | DERIVED     | page       | index  | PRIMARY              | page_len | 4       | NULL                                  | 88669406 | Using index; Using temporary; Using filesort |
|    2 | DERIVED     | pagelinks  | eq_ref | PRIMARY,pl_namespace | PRIMARY  | 265     | wikidatawiki.page.page_id,const,const |        1 | Using where; Using index                     |
|    2 | DERIVED     | redirect   | eq_ref | PRIMARY,rd_ns_title  | PRIMARY  | 4       | wikidatawiki.page.page_id             |        1 | Using where                                  |
+------+-------------+------------+--------+----------------------+----------+---------+---------------------------------------+----------+----------------------------------------------+
5 rows in set (0.01 sec)

Event Timeline

I think even if it's due to size of Wikidata, there's nothing in the Wikibase code we can do to fix/address it. I assume CPT should take a look.

rows: 88669406

Beautiful

daniel added subscribers: Lydia_Pintscher, daniel.

Back to the inbox for triage. This isn't actionable as it is. We'd probably have to design a new schema for representing the relevant info in the DB to make this work. Could be a "future initiative".

@Lydia_Pintscher: how important is this for Wikidata?
@Marostegui: how critical is this for DBAs? As a stop gap, we could just disable the feature on wikidata.

Hard to tell without the context of when this is happening for users. Does anyone know?

Back to the inbox for triage. This isn't actionable as it is. We'd probably have to design a new schema for representing the relevant info in the DB to make this work. Could be a "future initiative".

@Lydia_Pintscher: how important is this for Wikidata?
@Marostegui: how critical is this for DBAs? As a stop gap, we could just disable the feature on wikidata.

So far I haven't seen anything being overloaded by this query, but it might in the future, who knows.
However, given the huge amount of rows this query has to scan, I doubt it will ever finish on time before being killed. So whatever uses it, might not get the expected data

Hard to tell without the context of when this is happening for users. Does anyone know?

The query in question is generated by https://www.wikidata.org/wiki/Special:WhatLinksHere/Property:P248. That seems to work fine though, and returns nearly instantly.

So far I haven't seen anything being overloaded by this query, but it might in the future, who knows.
However, given the huge amount of rows this query has to scan, I doubt it will ever finish on time before being killed. So whatever uses it, might not get the expected data

Since the link above works fine, it appears to me that the query is only "sometimes" problematic, due to a suboptimal query plan. For me locally, on a nearly empty database, the query plan looks like this:

+------+-------------+------------+--------+----------------------+--------------+---------+-----------------------------+------+--------------------------+
| id   | select_type | table      | type   | possible_keys        | key          | key_len | ref                         | rows | Extra                    |
+------+-------------+------------+--------+----------------------+--------------+---------+-----------------------------+------+--------------------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL                 | NULL         | NULL    | NULL                        | 2    | Using filesort           |
|    1 | PRIMARY     | page       | eq_ref | PRIMARY              | PRIMARY      | 4       | temp_backlink_range.pl_from | 1    |                          |
|    2 | DERIVED     | pagelinks  | ref    | PRIMARY,pl_namespace | pl_namespace | 261     | const,const                 | 1    | Using where; Using index |
|    2 | DERIVED     | redirect   | eq_ref | PRIMARY,rd_ns_title  | PRIMARY      | 4       | default.pagelinks.pl_from   | 1    | Using where              |
|    2 | DERIVED     | page       | eq_ref | PRIMARY              | PRIMARY      | 4       | default.pagelinks.pl_from   | 1    | Using index              |
+------+-------------+------------+--------+----------------------+--------------+---------+-----------------------------+------+--------------------------+

I see only one "using filesort" there, and no "using temporary". But that may just be because there are no redirects to this page. But then, there are no redirects to Property:P248 on wikidata either...

Any idea when and why the query plan becomes nasty, and what to do about it?

Unfortunately hard to tell, but the optimizer behaves differently when there is lots of data and sometimes it does silly things :-(. We have seen over the years, especially with massive tables.
The only workaround we've found is sadly, using FORCE, USE or IGNORE INDEX on the queries to bypass those issues. Sometimes issues get fixed by running an analyze table so the optimizer gets recent table stats, or just from one version to another.
I could try to issue an analyze table on the involved tables, to see if that makes some difference.

The only workaround we've found is sadly, using FORCE, USE or IGNORE INDEX on the queries to bypass those issues. Sometimes issues get fixed by running an analyze table so the optimizer gets recent table stats, or just from one version to another.

Yea, the question is, what to force. The index that seems off in the query plan you supplied is page_len. That makes no sense. But forcing it to use PRIMARY there makes things worse:

> explain SELECT /* SpecialWhatLinksHere::showIndirectLinks  */  page_id,page_namespace,page_title,rd_from,rd_fragment,page_is_redirect  FROM (SELECT  pl_from,rd_from,rd_fragment  FROM `pagelinks` LEFT JOIN `redirect` ON ((rd_from = pl_from) AND rd_title = 'P248' AND (rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = 120) JOIN `page` use index (primary) ON ((pl_from = page_id))   WHERE pl_namespace = 120 AND pl_title = 'P248'  ORDER BY pl_from LIMIT 102  ) `temp_backlink_range` JOIN `page` on ((pl_from = page_id))    ORDER BY page_id LIMIT 51;
+------+-------------+------------+--------+----------------------+---------+---------+---------------------------------------+----------+----------------------------------------------+
| id   | select_type | table      | type   | possible_keys        | key     | key_len | ref                                   | rows     | Extra                                        |
+------+-------------+------------+--------+----------------------+---------+---------+---------------------------------------+----------+----------------------------------------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL                 | NULL    | NULL    | NULL                                  |      102 | Using temporary; Using filesort              |
|    1 | PRIMARY     | page       | eq_ref | PRIMARY              | PRIMARY | 4       | temp_backlink_range.pl_from           |        1 |                                              |
|    2 | DERIVED     | page       | index  | PRIMARY              | PRIMARY | 4       | NULL                                  | 90571878 | Using index; Using temporary; Using filesort |
|    2 | DERIVED     | pagelinks  | eq_ref | PRIMARY,pl_namespace | PRIMARY | 265     | wikidatawiki.page.page_id,const,const |        1 | Using where; Using index                     |
|    2 | DERIVED     | redirect   | eq_ref | PRIMARY,rd_ns_title  | PRIMARY | 4       | wikidatawiki.page.page_id             |        1 | Using where                                  |
+------+-------------+------------+--------+----------------------+---------+---------+---------------------------------------+----------+----------------------------------------------+
5 rows in set (0.00 sec)

I find the subquery a bit confusing, but it seems if we join page on pl_from = page_id, we should be using the primary index... Any idea what's going wrong there? Or what index hints we can use to fix it?

Also, do you have an idea why i'm getting the bad query plan when running explain on the production DB, but the actual page that runs that query loads fine?

Hard to tell without the context of when this is happening for users. Does anyone know?

The query in question is generated by https://www.wikidata.org/wiki/Special:WhatLinksHere/Property:P248. That seems to work fine though, and returns nearly instantly.

Ok thanks. That special page is rather important for Wikidata and turning it off completely is not an option I fear.

Ok thanks. That special page is rather important for Wikidata and turning it off completely is not an option I fear.

My idea was to turn of the "indirect links" feature. Direct links would still be shown, but links via redirects would not.

Marostegui changed the task status from Open to Stalled.Aug 25 2020, 11:09 AM
Marostegui added a project: DBA.

From what I can see the optimizer is chosen either PRIMARY or page_len depending on the version and the query speeds entirely depends on the MariaDB version as well:

labsdb1012.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	83547612	Using index; Using temporary; Using filesort
labsdb1011.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	83548548	Using index; Using temporary; Using filesort
labsdb1010.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	89525495	Using index; Using temporary; Using filesort
labsdb1009.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	83599184	Using index; Using temporary; Using filesort
dbstore1005.eqiad.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	92731048	Using index
db2100.codfw.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	83517708	Using index; Using temporary; Using filesort
db2094.codfw.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	92928185	Using index; Using temporary; Using filesort
db2091.codfw.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	90663802	Using index
db2086.codfw.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	92956600	Using index
db2085.codfw.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	92758694	Using index
db2084.codfw.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	81169979	Using index
db2083.codfw.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	92900558	Using index
db2082.codfw.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	92354559	Using index; Using temporary; Using filesort
db2081.codfw.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	90607914	Using index
db2080.codfw.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	81169979	Using index; Using temporary; Using filesort
db2079.codfw.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	92964806	Using index; Using temporary; Using filesort
db1126.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	90459904	Using index; Using temporary; Using filesort
db1124.eqiad.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	90543157	Using index; Using temporary; Using filesort
db1116.eqiad.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	83541923	Using index; Using temporary; Using filesort
db1114.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	92669155	Using index
db1111.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	90597191	Using index
db1109.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
db1104.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	92845070	Using index; Using temporary; Using filesort
db1101.eqiad.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	90315067	Using index
db1099.eqiad.wmnet:3318
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	90576576	Using index
db1092.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	PRIMARY	4	NULL	90513622	Using index
db1087.eqiad.wmnet:3306
1	PRIMARY	page	eq_ref	PRIMARY	PRIMARY	4	temp_backlink_range.pl_from	1
2	DERIVED	page	index	PRIMARY	page_len	4	NULL	90601328	Using index; Using temporary; Using filesort

From what I can see the query runs instantly if it uses PRIMARY, example with the same host I originally reported (db1092) which now after the upgrade, for some reason the optimizer uses the PK and not page_len as it used to when I created the task.

51 rows in set (0.002 sec)

If I pick a host from that list that uses page_len the query the filesorts and starting taking time.
And the live explain does confirm it is filesorting:

root@db2082.codfw.wmnet[wikidatawiki]> show explain for 70466998;
+------+-------------+------------+--------+----------------------+----------+---------+---------------------------------------+----------+----------------------------------------------+
| id   | select_type | table      | type   | possible_keys        | key      | key_len | ref                                   | rows     | Extra                                        |
+------+-------------+------------+--------+----------------------+----------+---------+---------------------------------------+----------+----------------------------------------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL                 | NULL     | NULL    | NULL                                  |      102 | Using temporary; Using filesort              |
|    1 | PRIMARY     | page       | eq_ref | PRIMARY              | PRIMARY  | 4       | temp_backlink_range.pl_from           |        1 |                                              |
|    2 | DERIVED     | page       | index  | PRIMARY              | page_len | 4       | NULL                                  | 92354674 | Using index; Using temporary; Using filesort |
|    2 | DERIVED     | pagelinks  | eq_ref | PRIMARY,pl_namespace | PRIMARY  | 265     | wikidatawiki.page.page_id,const,const |        1 | Using where; Using index                     |
|    2 | DERIVED     | redirect   | eq_ref | PRIMARY,rd_ns_title  | PRIMARY  | 4       | wikidatawiki.page.page_id             |        1 | Using where                                  |
+------+-------------+------------+--------+----------------------+----------+---------+---------------------------------------+----------+----------------------------------------------+
5 rows in set, 1 warning (0.035 sec)

The weird thing is that if I force that same query to use PRIMARY, the query still filesorts:

root@db2082.codfw.wmnet[wikidatawiki]> show explain for 70466998;
+------+-------------+------------+--------+----------------------+---------+---------+---------------------------------------+----------+----------------------------------------------+
| id   | select_type | table      | type   | possible_keys        | key     | key_len | ref                                   | rows     | Extra                                        |
+------+-------------+------------+--------+----------------------+---------+---------+---------------------------------------+----------+----------------------------------------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL                 | NULL    | NULL    | NULL                                  |      102 | Using temporary; Using filesort              |
|    1 | PRIMARY     | page       | eq_ref | PRIMARY              | PRIMARY | 4       | temp_backlink_range.pl_from           |        1 |                                              |
|    2 | DERIVED     | page       | index  | PRIMARY              | PRIMARY | 4       | NULL                                  | 92354776 | Using index; Using temporary; Using filesort |
|    2 | DERIVED     | pagelinks  | eq_ref | PRIMARY,pl_namespace | PRIMARY | 265     | wikidatawiki.page.page_id,const,const |        1 | Using where; Using index                     |
|    2 | DERIVED     | redirect   | eq_ref | PRIMARY,rd_ns_title  | PRIMARY | 4       | wikidatawiki.page.page_id             |        1 | Using where                                  |
+------+-------------+------------+--------+----------------------+---------+---------+---------------------------------------+----------+----------------------------------------------+
5 rows in set, 1 warning (0.035 sec)

Interestingly, this only happens on 10.1
On 10.4 the query picks the PK and finishes in 0.03, so MariaDB 10.1 even if using PK, is still way slower.

I have run the query on all the 10.4 hosts, and it runs instantly.:

# time for i in db2081 db2083 db2084 db2085:3318 db2086:3318 db2091 db1092 db1099:3318 db1101:3318 db1111 db1114 dbstore1005:3318; do echo "====$i==="; cat query  | mysql.py -h$i wikidatawiki ; done

real	0m3.076s
user	0m1.653s
sys	0m0.309s

Doing the same on the 10.1 makes the query never finished and I had to kill it on every host:

# time for i in db2080 db2082 db2100:3318 db2079; do echo "====$i==="; cat query  | mysql.py -h$i wikidatawiki ; done
====db2080===
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100) at line 1: Query execution was interrupted
====db2082===

^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100) at line 1: Query execution was interrupted
====db2100:3318===
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100) at line 1: Query execution was interrupted
====db2079===
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100) at line 1: Query execution was interrupted

real	0m51.062s
user	0m0.731s
sys	0m0.097s

So forcing PRIMARY won't solve anything as this looks like a MariaDB 10.1 bug. I am not going to send a bug report as 10.1 will EOL in 4 months and it is unlikely they'll touch the optimizer anymore in the minor releases until then.
This should get fixed once we've entirely migrated to 10.4. Stalling this task then.

Thankfully on s8 (wikidatawiki) there are only 3 replicas out of 8 that receive reads.

eprodromou subscribed.

This seems like we're just waiting for the upgrades to happen, so there's not a real job for PET to do. Re-tag us if there's more than is obvious here.

Change 673826 had a related patch set uploaded (by BrandonXLF; owner: BrandonXLF):
[mediawiki/core@master] Remove page inner join from subquery in SpecialWhatLinksHere

https://gerrit.wikimedia.org/r/673826

Change 673800 had a related patch set uploaded (by Reedy; owner: BrandonXLF):
[mediawiki/core@REL1_35] Remove page inner join from subquery in SpecialWhatLinksHere

https://gerrit.wikimedia.org/r/673800

Change 673800 merged by jenkins-bot:
[mediawiki/core@REL1_35] Remove page inner join from subquery in SpecialWhatLinksHere

https://gerrit.wikimedia.org/r/673800

Change 673826 merged by jenkins-bot:
[mediawiki/core@master] Remove page inner join from subquery in SpecialWhatLinksHere

https://gerrit.wikimedia.org/r/673826

In order to close this, we just need to finish these two tasks: T290868 T290865

We have no 10.1 replicas serving traffic, only masters which are not supposed to receive these queries so I am going to close this as resolved.