Page MenuHomePhabricator

SELECT query on page table appears to also reference revision table
Closed, ResolvedPublic

Description

When I ran this query on the Clouds via Quarry:

use fawiki_p;
select count(*) total_pages, sum(page_is_redirect) redirects
from page;

The EXPLAIN results from Quarry was as follows:

id    select_type    table      type     possible_keys    key       key_len   ref                        rows       Extra
1     SIMPLE	     page       ALL                                                                      2775439	
1     SIMPLE	     revision   eq_ref   PRIMARY          PRIMARY   4         fawiki.page.page_latest    1          Using index

Why would it reference the revision table?

Event Timeline

Huji created this task.May 10 2018, 1:47 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2018, 1:47 AM
Huji updated the task description. (Show Details)May 10 2018, 1:55 AM
bd808 closed this task as Resolved.May 10 2018, 3:24 PM
bd808 claimed this task.
bd808 added a subscriber: bd808.

The page view recently became more complex to provide backwards compatibility for queries as the Multi Content Revision project rolls out schema changes to the Wikimedia production databases. You can see more details at T184446: Configure Toolforge replica views and dumps for the new MCR tables.

$ sql enwiki
(u3518@enwiki.analytics.db.svc.eqiad.wmflabs) [enwiki_p]> show create view page\G
*************************** 1. row ***************************
                View: page
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `page` AS select `enwiki`.`page`.`page_id` AS `page_id`,`enwiki`.`page`.`page_namespace` AS `page_namespace`,`enwiki`.`page`.`page_title` AS `page_title`,`enwiki`.`page`.`page_restrictions` AS `page_restrictions`,`enwiki`.`page`.`page_is_redirect` AS `page_is_redirect`,`enwiki`.`page`.`page_is_new` AS `page_is_new`,`enwiki`.`page`.`page_random` AS `page_random`,`enwiki`.`page`.`page_touched` AS `page_touched`,`enwiki`.`page`.`page_links_updated` AS `page_links_updated`,`enwiki`.`page`.`page_latest` AS `page_latest`,`enwiki`.`page`.`page_len` AS `page_len`,if(`enwiki`.`content_models`.`model_name`,`enwiki`.`content_models`.`model_name`,`enwiki`.`page`.`page_content_model`) AS `page_content_model`,`enwiki`.`page`.`page_lang` AS `page_lang` from ((`enwiki`.`page` left join `enwiki`.`revision` on((`enwiki`.`revision`.`rev_id` = `enwiki`.`page`.`page_latest`))) left join (((`enwiki`.`slots` join `enwiki`.`slot_roles` on(((`enwiki`.`slots`.`slot_role_id` = `enwiki`.`slot_roles`.`role_id`) and (`enwiki`.`slot_roles`.`role_name` = 'main')))) join `enwiki`.`content` on((`enwiki`.`slots`.`slot_content_id` = `enwiki`.`content`.`content_id`))) join `enwiki`.`content_models` on((`enwiki`.`content`.`content_model` = `enwiki`.`content_models`.`model_id`))) on((`enwiki`.`slots`.`slot_revision_id` = `enwiki`.`revision`.`rev_id`)))
character_set_client: utf8
collation_connection: utf8_general_ci
1 row in set (0.37 sec)
Huji added a comment.May 10 2018, 4:25 PM

Good to know.

At least the link is happening through an index :)

jcrespo reopened this task as Open.May 11 2018, 7:57 AM
jcrespo added a subscriber: jcrespo.

I don't think this is ok, and I suspect is one of the reasons for T194343

The view change was introduced in rOPUP0545ee8d0981: wiki replicas: provide backward compatibility for MCR changes which is actually linked to T174047: Hide deprecated/unused fields on toolforge replica [MCR] rather than T184446.

The data formerly available via page.page_content_model has moved/is moving to content_models.model_name which is connected to the page data via the long chain of page.latest -> revision.rev_id -> slots.slot_revision_id -> slot_roles.slot_content_id -> content.content_id -> content_models.content_model. On the MediaWiki internal side this is probably a nice improvement, but it does add a lot of index scanning overhead in our attempt to provide a backwards compatible view of the page table for Wiki Replica consumers. The really big question here if this is causing significant performance problems is how valuable is page.page_content_model to the typical page table user and how can we inform them that they will need to change their queries to get that data.

bd808 removed bd808 as the assignee of this task.May 11 2018, 5:54 PM
bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.

I confirm that the only thing all those joins are needed for is back compat for page.page_content_model. It's entirely up to you whether to continue providing this back compat or to make your users change their code.

There's somewhat similar back-compat going on for archive.ar_text_id, archive.ar_content_model, archive.ar_content_format, revision.rev_text_id, revision.rev_content_model, and revision.rev_content_format.

Huji added a comment.May 12 2018, 12:25 AM

Are queries against the Labs DBs audited and logged, and if so, can we just run a grep on said logs to see how often people query page_content_model? I think the decision to retain or forgo the backwards comparability should be take into account the amount of the users or use cases that need it.

My point is, if the back compatibility can be done using the same names, that is ok, but if it "overshadows" existing tables, we are punishing people updating their code while not encouringing others to do so. I don't mind having a "page_old" or "page_backwards_compatible", but I don't think for those that do full scans of page, we should force them to do scans of revision, which will make their queries break anyway because they will not finish.

root@labsdb1011[enwiki]> flush status;
Query OK, 0 rows affected (0.00 sec)

root@labsdb1011[enwiki]> select count(*) total_pages, sum(page_is_redirect) redirects from page;
+-------------+-----------+
| total_pages | redirects |
+-------------+-----------+
|    44957538 |  10132574 |
+-------------+-----------+
1 row in set (5 min 47.50 sec)

root@labsdb1011[enwiki]> SHOW STATUS like 'Hand%';
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| Handler_commit             | 1        |
| Handler_delete             | 0        |
| Handler_discover           | 0        |
| Handler_external_lock      | 0        |
| Handler_icp_attempts       | 0        |
| Handler_icp_match          | 0        |
| Handler_mrr_init           | 0        |
| Handler_mrr_key_refills    | 0        |
| Handler_mrr_rowid_refills  | 0        |
| Handler_prepare            | 0        |
| Handler_read_first         | 1        |
| Handler_read_key           | 0        |
| Handler_read_last          | 0        |
| Handler_read_next          | 44957538 |
| Handler_read_prev          | 0        |
| Handler_read_retry         | 0        |
| Handler_read_rnd           | 0        |
| Handler_read_rnd_deleted   | 0        |
| Handler_read_rnd_next      | 0        |
| Handler_rollback           | 0        |
| Handler_savepoint          | 0        |
| Handler_savepoint_rollback | 0        |
| Handler_tmp_update         | 0        |
| Handler_tmp_write          | 0        |
| Handler_update             | 0        |
| Handler_write              | 0        |
+----------------------------+----------+
26 rows in set (0.26 sec)

Meanwhile, the same query on enwiki_p had to be killed after 40 minutes:

root@labsdb1011[enwiki]> use enwiki_p
Database changed
root@labsdb1011[enwiki_p]> flush status;
Query OK, 0 rows affected (0.00 sec)

root@labsdb1011[enwiki_p]> select count(*) total_pages, sum(page_is_redirect) redirects from page;
root@labsdb1011[enwiki_p]> select count(*) total_pages, sum(page_is_redirect) redirects from page;
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted
root@labsdb1011[enwiki_p]> SHOW STATUS like 'Hand%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 0       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 0       |
| Handler_icp_attempts       | 0       |
| Handler_icp_match          | 0       |
| Handler_mrr_init           | 0       |
| Handler_mrr_key_refills    | 0       |
| Handler_mrr_rowid_refills  | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 0       |
| Handler_read_key           | 7182424 |
| Handler_read_last          | 0       |
| Handler_read_next          | 0       |
| Handler_read_prev          | 0       |
| Handler_read_retry         | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_deleted   | 0       |
| Handler_read_rnd_next      | 7182424 |
| Handler_rollback           | 1       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_tmp_update         | 0       |
| Handler_tmp_write          | 0       |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+
26 rows in set (0.24 sec)
bd808 added a comment.May 14 2018, 5:20 PM

The hashtags tool currently has 145 copies of its enHashtagUpdate job running on the job grid. I'm 90% certain that this is related to the page view change.

bd808 added a comment.May 14 2018, 5:31 PM

My point is, if the back compatibility can be done using the same names, that is ok, but if it "overshadows" existing tables, we are punishing people updating their code while not encouringing others to do so. I don't mind having a "page_old" or "page_backwards_compatible", but I don't think for those that do full scans of page, we should force them to do scans of revision, which will make their queries break anyway because they will not finish.

Making a "page_backwards_compatible view would require tool maintainers to update their code to use it, but it would be a much simpler update for them to make than fully implementing the 6 way join that the new MCR table structure requires to get to the content model.

Are queries against the Labs DBs audited and logged, and if so, can we just run a grep on said logs to see how often people query page_content_model? I think the decision to retain or forgo the backwards comparability should be take into account the amount of the users or use cases that need it.

To my knowledge the queries made to the Wiki Replica servers are not logged, so I do not think we have an easy way to grep for historical usage. My naive assumption is that usage of the page_content_model field is low. I can't think of many reasons that the content model would matter for tools that are crawling the page table. I'm leaning towards recommending that we remove the new join logic and instead return whatever is in page.page_content_model for now and then change the view in the future to provide an empty string/null response for that field after it has been dropped from the production schema. Tools that do need the page_content_model data can be updated to do the needed joins themselves rather than relying on a view layer shim.

To my knowledge the queries made to the Wiki Replica servers are not logged

We have[[ https://dev.mysql.com/doc/refman/5.7/en/performance-schema-statement-digests.html | performance_schema ]] enabled on all MariaDB servers, which doesn't log each individual query, but gives an idea of what kind of queries are the most frequent (distribution). At any time we can enable finer logging, at the cost of performance. This is not publicly available because of privacy issues (it can provide usage patterns or leak private data -at least on production).

If we keep the backward-compatible field in the views, there is one join that can be removed (revision), but most of that will still remain. I don't have any real insight at all into how often or where that field is used by folks in tools, but replacing it with NULL might break fewer things that might include data structures that depend on its existence while still making unusable. If that one is a big problem, that seems like damage control?

Overall, I think it is better to break a couple tools than to break all of them by overburdening the DBs.

page_latest could be used in place of revision_id, I think, since that's what the join is on, which would cut out that one very large table. I'm not sure if that would resolve this entirely, though.

Bstorm added a comment.EditedMay 14 2018, 5:51 PM

Then it would be

page.latest -> slots.slot_revision_id -> slot_roles.slot_content_id -> content.content_id -> content_models.content_model

That could be problematic when slots increases a lot (which it presumably will).

Change 433085 had a related patch set uploaded (by Bstorm; owner: Brooke Storm):
[operations/puppet@production] wiki replicas: return page to a full view

https://gerrit.wikimedia.org/r/433085

Change 433085 merged by Bstorm:
[operations/puppet@production] wiki replicas: return page to a full view

https://gerrit.wikimedia.org/r/433085

Change 433206 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool labsdb1010

https://gerrit.wikimedia.org/r/433206

Change 433206 merged by Marostegui:
[operations/puppet@production] wiki replicas: depool labsdb1010

https://gerrit.wikimedia.org/r/433206

Huji added a comment.May 17 2018, 3:18 PM

How much longer until this actually takes effect? I reran the query right now and the EXPLAIN is still listing the revision table.

@Huji There are 3 servers with high load, it takes a bit to deploy changes there :-) Thanks @Bstorm and @Marostegui for the ongoing work on this!

Change 433591 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool labsdb1011

https://gerrit.wikimedia.org/r/433591

Change 433591 merged by Marostegui:
[operations/puppet@production] wiki replicas: depool labsdb1011

https://gerrit.wikimedia.org/r/433591

Mentioned in SAL (#wikimedia-operations) [2018-05-17T16:11:28Z] <marostegui> Reload haproxy on dbproxy1010 to depool labsdb1011 T174047 T194341

Finally got out the full view of the page table without joins to all three replicas.

Huji closed this task as Resolved.May 17 2018, 10:23 PM
Huji claimed this task.
Huji triaged this task as Normal priority.

Confirming that the issue seems to be fixed.

Huji reassigned this task from Huji to Marostegui.May 17 2018, 10:23 PM
Huji removed a project: Patch-For-Review.
Marostegui reassigned this task from Marostegui to Bstorm.May 18 2018, 5:04 AM
Vvjjkkii renamed this task from SELECT query on page table appears to also reference revision table to y8caaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Bstorm as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Huji, Aklapper.
Marostegui renamed this task from y8caaaaaaa to SELECT query on page table appears to also reference revision table.Jul 1 2018, 6:34 PM
Marostegui closed this task as Resolved.
Marostegui assigned this task to Bstorm.
Marostegui lowered the priority of this task from High to Normal.
Marostegui updated the task description. (Show Details)