- avoid filesort / temporary tables as much as possible
- avoid offet-based paging
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Tgr | T181107 Deploy Reading Lists Service to production | |||
Resolved | Tgr | T180402 Remove custom ordering from ReadingLists extension | |||
Resolved | Tgr | T182053 Improve query performance for reading lists |
Event Timeline
- https://gerrit.wikimedia.org/r/#/c/393925/ gets rid of all OFFSETs and adds a bunch of more targeted indexes, but still a number of filesorts/temp tables reamin
- MediaWiki support for convenient performance checking: https://gerrit.wikimedia.org/r/#/c/393503/
- script for generating test data (to avoid the query optimizer choosing otherwise costly plans when it sees there is very little data): https://gerrit.wikimedia.org/r/#/c/394728/
- production uses MariaDB 10.0; vagrant uses MySQL 5.6; the stretch-migration branch for vagrant uses MariaDB 10.1.
- debugged/fixed some issues with the migration branch and tried to downgrade the MariaDB version. Unfortunately stretch does not support 10.0 anymore; I tried to forward-port the jessie package but it was well above my package management abilities.
- patch for using MariaDB 10.0 on vagrant master: https://gerrit.wikimedia.org/r/395170 (Turning it into a proper role would be difficult and T51652: Vagrant should use MariaDB is already part of the stretch migration so not worth the effort.)
The current query plans are in
Comments:
getAllLists: usually ends up with a good plan but sometimes results in a filesort - could not pin down exactly when. Since the same query structure can result in plans with or without filesort, I will assume that it's just the optimizer deciding that fully using the index is not worth it when there is not enough data (even though I used two million rows of test data) so it's not a problem.
getListsByDateUpdated: added two dedicated indexes for this (so now there are 13 indexes on the two main tables :/ ). The problem is that getAllLists needs to filter on rl_deleted and getListsByDateUpdated doesn't; I hoped rl_deleted IN (0,1) could be handled with an index merge but the ORDER BY doesn't include rl_deleted so that won't work. Alternatively this could be handled by not having rl_deleted in the index and making getAllLists go through a few extra rows, or by including rl_deleted in ORDER BY / continuation; something to come back to later.
getListEntries: filesorts when getting the entries of multiple lists. Probably the same issue as above: the list ID is included in the condition but not in the ORDER BY. The REST endpoints never request more than one list so meh. Maybe worth fixing later.
getListEntriesByDateUpdated: this ends up with a completely crazy plan:
(20:19) root@localhost:[wiki]> EXPLAIN SELECT /* ReadingListRepository::getListEntriesByDateUpdated */ -> rle_id,rle_rl_id,rlp_project,rle_title,rle_date_created,rle_date_updated,rle_deleted -> FROM `reading_list`,`reading_list_entry`,`reading_list_project` -> WHERE (rl_id = rle_rl_id) -> AND (rle_rlp_id = rlp_id) -> AND rle_user_id = '1000' -> AND rl_deleted = '0' -> AND (rle_date_updated > '20100101000000') -> ORDER BY rle_date_updated asc,rle_id asc -> LIMIT 1000; +------+-------------+----------------------+--------+--------------------------------------------------------------------------------------------------+------------------------+---------+----------------------------------------+------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------------+--------+--------------------------------------------------------------------------------------------------+------------------------+---------+----------------------------------------+------+----------------------------------------------+ | 1 | SIMPLE | reading_list_project | index | PRIMARY | rlp_project | 257 | NULL | 176 | Using index; Using temporary; Using filesort | | 1 | SIMPLE | reading_list_entry | ref | rle_list_deleted_title_id,rle_list_deleted_updated_id,rle_user_updated_id,rle_user_project_title | rle_user_project_title | 8 | const,wiki.reading_list_project.rlp_id | 11 | Using where | | 1 | SIMPLE | reading_list | eq_ref | PRIMARY,rl_deleted_updated | PRIMARY | 4 | wiki.reading_list_entry.rle_rl_id | 1 | Using where | +------+-------------+----------------------+--------+--------------------------------------------------------------------------------------------------+------------------------+---------+----------------------------------------+------+----------------------------------------------+ 3 rows in set (0.00 sec)
When I force rle_user_updated_id I get a sane plan, with more rows:
(20:35) root@localhost:[wiki]> EXPLAIN SELECT /* ReadingListRepository::getListEntriesByDateUpdated */ -> rle_id,rle_rl_id,rlp_project,rle_title,rle_date_created,rle_date_updated,rle_deleted -> FROM `reading_list`,`reading_list_entry` use index (rle_user_updated_id),`reading_list_project` -> WHERE (rl_id = rle_rl_id) -> AND (rle_rlp_id = rlp_id) -> AND rle_user_id = '1000' -> AND rl_deleted = '0' -> AND (rle_date_updated > '20100101000000') -> ORDER BY rle_date_updated asc,rle_id asc -> LIMIT 1000; +------+-------------+----------------------+--------+----------------------------+---------------------+---------+------------------------------------+------+-----------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------------+--------+----------------------------+---------------------+---------+------------------------------------+------+-----------------------+ | 1 | SIMPLE | reading_list_entry | range | rle_user_updated_id | rle_user_updated_id | 18 | NULL | 4158 | Using index condition | | 1 | SIMPLE | reading_list | eq_ref | PRIMARY,rl_deleted_updated | PRIMARY | 4 | wiki.reading_list_entry.rle_rl_id | 1 | Using where | | 1 | SIMPLE | reading_list_project | eq_ref | PRIMARY | PRIMARY | 4 | wiki.reading_list_entry.rle_rlp_id | 1 | | +------+-------------+----------------------+--------+----------------------------+---------------------+---------+------------------------------------+------+-----------------------+ 3 rows in set (0.00 sec)
so I'm going to assume that this is again the optimizer doing weird optimization on a small data set.
getListsByPage: uses temporary + filesort. This is unavoidable and should be OK performance-wise; the commit summary of ad1edc63 has the details. seems fixed after 1a737fe40cd4
Change 393925 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@master] Sort lists and entries by name and last updated timestamp
Change 393925 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@master] Sort lists and entries by name and last updated timestamp
Added the remaining FIXMEs (which do not have significant production impact) to T171913: Fix technical debt in ReadingLists extension.
Change 395885 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@master] Improve query plan for getListsByPage
Change 395885 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@master] Improve query plan for getListsByPage
Change 396108 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.10] Sort lists and entries by name and last updated timestamp
Change 396113 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.10] Improve query plan for getListsByPage
Change 396116 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.11] Sort lists and entries by name and last updated timestamp
Change 396121 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.11] Improve query plan for getListsByPage
Change 396108 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.10] Sort lists and entries by name and last updated timestamp
Change 396113 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.10] Improve query plan for getListsByPage
Change 396116 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.11] Sort lists and entries by name and last updated timestamp
Change 396121 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@wmf/1.31.0-wmf.11] Improve query plan for getListsByPage