AC
- Tables should have all the indexes recommended by Amir: T308738#8103421
- Those indexes should also be created (manually) on beta
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Add indexes to the schema | mediawiki/extensions/CampaignEvents | master | +62 -2 |
Discussed: this should come after we have created all our special pages (probably a simple task, but we don't have enough information). Essentially, all V0 features should be in place.
I think the indexes we already have are fine, I just would add more 3:
The thing about the country is that we're likely to change the schema anyway for V1, so I'm not sure if we need to account for a future use case now. Also, the second index would be redundant because the country is already the leftmost column of the first index.
- event_status
- Used when querying events by status.
Slightly unsure about this one, the column can only have 2 values right now (and even in future versions its cardinality is going to remain very low), so I'm not sure if an index would help much.
I'm creating a prod-like set of tables with thousands of events locally and will run some queries on it to see what indexes would make a difference to the optimizer. I'll work on it in the weekend and I expect it'll take some time to come back with the results.
I played with the dataset a little bit, and here are my conclusions. This is not an extensive analysis, but I felt like it wasn't worth investing more time into it. I also think we shouldn't consider this a blocker for the beta deployment for the following reasons:
However, we should still address this as soon as possible. @cmelo, I asked @Ladsgroup if he could give us his recommendations on the schema, and he kindly agreed to meet with us in the upcoming two weeks. Amir, below you can find information about the schema, the queries, and the questions I have. This comment is long, I know it and I apologize for it. I tried to divide it in multiple sections, so that you can skip whatever you believe is not relevant to you. Thanks in advance for your help and for reading this wall of text.
See the db_patches directory. See autogenerated schema for MySQL. The purpose of the tables is fairly self-explanatory:
We currently don't have many indexes, except for primary keys and some uniqueness constraints.
As I wrote in T308738#8081669, I wanted to create a large dataset locally with production-like data so that I could play with it and see which changes were worth implementing. If you want the code that I used to generate the data, please ask for it, but note that I'll have to rewrite it first because I made some mistakes that had to be later accounted for in the code. Here are the relevant specifications given the simple use cases we have so far:
Each table is managed by (almost) a single class:
I said "almost" because we also have EventsPager.php which queries all tables directly. The (current) purpose of EventsPager is to list the events organized by a given user. Also, EventStore needs to query ce_participants and ce_organizers as well.
This is a list of queries for the current version of the extension. We will definitely add more in the future, but those should not be optimized now.
SELECT * FROM campaign_events WHERE event_page_namespace=X AND event_page_title=Y AND event_page_wiki=Z
SELECT * FROM campaign_events JOIN ce_organizers ON (event_id=ceo_event_id) WHERE ceo_user_id=X
SELECT * FROM campaign_events JOIN ce_participants ON (event_id=cep_event_id) WHERE cep_user_id=X AND cep_unregistered_at IS NULL
INSERT INTO ce_participants (cep_event_id,cep_user_id,cep_registered_at,cep_unregistered_at) VALUES (X,Y,Z,NULL) ON DUPLICATE KEY UPDATE cep_unregistered_at = NULL,cep_registered_at = Z
-- Single participant UPDATE ce_participants SET cep_unregistered_at = X WHERE cep_event_id = Y AND cep_unregistered_at IS NULL AND cep_user_id = Z -- Multiple participants UPDATE ce_participants SET cep_unregistered_at = X WHERE cep_event_id = Y AND cep_unregistered_at IS NULL AND cep_user_id IN ( x_1, ..., x_n )
SELECT cep_id, cep_user_id, cep_registered_at FROM ce_participants WHERE cep_event_id = X AND cep_unregistered_at IS NULL [ AND (cep_id > Y) ] ORDER BY cep_id LIMIT Z
SELECT * FROM ce_participants WHERE cep_event_id = X AND cep_user_id = Y AND cep_unregistered_at IS NULL LIMIT 1
SELECT COUNT(*) FROM ce_participants WHERE cep_event_id = X AND cep_unregistered_at IS NULL LIMIT 1
SELECT ceo_user_id,ceo_role_id FROM ce_organizers WHERE ceo_event_id = X LIMIT Y
SELECT * FROM ce_organizers WHERE ceo_event_id = X AND ceo_user_id = Y LIMIT 1
SELECT COUNT(*) FROM ce_organizers WHERE ceo_event_id = X LIMIT 1
Then we have EventsPager, used on Special:MyEvents to display a list of event that you organize. The events are sortable by start date, name, and number of participants. You can also filter them by status. The basic query is:
SELECT event_id,event_name,event_page_namespace,event_page_title,event_page_prefixedtext,event_page_wiki,event_status,event_start,event_meeting_type,num_participants FROM (SELECT event_id,event_name,event_page_namespace,event_page_title,event_page_prefixedtext,event_page_wiki,event_status,event_start,event_meeting_type,COUNT(cep_id) AS `num_participants` FROM `campaign_events` LEFT JOIN `ce_participants` ON ((event_id=cep_event_id)) LEFT JOIN `ce_organizers` ON ((event_id=ceo_event_id)) WHERE event_deleted_at IS NULL AND cep_unregistered_at IS NULL AND ceo_user_id = X GROUP BY cep_event_id,event_id,event_name,event_page_namespace,event_page_title,event_page_prefixedtext,event_page_wiki,event_status,event_start,event_meeting_type ) `tmp` ORDER BY Y LIMIT Z
We use a temporary table because otherwise we could not sort by the number of participants, due to limitations in the core pager logic and for postgres support (see T308694). Also, note that the query above is valid under ONLY_FULL_GROUP_BY in MySQL < 5.7.5, where it still didn't detect functional dependency, which explains why all fields are appearing in the GROUP BY even if they're redundant.
The list can be filtered by name and status; in both cases we change the WHERE condition of the inner query;
The possible orderings are:
Below I'm listing the potential issues I noticed with my questions. @Ladsgroup I'd appreciate if you could answer those.
The first issue I noticed is when selecting events organized by a given user (either as a simple list, or the paginated version for Special:MyEvents). The current index is ceo_event_id, ceo_user_id, ceo_role_id, so it cannot be used for the JOIN (because ceo_user_id is needed first). As such, the whole campaign_events table is scanned, and then filtered later. This could be resolved by indexing the ceo_user_id column to result in a better query plan where the join condition uses the index and cuts down the number of rows. However, I think we will still need both indexes, because ceo_event_id can be used without ceo_user_id when retrieving the count/list of organizers of a specific event.
Getting all the events that a user participates in shows similar issues and would benefit from having an index on the user ID. Again though, we also need cep_event_id to be indexed, as it's used e.g when counting participants.
The "get the participants of an event" one seems good, but it filesorts because the PK index is not used (it uses the one on event+user instead). I'm not sure if it's worth adding the ID to a composite index, since for now only low LIMITs can be used.
Now for the pager one. First of all, this query would also benefit from the index on ceo_user_id proposed above. Then, the inner query has Using temporary; Using filesort for ce_organizers. I'm sure this is caused by the GROUP BY but I'm not sure how to fix that. The outer query (on the derived table) also filesorts, but I'm not sure if we can avoid that. Overall, this query is noticeably slow if an organizer has many events (which is maybe not very realistic, but still worth noting). I don't think adding indexes to campaign_events for the where condition would make any difference, because ce_organizer is scanned first and only the corresponding rows in campaign_events are returned. Also, I was wondering at T308738#8081669 if an index on event_status would help at all, since the cardinality is just 2. As for indexes on the ORDER BY columns: since it's the derived table which is ordered, I'm not sure if they would help. I think MySQL should be able to use indexes from inner tables in some situations, but this does not seem to be the case here.
As a final doubt/question: the current schema forces us to use joins + group by + temp tables in the pager query for the number of participants. I was wondering if it would make sense to denormalize the value, adding a campaign_events.event_num_participants field that is updated whenever the ce_participants table is updated. This would greatly simplify the query, but I'm not sure if it's really worth doing.
Adding indexes requires not just the schema and queries but also data and sometimes optimizer bugs and so on. At some point it basically turns into an art than actual engineering/architecture work but I doubt we would get there.
We also have the space/speed tradeoff. We are rather okay in space in x1 so we can err on side of more indexes.
Here is rule of thumb:
This should be the order of importance for index (and the ones that must get on the leftmost part of the index):
With these notes, let me take a look at your schema and queries. I saw you already have index on ceo_event_id which is great.
I'm looking at campaign_events and its queries now.
On joining to campaign_events table, you definitely need to add these indexes (I know it's opposite of the join column but join order is important, these will be the first tables in the join order so they should have the join on where condition):
Add index on event_id, event_deleted_at on campaign_events and I think you'd be good to go.
Amir, thanks for your detailed replies. Sorry if I didn't get back to you sooner, but with the release happening last week I had so many things on my mind.
Right now we already have a unique index on ceo_event_id, ceo_user_id, ceo_role_id (in the future, each user could have multiple roles for the same event).
- I'm generally not a big fan of upsert. It can easily start wasting auto increment values in updates. I suggest doing a query first and if it exists, just going with update. If not, then upsert (to handle rare race conditions).
Yup, we're already doing that, see here. Now that I think about it, this is also something I wanted to ask you: right now we're doing that in an atomic section + a FOR UPDATE lock. I think the lock should be enough and the atomic section wouldn't be necessary though, am I right?
Summing up, here are the final indexes to have (including those that already exist) if I understood your comments:
-- On campaign_events PRIMARY KEY(event_id), UNIQUE INDEX event_page ( event_page_wiki, event_page_namespace, event_page_title ), INDEX event_id_deleted (event_id, event_deleted_at) -- On ce_organizers PRIMARY KEY(ceo_id), UNIQUE INDEX ceo_event_user_role ( ceo_event_id, ceo_user_id, ceo_role_id ), -- Used when retrieving organizers for a single event INDEX ceo_user_event ( ceo_user_id, ceo_event_id ) -- Used for JOINs -- On ce_participants PRIMARY KEY(cep_id), UNIQUE INDEX cep_event_participant (cep_event_id, cep_user_id), -- Uniqueness constraint INDEX cep_event_unregistered ( cep_event_id, cep_unregistered_at ), -- For COUNT INDEX cep_user_unregistered_event ( cep_user_id, cep_unregistered_at, cep_event_id ) -- For JOIN
Is this correct? I'm also wondering if we could avoid one of those indexes on ce_participants.
Thank you again!
Sorry it took me a bit to respond, I was out sick (and half-time for a while due to sickness)
I doubt we would ever hit race conditions like that. Still you can do a read with a replica for getting the id (to avoid insert) and then do a read with FOR UPDATE if needed. The thing is that specially if you select ranges or if the id doesn't exist, it might lock way more rows (e.g. gap lock) than you intend to. So I'm not sure you'd FOR UPDATE first, try replica (which would catch most cases I assume), then master with FOR UPDATE
Summing up, here are the final indexes to have (including those that already exist) if I understood your comments:
-- On campaign_events PRIMARY KEY(event_id), UNIQUE INDEX event_page ( event_page_wiki, event_page_namespace, event_page_title ), INDEX event_id_deleted (event_id, event_deleted_at) -- On ce_organizers PRIMARY KEY(ceo_id), UNIQUE INDEX ceo_event_user_role ( ceo_event_id, ceo_user_id, ceo_role_id ), -- Used when retrieving organizers for a single event INDEX ceo_user_event ( ceo_user_id, ceo_event_id ) -- Used for JOINs -- On ce_participants PRIMARY KEY(cep_id), UNIQUE INDEX cep_event_participant (cep_event_id, cep_user_id), -- Uniqueness constraint INDEX cep_event_unregistered ( cep_event_id, cep_unregistered_at ), -- For COUNT INDEX cep_user_unregistered_event ( cep_user_id, cep_unregistered_at, cep_event_id ) -- For JOINIs this correct? I'm also wondering if we could avoid one of those indexes on ce_participants.
Yes, it looks good. If you are worried about storage (which I'm not much given that it's going to be on x1), you can avoid adding cep_event_unregistered and for counts, it would scan all rows of an event which should be just slightly more than the rows it would scan with the index.
HTH
Thanks Amir, this definitely helps! @cmelo Would you like to implement the suggestions above, or should I?
Change 826877 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):
[mediawiki/extensions/CampaignEvents@master] Add indexes to the schema
I've implemented the suggestion above, and can confirm that the queries are really fast on the test dataset.
Change 826877 merged by jenkins-bot:
[mediawiki/extensions/CampaignEvents@master] Add indexes to the schema
Commands that I will run on beta:
ALTER TABLE campaign_events ADD INDEX event_id_deleted (event_id, event_deleted_at); ALTER TABLE ce_participants ADD INDEX cep_event_unregistered ( cep_event_id, cep_unregistered_at ); ALTER TABLE ce_participants ADD INDEX cep_user_unregistered_event ( cep_user_id, cep_unregistered_at, cep_event_id ); ALTER TABLE ce_organizers ADD INDEX ceo_user_event (ceo_user_id, ceo_event_id);
Mentioned in SAL (#wikimedia-releng) [2022-08-31T16:39:06Z] <Daimona> Applying schema change to the wikishared DB on beta for the CampaignEvents extension # T308738
Index recommendations:
-- On campaign_events PRIMARY KEY(event_id), UNIQUE INDEX event_page ( event_page_wiki, event_page_namespace, event_page_title ), INDEX event_id_deleted (event_id, event_deleted_at)
SHOW INDEX FROM campaign_events;
MariaDB [wikishared]> SHOW INDEX FROM campaign_events; +-----------------+------------+-------------------+--------------+----------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-----------------+------------+-------------------+--------------+----------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | campaign_events | 0 | PRIMARY | 1 | event_id | A | 114 | NULL | NULL | | BTREE | | | | campaign_events | 0 | event_page | 1 | event_page_wiki | A | 57 | NULL | NULL | | BTREE | | | | campaign_events | 0 | event_page | 2 | event_page_namespace | A | 57 | NULL | NULL | | BTREE | | | | campaign_events | 0 | event_page | 3 | event_page_title | A | 114 | NULL | NULL | | BTREE | | | | campaign_events | 1 | event_id_deleted | 1 | event_id | A | 114 | NULL | NULL | | BTREE | | | | campaign_events | 1 | event_id_deleted | 2 | event_deleted_at | A | 114 | NULL | NULL | YES | BTREE | | | | campaign_events | 1 | event_timezone_id | 1 | event_timezone | A | 2 | NULL | NULL | | BTREE | | | | campaign_events | 1 | event_timezone_id | 2 | event_id | A | 114 | NULL | NULL | | BTREE | | | +-----------------+------------+-------------------+--------------+----------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Index recommendations:
-- On ce_organizers PRIMARY KEY(ceo_id), UNIQUE INDEX ceo_event_user_role ( ceo_event_id, ceo_user_id, ceo_role_id ), -- Used when retrieving organizers for a single event INDEX ceo_user_event ( ceo_user_id, ceo_event_id ) -- Used for JOINs
SHOW INDEX FROM ce_organizers;
MariaDB [wikishared]> SHOW INDEX from ce_organizers; +---------------+------------+---------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +---------------+------------+---------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | ce_organizers | 0 | PRIMARY | 1 | ceo_id | A | 133 | NULL | NULL | | BTREE | | | | ce_organizers | 0 | ceo_event_user_role | 1 | ceo_event_id | A | 133 | NULL | NULL | | BTREE | | | | ce_organizers | 0 | ceo_event_user_role | 2 | ceo_user_id | A | 133 | NULL | NULL | | BTREE | | | | ce_organizers | 0 | ceo_event_user_role | 3 | ceo_role_id | A | 133 | NULL | NULL | | BTREE | | | | ce_organizers | 1 | ceo_user_event | 1 | ceo_user_id | A | 133 | NULL | NULL | | BTREE | | | | ce_organizers | 1 | ceo_user_event | 2 | ceo_event_id | A | 133 | NULL | NULL | | BTREE | | | +---------------+------------+---------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Index recommendations:
-- On ce_participants PRIMARY KEY(cep_id), UNIQUE INDEX cep_event_participant (cep_event_id, cep_user_id), -- Uniqueness constraint INDEX cep_event_unregistered ( cep_event_id, cep_unregistered_at ), -- For COUNT INDEX cep_user_unregistered_event ( cep_user_id, cep_unregistered_at, cep_event_id ) -- For JOIN
SHOW INDEX FROM ce_participants;
MariaDB [wikishared]> SHOW INDEX FROM ce_participants; +-----------------+------------+-----------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-----------------+------------+-----------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | ce_participants | 0 | PRIMARY | 1 | cep_id | A | 276 | NULL | NULL | | BTREE | | | | ce_participants | 0 | cep_event_participant | 1 | cep_event_id | A | 276 | NULL | NULL | | BTREE | | | | ce_participants | 0 | cep_event_participant | 2 | cep_user_id | A | 276 | NULL | NULL | | BTREE | | | | ce_participants | 1 | cep_event_unregistered | 1 | cep_event_id | A | 276 | NULL | NULL | | BTREE | | | | ce_participants | 1 | cep_event_unregistered | 2 | cep_unregistered_at | A | 276 | NULL | NULL | YES | BTREE | | | | ce_participants | 1 | cep_user_unregistered_event | 1 | cep_user_id | A | 276 | NULL | NULL | | BTREE | | | | ce_participants | 1 | cep_user_unregistered_event | 2 | cep_unregistered_at | A | 276 | NULL | NULL | YES | BTREE | | | | ce_participants | 1 | cep_user_unregistered_event | 3 | cep_event_id | A | 276 | NULL | NULL | | BTREE | | | +-----------------+------------+-----------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
AC passes as index recommendations match what are currently in the tables on betacluster. One note though, campaign_events table also has the event_timezone_id index with columns event_timezone and event_id which are not mentioned in the index recommendations AC. @Daimona is this acceptable and correct? Wanted to check on this before marking this ticket as done.