Page MenuHomePhabricator

Review & apply schema changes for T250748
Closed, ResolvedPublic

Description


Additional question (this is not a blocked for applying the schema change - this can be done later, but also great if it can be performed together with above schema change)
We'll want to update the existing data before we start to use this new column.

UPDATE machine_vision_image
INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id
INNER JOIN machine_vision_suggestion ON mvs_mvl_id = mvl_id
SET mvi_priority = 127
WHERE mvs_timestamp < 20191201000000;

Can we execute this large update directly, or should we write a script to do it in batches?


Additional review information WRT schema change:

This new query will be executed every time new images are requested: (to fetch a cutoff priority)

SELECT mvi_priority
FROM machine_vision_image
INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id
WHERE mvl_review = 0
GROUP BY mvi_priority HAVING COUNT( DISTINCT mvi_id ) >= 200
ORDER BY mvi_priority DESC
LIMIT 1

And then the existing subquery for actually fetching those images changes from this:

SELECT mvi_sha1
FROM machine_vision_image
INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id
INNER JOIN machine_vision_suggestion ON mvs_mvl_id = mvl_id
WHERE mvl_review = 0 AND mvs_timestamp < 20191201000000
ORDER BY mvi_rand DESC
LIMIT 10

to this:

SELECT mvi_sha1
FROM machine_vision_image
INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id
WHERE mvl_review = 0 AND mvi_priority >= $priority
ORDER BY mvi_rand DESC
LIMIT 10

Schema change progress:

  • commonswiki
    • eqiad
    • codfw

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.
Marostegui added a subscriber: Marostegui.

That UPDATE seems too big to actually do it at once (2M rows or so), please do it via a script with small batches (maybe 1000 rows per batch) and wait for replication, once I have performed the schema change.
I will perform the schema change and then check the query plan.
The current query seems good enough, so I don't think the new one will change a lot, but let's double check after the schema change.

I will comment here once the UPDATE can be performed.

Mentioned in SAL (#wikimedia-operations) [2020-06-10T11:38:06Z] <marostegui> Deploy schema change on testcommonswiki T255003

This is how the tables looks like in testcommonswiki, please confirm if this is what you expected:

root@db1081.eqiad.wmnet[testcommonswiki]> show create table machine_vision_image\G
*************************** 1. row ***************************
       Table: machine_vision_image
Create Table: CREATE TABLE `machine_vision_image` (
  `mvi_id` int(11) NOT NULL AUTO_INCREMENT,
  `mvi_sha1` varbinary(32) NOT NULL,
  `mvi_rand` float NOT NULL,
  `mvi_priority` tinyint(3) DEFAULT '0',
  PRIMARY KEY (`mvi_id`),
  UNIQUE KEY `mvi_sha1` (`mvi_sha1`(10)),
  KEY `mvi_rand` (`mvi_rand`),
  KEY `mvi_priority` (`mvi_priority`)
) ENGINE=InnoDB AUTO_INCREMENT=38 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED
1 row in set (0.00 sec)

Mentioned in SAL (#wikimedia-operations) [2020-06-10T11:50:27Z] <marostegui> Deploy schema change on commonswiki codfw T255003

I have altered codfw already, so we can do some tests there.
This query looks good:

SELECT mvi_sha1
FROM machine_vision_image
INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id
WHERE mvl_review = 0 AND mvi_priority >= $priority
ORDER BY mvi_rand DESC
LIMIT 10

However, the new query looks problematic. It takes around 20 seconds, which is way too much, especially if it will be executed often:

SELECT mvi_priority
FROM machine_vision_image
INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id
WHERE mvl_review = 0
GROUP BY mvi_priority HAVING COUNT( DISTINCT mvi_id ) >= 200
ORDER BY mvi_priority DESC
LIMIT 1
root@PRODUCTION s4 slave[commonswiki]> show explain for 4830556;
+------+-------------+----------------------+-------+------------------+------------------+---------+-----------------------------------------+---------+-----------------------------+
| id   | select_type | table                | type  | possible_keys    | key              | key_len | ref                                     | rows    | Extra                       |
+------+-------------+----------------------+-------+------------------+------------------+---------+-----------------------------------------+---------+-----------------------------+
|    1 | SIMPLE      | machine_vision_image | index | PRIMARY          | mvi_priority     | 2       | NULL                                    | 2938183 | Using index; Using filesort |
|    1 | SIMPLE      | machine_vision_label | ref   | mvl_mvi_wikidata | mvl_mvi_wikidata | 4       | commonswiki.machine_vision_image.mvi_id |       2 | Using where                 |
+------+-------------+----------------------+-------+------------------+------------------+---------+-----------------------------------------+---------+-----------------------------+
2 rows in set, 1 warning (0.00 sec)

That's a lot of rows being scanned, can you possibly think of a way of making that query differently to make sure it runs faster?
We normally prefer more queries but fast, than 1 single slow query.

By looking at the query trace, most of the cost and rows scan comes from the GROUP BY:

"reconsidering_access_paths_for_index_ordering": {
  "clause": "GROUP BY",
  "fanout": 3,
  "read_time": 20089,
  "table": "machine_vision_image",
  "rows_estimation": 2554367,
  "possible_keys": [
    {
      "index": "PRIMARY",
      "can_resolve_order": false,
      "cause": "not usable index for the query"
    },
    {
      "index": "mvi_sha1",
      "can_resolve_order": false,
      "cause": "not usable index for the query"
    },
    {
      "index": "mvi_rand",
      "can_resolve_order": false,
      "cause": "not usable index for the query"
    },
    {
      "index": "mvi_priority",
      "can_resolve_order": true,
      "updated_limit": 851455,
      "index_scan_time": 13393,
      "records": 2554367,
      "chosen": true

Yeah the group by & the mvl_review = 0 clause were both significant issues for this join.

Here's an alternative for the problematic query that would work just as well for the application.
Is this one ok to use instead? It drastically limits the dataset, and seems to execute near-instantaneously with the new schema (0.04sec)

SELECT mvi_priority
FROM machine_vision_image
WHERE mvi_id IN (
    SELECT mvl_mvi_id
    FROM machine_vision_label
    WHERE mvl_review = 0
)
ORDER BY mvi_priority DESC
LIMIT 1 OFFSET 200;

If above query is fine, the schema migration can continue & I'll go update the other patch to execute this new query.

This query looks much better. 0.05 and it no longer scans the whole table

s4 eqiad progress

  • labsdb1012
  • labsdb1011
  • labsdb1010
  • labsdb1009
  • dbstore1004
  • db1149
  • db1148
  • db1147
  • db1146
  • db1144
  • db1143
  • db1142
  • db1141
  • db1138
  • db1125
  • db1121
  • db1102
  • db1084
  • db1081

Change 605814 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] filtered_tables.txt: Add mvi_priority column

https://gerrit.wikimedia.org/r/605814

Change 605814 merged by Marostegui:
[operations/puppet@production] filtered_tables.txt: Add mvi_priority column

https://gerrit.wikimedia.org/r/605814

Marostegui updated the task description. (Show Details)

All done!