Page MenuHomePhabricator

Slow random image queries in MachineVision
Closed, ResolvedPublic

Description

From logstash:

Expectation (readQueryTime <= 5) by MediaWiki::main not met (actual: 5.7761619091034):
query-m: SELECT DISTINCT mvi_sha1 AS `value` FROM `machine_vision_image` INNER JOIN `machine_vision_label` ON ((mvi_id = mvl_mvi_id)) WHERE mvl_review = N AND (mvi_rand > N.N) ORDER BY mvi_rand ASC LIMIT N  [TRX#f71cbb]
#0 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/TransactionProfiler.php(252): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated('readQueryTime', Object(Wikimedia\Rdbms\GeneralizedSql), 5.7761619091034)
#1 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/database/Database.php(1344): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion(Object(Wikimedia\Rdbms\GeneralizedSql), 1580247583.2197, false, 10)
#2 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/database/Database.php(1226): Wikimedia\Rdbms\Database->executeQueryAttempt('SELECT DISTINCT...', 'SELECT /* Media...', false, 'MediaWiki\\Exten...', 0)
#3 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/database/Database.php(1162): Wikimedia\Rdbms\Database->executeQuery('SELECT DISTINCT...', 'MediaWiki\\Exten...', 0)
#4 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/database/Database.php(1828): Wikimedia\Rdbms\Database->query('SELECT DISTINCT...', 'MediaWiki\\Exten...')
#5 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/database/Database.php(1691): Wikimedia\Rdbms\Database->select(Array, Array, Array, 'MediaWiki\\Exten...', Array, Array)
#6 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->selectFieldValues(Array, 'mvi_sha1', Array, 'MediaWiki\\Exten...', Array, Array)
#7 /srv/mediawiki/php-1.35.0-wmf.16/includes/libs/rdbms/database/DBConnRef.php(311): Wikimedia\Rdbms\DBConnRef->__call('selectFieldValu...', Array)
#8 /srv/mediawiki/php-1.35.0-wmf.16/extensions/MachineVision/src/Repository.php(292): Wikimedia\Rdbms\DBConnRef->selectFieldValues(Array, 'mvi_sha1', Array, 'MediaWiki\\Exten...', Array, Array)
#9 /srv/mediawiki/php-1.35.0-wmf.16/extensions/MachineVision/src/Repository.php(296): MediaWiki\Extension\MachineVision\Repository->MediaWiki\Extension\MachineVision\{closure}(true, 10, Array)
#10 /srv/mediawiki/php-1.35.0-wmf.16/extensions/MachineVision/src/Special/SpecialSuggestedTags.php(70): MediaWiki\Extension\MachineVision\Repository->getTitlesWithUnreviewedLabels(10)
#11 /srv/mediawiki/php-1.35.0-wmf.16/extensions/MachineVision/src/Special/SpecialSuggestedTags.php(46): MediaWiki\Extension\MachineVision\Special\SpecialSuggestedTags->getInitialSuggestedTagsData()
#12 /srv/mediawiki/php-1.35.0-wmf.16/includes/specialpage/SpecialPage.php(575): MediaWiki\Extension\MachineVision\Special\SpecialSuggestedTags->execute(NULL)
#13 /srv/mediawiki/php-1.35.0-wmf.16/includes/specialpage/SpecialPageFactory.php(611): SpecialPage->run(NULL)
#14 /srv/mediawiki/php-1.35.0-wmf.16/includes/MediaWiki.php(298): MediaWiki\Special\SpecialPageFactory->executePath(Object(Title), Object(RequestContext))
#15 /srv/mediawiki/php-1.35.0-wmf.16/includes/MediaWiki.php(967): MediaWiki->performRequest()
#16 /srv/mediawiki/php-1.35.0-wmf.16/includes/MediaWiki.php(530): MediaWiki->main()
#17 /srv/mediawiki/php-1.35.0-wmf.16/index.php(46): MediaWiki->run()
#18 /srv/mediawiki/w/index.php(3): require('/srv/mediawiki/...')
#19 {main}

The problem seems to be the DISTINCT:

> EXPLAIN SELECT DISTINCT(mvi_sha1) AS `value` FROM `machine_vision_image` INNER JOIN `machine_vision_label` ON ((mvi_id = mvl_mvi_id)) WHERE mvl_review = 0 AND (mvi_rand > 0.5) ORDER BY mvi_rand ASC LIMIT 100;;
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => machine_vision_image
    [type] => range
    [possible_keys] => PRIMARY,mvi_rand
    [key] => mvi_rand
    [key_len] => 4
    [ref] =>
    [rows] => 134357
    [Extra] => Using index condition; Using temporary; Using filesort
)
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => machine_vision_label
    [type] => ref
    [possible_keys] => mvl_mvi_wikidata
    [key] => mvl_mvi_wikidata
    [key_len] => 4
    [ref] => commonswiki.machine_vision_image.mvi_id
    [rows] => 4
    [Extra] => Using where; Distinct
)
> EXPLAIN SELECT mvi_sha1 AS `value` FROM `machine_vision_image` INNER JOIN `machine_vision_label` ON ((mvi_id = mvl_mvi_id)) WHERE mvl_review = 0 AND (mvi_rand > 0.5) ORDER BY mvi_rand ASC LIMIT 10;
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => machine_vision_image
    [type] => range
    [possible_keys] => PRIMARY,mvi_rand
    [key] => mvi_rand
    [key_len] => 4
    [ref] =>
    [rows] => 134357
    [Extra] => Using index condition
)
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => machine_vision_label
    [type] => ref
    [possible_keys] => mvl_mvi_wikidata
    [key] => mvl_mvi_wikidata
    [key_len] => 4
    [ref] => commonswiki.machine_vision_image.mvi_id
    [rows] => 4
    [Extra] => Using where
)
> select count(*) from machine_vision_label where mvl_review = 0 limit 1000;;
stdClass Object
(
    [count(*)] => 2086405
)

> select count(*) from machine_vision_label where mvl_review = 1;
stdClass Object
(
    [count(*)] => 6625
)

How many duplicate entries are there (roughly) for each SHA-1? Maybe the LIMIT can be the number of desired results times a multiplier and the DISTINCT can be removed. Any duplicates can be filtered in PHP in that case. Any excess results would be discarded in PHP rather than mysql.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is doing a full table scan:

root@db1081.eqiad.wmnet[commonswiki]> FLUSH STATUS; SELECT /* MediaWiki\Extension\MachineVision\Repository::getTitlesWithUnreviewedLabels */ DISTINCT mvi_sha1 AS `value` FROM `machine_vision_image` INNER JOIN `machine_vision_label` ON ((mvi_id = mvl_mvi_id)) WHERE mvl_review = 0 AND (mvi_rand > 0.16201735574846) ORDER BY mvi_rand ASC LIMIT 10; nopager; show status like 'Hand%';
Query OK, 0 rows affected (0.00 sec)

<snip>
10 rows in set (8.77 sec)

PAGER set to stdout
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 0      |
| Handler_icp_attempts       | 464056 |
| Handler_icp_match          | 464056 |
| Handler_mrr_init           | 0      |
| Handler_mrr_key_refills    | 0      |
| Handler_mrr_rowid_refills  | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 464057 |
| Handler_read_last          | 0      |
| Handler_read_next          | 610770 |
| Handler_read_prev          | 0      |
| Handler_read_retry         | 0      |
| Handler_read_rnd           | 10     |
| Handler_read_rnd_deleted   | 0      |
| Handler_read_rnd_next      | 443011 |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_tmp_update         | 0      |
| Handler_tmp_write          | 443010 |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+

I think we arrived at a pretty poor query pattern because of a conceptual mistake. We're storing reviewed state at the label level (mvl_reviewed), which is technically accurate given that we may have additional labels coming in from other sources in the future, but in our products we're really thinking in terms of images, and not labels, that need review. All labels associated with an image are reviewed at once.

Here are a couple of schema tweaks that I think could considerably improve things:

  1. Add a mvi_reviewed boolean column to machine_vision_image to reflect whether the image has labels needing review. (A "reviewed" image can be flipped back to unreviewed state in the event that additional unreviewed labels are added for it in the future.)
  1. Add a mvi_uploader column to machine_vision_image, and remove mvl_uploader (which contains an identical value for every label pertaining to the same image) from machine_vision_label.

Then this query would become much simpler, something like:

SELECT mvi_sha1 FROM machine_vision_image WHERE mvi_rand > <rand> AND mvi_reviewed = 0 [AND mvi_uploader = <user id>] ORDER BY mvi_rand [LIMIT <limit>];

Change 574898 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] DB schema updates, part 1: Add and populate machine_vision_image columns

https://gerrit.wikimedia.org/r/574898

Change 574899 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Update MachineVision Repository class to use the updated schema

https://gerrit.wikimedia.org/r/574899

Change 574900 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Drop mvl_uploader_id column from machine_vision_label

https://gerrit.wikimedia.org/r/574900

@Marostegui Assuming that the above approach is agreed upon, when do you expect that you would be able to perform the schema change to add the new columns?

I imagine that how this will go is that we'll add the new mvi_uploader_id and mvi_reviewed columns, populate them by maintenance script, and immediately update the application code to use them. Then we can remove the unused mvl_uploader_id column at a convenient later date.

How do we typically go about preventing the tables from being written to during the required schema updates and code changes? If necessary, we can temporarily disable the extension, but maybe there's a better way.

I can imagine an alternative, more staggered approach where we add the new columns, let MediaWiki start writing to them but not yet reading from them, backfill the older rows, then update MediaWiki again to consume the new columns. But I'm not sure that that would accomplish anything beyond making the changes a bit more cumbersome to roll back if needed.

Using DISTINCT is definitely the performance killer.

@Cparle pointed out a possible alternative workaround that avoids a schema change: We could remove the DISTINCT directive and run the query in a loop, using a different random value each time and taking only unique values, until we've reached the requested number of results. The downside is having to run the query multiple times to get a full result set, but without the DISTINCT keyword each individual query is fast.

It might be worth changing the schema in any case — that's ultimately up to the Structured Data team — but if the workaround sounds acceptable, we could do that to fix the performance issue in the meantime.

How many duplicate entries are there (roughly) for each SHA-1? Maybe the LIMIT can be the number of desired results times a multiplier and the DISTINCT can be removed. Any duplicates can be filtered in PHP in that case. Any excess results would be discarded in PHP rather than mysql.

This approach would work, too, now that I'm thinking about it clearly. There are approximately 8 instances of each SHA-1 value in machine_vision_label (one for each label suggestion). Maybe we could bump the multiplier to 10 for good measure and then query for LIMIT (10 * $limit).

Change 575345 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Fix slow random unreviewed image queries

https://gerrit.wikimedia.org/r/575345

OK, I think we should merge and deploy the patch above, and leave the schema changes for another time.

If you can give us specific queries, we can test then in the production environment. If they contain PII, (is: user_id) just create a NDA paste and I can read that and run them in production.
Also, if you believe you'd finally need a schema change, please follow: https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change

To your specific question about when we could run it, it depends on the workload. Also, to be able to finish it, it can take minutes or months, depending on the size of the tables. Given the current sizes of the machine_vision_* tables, I would expect it would take around 1 week since the moment we pick it up.

Change 574898 abandoned by Mholloway:
Schema updates, part 1: Add and populate machine_vision_image columns

Reason:
Let's wait on any potential schema changes for now.

https://gerrit.wikimedia.org/r/574898

Change 574899 abandoned by Mholloway:
Update MachineVision Repository class to use the updated schema

https://gerrit.wikimedia.org/r/574899

Change 574900 abandoned by Mholloway:
Schema updates, part 2: Drop mvl_uploader_id from machine_vision_label

https://gerrit.wikimedia.org/r/574900

Change 575345 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Fix slow random unreviewed image queries

https://gerrit.wikimedia.org/r/575345

The readQueryTime warnings for the random image query ceased as of the above patch going live on Commons yesterday. \o/

If you can give us specific queries, we can test then in the production environment. If they contain PII, (is: user_id) just create a NDA paste and I can read that and run them in production.

I have a follow-up patch here to combine two queries into a join, but it should probably be tested to verify that it results in an overall improvement.

The current state:

  1. A query is run like: SELECT mvi_sha1 FROM machine_vision_image INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id WHERE mvl_review = 0 AND mvi_rand > 0.5 ORDER BY mvi_rand ASC LIMIT 100;
  2. The results from the above are used in a second query: SELECT img_name FROM image WHERE img_sha1 IN (...);

Proposed patch:

  • A single query is run: SELECT img_name FROM image INNER JOIN (SELECT mvi_sha1 FROM machine_vision_image INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id WHERE mvl_review = 0 AND mvi_rand > 0.5 ORDER BY mvi_rand ASC LIMIT 100) AS machinevision ON img_sha1 = machinevision.mvi_sha1;

Thank for queries.

This query looks good to me. From what I can see the query runs fast (0.03) on a commons host.

SELECT img_name FROM image INNER JOIN (SELECT mvi_sha1 FROM machine_vision_image INNER JOIN machine_vision_label ON mvi_id = mvl_mvi_id WHERE mvl_review = 0 AND mvi_rand > 0.5 ORDER BY mvi_rand ASC LIMIT 100) AS machinevision ON img_sha1 = machinevision.mvi_sha1;

It is scanning around 400k but it is not filesorting or anything from what I can see:

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 20    |
| Handler_icp_match          | 20    |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 117   |
| Handler_read_last          | 0     |
| Handler_read_next          | 221   |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 101   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 100   |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

I have also examined this query on 10.4, which gives us access to the internal optimize trace and it is choosing the best query plan based on cost: mvi_rand is chosen over the PK based on the possible rows it would scan.