Page MenuHomePhabricator

DBA review for the MachineVision extension
Open, NormalPublic


Extension:MachineVision needs a DBA review before it can be deployed to production. MachineVision is an extension for interacting with third-party machine vision providers, storing results, and serving them for use on-wiki. Its initial on-wiki use is to support the Machine-Aided Depicts project on Wikimedia Commons.

The tables used by the extension are described on Extension:MachineVision/Schema and its subpages. The extension is only planned for deployment to testcommonswiki and commonswiki.

Target deployment date: Wednesday, October 9, 2019

All PHP code interacting with the database (aside from maintenance scripts) is contained in the Repository class.

Query and table usage details:

  • machine_vision_provider is to be used with the NameTableStore construct in MediaWiki. It associates provider names with numeric IDs. It will probably only have one row for the foreseeable future.
  • machine_vision_freebase_mapping holds one-to-one mappings between Freebase IDs (many of which are still in use in the Google Knowledge Graph) and Wikidata IDs. It will be populated with a maintenance script and used to look up Wikidata IDs based on received Knowledge Graph/Freebase IDs as labeling responses are received from the Google Cloud Vision API. It will hold approximately 2.1 million rows. (Note: Use of Google Cloud Vision as a labeling provider is tentative.)
  • machine_vision_suggestion will be updated as labeling responses are received. The data in contains is mostly for planned future use, e.g., to compare the performance of suggestions from different providers. It will not be regularly queried.
  • machine_vision_label will be the hottest table. It will be populated as labeling responses are received, and will support queries to identify images with unreviewed labels, to retreive those labels, and to update their review state as reviews are submitted via the action API.
  • machine_vision_label and machine_vision_suggestion should be the same size as long as there is only one machine vision provider in use. Assuming 10 label suggestions are returned per labeling request, both tables will contain: ( ~260,000 featured/valued/quality images ) + ( ~2 million images used in mainspace pages on non-Commons wikis ) x 10 label suggestions per image (probably higher than reality, to be on the safe side) = ~22.6 million rows.
  • If one or more additional machine vision providers are added, and they are used to request labels for the same images, the machine_vision_suggestion table would grow linearly, but we would expect the machine_vision_label table to grow much more slowly, since we would expect many of the suggested labels to be duplicates of those already in the table.

Event Timeline

Mholloway created this task.Jul 5 2019, 5:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 5 2019, 5:37 PM
Mholloway renamed this task from Request DBA review for the machine vision middleware to DBA review for the MachineVision extension.Aug 20 2019, 3:11 PM
Mholloway updated the task description. (Show Details)
Mholloway updated the task description. (Show Details)
Mholloway moved this task from Backlog to Tracking on the Machine vision board.
Mholloway updated the task description. (Show Details)Sep 10 2019, 12:37 AM
Mholloway updated the task description. (Show Details)
jcrespo moved this task from Triage to Backlog on the DBA board.EditedSep 10 2019, 4:26 AM
jcrespo added a subscriber: jcrespo.

Hi, may I ask to mockup (although as realistically as possible), 1 or 2 rows per table to better understand the amount of space that will be needed. Thanks for the row estimations!

Also while doing that please indicate if any of those rows should be private (not replicated to public wikireplicas for cloud tools usage).

@jcrespo Sure. Here are a few samples from my local machine:

  • machine_vision_provider
(13:48) vagrant@localhost:[commonswiki]> select * from machine_vision_provider;
| mvp_id | mvp_name |
|      2 | random   |
|      3 | google   |
2 rows in set (0.00 sec)
  • machine_vision_freebase_mapping
(13:52) vagrant@localhost:[commonswiki]> select * from machine_vision_freebase_mapping limit 2;
| mvfm_freebase_id | mvfm_wikidata_id |
| /m/010005        | Q974334          |
| /m/01000j        | Q981582          |
2 rows in set (0.00 sec)
  • machine_vision_suggestion
(13:55) vagrant@localhost:[commonswiki]> select * from machine_vision_suggestion where mvs_confidence > 0 limit 2;
| mvs_mvl_id | mvs_provider_id | mvs_timestamp  | mvs_confidence |
|         59 |               3 | 20190829213134 |       0.986427 |
|         60 |               3 | 20190829213134 |       0.939012 |
2 rows in set (0.00 sec)
  • machine_vision_label
(13:55) vagrant@localhost:[commonswiki]> select * from machine_vision_label limit 2;
| mvl_id | mvl_image_sha1                  | mvl_wikidata_id | mvl_review | mvl_uploader_id | mvl_suggested_time |
|     52 | dgbr1rv2hfcnun8h6yzpzkggmamqgyk | Q2934           |          1 |               2 | 15677450935952     |
|     69 | 6bb3le70xhjl4ogh1a36m9xhg8a0nag | Q31528          |          1 |               2 | 15677451442816     |
2 rows in set (0.00 sec)

Probably the most unusual thing going on in this extension, which I should mention specifically here, is the getTitlesWithUnreviewedLabels method in the Repository class, which returns the least recently added or served image. That functionality was added recently in

The strategy for doing this is to select the row with the lowest mvl_suggested_time with:

SELECT DISTINCT mvl_image_sha1 FROM machine_vision_label WHERE mvl_review = 0 ORDER BY mvl_suggested_time LIMIT 1 FOR UPDATE;

and then to immediately update the mvl_suggested_time for the returned row with the current time.

EXPLAIN for that query:

(14:08) vagrant@localhost:[commonswiki]> EXPLAIN SELECT DISTINCT mvl_image_sha1 FROM machine_vision_label WHERE mvl_review = 0 ORDER BY mvl_suggested_time LIMIT 1 FOR UPDATE;
| id   | select_type | table                | type  | possible_keys                            | key                | key_len | ref  | rows | Extra                        |
|    1 | SIMPLE      | machine_vision_label | index | mvl_review_sha1,mvl_sha1_review_uploader | mvl_suggested_time | 14      | NULL |    1 | Using where; Using temporary |

Please see the second part of my request for information on the comment above.

@jcrespo Ah, sorry about that. I don't believe anything here needs to be private.

jcrespo triaged this task as Normal priority.Sep 11 2019, 12:07 PM
jcrespo moved this task from Backlog to Next on the DBA board.