Page MenuHomePhabricator

DBA review for the MachineVision extension
Closed, ResolvedPublic

Description

Extension:MachineVision needs a DBA review before it can be deployed to production. MachineVision is an extension for interacting with third-party machine vision providers, storing results, and serving them for use on-wiki. Its initial on-wiki use is to support the Machine-Aided Depicts project on Wikimedia Commons.

The tables used by the extension are described on Extension:MachineVision/Schema and its subpages. The extension is only planned for deployment to testcommonswiki and commonswiki.

Target deployment date: Wednesday, October 9, 2019


All PHP code interacting with the database (aside from maintenance scripts) is contained in the Repository class.

Query and table usage details:

  • machine_vision_provider is to be used with the NameTableStore construct in MediaWiki. It associates provider names with numeric IDs. It will probably only have one row for the foreseeable future.
  • machine_vision_freebase_mapping holds one-to-one mappings between Freebase IDs (many of which are still in use in the Google Knowledge Graph) and Wikidata IDs. It will be populated with a maintenance script and used to look up Wikidata IDs based on received Knowledge Graph/Freebase IDs as labeling responses are received from the Google Cloud Vision API. It will hold approximately 2.1 million rows. (Note: Use of Google Cloud Vision as a labeling provider is tentative.)
  • machine_vision_suggestion will be updated as labeling responses are received. The data in contains is mostly for planned future use, e.g., to compare the performance of suggestions from different providers. It will not be regularly queried.
  • machine_vision_label will be the hottest table. It will be populated as labeling responses are received, and will support queries to identify images with unreviewed labels, to retreive those labels, and to update their review state as reviews are submitted via the action API.
  • machine_vision_label and machine_vision_suggestion should be the same size as long as there is only one machine vision provider in use. Assuming 10 label suggestions are returned per labeling request, both tables will contain: ( ~260,000 featured/valued/quality images ) + ( ~2 million images used in mainspace pages on non-Commons wikis ) x 10 label suggestions per image (probably higher than reality, to be on the safe side) = ~22.6 million rows.
  • If one or more additional machine vision providers are added, and they are used to request labels for the same images, the machine_vision_suggestion table would grow linearly, but we would expect the machine_vision_label table to grow much more slowly, since we would expect many of the suggested labels to be duplicates of those already in the table.

Related Objects

StatusSubtypeAssignedTask
Resolved Ramsey-WMF
Resolved Mholloway
Resolved Mholloway
Resolved Mholloway
Resolvedsbassett
Resolved Mholloway
ResolvedMarostegui
Resolved Mholloway
ResolvedKrinkle
Resolved Mholloway
ResolvedNone
Resolved Ramsey-WMF
Resolved Mholloway
Resolved Mholloway
Resolved Mholloway
Resolved Mholloway
DeclinedReedy

Event Timeline

Mholloway renamed this task from Request DBA review for the machine vision middleware to DBA review for the MachineVision extension.Aug 20 2019, 3:11 PM
Mholloway updated the task description. (Show Details)
Mholloway updated the task description. (Show Details)
Mholloway moved this task from Backlog to Tracking on the MachineVision board.
jcrespo moved this task from Triage to Backlog on the DBA board.EditedSep 10 2019, 4:26 AM
jcrespo subscribed.

Hi, may I ask to mockup (although as realistically as possible), 1 or 2 rows per table to better understand the amount of space that will be needed. Thanks for the row estimations!

Also while doing that please indicate if any of those rows should be private (not replicated to public wikireplicas for cloud tools usage).

@jcrespo Sure. Here are a few samples from my local machine:

  • machine_vision_provider
(13:48) vagrant@localhost:[commonswiki]> select * from machine_vision_provider;
+--------+----------+
| mvp_id | mvp_name |
+--------+----------+
|      2 | random   |
|      3 | google   |
+--------+----------+
2 rows in set (0.00 sec)
  • machine_vision_freebase_mapping
(13:52) vagrant@localhost:[commonswiki]> select * from machine_vision_freebase_mapping limit 2;
+------------------+------------------+
| mvfm_freebase_id | mvfm_wikidata_id |
+------------------+------------------+
| /m/010005        | Q974334          |
| /m/01000j        | Q981582          |
+------------------+------------------+
2 rows in set (0.00 sec)
  • machine_vision_suggestion
(13:55) vagrant@localhost:[commonswiki]> select * from machine_vision_suggestion where mvs_confidence > 0 limit 2;
+------------+-----------------+----------------+----------------+
| mvs_mvl_id | mvs_provider_id | mvs_timestamp  | mvs_confidence |
+------------+-----------------+----------------+----------------+
|         59 |               3 | 20190829213134 |       0.986427 |
|         60 |               3 | 20190829213134 |       0.939012 |
+------------+-----------------+----------------+----------------+
2 rows in set (0.00 sec)
  • machine_vision_label
(13:55) vagrant@localhost:[commonswiki]> select * from machine_vision_label limit 2;
+--------+---------------------------------+-----------------+------------+-----------------+--------------------+
| mvl_id | mvl_image_sha1                  | mvl_wikidata_id | mvl_review | mvl_uploader_id | mvl_suggested_time |
+--------+---------------------------------+-----------------+------------+-----------------+--------------------+
|     52 | dgbr1rv2hfcnun8h6yzpzkggmamqgyk | Q2934           |          1 |               2 | 15677450935952     |
|     69 | 6bb3le70xhjl4ogh1a36m9xhg8a0nag | Q31528          |          1 |               2 | 15677451442816     |
+--------+---------------------------------+-----------------+------------+-----------------+--------------------+
2 rows in set (0.00 sec)

Probably the most unusual thing going on in this extension, which I should mention specifically here, is the getTitlesWithUnreviewedLabels method in the Repository class, which returns the least recently added or served image. That functionality was added recently in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MachineVision/+/534724/.

The strategy for doing this is to select the row with the lowest mvl_suggested_time with:

SELECT DISTINCT mvl_image_sha1 FROM machine_vision_label WHERE mvl_review = 0 ORDER BY mvl_suggested_time LIMIT 1 FOR UPDATE;

and then to immediately update the mvl_suggested_time for the returned row with the current time.

EXPLAIN for that query:

(14:08) vagrant@localhost:[commonswiki]> EXPLAIN SELECT DISTINCT mvl_image_sha1 FROM machine_vision_label WHERE mvl_review = 0 ORDER BY mvl_suggested_time LIMIT 1 FOR UPDATE;
+------+-------------+----------------------+-------+------------------------------------------+--------------------+---------+------+------+------------------------------+
| id   | select_type | table                | type  | possible_keys                            | key                | key_len | ref  | rows | Extra                        |
+------+-------------+----------------------+-------+------------------------------------------+--------------------+---------+------+------+------------------------------+
|    1 | SIMPLE      | machine_vision_label | index | mvl_review_sha1,mvl_sha1_review_uploader | mvl_suggested_time | 14      | NULL |    1 | Using where; Using temporary |
+------+-------------+----------------------+-------+------------------------------------------+--------------------+---------+------+------+------------------------------+

Please see the second part of my request for information on the comment above.

@jcrespo Ah, sorry about that. I don't believe anything here needs to be private.

jcrespo triaged this task as Medium priority.Sep 11 2019, 12:07 PM
jcrespo moved this task from Backlog to Pending comment on the DBA board.

Hi @jcrespo, just checking in on this. Our goal is to deploy to production the last week of this month. Have you had a chance to look this over?

There have been a couple of changes since 9/10:

  • A new machine_vision_safe_search table was added. Data is inserted at the same time label suggestions are first inserted into other tables. The data is used upon when inserting label suggestions and at this point is not queried further. A single row per image is expected. Sample data:
(21:08) vagrant@localhost:[commonswiki]> select * from machine_vision_safe_search;
+---------------------------------+------------+------------+--------------+---------------+-----------+
| mvss_image_sha1                 | mvss_adult | mvss_spoof | mvss_medical | mvss_violence | mvss_racy |
+---------------------------------+------------+------------+--------------+---------------+-----------+
| r1h7rwwnm81x0eofl8jwwlix3a5p8ia |          2 |          1 |            1 |             1 |         4 |
+---------------------------------+------------+------------+--------------+---------------+-----------+
  • Also, two new rows were added to machine_vision_label: mvl_reviewer_id and mvl_reviewed_time. mvl_reviewer_id stores the (local) user ID of the user who reviewed a suggested label, and mvl_reviewed_time stores the timestamp of the review (in the same format as is used for mvl_suggested_time). They default to null and are populated when a label is reviewed.

Sadly I am not in charge of databases anymore, @Marostegui will have to do the work.

I am going on holidays in a few days, and I will be gone till the 11th of Novemeber, so I am not sure I will be able to tackle this before I leave. I would actually prefer if @jcrespo could give it a final look as he actually started with this months ago. If that is not possible I will see what I can do before I leave.

OK. We (Product Infrastructure and Structured Data) would greatly appreciate it if one of you can find some time to finish this up soon. Please keep me posted.

So the tables that worries me the most are machine_vision_label machine_vision_suggestion just because of the size estimations (22M as per your numbers). And machine_vision_label because of its activity.
The other tables are tiny enough to not cause issues at this point I believe.

I have been doing some local tests with the rows examples you provided with machine_vision_label and I believe with 22M rows the size on disk might reach around 10GB which is not bad. However, I will probably compress those tables once they are created, so we get them compressed from the start.

CREATE TABLE `machine_vision_label` (
  `mvl_id` int(11) NOT NULL AUTO_INCREMENT,
  `mvl_image_sha1` varbinary(32) NOT NULL,
  `mvl_wikidata_id` varbinary(32) NOT NULL,
  `mvl_review` tinyint(4) NOT NULL DEFAULT '0',
  `mvl_uploader_id` int(10) unsigned NOT NULL DEFAULT '0',
  `mvl_suggested_time` binary(14) NOT NULL,
  `mvl_reviewer_id` int(10) unsigned DEFAULT NULL,
  `mvl_reviewed_time` binary(14) DEFAULT NULL,
  PRIMARY KEY (`mvl_id`),
  UNIQUE KEY `mvl_sha1_wikidata` (`mvl_image_sha1`,`mvl_wikidata_id`),
  KEY `mvl_review_sha1` (`mvl_review`,`mvl_image_sha1`),
  KEY `mvl_sha1_review_uploader` (`mvl_review`,`mvl_uploader_id`),
  KEY `mvl_suggested_time` (`mvl_suggested_time`)

Same for machine_vision_suggestion which is, though, a lot smaller if we estimate its size in 22M rows, which could be around 3-4GB.

CREATE TABLE `machine_vision_suggestion` (
  `mvs_mvl_id` int(11) NOT NULL,
  `mvs_provider_id` int(10) unsigned NOT NULL,
  `mvs_timestamp` binary(14) NOT NULL,
  `mvs_confidence` float unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`mvs_mvl_id`,`mvs_provider_id`)

Table structured has been grabbed from https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MachineVision/ and I have also applied the patches.
I also assume it is easy and safe to revert/disable the extension if after the deployment we see Commons not behaving as expected and this needs further tackling?
As I said on a previous comment, I won't be available for the release as I will be on vacation, but @jcrespo will oversee how this deployments affect Commons, just in case this needs some adjustments or revert (we don't expect it, but just to be on the safe side).

Also, as per T227355#5479123, everything is fine to be replicated and will be replicated, but remember if you need the data to be accessible on the wiki replicas, you have to create a task for the cloud-services-team to create the views there.
Once this is deployed, at some point, we should add these tables and columns to https://github.com/wikimedia/puppet/blob/production/modules/role/files/mariadb/filtered_tables.txt (as everything is fine to be replicated, it is ok not to do it now).
If you'd have the time to send that patch, that'd be helpful too, we can merge it on our puppet repo for you.

The format is:
table,column,K

K= OK to replicate
F= filtered

In this case all the columns will be "K" as per your comments.

I also assume it is easy and safe to revert/disable the extension if after the deployment we see Commons not behaving as expected and this needs further tackling?

Yes, it should be fine to disable in an emergency.

Also, as per T227355#5479123, everything is fine to be replicated and will be replicated, but remember if you need the data to be accessible on the wiki replicas, you have to create a task for the cloud-services-team to create the views there.
Once this is deployed, at some point, we should add these tables and columns to https://github.com/wikimedia/puppet/blob/production/modules/role/files/mariadb/filtered_tables.txt (as everything is fine to be replicated, it is ok not to do it now).
If you'd have the time to send that patch, that'd be helpful too, we can merge it on our puppet repo for you.

Created T235887 for that follow-up work. In the meantime, I will resolve this task. Thank you very much for the review, @Marostegui!

Mholloway claimed this task.
Mholloway reassigned this task from Mholloway to Marostegui.