[[ https://wwwmediawiki.org/wiki/Extension:MachineVision | Extension:MachineVision ]] needs a DBA review before it can be deployed to production. MachineVision is an extension for interacting with third-party machine vision providers, storing results, and serving them for use on-wiki. Its initial on-wiki use is to support the Machine-Aided Depicts project on Wikimedia Commons.
The tables used by the extension are described on [[ https://www.mediawiki.org/wiki/Extension:MachineVision/Schema | Extension:MachineVision/Schema ]] and its subpages.
Target deployment date: **Wednesday, October 9, 2019**
---
All PHP code interacting with the database (aside from maintenance scripts) is contained in the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MachineVision/+/master/src/Repository.php | Repository ]] class.
Query and table usage details:
* machine_vision_provider is to be used with the NameTableStore construct in MediaWiki. It associates provider names with numeric IDs. It will probably only have one row for the foreseeable future.
* machine_vision_freebase_mapping holds one-to-one mappings between Freebase IDs (many of which are still in use in the Google Knowledge Graph) and Wikidata IDs. It will be populated with a [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MachineVision/+/master/maintenance/populateFreebaseMapping.php | maintenance script ]] and used to look up Wikidata IDs based on received Knowledge Graph/Freebase IDs as labeling responses are received from the Google Cloud Vision API. It will hold approximately 2.1 million rows. (Note: Use of Google Cloud Vision as a labeling provider is tentative.)
* machine_vision_suggestion will be updated as labeling responses are received. The data in contains is mostly for planned future use, e.g., to compare the performance of suggestions from different providers. It will not be regularly queried.
* machine_vision_label will be the hottest table. It will be populated as labeling responses are received, and will support queries to identify images with unreviewed labels, to retreive those labels, and to update their review state as reviews are submitted via the action API.
* machine_vision_label and machine_vision_suggestion should be the same size as long as there is only one machine vision provider in use. Assuming 10 label suggestions are returned per labeling request, both tables will contain: ( ~260,000 featured/valued/quality images ) + ( ~2 million images used in mainspace pages on non-Commons wikis ) x 10 label suggestions per image (probably higher than reality, to be on the safe side) = ~22.6 million rows.
* If one or more additional machine vision providers are added, and they are used to request labels for the same images, the machine_vision_suggestion table would grow linearly, but we would expect the machine_vision_label table to grow much more slowly, since we would expect many of the suggested labels to be duplicates of those already in the table.