Page MenuHomePhabricator

Populate MachineVision databases for images commonly returned by search
Open, Needs TriagePublic5 Estimated Story Points

Description

As a developer working on autocomplete i would like MachineVision predictions for historical search results so i can use them in T250436

Search Platform would like to use MachineVision predictions as part of a process to work with user submitted queries and determine which are useful to propose to other users as query completionypes. In initial explorations we found it viable to bring this data together, but the data is not complete enough for our use case. While the MachineVision database for commonswiki contains results for 5.7M pages, the overlap between that set and the ~16M titles returned by search between jan 1 and feb 8th is only around 70k images.

6.5M pages to import by page_id, unsorted: https://analytics.wikimedia.org/published/datasets/one-off/ebernhardson/common_fulltext_search_page_ids.commonswiki.20210101-20210210.csv.gz

AC: MachineVision databases for commonswiki contain predictions for common search results

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

how many titles are reasonable to process?

@EBernhardson said in IRC that from Jan 1 - Feb 8, 16.3M unique titles were returned by full_text search to api and web on commonswiki. 7.7M were seen by more than one identity, 2.7M by more than 5. "Seen" here is a bit generous, it only means returned not necessarily that they scrolled down.

We have more than enough credits to cover the 7.7M number. 16.3M would be a bit too much. I don't know how long 7.7M would take to process, but if it's reasonable that would be my recommendation.

CSV report, contains three columns: page_id, num_times_seen,num_ident_seen. Likely MV only needs the first, but if we want to re-filter this should make it trivial. Contains page_id's returned to two or more identities. Spent a little more time cleaning up the data this time around, this contains 6.5M page_ids. This does not filter images MV already has predictions for.

https://analytics.wikimedia.org/published/datasets/one-off/ebernhardson/common_fulltext_search_page_ids.commonswiki.20210101-20210210.csv.gz

Thanks @EBernhardson. So that we can prioritize effectively on the SD team side - is this blocking the query suggestion work right now? When do you need this by to continue the work?

If everything is fine, I can probably just do this myself? It looks like i need to format suggestions to match the way createFileListFromCategoriesAndTemplates.php generates a list of files to run, chunk it into reasonable sizes and run fetchSuggestions.php over them?

If everything is fine, I can probably just do this myself? It looks like i need to format suggestions to match the way createFileListFromCategoriesAndTemplates.php generates a list of files to run, chunk it into reasonable sizes and run fetchSuggestions.php over them?

Sounds good to me! @matthiasmullie or @Cparle do you see any issues?

Gehel set the point value for this task to 5.

Mentioned in SAL (#wikimedia-operations) [2021-02-20T00:15:22Z] <ebernhardson> start batch processing images through MachineVision fetchSuggestions.php for T274220 on mwmaint1002

Been running over the weekend. Inputs are split to have 10k files per run. The mean time to run has been 176 minutes per 10k files, or around one file per second. At one file per second we are looking at just short of 70 days of constant running to go through 6M images.

@matthiasmullie @Cparle Any guidance on parallelism? I can certainly make this run the fetch script multiple times in parallel for different inputs, but I'm not sure what is appropriate.

I don't think running in parallel with different inputs would be a problem.
Ping @Mholloway in case he has thoughts.

No, it shouldn't be a problem. I was being extremely conservative when I wrote that script, more because I was worried about blowing past our Google Cloud Vision budget than for any other reason. Either updating the script to be more efficient or running multiple instances in parallel should be fine.

This has been running with 5 processes for most of a week now. Unfortunately ~3 of the 5 keep dieing due to database locking issues. This suggests the code is holding database locks longer than it should, not clear if we need to dig into it yet. Because these are in batches of 10k images and the failed batches have to be retried i think some predictions are being double requested.

For now going to let it keep running and restarting the imports when they fail.

Related stack trace:

Wikimedia\Rdbms\DBQueryError from line 1703 of /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php: Error 1205: Lock wait timeout exceeded; try r
estarting transaction (10.64.48.124)                                                                                                                                    
Function: MediaWiki\Extension\MachineVision\Repository::insertLabels                                                                                                    
Query: INSERT IGNORE INTO `machine_vision_image` (mvi_sha1,mvi_priority,mvi_rand) VALUES ('mc9cmscpm8w47cvqtf9x644lorb39ro',0,'0.71898993510706')                       
                                                                                                                                                                        
#0 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1687): Wikimedia\Rdbms\Database->getQueryException('Lock wait timeo...', 1205, 'INSERT IGN
ORE I...', 'MediaWiki\\Exten...')                                                                                                                                       
#1 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1662): Wikimedia\Rdbms\Database->getQueryExceptionAndLog('Lock wait timeo...', 1205, 'INSERT IGNORE I...', 'MediaWiki\\Exten...')
#2 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1231): Wikimedia\Rdbms\Database->reportQueryError('Lock wait timeo...', 1205, 'INSERT IGNO
RE I...', 'MediaWiki\\Exten...', false)
#3 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(2367): Wikimedia\Rdbms\Database->query('INSERT IGNORE I...', 'MediaWiki\\Exten...', 128)
#4 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(2327): Wikimedia\Rdbms\Database->doInsertNonConflicting('machine_vision_...', Array, 'Medi
aWiki\\Exten...')
#5 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->insert('machine_vision_...', Array, 'MediaWiki\\Exten...',
 Array)
#6 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call('insert', Array)
#7 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/src/Repository.php(97): Wikimedia\Rdbms\DBConnRef->insert('machine_vision_...', Array, 'MediaWiki\\Exten...
', Array)
#8 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/src/Client/GoogleCloudVisionClient.php(209): MediaWiki\Extension\MachineVision\Repository->insertLabels('mc
9cmscpm8w47cv...', 'google', 7232155, Array, 0, 0)
#9 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(144): MediaWiki\Extension\MachineVision\Client\GoogleCloudVisionClient->fe
tchAnnotations('google', Object(LocalFile), 0)
#10 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(97): MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions->fetc
hForFile(Object(LocalFile), 0)
#11 /srv/mediawiki/php-1.36.0-wmf.32/maintenance/doMaintenance.php(106): MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions->execute()
#12 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(178): require_once('/srv/mediawiki/...')
#13 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#14 {main}