Page MenuHomePhabricator

Populate MachineVision databases for images commonly returned by search
Closed, DeclinedPublic5 Estimated Story Points


As a developer working on autocomplete i would like MachineVision predictions for historical search results so i can use them in T250436

Search Platform would like to use MachineVision predictions as part of a process to work with user submitted queries and determine which are useful to propose to other users as query completionypes. In initial explorations we found it viable to bring this data together, but the data is not complete enough for our use case. While the MachineVision database for commonswiki contains results for 5.7M pages, the overlap between that set and the ~16M titles returned by search between jan 1 and feb 8th is only around 70k images.

6.5M pages to import by page_id, unsorted:

AC: MachineVision databases for commonswiki contain predictions for common search results

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

how many titles are reasonable to process?

@EBernhardson said in IRC that from Jan 1 - Feb 8, 16.3M unique titles were returned by full_text search to api and web on commonswiki. 7.7M were seen by more than one identity, 2.7M by more than 5. "Seen" here is a bit generous, it only means returned not necessarily that they scrolled down.

We have more than enough credits to cover the 7.7M number. 16.3M would be a bit too much. I don't know how long 7.7M would take to process, but if it's reasonable that would be my recommendation.

CSV report, contains three columns: page_id, num_times_seen,num_ident_seen. Likely MV only needs the first, but if we want to re-filter this should make it trivial. Contains page_id's returned to two or more identities. Spent a little more time cleaning up the data this time around, this contains 6.5M page_ids. This does not filter images MV already has predictions for.

Thanks @EBernhardson. So that we can prioritize effectively on the SD team side - is this blocking the query suggestion work right now? When do you need this by to continue the work?

If everything is fine, I can probably just do this myself? It looks like i need to format suggestions to match the way createFileListFromCategoriesAndTemplates.php generates a list of files to run, chunk it into reasonable sizes and run fetchSuggestions.php over them?

If everything is fine, I can probably just do this myself? It looks like i need to format suggestions to match the way createFileListFromCategoriesAndTemplates.php generates a list of files to run, chunk it into reasonable sizes and run fetchSuggestions.php over them?

Sounds good to me! @matthiasmullie or @Cparle do you see any issues?

Gehel set the point value for this task to 5.

Mentioned in SAL (#wikimedia-operations) [2021-02-20T00:15:22Z] <ebernhardson> start batch processing images through MachineVision fetchSuggestions.php for T274220 on mwmaint1002

Been running over the weekend. Inputs are split to have 10k files per run. The mean time to run has been 176 minutes per 10k files, or around one file per second. At one file per second we are looking at just short of 70 days of constant running to go through 6M images.

@matthiasmullie @Cparle Any guidance on parallelism? I can certainly make this run the fetch script multiple times in parallel for different inputs, but I'm not sure what is appropriate.

I don't think running in parallel with different inputs would be a problem.
Ping @Mholloway in case he has thoughts.

No, it shouldn't be a problem. I was being extremely conservative when I wrote that script, more because I was worried about blowing past our Google Cloud Vision budget than for any other reason. Either updating the script to be more efficient or running multiple instances in parallel should be fine.

This has been running with 5 processes for most of a week now. Unfortunately ~3 of the 5 keep dieing due to database locking issues. This suggests the code is holding database locks longer than it should, not clear if we need to dig into it yet. Because these are in batches of 10k images and the failed batches have to be retried i think some predictions are being double requested.

For now going to let it keep running and restarting the imports when they fail.

Related stack trace:

Wikimedia\Rdbms\DBQueryError from line 1703 of /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php: Error 1205: Lock wait timeout exceeded; try r
estarting transaction (                                                                                                                                    
Function: MediaWiki\Extension\MachineVision\Repository::insertLabels                                                                                                    
Query: INSERT IGNORE INTO `machine_vision_image` (mvi_sha1,mvi_priority,mvi_rand) VALUES ('mc9cmscpm8w47cvqtf9x644lorb39ro',0,'0.71898993510706')                       
#0 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1687): Wikimedia\Rdbms\Database->getQueryException('Lock wait timeo...', 1205, 'INSERT IGN
ORE I...', 'MediaWiki\\Exten...')                                                                                                                                       
#1 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1662): Wikimedia\Rdbms\Database->getQueryExceptionAndLog('Lock wait timeo...', 1205, 'INSERT IGNORE I...', 'MediaWiki\\Exten...')
#2 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1231): Wikimedia\Rdbms\Database->reportQueryError('Lock wait timeo...', 1205, 'INSERT IGNO
RE I...', 'MediaWiki\\Exten...', false)
#3 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(2367): Wikimedia\Rdbms\Database->query('INSERT IGNORE I...', 'MediaWiki\\Exten...', 128)
#4 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(2327): Wikimedia\Rdbms\Database->doInsertNonConflicting('machine_vision_...', Array, 'Medi
#5 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->insert('machine_vision_...', Array, 'MediaWiki\\Exten...',
#6 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call('insert', Array)
#7 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/src/Repository.php(97): Wikimedia\Rdbms\DBConnRef->insert('machine_vision_...', Array, 'MediaWiki\\Exten...
', Array)
#8 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/src/Client/GoogleCloudVisionClient.php(209): MediaWiki\Extension\MachineVision\Repository->insertLabels('mc
9cmscpm8w47cv...', 'google', 7232155, Array, 0, 0)
#9 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(144): MediaWiki\Extension\MachineVision\Client\GoogleCloudVisionClient->fe
tchAnnotations('google', Object(LocalFile), 0)
#10 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(97): MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions->fetc
hForFile(Object(LocalFile), 0)
#11 /srv/mediawiki/php-1.36.0-wmf.32/maintenance/doMaintenance.php(106): MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions->execute()
#12 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(178): require_once('/srv/mediawiki/...')
#13 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#14 {main}

So far 352 files have processed, with 115 remaining.

All imports have completed. Next step is to re-run the previous work joining the datasets and verify we now have an acceptable percentage of queries with predictions.

database is imported to ebernhardson.machine_vision_safe_search/date=20210323, haven't had a chance to dig into it yet.

Some preliminary stats:

  • Initial dataset was 6.51M page_id's
  • When preparing dataset for processing by machine vision the page_id's were looked up in an mw replica database, the resulting file list contained only 4.55M pages. Unclear why 30% of page_id's were not found.
  • Of the 4.55M pages requested, a prior dump from february had a coverage of 14%, a dump taken march 23'd after the import completed has 4.42M matching pages (97% coverage). Overall we can say the images requested are all now available under the requested image names.
  • Joining the original dataset of 6.51M page_id's finds 4.44M matching pages (68% of page_ids, but same 97% coverage of filenames)

From the above we can conclude the import of image classifications was successfull. Further work is necessary to verify the predictions are fit for purpose, meaning that having coverage of images previously returned by search leads to reasonable coverage of images returned by search in the future. A preliminary look suggests the coverage declines relatively quickly, but more direct analysis needs to be done to compare the page_id's previously dumped against new results to verify we do return similar sets of images over time.

Unclear why 30% of page_id's were not found.

Poking through my bash history, the dataset I ran was created with the appended command line, where the input is the 6.5M titles and the output is 4.5M image titles. This is missing the case-insensitive flag (-i), but that would have only increased output to 4.9M image titles. Checking a sample of 10k non-matching titles we have 8k pdf, 700 djvu, and then ogg, ogv, tiff, wav, mp3, ... in decending values. In summary, those 30% were presumed to be "not images" through a rough heuristic that looks to be correct in the majority of cases.

egrep '.(tif|gif|svg|png|jpg|jpeg)$' common_fulltext_search_titles.commonswiki.20210101-20210210.txt | pv -l > common_fulltext_search_titles_for_images.commonswiki.20210101-20210201.txt

Change 709550 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] [WIP] Notebooks for query completion data processing