Populate MachineVision databases for images commonly returned by search
Closed, DeclinedPublic5 Estimated Story Points
Actions

Assigned To

None

Authored By

	EBernhardson
	Feb 9 2021, 1:43 AM

Description

As a developer working on autocomplete i would like MachineVision predictions for historical search results so i can use them in T250436

Search Platform would like to use MachineVision predictions as part of a process to work with user submitted queries and determine which are useful to propose to other users as query completionypes. In initial explorations we found it viable to bring this data together, but the data is not complete enough for our use case. While the MachineVision database for commonswiki contains results for 5.7M pages, the overlap between that set and the ~16M titles returned by search between jan 1 and feb 8th is only around 70k images.

6.5M pages to import by page_id, unsorted: https://analytics.wikimedia.org/published/datasets/one-off/ebernhardson/common_fulltext_search_page_ids.commonswiki.20210101-20210210.csv.gz

AC: MachineVision databases for commonswiki contain predictions for common search results

Details

	Subject	Repo	Branch	Lines +/-
	[WIP] Notebooks for query completion data processing	wikimedia/discovery/analytics	master	+5 K -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T250436 [Epic] Query Completion
		Declined		None	T274220 Populate MachineVision databases for images commonly returned by search

Event Timeline

EBernhardson created this task.Feb 9 2021, 1:43 AM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptFeb 9 2021, 1:43 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

how many titles are reasonable to process?

@EBernhardson said in IRC that from Jan 1 - Feb 8, 16.3M unique titles were returned by full_text search to api and web on commonswiki. 7.7M were seen by more than one identity, 2.7M by more than 5. "Seen" here is a bit generous, it only means returned not necessarily that they scrolled down.

We have more than enough credits to cover the 7.7M number. 16.3M would be a bit too much. I don't know how long 7.7M would take to process, but if it's reasonable that would be my recommendation.

CSV report, contains three columns: page_id, num_times_seen,num_ident_seen. Likely MV only needs the first, but if we want to re-filter this should make it trivial. Contains page_id's returned to two or more identities. Spent a little more time cleaning up the data this time around, this contains 6.5M page_ids. This does not filter images MV already has predictions for.

https://analytics.wikimedia.org/published/datasets/one-off/ebernhardson/common_fulltext_search_page_ids.commonswiki.20210101-20210210.csv.gz

EBernhardson updated the task description. (Show Details)Feb 11 2021, 10:48 PM

EBernhardson updated the task description. (Show Details)

Thanks @EBernhardson. So that we can prioritize effectively on the SD team side - is this blocking the query suggestion work right now? When do you need this by to continue the work?

If everything is fine, I can probably just do this myself? It looks like i need to format suggestions to match the way createFileListFromCategoriesAndTemplates.php generates a list of files to run, chunk it into reasonable sizes and run fetchSuggestions.php over them?

In T274220#6826710, @EBernhardson wrote:

If everything is fine, I can probably just do this myself? It looks like i need to format suggestions to match the way createFileListFromCategoriesAndTemplates.php generates a list of files to run, chunk it into reasonable sizes and run fetchSuggestions.php over them?

Sounds good to me! @matthiasmullie or @Cparle do you see any issues?

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Feb 15 2021, 3:54 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel added a parent task: T250436: [Epic] Query Completion.Feb 15 2021, 4:28 PM

Gehel set the point value for this task to 5.

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Feb 15 2021, 4:30 PM

Mentioned in SAL (#wikimedia-operations) [2021-02-20T00:15:22Z] <ebernhardson> start batch processing images through MachineVision fetchSuggestions.php for T274220 on mwmaint1002

Been running over the weekend. Inputs are split to have 10k files per run. The mean time to run has been 176 minutes per 10k files, or around one file per second. At one file per second we are looking at just short of 70 days of constant running to go through 6M images.

@matthiasmullie @Cparle Any guidance on parallelism? I can certainly make this run the fetch script multiple times in parallel for different inputs, but I'm not sure what is appropriate.

CBogen moved this task from Triage to Tracking on the Structured-Data-Backlog board.Feb 22 2021, 5:30 PM

I don't think running in parallel with different inputs would be a problem.
Ping @Mholloway in case he has thoughts.

No, it shouldn't be a problem. I was being extremely conservative when I wrote that script, more because I was worried about blowing past our Google Cloud Vision budget than for any other reason. Either updating the script to be more efficient or running multiple instances in parallel should be fine.

EBernhardson mentioned this in T260068: Classify completion candidate image results.Mar 1 2021, 11:21 PM

This has been running with 5 processes for most of a week now. Unfortunately ~3 of the 5 keep dieing due to database locking issues. This suggests the code is holding database locks longer than it should, not clear if we need to dig into it yet. Because these are in batches of 10k images and the failed batches have to be retried i think some predictions are being double requested.

For now going to let it keep running and restarting the imports when they fail.

Related stack trace:

Wikimedia\Rdbms\DBQueryError from line 1703 of /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php: Error 1205: Lock wait timeout exceeded; try r
estarting transaction (10.64.48.124)                                                                                                                                    
Function: MediaWiki\Extension\MachineVision\Repository::insertLabels                                                                                                    
Query: INSERT IGNORE INTO `machine_vision_image` (mvi_sha1,mvi_priority,mvi_rand) VALUES ('mc9cmscpm8w47cvqtf9x644lorb39ro',0,'0.71898993510706')                       
                                                                                                                                                                        
#0 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1687): Wikimedia\Rdbms\Database->getQueryException('Lock wait timeo...', 1205, 'INSERT IGN
ORE I...', 'MediaWiki\\Exten...')                                                                                                                                       
#1 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1662): Wikimedia\Rdbms\Database->getQueryExceptionAndLog('Lock wait timeo...', 1205, 'INSERT IGNORE I...', 'MediaWiki\\Exten...')
#2 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(1231): Wikimedia\Rdbms\Database->reportQueryError('Lock wait timeo...', 1205, 'INSERT IGNO
RE I...', 'MediaWiki\\Exten...', false)
#3 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(2367): Wikimedia\Rdbms\Database->query('INSERT IGNORE I...', 'MediaWiki\\Exten...', 128)
#4 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/Database.php(2327): Wikimedia\Rdbms\Database->doInsertNonConflicting('machine_vision_...', Array, 'Medi
aWiki\\Exten...')
#5 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->insert('machine_vision_...', Array, 'MediaWiki\\Exten...',
 Array)
#6 /srv/mediawiki/php-1.36.0-wmf.32/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call('insert', Array)
#7 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/src/Repository.php(97): Wikimedia\Rdbms\DBConnRef->insert('machine_vision_...', Array, 'MediaWiki\\Exten...
', Array)
#8 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/src/Client/GoogleCloudVisionClient.php(209): MediaWiki\Extension\MachineVision\Repository->insertLabels('mc
9cmscpm8w47cv...', 'google', 7232155, Array, 0, 0)
#9 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(144): MediaWiki\Extension\MachineVision\Client\GoogleCloudVisionClient->fe
tchAnnotations('google', Object(LocalFile), 0)
#10 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(97): MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions->fetc
hForFile(Object(LocalFile), 0)
#11 /srv/mediawiki/php-1.36.0-wmf.32/maintenance/doMaintenance.php(106): MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions->execute()
#12 /srv/mediawiki/php-1.36.0-wmf.32/extensions/MachineVision/maintenance/fetchSuggestions.php(178): require_once('/srv/mediawiki/...')
#13 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#14 {main}

EBernhardson claimed this task.Mar 10 2021, 12:05 AM

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

So far 352 files have processed, with 115 remaining.

All imports have completed. Next step is to re-run the previous work joining the datasets and verify we now have an acceptable percentage of queries with predictions.

database is imported to ebernhardson.machine_vision_safe_search/date=20210323, haven't had a chance to dig into it yet.

Some preliminary stats:

Initial dataset was 6.51M page_id's
When preparing dataset for processing by machine vision the page_id's were looked up in an mw replica database, the resulting file list contained only 4.55M pages. Unclear why 30% of page_id's were not found.
Of the 4.55M pages requested, a prior dump from february had a coverage of 14%, a dump taken march 23'd after the import completed has 4.42M matching pages (97% coverage). Overall we can say the images requested are all now available under the requested image names.
Joining the original dataset of 6.51M page_id's finds 4.44M matching pages (68% of page_ids, but same 97% coverage of filenames)

From the above we can conclude the import of image classifications was successfull. Further work is necessary to verify the predictions are fit for purpose, meaning that having coverage of images previously returned by search leads to reasonable coverage of images returned by search in the future. A preliminary look suggests the coverage declines relatively quickly, but more direct analysis needs to be done to compare the page_id's previously dumped against new results to verify we do return similar sets of images over time.

Unclear why 30% of page_id's were not found.

Poking through my bash history, the dataset I ran was created with the appended command line, where the input is the 6.5M titles and the output is 4.5M image titles. This is missing the case-insensitive flag (-i), but that would have only increased output to 4.9M image titles. Checking a sample of 10k non-matching titles we have 8k pdf, 700 djvu, and then ogg, ogv, tiff, wav, mp3, ... in decending values. In summary, those 30% were presumed to be "not images" through a rough heuristic that looks to be correct in the majority of cases.

egrep '.(tif|gif|svg|png|jpg|jpeg)$' common_fulltext_search_titles.commonswiki.20210101-20210210.txt | pv -l > common_fulltext_search_titles_for_images.commonswiki.20210101-20210201.txt

Change 709550 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] [WIP] Notebooks for query completion data processing

https://gerrit.wikimedia.org/r/709550

gerritbot added a project: Patch-For-Review.Aug 2 2021, 10:03 PM

EBernhardson removed EBernhardson as the assignee of this task.Aug 3 2021, 5:06 PM

EBernhardson edited projects, added Discovery-Search; removed Discovery-Search (Current work).

EBernhardson moved this task from needs triage to elastic / cirrus on the Discovery-Search board.

EBernhardson closed this task as Declined.Jan 24 2022, 4:20 PM

Populate MachineVision databases for images commonly returned by searchClosed, DeclinedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Populate MachineVision databases for images commonly returned by search
Closed, DeclinedPublic5 Estimated Story Points
Actions

Related Objects
Search...