Add lead image data to the data file generated for populating weighted_tags in the commons search index
Closed, InvalidPublic
Actions

Assigned To

Authored By

	Cparle
	Jan 18 2022, 2:31 PM

Description

User story

As a user I want image search to give me results that are as good as possible, and also to be able to retrieve good suggestions for images for unillustrated articles. At the moment data on whether an image is used as as a lead image in a wikipedia article is not stored in the commons search index, and we believe both search and image suggestions can be improved using this information. This ticket adds the information to the dataset which is subsequently used to add wikidata data to the commons search index.

See https://github.com/cormacparle/commons_wikidata_links for the current jupyter notebook for gathering wikidata data so we can use it to populate weighted_tags in the commons search index

Add a field to the output data that is written to the parquet file which contains the wikidata items of any wiki article the image is the lead image for

e.g. if Image_X is the lead image on https://ga.wikipedia.org/Page_Y - i.e. Image_X's title is the value of either the 'page_image_free' page_prop or the 'page_image' page_prop for https://ga.wikipedia.org/Page_Y
AND https://ga.wikipedia.org/Page_Y has a corresponding wikidata id Q12345
then for Image_X we'll set the field image.linked.from.wikidata.lead_image/Q12345|<score>
<score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id Q12345 across all wikis (using incoming links via the pagelinks table to give a measure of "importance")

The extra search data should not be added to any image that is excluded by the current Image Suggestions Algorithm (this is covered already in the existing notebook code, just making a note of it here for completeness)

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T299781 [EPIC] Image suggestions backend
Resolved	mfossati	T296814 [EPIC] Article-level image suggestions data pipeline
Invalid	mfossati	T299408 Add lead image data to the data file generated for populating weighted_tags in the commons search index

Event Timeline

Cparle created this task.Jan 18 2022, 2:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2022, 2:31 PM

Cparle assigned this task to mfossati.Jan 18 2022, 2:31 PM

Cparle mentioned this in T296814: [EPIC] Article-level image suggestions data pipeline.

CBogen moved this task from Incoming to Ready for Development on the Structured-Data-Backlog (Current Work) board.Jan 18 2022, 2:35 PM

CBogen moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

Currently waiting for T299343.

Progress is tracked on my fork: https://github.com/marfox/commons_wikidata_links/tree/T299408

Cparle edited parent tasks, added: T299781: [EPIC] Image suggestions backend ; removed: T296814: [EPIC] Article-level image suggestions data pipeline.Jan 21 2022, 5:24 PM

Cparle mentioned this in T299885: [L] Push unillustrated articles with their suggestions, suggestion reasons and confidence scores to Cassandra.Jan 24 2022, 10:14 AM

Cparle updated the task description. (Show Details)Jan 25 2022, 5:48 PM

Cparle updated the task description. (Show Details)

Cparle mentioned this in T299890: [M] Exclude previously rejected image suggestions when generating new suggestions.Jan 25 2022, 6:11 PM

@mfossati I just noticed in research's notebook for the original version of this they're classifying an image as a "lead image" if its title is the value of the 'page_image_free' page_prop or the 'page_image' page_prop in a wiki article. I missed the 'page_image' part in the ticket description, so I'll add it and you can update your code to include those images

Cparle updated the task description. (Show Details)Jan 28 2022, 3:23 PM

<score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id Q12345 across all wikis (using incoming links via the pagelinks table to give a measure of "importance")

Here's how the score is computed (then rounded to the nearest integer):

all Wiki pages : incoming links per QID = order of magnitude of all Wiki pages : score

Using 1000 instead of the order of magnitude is likely to yield very small scores, due to a low amount of incoming links per QID. This was encountered on a random sample of 360 rows, so the estimation may be biased.

The current score exclusively captures the importance of Wikidata IDs: I'd like to propose a simple additional weight that would also take into account the usage of Commons images.

Given a (Commons page ID, QID, score) row in the input dataset, the formula would be:
all Wiki pages with images : # of Wiki pages with Commons page ID = 1 : weight

Then:
final score = weight * score

Pull request that closes this task here: https://github.com/cormacparle/commons_wikidata_links/pull/1

Ahem ... it turned out that I misread the original notebook code, and it was already doing this

/me looks embarrassed

Closing the ticket ...

@Cparle , no worries, I haven't spotted redundant work within that big SQL query of the original notebook neither!
No score was computed before, so I still think this task complements T286562 and paves the way for T300045, too.

nit: you can ignore page_image if you are only doing this for the Commons index as Commons images are always free.

mfossati added a parent task: T296814: [EPIC] Article-level image suggestions data pipeline.Feb 18 2022, 5:52 PM

Add lead image data to the data file generated for populating weighted_tags in the commons search indexClosed, InvalidPublicActions

Description

User story

Related ObjectsSearch...

Event Timeline

Add lead image data to the data file generated for populating weighted_tags in the commons search index
Closed, InvalidPublic
Actions

Related Objects
Search...