Page MenuHomePhabricator

Add lead image data to the data file generated for populating weighted_tags in the commons search index
Closed, InvalidPublic

Description

User story

As a user I want image search to give me results that are as good as possible, and also to be able to retrieve good suggestions for images for unillustrated articles. At the moment data on whether an image is used as as a lead image in a wikipedia article is not stored in the commons search index, and we believe both search and image suggestions can be improved using this information. This ticket adds the information to the dataset which is subsequently used to add wikidata data to the commons search index.


See https://github.com/cormacparle/commons_wikidata_links for the current jupyter notebook for gathering wikidata data so we can use it to populate weighted_tags in the commons search index

Add a field to the output data that is written to the parquet file which contains the wikidata items of any wiki article the image is the lead image for

  • e.g. if Image_X is the lead image on https://ga.wikipedia.org/Page_Y - i.e. Image_X's title is the value of either the 'page_image_free' page_prop or the 'page_image' page_prop for https://ga.wikipedia.org/Page_Y
  • AND https://ga.wikipedia.org/Page_Y has a corresponding wikidata id Q12345
  • then for Image_X we'll set the field image.linked.from.wikidata.lead_image/Q12345|<score>
  • <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id Q12345 across all wikis (using incoming links via the pagelinks table to give a measure of "importance")

The extra search data should not be added to any image that is excluded by the current Image Suggestions Algorithm (this is covered already in the existing notebook code, just making a note of it here for completeness)

Event Timeline

Cparle updated the task description. (Show Details)

@mfossati I just noticed in research's notebook for the original version of this they're classifying an image as a "lead image" if its title is the value of the 'page_image_free' page_prop or the 'page_image' page_prop in a wiki article. I missed the 'page_image' part in the ticket description, so I'll add it and you can update your code to include those images

<score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id Q12345 across all wikis (using incoming links via the pagelinks table to give a measure of "importance")

Here's how the score is computed (then rounded to the nearest integer):

all Wiki pages : incoming links per QID = order of magnitude of all Wiki pages : score

Using 1000 instead of the order of magnitude is likely to yield very small scores, due to a low amount of incoming links per QID. This was encountered on a random sample of 360 rows, so the estimation may be biased.

The current score exclusively captures the importance of Wikidata IDs: I'd like to propose a simple additional weight that would also take into account the usage of Commons images.

Given a (Commons page ID, QID, score) row in the input dataset, the formula would be:
all Wiki pages with images : # of Wiki pages with Commons page ID = 1 : weight

Then:
final score = weight * score

Ahem ... it turned out that I misread the original notebook code, and it was already doing this

/me looks embarrassed

Closing the ticket ...

@Cparle , no worries, I haven't spotted redundant work within that big SQL query of the original notebook neither!
No score was computed before, so I still think this task complements T286562 and paves the way for T300045, too.

nit: you can ignore page_image if you are only doing this for the Commons index as Commons images are always free.