Page MenuHomePhabricator

Metrics for SDoC: look at search hits based on which element the search is hitting
Closed, ResolvedPublic

Description

Search hits based on which element the search is hitting

  • file name vs. description vs. category
    • After talking with @EBernhardson , we decided this is not feasible since we don't record this information now.
  • "Unfindable" images metrics: lack of categorization, unhelpful file name, no description (or poor description)
    • e.g. images (and files in general) that you just stumble upon but there’s no way anyone would be able to search for it
  • investigate file annotations and if any tracking (logging) of them are available

Event Timeline

debt created this task.Oct 3 2017, 11:36 PM
debt updated the task description. (Show Details)Oct 6 2017, 9:21 PM
debt updated the task description. (Show Details)Oct 11 2017, 9:43 PM
debt updated the task description. (Show Details)Oct 19 2017, 7:14 PM
chelsyx updated the task description. (Show Details)Oct 24 2017, 6:41 PM
chelsyx added a subscriber: EBernhardson.

There are 142,994 files with annotations (ImageNote), follow this link for the most current count.

The revision history of annotations are there, along with other page revision history, for example: https://commons.wikimedia.org/w/index.php?title=File:Henley_2009_women.jpg&action=history

@Ramsey-WMF Is this what you want?

chelsyx updated the task description. (Show Details)Oct 26 2017, 3:27 AM
chelsyx updated the task description. (Show Details)

For unhelpful file names, I want to extract the old and new file names from the move log whose change reason is meaningless or ambiguous, and then train a model to classify these file names. As far as I know, short text classification like this is a bit tricky.. @mpopov do you have any suggestion?

debt added a comment.Oct 26 2017, 11:00 PM

Oh, that looks like that will be quite interesting, @chelsyx, although it looks like it might be a bit of manual work involved.

There are 142,994 files with annotations (ImageNote), follow this link for the most current count.

The revision history of annotations are there, along with other page revision history, for example: https://commons.wikimedia.org/w/index.php?title=File:Henley_2009_women.jpg&action=history

@Ramsey-WMF Is this what you want?

Hi @chelsyx . This is good, and was the first part of what we wanted to figure out. The second part was figuring out if the ImageNote is searchable and obtain a measurement of how often it pops up in search. Since it doesn't seem like we log which element provides the "hit" in search, it's a moot point now.

While we don't log it, we could certainly take a sampling of say 20k queries, run them against our test cluster, and poke at the results to see which parts triggered the hit.

debt added a comment.Oct 27 2017, 9:27 PM

Great idea, @EBernhardson, let's do it! @chelsyx can you get that sampling from the data we already have?

Great idea, @EBernhardson, let's do it! @chelsyx can you get that sampling from the data we already have?

@debt Yes, I can get those queries from TestSearchSatisfaction2 table. We will need help from @EBernhardson to run them against test cluster and check the results.

Oh, that looks like that will be quite interesting, @chelsyx, although it looks like it might be a bit of manual work involved.

Getting data from the move log is easy, but it will take some time to train and adjust the model. @debt @Ramsey-WMF Let me know if you want me to spend time on getting other metrics done rather than this.

@chelsyx - let's work on finishing up the other metrics before we take on any additional training and testing of the models (since the feature we're looking at isn't already being logged); moving this ticket to done.

debt changed the task status from Open to Stalled.Nov 7 2017, 9:31 PM
debt lowered the priority of this task from Medium to Low.

Moving to backlog until we have time to dig deeper (if that is still what is required when we're done with the other SDoC metrics baseline work).

On November 7, the number of files having a "needing categories" category is 4,268,386 (10%). The following table break down the counts by media type:

img_media_typeneed_catn_filesproportion
bitmapno3617694184.47%
bitmapyes42072329.82%
drawingno11673892.73%
drawingyes177440.04%
audiono7922231.85%
audioyes26250.01%
videono719440.17%
videoyes366130.09%
multimediano40%
officeno3510350.82%
officeyes41720.01%

Status of tasks of this ticket:

  • Search hits based on which element the search is hitting: file name vs. description vs. category
    • This is not feasible currently. Possible solution is T177353#3716344, and we will need help from search backend team.
  • "Unfindable" images metrics: lack of categorization, unhelpful file name, no description (or poor description)
    • Categories: The number of files having a "needing categories" category and the breakdown is shown on T177353#3743257. We have a query to count the number of files by the number of categories, category type (hidden vs not) and media type. But we are having some problems when using this query on mysql database. Possible solution is available, but it would take some time.
    • Description: We could use advanced search and/or parse the page content with hive (using a experimental table set up by analytics), but it would take some time.
    • File name: We could get this done by machine learning as described in T177353#3712897, but it would take some time to train and tune the model.
  • Investigate file annotations and if any tracking (logging) of them are available

Given the difficulties we are facing as described above, @debt and I decide to put this ticket to backlog and work on other SDoC metrics first.

chelsyx changed the task status from Stalled to Open.EditedDec 12 2017, 7:10 PM
chelsyx raised the priority of this task from Low to Medium.

We parsed the wikitext of all files in Commons xml data dumps of November 20, 2017. Out of the total 43,268,565 files, 41,796,560 (96.6%) files have a infobox, 41,309,028 (95.47%) have some contents in their description fields (description, title, depicted people, depicted place, etc).

Caveat:

  • There are a large number of infobox-like templates (e.g. Infobox_templates:_based_on_Information_template, Data_ingestion_layout_templates, templates only for one batch of uploads like this) with description fields of various names (e.g. some use commons_description instead of description). This makes counting very difficult because we cannot enumerate all of these infobox names and description field names.
  • Some users create their own templates on top of other infobox templates for upload convenience. This makes the file description masked -- they cannot be search. For example, the wikitext of File:Cyclopaedia, Chambers - Volume 1 - 0133.jpg is:
{{Cyclopaedia, Chambers page
 | volume = 1
 | prev = 0132
 | page = 0133
 | next = 0134
}}

A lot of the information we see on the web page is actually hidden in its template Template:Cyclopaedia,_Chambers_page. This makes it very hard to find this file through search, because search is done by matching the above shown wikitext of this file. We should encourage our users to clean up this kind of templates.

Categorization
Excluding hidden categories and 'needing_category' categories, there are 1,629,592 (3.73%) files that don't belong to any category, 22,492,880 (51.55%) files belong to only 1 category as of December 12, 2017.

Breakdown by media type and analysis codebase can be found here: https://github.com/wikimedia-research/SDoC-Initial-Metrics/tree/master/T177353

If you find the number here is conflict with T177353#3743257, that's because files with 'needing_category' categories may actually have other categories at the same time -- possibly because users add categories to a file but forgot to remove 'needing_category', or the 'needing_category' got moved to hidden categories. The graph above shows a more accurate count.

All results and analysis codebase can be found here: https://github.com/wikimedia-research/SDoC-Initial-Metrics/tree/master/T177353

For unhelpful file names, I created a child ticket T182849 since it should be a separate project and we don't have the bandwidth to deal with it now.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:07 PM
chelsyx closed this task as Resolved.Feb 7 2018, 5:58 PM