Page MenuHomePhabricator

Identify common patterns of category search near-misses. Search for an API that would allow "fuzzy" search
Closed, ResolvedPublic

Description

Task

  1. Fill out samples on GitHub wiki
  2. Test samples with Method A and Method B suggested on StackExchange thread

Method A: MediaWiki API for Commons with search query for pages of type "Category" (srnamespace=14), maximum 10 results (srlimit=10):

https://commons.wikimedia.org/w/api.php?action=query&list=search&srwhat=text&srenablerewrites=1&srnamespace=14&srlimit=10&srsearch=your_string

Method B: MediaWiki API for Commons with opensearch for maximum 10 results (limit=10):

https://commons.wikimedia.org/w/api.php?action=opensearch&limit=10&suggest=1&search=Category:your_string

(I modified this to https://commons.wikimedia.org/w/api.php?action=opensearch&format=jsonfm&limit=10&suggest=1&search=Category:apartment%20building , just minor output change to read the results more easily)

  1. Look for other possible methods

Results

I've completed the tests for Method A and B - Method A seems to vastly outperform Method B, and produces some decent category suggestions.

I've also searched for other possible APIs but have not found anything useful so far.

Conclusion

We have discussed this and have decided that we will go with Method A.

Event Timeline

josephine_l claimed this task.
josephine_l raised the priority of this task from to Medium.
josephine_l updated the task description. (Show Details)

Yes, use the "Sample generation methodology" to generate 5 more samples.

I haven't tried the stackexchange answer but that would be a first method
(Method A?) to try indeed :-)

@Nicolas_Raoul Okay, will do, thanks. :) The stackexchange answer describes two separate methods I think - even though both use MediaWiki API, one uses action=query and the other uses action=opensearch. AFAIK I don't think you can run two actions in one API call.

Ah OK, we have methods A and B then :-)

@Nicolas_Raoul - I've filled up the GitHub wiki, does it look okay? '

How do you think we should compare the results of the various methods? I figure I'll search for each 'guessed category' string using the APIs and report how many of the 'actual categories' it found?

The procedure says "just looking at the image and image name" so for image 6 for instance "road of Chinawal-Savkheda" could be another guess.

@Nicolas_Raoul - Oh, right, thanks. :) I did look at the image name for most of them, but for some reason I missed it on that sample.

Yeah, the others are great!

So indeed the next step is to evaluate each method and score it.

@Nicolas_Raoul I've filled in the test results for Sample 1 and 4, is that what you had in mind for the evaluation?