Page MenuHomePhabricator

Identify the best strategy/APIs to find Commons categories that are within a certain radius of the specified GPS coordinates
Closed, ResolvedPublic

Description

Week: Dec 7 - Dec 13
Task: Identify the best strategy/APIs to find Commons categories that are within a certain radius of the specified GPS coordinates
Deliverable: Wiki memo: cURL requests that provide the right categories (30% false positives OK) for all possible use cases and edge cases

I will be using https://github.com/nicolas-raoul/apps-android-commons/wiki/Location-based-category-search to document the results of testing the categories obtained via different APIs/strategies against the benchmark of categories that have been manually entered (by the Commons community or by myself) for each picture.

Pictures are found by:

  1. Visiting https://commons.wikimedia.org/wiki/Special:Random/File
  2. Eliminating files that are not photos or could not possibly be obtained via a smartphone
  3. File must have location data available

For each picture, I aim to perform a comparison for instance:

  • Manually: x0 good categories
  • WikiData API: x1 good categories, y1 false positives
  • Commons API: x2 good category, y2 false positive
  • "Existing pics at that location" strategy: x3 good category, y3 false positives

WikiData API: I am running queries via TABernacle for instance

claim[373] AND around[625,49.27066666666666,14.073769444444444,0.1]

Property 373 signifies the Commons category. I start with radius 0.1km and increase the number if no categories are found.

Method C: Search for existing pics at that location" strategy

This is described in more detail at https://etherpad.wikimedia.org/p/commons-app-android-nearby-categories

Method D - Same as Method C, except we increase radius until at least 5 unique categories are found. Results on GitHub wiki.

Conclusion: We will go with Method D

Event Timeline

josephine_l claimed this task.
josephine_l raised the priority of this task from to Medium.
josephine_l updated the task description. (Show Details)
josephine_l set Security to None.
josephine_l updated the task description. (Show Details)

@Nicolas_Raoul How are we defining 'false positives' in this context? Is that categories that are found by the API which are wrong or not applicable to what the user is looking for?

Yes, for instance if you take a picture at 48.853333, 2.369167 and the algorithm returns Category:Popemobile just because the popemobile happened to pass by and someone uploaded to picture of it at this place.
The "take categories of pictures taken at that place" strategy is one of the most promising, but it will result in a certain proportion of such false positives.

@Nicolas_Raoul : Ah, okay, thanks. How will we evaluate the usage of WikiData API and Commons API for this purpose then, since I think they won't be returning false positives (although there might be a lot of missed categories)? Or should I test the 'take categories of pics taken at that place' strategy first and proceed with it as long as there are <30% false positives, and only test WikiData and Commons API if that strategy fails?

For each picture, you should perform such a comparison:

  • Algorithm A: 3 good categories, 2 false positives
  • Algorithm B: 1 good category, 0 false positive
  • Algorithm C: 0 good category, 10 false positives
  • Manually: 5 good categories

Thanks, gotcha. What is the 'manually' item for?

Also, is there any preference as to where I should get the test pictures from? Can I use any picture I have, or should they be pictures that are already on Commons, etc?

Manually is to check the usefulness of the algorithms. "Good categories" is defined by what the human user has manually found.

Let's say a human person looks for categories for picture A, and only find 1 good category which is applicable. Then 8 categories for picture B. Ideally, the algorithms should end up with the same ones.

The score of each algorithm on a particular picture is probably something like: number of (number of good categories - number of false positive / 3) / number of good categories found by human

The best is to use https://commons.wikimedia.org/wiki/Special:Random/File and only eliminate the files that are not pictures, or that one could not possibly take with a smartphone (ex:satellite, microscope, old pictures).

Ah, okay. But if I am using that link (which is similar to the way we populated https://github.com/nicolas-raoul/apps-android-commons/wiki/Fuzzy-category-search with, right?) how will I do the manual search, as I can't manually search for categories on the app since I am not uploading that picture? Or does 'manual' mean the categories that that picture already has?

Just take the categories that have been manually entered by the community, after completing them if needed (some pictures in Commons are not well-categorized).

By the way, checking the categories of random pictures that already exist on Commons can be a good exercise to get familiar with the unwritten rules of Commons categories.

Thanks Nicolas! :) I'll start filling out the GitHub wiki as soon as I can, then. Is 10 a good number of pictures to test?

I have created a wiki page, and added 2 pics and their manual categories along with the format for the rest of the tests at https://github.com/nicolas-raoul/apps-android-commons/wiki/Location-based-category-search - do let me know if I'm the right track?

I've been tinkering with running WikiData queries using an example query

claim[373] AND around[625,40.7576,-73.9857,0.1]

However it seems that the WikiData API takes LATITUDE and LONGITUDE in decimal degrees (e.g. "52.205,0.119"). The camera location for the photos in Commons is using degrees, minutes and seconds, e.g. 38° 06′ 49.93″ N, 13° 21′ 22.55″ E . So I tried using the https://www.fcc.gov/encyclopedia/degrees-minutes-seconds-tofrom-decimal-degrees tool to convert them, but what do I do with the N and E? Is N equivalent to a positive latitude in decimal degrees and S equivalent to a negative latitude in decimal degrees, and same with E (positive) and W (negative)?

I think we should use decimal everywhere, and convert any other notations
we may receive.

You are right about N and E.

Thanks Nicolas! I'll convert the camera location for the test pics on our wiki to decimal then.

@Niedzielski I was thinking about that as well. We were talking about testing the 'Commons API' in addition to the Wikidata API and the 'pics with similar location' strategy... but when we say 'Commons API', I think what we really mean is Commons:API/MediaWiki[0], right? Which is the API that contains the geosearch functionality that you mentioned. The actual 'Commons: Commons API"[1], besides being experimental, does not seem to have the functionality that we need (please correct me if I'm wrong?).

So... should I be seeing if there is a way to get [0] to return categories instead of pages?

[0] https://commons.wikimedia.org/wiki/Commons:API/MediaWiki
[1] https://commons.wikimedia.org/wiki/Commons:Commons_API

I was not aware of the difference, but it is worth investigating both I
guess.
And yes, as you know we need categories, not pages.

So I ran a WikiData query using the TABernacle tool for the first sample in our wiki, and the results obtained are: https://tools.wmflabs.org/wikidata-todo/tabernacle.html?wdq=claim%5B373%5D%20AND%20around%5B625%2C38.11386944444445%2C13.356263888888888%2C0.1%5D&pagepile=885&props=373%2C625&items=&show=1

Comparing the results between manual categorization and the WikiData query categories:

Manual categorization: 2 good categories

  • Side views of the Cathedral of Palermo - Architectural details
  • Cathedral (Palermo) - Exterior

WikiData query:

  • Museo Diocesano (Palermo)
  • Cathedral (Palermo)
  • La Martorana (Palermo)

So, does that mean that WikiData found 1 good category and 2 false positives? But Museo Diocesano and La Martorana are not really false positives, right, since they really do exist at that location? Maybe 1 good category and 0 false positives then?

Also, I've been using 0.1 (km) as the radius parameter for the API, is that okay?

@josephine_l, hm, I think I'm missing something obvious. Forgive a stupid question but how did you arrive at the list of manual categories? For example, I see La Martorana[0] does exist on Commons. It does not seem to have geodata associated with it other than the town name but couldn't that be derived via reverse geocoding? Also, should more locations be considered?

[0] https://commons.wikimedia.org/wiki/Chiesa_della_Martorana_(Palermo)

@Niedzielski I am following the methodology suggested by @Nicolas_Raoul for manual categorization:

Just take the categories that have been manually entered by the community, after completing them if needed (some pictures in Commons are not well-categorized).

So for the sample above, I went to the picture's page at https://commons.wikimedia.org/wiki/File:Palermo_Cathedral2.JPG and scrolled down to the very end where it says that

Category (++): Side views of the Cathedral of Palermo (−) (±) (↓) (↑)(+)

Then I looked around for any other possible Commons categories that I thought the picture should reasonably be categorized under, and arrived at the other category "Cathedral (Palermo) - Exterior". (Edited: Whoops, just realized that La Martorana is indeed the name of that cathedral! Seems like WikiData is outperforming manual categorization in our case.... what do we do now?)

I'm not sure what you mean by more locations, do you mean for this sample or for others? There are 10 samples that I plan to test the various APIs/strategies against - they are listed, along with their details and the methods used to find them, on https://github.com/nicolas-raoul/apps-android-commons/wiki/Location-based-category-search . I think maybe I should have made this URL more explicit in my task description, sorry about that.

@Nicolas_Raoul @Niedzielski I'm beginning to fill out the GitHub wiki with results from the WikiData queries, but I still have 2 questions about their interpretation:

  1. What do I do if WikiData discovers a category that should be there but that manual categorization failed to find? Can we count it in?
  2. I'm still not very certain about which categories should be considered false positives. For instance if there is a cathedral and a museum at the same location. The picture is of the cathedral so the right category is 'cathedral', but the query found the museum and the cathedral (because they are at the same place). I assume the museum should be considered neither a good category for the picture, nor a false positive, because the museum actually does exist at the location?

Thanks!

After experimenting with Commons:Commons API[0][1], I have concluded that this API is meant for obtaining information on pictures that already exist on Commons and whose filename is already known, not for searching for categories.

I will move on to looking at the Commons:API/MediaWIki[2] next.

[0]https://commons.wikimedia.org/wiki/Commons:Commons_API
[1]https://tools.wmflabs.org/magnus-toolserver/commonsapi.php
[2]https://commons.wikimedia.org/wiki/Commons:API/MediaWiki

if WikiData discovers a category that should be there but that manual categorization failed to find?

In that case, just add it to the manual categories. Manual categories must be perfect, for the tool to aim towards that perfection.

If the picture's coordinates are closer to the cathedral than to the museum, then we must consider "Museum" is a false positive.
By the way, the museum and the cathedral, even if close, do not have exactly the same coordinates, right?

In the future (not for this Outreachy), we could dream that a next-generation algorithm might perform shape recognition to guess whether the picture shows a cathedral or not. It could even extract the camera orientation and focal length from EXIF and attempt to use them to evaluate the coordinates of the object (rather than the coordinates of the camera). Such an algorithm would probably not be usable any time soon, though, because cameras orientation information is often off.

In that case, just add it to the manual categories. Manual categories must be perfect, for the tool to aim towards that perfection.

Will do, thanks.

If the picture's coordinates are closer to the cathedral than to the museum, then we must consider "Museum" is a false positive.
By the way, the museum and the cathedral, even if close, do not have exactly the same coordinates, right?

Ah okay. A lot of it depends on the value chosen for the 'radius' parameter. For instance, in the example given for Sample 1 (cathedral and museum), if I set the radius to a very specific 0.07km or 0.08km, it finds only the cathedral and not the museum. But if I set it to 0.06km, it finds nothing. And if I set it to 0.09km, it finds the cathedral and the museum. The 'ideal' radius differs depending on the picture.

So I guess my next question is: What radius should we set as the benchmark?

Hey! I was having some trouble mentally coalescing the growing list of Phab comments and IRC communications around this task so I've started an Etherpad here[0] for picking a strategy.

[0] https://etherpad.wikimedia.org/p/commons-app-android-nearby-categories

Hi @Nicolas_Raoul, yesterday you suggested that for selecting a radius for Method C I could try:

Or better: specify a large ggsradius and a large ggslimit, then sort the results by their distance to the point, and only then take the 5 closest

However I can't seem to find a way to sort by distance to point within the API sandbox (@Niedzielski, do you think there's a way?) so I think I'll run the queries with arbitrary ggsradius for now and maybe we can put it in programmatically if we choose this strategy for the project?

Also, an issue with Method C is that a lot of technical categories are included in it (license, camera, etc). Edit: Oh, found a way to not show those categories, by setting the !hidden parameter. Not an issue any longer.

The issue now is that it seems that the query is pulling up the exact same image (with, of course, the same categories) that we are testing the query on. For instance, sample 2 is the image File:Oberstaufen_Heiligen-Geist_church.jpg, and the main page retrieved by API sandbox is

"pageid": 38399507,
"ns": 6,
"title": "File:Oberstaufen Heiligen-Geist church.jpg",
"categories": [
    {
        "ns": 14,
        "title": "Category:Churches in Oberstaufen"
    },
    {
        "ns": 14,
        "title": "Category:Cultural heritage monuments in Oberstaufen"
    },
    {
        "ns": 14,
        "title": "Category:Holy Spirit churches in Bavaria"
    }

So the results are a bit misleading, as each image that we are testing on is pointing back to itself? I'll manually exclude the search results with the exact same filename for now I guess.

Thanks @Niedzielski! I'll use that query from now on. That doesn't sort in the sandbox itself (from what I see), but when the implementation time comes it should be fairly easy to run a sort based on the value of 'codistancefrompoint', yeah?

@josephine_l, the response.query.pages comes back as a JSON object which is unordered as a property of JSON. It might worth poking around some time to see if you can get it back as a sorted JSON array but I'm not optimistic. That said, I actually don't think we'll need it. I think it's more likely we'd take the category frequency, distance, and maybe some other properties to do our own weighting. It would be a nice option for debugging though.

Using &formatversion=2 (https://www.mediawiki.org/wiki/API:JSON_version_2) provides various format improvements, one of which being getting a JSON array back instead of an object.

@Legoktm, nice! You wouldn't happen to know if there's an API sandbox that uses formatversion=2? I seem to recall a showcase that demoed some dramatic improvements (maybe by anomie?) to the sandbox but I can't seem to find the link.

Thanks @Legoktm! Will keep that in mind for later. :)

@Nicolas_Raoul and @Niedzielski - I'm still a bit unsure about which categories to put under 'false positives'. I've filled in the false positives for both WikiData and Method C for Sample 1-3, could either of you please take a look and let me know if I'm doing it right?

@josephine_l, I'm not sure what @Nicolas_Raoul has in mind. I think false positives for a given lat / lng would be finding categories that are irrelevant.

@Niedzielski - Yeah, I think so. I'm just a bit murky on which wrong categories count as 'irrelevant' and which don't. :) But I've filled up the wiki with the false positives as best I can, hopefully they're mostly correct. Today I'll try to put the numbers into the equation that @Nicolas_Raoul mentioned above, and see how each method fares.

josephine_l updated the task description. (Show Details)
josephine_l updated the task description. (Show Details)

@Niedzielski @Nicolas_Raoul I've pasted the results from scoring the algorithms above. It feels like maybe we are penalizing false positives a bit much with this equation, as one false positive essentially cancels out the effect of one good category? But that aside, it seems that WikiData "wins" according to these calculations.

I gave the score formula without thinking about it much, we should not penalize false positive that much indeed:

  • A false positive means the user need to scroll a bit more
  • A good category not present means that the user has to imagine its name, type the beginning of its name, and skim through a LOT of "false positives" that begin with the same letters.

I also feel that we could accept a larger radius in order to have like 10 proposed categories from which to choose.

Minor: I think you should disregard the "good category" for sample 6, because all result pictures have obviously been uploaded by the same person at the same time.

There seems to be samples without overlap, for instance in sample 1 each strategy has found a good category that the other strategy has not found. I wonder if that justifies running two requests though. The week will soon be over and we have to reach a decision. In light of the test results, I would tend to favor method C.

Going further, we still need to think about how the algorithm should set the radius, maybe launching more requests until enough results have been gathered.
Could you please run the requests and test with this "method D":

  1. Run the same request as "method C" with a radius of 100 meters
  2. If the number of results is below 5, run "method C" again with a radius multiplied by 10 (so 1000 meters).
  3. If the number of results is still below 5, run "method C" again with a radius multiplied again by 10 (so 10000 meters).
  4. etc

Could you also please sort the results by proximity, and specify at what rank the good results appeared (for instance good results at 1st position, 2nd, 3rd is much better than at 7th, 8th, 9th). To sort results by proximity, a quick approximation of distance can be made by Pythagoras theorem in a spreadsheet program (it won't be exact around poles but should be far enough for this test).

Thanks a lot! :-)

@Nicolas_Raoul - Yeah, I agree that method C seems to be better, as I also think that good categories should strongly outweigh false positives, and method C finds more good categories than WikiData.

What do you mean exactly by 'number of results below 5'? Do you mean "number of unique categories found", or "number of pictures found"?

Sorry I meant "number of unique categories found".
The number of categories we propose to the user. I think showing like 5~15 propositions to the user is appropriate.

I've updated the wiki with results of Method D. The results appear to be quite good, as the good categories are usually in the higher ranks, and additional good categories are sometimes found when expanding the radius. With a higher radius it seems that Method D manages to find most of the good categories that WikiData finds. However a few samples don't ever manage to hit > 5 categories (as there appears to be a limit of 10,000m for the ggsradius, and anyway I think 100,000m is a bit too far even if it did work).

@NiharikaKohli @01tonythomas - How long do closed/resolved tasks stay in the system - will they be deleted? I'm not sure if I should mark a completed task as 'resolved' or if I should just put it in the 'Done' section of my project board.

@NiharikaKohli @01tonythomas - How long do closed/resolved tasks stay in the system - will they be deleted? I'm not sure if I should mark a completed task as 'resolved' or if I should just put it in the 'Done' section of my project board.

Resolved tasks stay in the system forever but they don't automatically show up on your board. You have to Filter by: All tasks (top right) to show them. If you want them to be visible on the board, feel free to keep them in Done, else resolve them.

Thanks for the clarification, @NiharikaKohli !