Page MenuHomePhabricator

Filter Image Recommendations by Image Source
Closed, ResolvedPublic1 Estimated Story Points

Description

Narrative

  • As a mobile reader (familiar with editing on my device)
    • When I am reading an article with no images
    • I want to see any image(s) that could be used to illustrate the article,
    • so that I can both gain a better understanding of the topic, as well as contribute to helping others who read the article in the future.

Acceptance Criteria

  • As the StructuredData, I want the ability to request unillustrated pages that have image suggestions only from the image source MediaSearch, so that I am able to determine which
  • As the Android Product Manager, I want the ability to request unillustrated pages that have image suggestions only from the image sources of Wikidata, Commons, and other Wikipedias, so that I am able to determine the accuracy of the ImageMatchingAlgo results.

Example Spec 1: Generic

Request

GET https://api.wikimedia.org/image_suggestions/v1/en/wikipedia/pages?source=[ima|ms]

Note: ima = Image Matching Algorithm, ms = MediaSearch. We could easily allow aliases of fully-spelled-out terms, like "image-matching-algorithm" and 'media-search", or whatever callers would find mnemonic. Suggestions welcome. But short versions are convenient for keeping logs readable.

Response

{
  "project": enwiki
  "page" "Cat"
  [
  	{
	  "filename": "File:Striped_Cat.jpg",
          "source": "Wikidata",
          "confidence_rating":"high"
	},
  	{
	  "filename": "File:Spotted_Cat.jpg",
          "source": "Commons Category",
          "confidence_rating":"medium"
	},
  ]
},
{
  "project": enwiki
  "page" "Frog"
  [
  	{
	  "filename": "File:Green_Frog.jpg",
          "source": "Wikidata",
          "confidence_rating":"high"
	},
	{
	  "filename": "File:Yellow_Frog.jpg",
          "source": "Commons Category",
          "confidence_rating":"medium"
	}
  ]
}

Note: this response is unchanged from the response without filtering. The source is echoed, even though it will initially be known, both for consistency and because we may in the future allow filtering for multiple sources (once we have more).

Open Questions

Subtasks

Event Timeline

sdkim raised the priority of this task from Low to High.Feb 17 2021, 6:33 PM
BPirkle set the point value for this task to 1.

The estimate I assigned (1) assumes that the base implementation of the endpoint without filtering is complete. Adding filtering should be straightforward.

As described, this filtering does not allow an exclude list, to return responses like "give me results for all sources except MediaSearch". This could be added if necessary, but it seems likely that our number of sources will forever be small enough that it is not an imposition on clients to enumerate all desired sources.

This could be added if necessary, but it seems likely that our number of sources will forever be small enough that it is not an imposition on clients to enumerate all desired sources.

I'd prefer if we did enumerate all image sources rather than suggestion source (ima, ms). Does this alter the estimate @BPirkle ?

This could be added if necessary, but it seems likely that our number of sources will forever be small enough that it is not an imposition on clients to enumerate all desired sources.

I'd prefer if we did enumerate all image sources rather than suggestion source (ima, ms). Does this alter the estimate @BPirkle ?

Can you clarify what you mean by "enumerate all image sources rather than suggestion source"? I'm not sure I understand what distinction you're trying to make.

If you're just saying that you'd like clients to be able to specify multiple sources, that should be fine with no change to the estimate.

Yes, what I mean is to be able to pass in the following parameters in:

  • source="Wikidata"
  • source="Commons"
  • source="MediaSearch"

as well as have the ability to concatenate them together.

I find it confusing to mix the two interpretations of "source" in one parameter.

MediaSearch searches Commons. So does the Image Matching Algorithm. Expecting callers to know that a source of "Commons" means "I want suggestions that the Image Matching Algorithm found in Commons but not suggestions that MediaSearch found in Commons" seems unrealistic.

We're all fresh on this right now because we're in the middle of the project, but I can imagine that a couple of years from now we'd all forget some of those distinctions. And what happens when we someday add a third or fourth data source that also ultimately draws from the same underlying data?

In retrospect, filtering by "source" parameter even with choices of just "Algorithm" or "MediaSearch" feels a bit like exposing implementation details. In theory, this seems like a job for the Confidence Rating. Using Confidence Rating instead of Source would allows clients to state expectations about the quality of results they receive without (1) having to know implementation details and (2) tightly coupling themselves to an underlying implementation that might change in the future. And if a client receives results of "High" quality, why does it care where the result came from?

However, I understand that at this stage in the project, clients want finer grained control for research and tuning. So if we really want and need clients to be able to filter individual sources (Algorithm vs MediaSearch) at a fine-grained level (Wikidata vs Commons), I suggest we break that out into separate parameters specific to each source. So the "source" parameter would let clients choose between Algorithm and Mediasearch, while some other yet-to-be-named parameters let client specify details about how the Algorithm or the Mediasearch results are filtered.

I'd also be happier if we came up with a different word rather than reusing "source" to mean two different things, even in our casual discussion. I dislike being critical without offering a suggestion, but I'm struggling to come up with good names. My best current suggestions are:

  • origin
  • provider
  • supplier

Anyone have better ideas?

In that nomenclature, the Image Matching Algorithm and MediaSearch might be "providers" (or whatever word we pick), and the Image Matching Algorithm might draw its data from "sources" like Wikidata and Commons.

Change 669282 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/services/image-suggestion-api@master] Add image source parameter

https://gerrit.wikimedia.org/r/669282

Change 669282 merged by jenkins-bot:
[mediawiki/services/image-suggestion-api@master] Add image source parameter

https://gerrit.wikimedia.org/r/669282