Page MenuHomePhabricator

Provide which wiki an image suggestion is found on
Open, Needs TriagePublic3 Estimated Story Points

Description

Context

Android wants the ability to provide the reason for an image suggestion. For example,


source: https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/Android/Add_an_image_MVP

As the lower part of the image suggestion describes:

Suggestion reason: Used in the same article on another language Wikipedia: German

The algo provides this in the raw data from the note column. There has been requests if we can change things like jawiki to Japanese Wikipedia.

Acceptance Criteria
  • Given I have made a request to the Image Suggestion API, I expect to receive a found_on for each image suggestion.
Example Response
[
  {
    "page": "Cat",
    "suggestions": [
      {
        "filename": "Cheetah.jpg",
        "source": "Wikipedia",
        "found_on": [ 
           "cswiki", 
           "nlwiki",
           "zhwiki",
           "azbwiki",
           "dewiki",
           "viwiki"
        ]
        "confidence_rating": "string"
      }
    ]
  }
Open Questions
Out of scope
  • MediaSearch does not provide reasons for suggestions as to where it was found on

Event Timeline

@sdkim I don't think passing back the raw data from the note column in the API response is as useful as it might be. This will be consumed by Android and our bot writers, and will need to be parsed/interpreted. Rather than doing multiple implementations of the parsing/interpreting code it'd be better for the PET (or even research) to do this upstream, and to return something like "found_on": [ "enwiki", "frwiki", ... ]

Hi @Cparle "found_on" is a good idea and if it's possible to add "found_on_filter" to choice specific wikis.

sdkim updated the task description. (Show Details)

@Cparle Does MediaSearch provide this per suggestion? This is available from ImageMatchAlgo but want to clarify whether MediaSearch provides it or not

One nit: the example shows source: Wikidata and found_on: [cswiki, ...].
Information re which wiki an image was found on will only be available for source: wikipedia.
Thus found_on is a property of Wikipedia sources only.

For wikidata and commons sources the reason why an image was chosen is tautological. Eg. an image is chosen by ImageMatchAlgo because:

  • image was in the Wikidata item
  • image was selected at random from the Commons category linked in the Wikidata item

Does MediaSearch provide this per suggestion?

No. MediaSearch returns no data about whether an image is used on a wiki, and it's not on our roadmap to provide it. We could investigate if you need us to, but there will v likely be a performance cost

@Dbrant are Android's needs met with the example response above? I vaguely remember you asking about a language code of some sort

@sdkim Yep, that looks good in the Wikipedia case. And if I understand correctly, if the image comes from the Wikidata entity, the source would be Wikidata? (with an empty or nonexistent found_on list?)

And if I understand correctly, if the image comes from the Wikidata entity, the source would be Wikidata? (with an empty or nonexistent found_on list?)

I think so. But right now we currently have source enums for "ima" or "ms" which might need to better reflect.
Possibly considering an image_source (Wikipedia, Wikidata, Commons) and an algorithm_source (ImageMatchAlgo, MediaSearch)? But need to discuss with the team

sdkim set the point value for this task to 3.Thu, Mar 18, 4:14 PM
sdkim renamed this task from Provide the reason for an image suggestion to Provide which wiki an image suggestion is found on.Wed, Mar 24, 2:58 PM

We continue to have some confusion surrounding the word "source". We are currently using it to mean both Algorithm vs MediaSearch and to specify how the Algorithm identified a suggestion.

I strongly feel we need to change our language to disambiguate between these two meanings. What about:

  • suggestion_source: Image Matching Algorithm vs MediaSearch
  • image_source: Wikipedia/Wikidata/Commons

I like "suggestion_source" better than the "algorithm_source" proposed above, because "algorithm" is already an overloaded word to mean both the common usage of the word (a procedure for solving a problem) and as a shorthand name for Image Matching Algorithm. Plus we're already in a "suggestion" block in the response data in the spot where this piece of data is specified. With that said, I'll be agreeable to term that disambiguates the two meanings of "source".

The task description currently says:

[
  {
    "page": "Cat",
    "suggestions": [
      {
        "filename": "Cheetah.jpg",
        "source": "Wikipedia",
        "found_on": [ 
           "cswiki", 
           "nlwiki",
           "zhwiki",
           "azbwiki",
           "dewiki",
           "viwiki"
        ]
        "confidence_rating": "string"
      }
    ]
  }

The "source": "Wikipedia" seems out of place. This should be either "ima" or "ms" (or if we prefer different terms for specifying Image Matching Algorithm vs MediaSearch I'm okay with renaming those). But this field should not be used to specify anything related to the internal details of either of those things. That should be in its own nested block. What about:

[
  {
    "page": "Cat",
    "suggestions": [
      {
        "filename": "Cheetah.jpg",
        "suggestion_source": "ima",
        "confidence_rating": "string",
        "details": {
          "image_source": "Wikipedia",
          "found_on": [ 
             "cswiki", 
             "nlwiki",
             "zhwiki",
             "azbwiki",
             "dewiki",
             "viwiki"
          ]
      }
    ]
  }

In the above example, the format of the "details" block would be dependent on the "suggestion_source" value. Different suggestion sources might have very different available details.

Alternatively, we could transform suggestion_source from a string field into a block with the suggestion source name and a nested variable-format details sub-object.

[
  {
    "page": "Cat",
    "suggestions": [
      {
        "filename": "Cheetah.jpg",
        "confidence_rating": "string",
        "suggestion_source": {
          "name": "ima",
          "details": {    
            "image_source": "Wikipedia",
            "found_on": [ 
              "cswiki", 
              "nlwiki",
              "zhwiki",
              "azbwiki",
              "dewiki",
              "viwiki"          
            ]
          }
        }
      }
    ]
  }

Great points @BPirkle . I agree that we should be explicit about the suggestion source and image source.

[
  {
    "page": "Cat",
    "suggestions": [
      {
        "filename": "Cheetah.jpg",
        "suggestion_source": "ima",
        "confidence_rating": "string",
        "details": {
          "image_source": "Wikipedia",
          "found_on": [ 
             "cswiki", 
             "nlwiki",
             "zhwiki",
             "azbwiki",
             "dewiki",
             "viwiki"
          ]
      }
    ]
  }

I am personally a fan of this one but would be interested to hear @Cparle and @Dbrant 's thoughts?

Definitely agree about untangling "image source" from "suggestion source", and any of the proposed structures would work perfectly well for us, but I might actually lean towards @BPirkle's last suggested structure, which IMO is the most semantically accurate, i.e. putting source-specific details in the actual structure of the suggestion source.

I like the last one better as well. Considering the discussion in T277190: Return results in a randomized deterministic way, I'd actually prefer to wrap the entire response in a containing object, so we'd have somewhere to put the seed value, and any other fields we think of in the future. So something like:

{
  "seed": 12345,
  "pages": [
     {
       "page": "Cat",
       "suggestions": [
         {
           "filename": "Cheetah.jpg",
           "suggestion_source": "ima",
           "confidence_rating": "string",
           "details": {
             "image_source": "Wikipedia",
             "found_on": [ 
                "cswiki", 
               "nlwiki",
                "zhwiki",
                "azbwiki",
                "dewiki",
                "viwiki"
             ]
         },
         {
           <another suggestion>
         }
       ]
     },
     {
        <another page>
     }
  ]
}

However, @Cparle said this in T277190:

I've been telling people that the format we have is fixed, and we'll be versioning changes, so I guess we should bump the version if we're changing the format

I had thought that v0 implied unstable, but I can only find discussion on that, not anywhere it was actually agreed to. And even that was related to the API Gateway, and we're not (yet) exposing the image suggestions service there. So maybe bumping the version is the right thing to do.

Who is actually hitting the service right now, and how much effect would a change have on them? Would we need to maintain (at least for a transition period) a v0 endpoint that produces the current format in addition to a v1 endpoint with the new format? A transition period is doable, but would be a bit more coding/testing than just switching to a new v1 endpoint.

Who is actually hitting the service right now, and how much effect would a change have on them? Would we need to maintain (at least for a transition period) a v0 endpoint that produces the current format in addition to a v1 endpoint with the new format?

Let me check with our bot writers ...

Who is actually hitting the service right now, and how much effect would a change have on them?

I talked to our test bot writers, and they're ok with us changing the format

Change 677067 had a related patch set uploaded (by BPirkle; author: BPirkle):

[mediawiki/services/image-suggestion-api@master] Deterministic randomized image suggestion results

https://gerrit.wikimedia.org/r/677067

Change 677313 had a related patch set uploaded (by BPirkle; author: BPirkle):

[mediawiki/services/image-suggestion-api@master] Adjust sqlite schema and code for found_on column in .tsv files

https://gerrit.wikimedia.org/r/677313

Change 677313 merged by jenkins-bot:

[mediawiki/services/image-suggestion-api@master] Adjust sqlite schema and code for found_on column in .tsv files

https://gerrit.wikimedia.org/r/677313

Change 677067 merged by jenkins-bot:

[mediawiki/services/image-suggestion-api@master] Deterministic randomized image suggestion results

https://gerrit.wikimedia.org/r/677067

Change 678970 had a related patch set uploaded (by BPirkle; author: BPirkle):

[mediawiki/services/image-suggestion-api@master] Return found_on data with image matching algorithm suggestions

https://gerrit.wikimedia.org/r/678970

Change 678970 merged by jenkins-bot:

[mediawiki/services/image-suggestion-api@master] Return found_on data with image matching algorithm suggestions

https://gerrit.wikimedia.org/r/678970