Page MenuHomePhabricator

For a given page, list the image suggestions
Closed, ResolvedPublic3 Estimated Story Points

Description

User Story
As a mobile reader (familiar with editing on my device)
When I am reading an article with no images
I want to see any image(s) that could be used to illustrate the article,
so that I can both gain a better understanding of the topic, as well as contribute to helping others who read the article in the future.

Acceptance Criteria

  • Given I have made a request with a single page ID, 1234, I expect to see the image suggestions for that page
  • Given I have made a request with up to 250 page IDs, (1234, 5678, ...) I expect to see the list of pages and their respective image suggestions
  • Include page_id as part of the response payload
  • Only required for ImageMatchAlgo, and not MediaSearch

Hygiene

  • API Documentation is published and viewable
  • Tests related to ACs are written and pass
Example Request/Response

Request
GET /image-suggestions/v0/wikipedia/en/pages?id=1234,5678

Response
HTTP 200

{
  "pages": [
    {
      "project": "enwiki",
      "page_id": "1234",
      "page": "Cat",
      "suggestions": [
        {
          "filename": "Mountain Lion.jpg",
          "confidence_rating": "low",
          "source": {
            "name": "ima",
            "details": {
            }
          }
        },
        {
          "filename": "Bengal Tiger.jpg",
          "confidence_rating": "low",
          "source": {
            "name": "ima",
            "details": {
            }
          }
        }
    },
   {
      "project": "enwiki",
      "page_id": "5678",
      "page": "Dog",
      "suggestions": [
        {
          "filename": "Poodle.jpg",
          "confidence_rating": "low",
          "source": {
            "name": "ima",
            "details": {
            }
          }
        },
        {
          "filename": "Shiba Inu.jpg",
          "confidence_rating": "low",
          "source": {
            "name": "ima",
            "details": {
            }
          }
        }
    }
  ]
}

Open Questions

  • Are we able to query suggestions by page_id or just page_title?

Both are available

    • If title is only available, should we require the request to have a payload of pages? or keep inline as URI parameter
  • Is image information like original_size or thumbnail currently available via the dataset?

No

  • Which wikis will this first solution cover for growth?
    • Arabic
    • Czech
    • Vietnamese
    • Bengali
    • Spanish
    • Portuguese
    • Persian
    • Turkish

Subtasks

Event Timeline

sdkim triaged this task as Low priority.

This is the /suggestions/ endpoint from the original api spec, right?

sdkim renamed this task from For a Given Page, List the Image Recommendations to For a given page, list the image suggestions.Jun 17 2021, 1:43 PM
sdkim raised the priority of this task from Low to Medium.
sdkim updated the task description. (Show Details)

Are we able to query suggestions by page_id or just page_title?

The tsv files have both. MediaSearch responses aren't needed for this, right?

Is image information like original_size or thumbnail currently available via the dataset?

No. All we currently have available is filename. There are several options for providing additional information, one of which is for the client to directly call one of the existing APIs for this information.

The tsv files have both. MediaSearch responses aren't needed for this, right?

Correct, no MS responses required. Only Image Match Algo

@Tgr
In terms of exposing the model_version, we only have a dataset_id (column) that allows to re-trace a dataset to a given training run. We don’t currently expose model version.

Will this suffice?

@Tgr
In terms of exposing the model_version, we only have a dataset_id (column) that allows to re-trace a dataset to a given training run. We don’t currently expose model version.

Will this suffice?

Yes, I think for practical purposes that's the same thing as a model version.

(Also, the model_version field is not a strict requirement for the initial version, I just assumed that it will be easy to provide. If that's not the case, let's just omit it.)

The task description mentions requests for pages cat and dog, but the example request is for pages 1,2. I'm assuming that 1,2 represents page ids and not page names (https://en.wikipedia.org/wiki/1 and https://en.wikipedia.org/wiki/2 both exist), and that we'd prefer to request by page id rather than page title. While we can request by either, page id is more compact and we already have a database index on it.

The task description mentions the ability to ask for n pages, but does not place an upper bound on n. I'm assuming that actual anticipated usages is for at most dozens of pages and not hundreds or more.

Please let me know if I'm assuming incorrectly about either of these things.

sdkim updated the task description. (Show Details)
sdkim updated the task description. (Show Details)
sdkim set the point value for this task to 3.

Change 701222 had a related patch set uploaded (by BPirkle; author: BPirkle):

[mediawiki/services/image-suggestion-api@master] Allow requesting suggestions for a specific page set via an id param

https://gerrit.wikimedia.org/r/701222

The task description mentions requests for pages cat and dog, but the example request is for pages 1,2. I'm assuming that 1,2 represents page ids and not page names (https://en.wikipedia.org/wiki/1 and https://en.wikipedia.org/wiki/2 both exist), and that we'd prefer to request by page id rather than page title. While we can request by either, page id is more compact and we already have a database index on it.

Correct - it's possible to use both on the client side as well, but I figured a more compact option would be nicer, especially if we allow requesting multiple pages at the same time. (I guess the counterargument would be that titles are more comfortable for manual testing/debugging.)

The task description mentions the ability to ask for n pages, but does not place an upper bound on n. I'm assuming that actual anticipated usages is for at most dozens of pages and not hundreds or more.

GrowthExperiments requests tasks in batches of 250 so supporting batches of that magnitude would be the most convenient for us. We can of course use paging or some other workaround if that's problematic.

GrowthExperiments requests tasks in batches of 250 so supporting batches of that magnitude would be the most convenient for us. We can of course use paging or some other workaround if that's problematic.

Thanks. 250 should be fine. I'll confirm that once the first-cut implementation is done. If we were using page titles, I'd get concerned that 250 long titles would give problems with url length, but with page ids it should be okay..

@Tgr -- as Platform works on this new version of the API to support our first iteration of image suggestions on the web, they say it would be helpful for you to spend some time with the previous proof-of-concept API to provide any feedback on performance or the format of the data that comes back.

Could you please go through these things, and reply with any notes about issues you foresee for the API being built in this task? Since the proof-of-concept API and the upcoming one will share a lot of functionality?

From a couple of manual tests for ewniki, the API seems to take around 0.5s for 1 suggestion and 1.5s for 100 (can be a bit slower when including MediaSearch results, but that seems to be using a slower query). We'd request 250 at a time so assuming linear-ish complexity, it might take about 3s. That's a bit slower than the current API (I don't think we measure backend performance but end to end the API call takes around 1s) and maybe too slow to make the user wait for it so we'll probably have to make our data fetching logic smarter (which we were considering to do anyway - the relevant task is T260522: Optimize number of results requested from API).

I don't think we have any particular expectation about the data format - we can work with whatever.

The current iteration lives in Wikimedia Cloud, would that be the case for the next iteration as well?

The current iteration lives in Wikimedia Cloud, would that be the case for the next iteration as well?

Yes (for some value of "next iteration").

We have had discussions about switching to production-level hosting. Also, we are considering the current sqlite storage a placeholder, and expect a change to more a performant alternative, likely Cassandra, in the future. My current expectation is that the time frame for both these things (hosting and storage changes) is measured in months or even quarters and not weeks.

Change 701222 merged by jenkins-bot:

[mediawiki/services/image-suggestion-api@master] Allow requesting suggestions for a specific page set via an id param

https://gerrit.wikimedia.org/r/701222

Thanks for setting up the API! As far as I can tell, it works fine for our purposes. Some minor comments/questions:

  • Limit seems to be interpreted strangely. E.g. https://image-suggestion-api.wmcloud.org/image-suggestions/v0/wikipedia/en/pages?seed=7947&limit=15 -> this returns three pages, two with 10-10 images, one with 1 image. So neither the number of images nor the number of pages is 15. Is this because of the filtering of pages with no results? We won't use the "random mode" of the API and will always specify the page IDs, so this doesn't affect us; I'm just curious.
  • Not relevant for now, but if the API will live on some standard Wikimedia domain, X-Wikimedia-Debug should be added to Access-Control-Allow-Headers for testing changes during deployment (see e.g. T252826: Whitelist x-wikimedia-debug header field (currently not allowed by Access-Control-Allow-Headers in preflight response)).
  • Images for a single page seem to be capped at 10, is that correct?
  • When specifying 200-something IDs, a request seems to take around three seconds (a bit higher with load, e.g. with ab -lk -c5 it went up to six; single-ID requests took about 300ms with no load and 5-600ms with higher load). That should be OK for the first iteration as the big request is cached, and the single-ID requests will be sent during VisualEditor initialization which dominates setup time anyway. Do you have any idea if that performance is going to stay relatively uniform (apart from the potential effect of high load, although that's pretty unlikely during the pilot) or are there other things that can influence it negatively? (Maybe some kind of WMCS traffic that's not related to this API? Not sure how much the VM is isolated from others, performance-wise.)

hey @Tgr !

  1. Yeah youre right! It's because we filter out pages without results. Some of the pages in the algorithm results have no suggestions, and when we hit MediaSearch if nothing still comes back, we just end up removing the page from the resultset. More details on that here
  2. Thanks for the heads up!! We'll make a note of that to change if we move domains.
  3. Yes, images for a single page are capped at 10. Is there a need for more than 10?
  4. Performance optimizations were something we put on the back burner for the time being, since it was supposed to be (and still technically is?) a POC. There's a couple things that are probably affecting the performance, one being that the db is still an in-memory sqlite store, also we could gage how much memory/storage is on the VM this thing is running on, etc.

@sdkim or @BPirkle may have a better idea on how we're going to continue working on the API, namely making sure its performant enough!

Thanks @nnikkhoui!

@RHo might be the best person to confirm for sure but I think a cap of 10 images recommended per page is fine.

The current performance is also fine for Iteration 1, I was just wondering if we can expect it to mostly stay like it is now or whether it might get significantly worse occasionally for reasons neither team has control over - again, I don't know much about how well Cloud VPS VMs are isolated from things happening with other VMs in the same cluster.

Thanks @nnikkhoui!

@RHo might be the best person to confirm for sure but I think a cap of 10 images recommended per page is fine.

The current performance is also fine for Iteration 1, I was just wondering if we can expect it to mostly stay like it is now or whether it might get significantly worse occasionally for reasons neither team has control over - again, I don't know much about how well Cloud VPS VMs are isolated from things happening with other VMs in the same cluster.

Thanks for clarifications @nnikkhoui! Yes, I can confirm what @Tgr is saying, 10 images per page is more than enough, esp. for Iteration 1.

Also, @nnikkhoui, I can confirm that 10 images per page is the explicit business rule we set for the algorithm. So I think we're good. Thank you!

The performance seems to have signficantly degraded recently:

$ ab -lk -c1 -n25 'https://image-suggestion-api.wmcloud.org/image-suggestions/v0/wikipedia/en/pages?source=ima&id=426563'
Concurrency Level:      1
Time taken for tests:   139.821 seconds
Requests per second:    0.18 [#/sec] (mean)
Time per request:       5592.843 [ms] (mean)

Percentage of the requests served within a certain time (ms)
  50%   5511
  66%   5745
  75%   5753
  80%   5780
  90%   5902
  95%   5957
  98%   6452
  99%   6452
 100%   6452 (longest request)

That's about 20x slower than the tests in T273236#7254476. Any idea what happened?
(By the way, are there any statsd or similar metrics available for the Image Suggestion API? We'd like to integrate some API performance metrics into our dashboard if possible.)

@Tgr Re: metrics, in my conversations with the cloud team I deduced that there was no way to publish metrics as of now because the API is hosted in a non-prod environment (cloud vps) and not under the deployment-prep project (which I believe is the only non-prod project able to publish metrics).

As for performance, yikes youre right that is really slow...It looks like adding the source=ima slows down the query significantly. Your previous tests didn't have that param if I'm seeing correctly. Adding that parameter executes a different SQL query altogether. There may be an optimization or an index we could throw in there to speed things up, can take a look today!

@Tgr Can you confirm its more performant now? I think it should be! (for enwiki queries)

@Tgr Re: metrics, in my conversations with the cloud team I deduced that there was no way to publish metrics as of now because the API is hosted in a non-prod environment (cloud vps) and not under the deployment-prep project (which I believe is the only non-prod project able to publish metrics).

Ack, thanks. We can track it easily on our side, I was just wondering if there's something readily available.

As for performance, yikes youre right that is really slow...It looks like adding the source=ima slows down the query significantly. Your previous tests didn't have that param if I'm seeing correctly.

You are probably right. I don't rememeber if I used it back then, and I didn't think to record the exact queries.

We only request single titles or page IDs so we could omit the source filter if that's easier. (It's convenient to filter out suggestions we can't use, but that can be done client-side if needed.) But the API seems lightning fast now, faster by a factor of 5 or so than without the source filter even. Awsome! Thanks a lot for fixing.

Would it be possible to apply those indexes to arwiki, bnwiki, cswiki and viwiki? We use enwiki for testing, but those four will be the actual pilot wikis.

@Tgr After i posted that comment i did end up indexing all the language tables, so those 4 should also be just as fast as well, so no need to omit the source filter!

Change 736471 had a related patch set uploaded (by Nikki Nikkhoui; author: Nikki Nikkhoui):

[mediawiki/services/image-suggestion-api@master] Add index to row_num_ima

https://gerrit.wikimedia.org/r/736471

Change 736471 merged by jenkins-bot:

[mediawiki/services/image-suggestion-api@master] Add index to row_num_ima

https://gerrit.wikimedia.org/r/736471

CBogen subscribed.

Closing since we've now moved to using the production API. Please reopen if I'm mistaken.