Page MenuHomePhabricator

Public-facing API for image suggestions data
Open, MediumPublic

Description

Request Status: New Request
Request Type: project support request
Related OKRs: Product Platform O: Our platform and processes are ready and able to invite all of the world's population to join us. KR 1: Machines are able to recognize Wikimedia content and suggest relations to other Wikimedia content in at least 2 wikis, enabling experimentation with at least two new strategic features.

Request Title: Public-facing API for image suggestions data

  • Request Description: By the end of June, we will have a pipeline for image suggestions data as well as an internal-only API to access that data (see T304891). We have some users who are interested in building tools using this API. For example, WMPT expressed interest in building a tool that specifically targets campaigns of users with notifications in a particular category. However, without access to the API, users cannot build gadgets or tools or other services using this data. Additionally, a public-facing API will be helpful with QA and testing.
  • Indicate Priority Level:
  • Main Requestors: Structured Data team (@CBogen and @SWakiyama)
  • Ideal Delivery Date: September 1, 2022
  • Stakeholders: Structured Data team, Growth team, interested community members

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesPhab ticket for the currently planned internal-facing API: T304891
Product One PagerYes<add link here>
Product Requirements Document (PRD)YesPRD for the image suggestions notifications
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>

Event Timeline

As an alternative to a public API, we could provide a flat file containing all image suggestions for any wiki quite easily (in .csv format or similar)

(once we have our data pipeline up and running)

We (Growth-Team) use the current instance (via $wgGEImageRecommendationServiceUrl) for:

  • local development
  • beta cluster

We could go back to mocking the service, but it would be really nice to be able to use the actual production service for debugging/development/QA.

@CBogen do you know if anyone is actively working on this, or when we might hope for this API to be created?

@CBogen do you know if anyone is actively working on this, or when we might hope for this API to be created?

My understanding is that PET does not have the capacity to take this on in this FY, but @lbowmaker can better confirm.

@CBogen do you know if anyone is actively working on this, or when we might hope for this API to be created?

My understanding is that PET does not have the capacity to take this on in this FY, but @lbowmaker can better confirm.

We will focus on supporting Section Topics in the short term, if we get a chance to work on this we will. I think it's worth doing but I don't think it's something trivial or quick (based on how long it took to get the internal one deployed to k8s).

@CBogen do you know if anyone is actively working on this, or when we might hope for this API to be created?

My understanding is that PET does not have the capacity to take this on in this FY, but @lbowmaker can better confirm.

We will focus on supporting Section Topics in the short term, if we get a chance to work on this we will. I think it's worth doing but I don't think it's something trivial or quick (based on how long it took to get the internal one deployed to k8s).

From what I remember from link recommendation, setting up the external service went pretty quickly:

I don't want to step on anyone's toes, so let me know if I should just wait, but I'm happy to post some patches based on the above and ask Hugh and someone from SRE to comment on them. For starters, I think we don't need to expose the full API, just the GET routes.

@CBogen do you know if anyone is actively working on this, or when we might hope for this API to be created?

My understanding is that PET does not have the capacity to take this on in this FY, but @lbowmaker can better confirm.

We will focus on supporting Section Topics in the short term, if we get a chance to work on this we will. I think it's worth doing but I don't think it's something trivial or quick (based on how long it took to get the internal one deployed to k8s).

From what I remember from link recommendation, setting up the external service went pretty quickly:

I don't want to step on anyone's toes, so let me know if I should just wait, but I'm happy to post some patches based on the above and ask Hugh and someone from SRE to comment on them. For starters, I think we don't need to expose the full API, just the GET routes.

Sure sounds good, also check with Eric Evans to get his take.

@akosiaris @Eevans @JMeybohm this is a follow-up task from T304891: New Service Request Generated Datasets: Image Suggestions Service.

AIUI, creating a public-facing API for the image-suggestion service would involve the following steps:

Does that sound like an accurate summary? If so, I don't mind putting together some patches, but maybe it is more complicated and I am missing some steps.

@akosiaris @Eevans @JMeybohm this is a follow-up task from T304891: New Service Request Generated Datasets: Image Suggestions Service.

AIUI, creating a public-facing API for the image-suggestion service would involve the following steps:

Does that sound like an accurate summary? If so, I don't mind putting together some patches, but maybe it is more complicated and I am missing some steps.

The Data Gateway service (currently just image suggestion, but meant to accommodate future use-cases as well) should not be exposed to end-users. It is nothing more than a thin middleware to publish database tables using HTTP semantics; It is only a means of decoupling clients from the underlying database. It falls well below the standards of our other external APIs, for example: it does zero parameter validation (by design), passing everything verbatim to the database. That means any response that does not return status 200, will return status 500 with error codes and verbatim messages from the database (not good).

If we want to publish the results, the current de facto route to doing so would be to create a separate service dedicated to this, one that invokes the Data Gateway service for its results. If this is something that we would do for every dataset published (or even a significant subset), than a reasonable argument could be made that the Gateway service should provide an API suitable for public consumption (to avoid everyone having to implement a separate service). This was discussed briefly -and we can raise the topic again- but when we entertained this, it quickly increased the scope, complexity, and effectiveness of the platform. You can't really go down this path without introducing bespoke (per-service) requirements, and in our model we delegate that work to the teams implementing a product.

TL;DR the way it stands right now, the Data Gateway/image suggestion service is your database. In the absence of it, you'd be making queries to the database directly. You wouldn't expose the database to clients, and the same is true here.

@akosiaris @Eevans @JMeybohm this is a follow-up task from T304891: New Service Request Generated Datasets: Image Suggestions Service.

AIUI, creating a public-facing API for the image-suggestion service would involve the following steps:

Does that sound like an accurate summary? If so, I don't mind putting together some patches, but maybe it is more complicated and I am missing some steps.

The Data Gateway service (currently just image suggestion, but meant to accommodate future use-cases as well) should not be exposed to end-users. It is nothing more than a thin middleware to publish database tables using HTTP semantics; It is only a means of decoupling clients from the underlying database. It falls well below the standards of our other external APIs, for example: it does zero parameter validation (by design), passing everything verbatim to the database. That means any response that does not return status 200, will return status 500 with error codes and verbatim messages from the database (not good).

If we want to publish the results, the current de facto route to doing so would be to create a separate service dedicated to this, one that invokes the Data Gateway service for its results. If this is something that we would do for every dataset published (or even a significant subset), than a reasonable argument could be made that the Gateway service should provide an API suitable for public consumption (to avoid everyone having to implement a separate service). This was discussed briefly -and we can raise the topic again- but when we entertained this, it quickly increased the scope, complexity, and effectiveness of the platform. You can't really go down this path without introducing bespoke (per-service) requirements, and in our model we delegate that work to the teams implementing a product.

TL;DR the way it stands right now, the Data Gateway/image suggestion service is your database. In the absence of it, you'd be making queries to the database directly. You wouldn't expose the database to clients, and the same is true here.

Got it @Eevans, thanks for writing that out, that makes sense.

OK, then instead of a fully-developed application and Kubernetes service to act as the front-end for the Data Gateway, another thought I had was to consider creating a REST endpoint with the GrowthExperiments extension, and to proxy requests to the Data Gateway via that endpoint.

The endpoint would be something like https://test.wikipedia.org/w/rest.php/growthexperiments/v0/image-suggestion/{pageId}. GrowthExperiments extension could validate the page ID, implement caching, do rate limiting, and wrap errors. Would something like that be acceptable? I think that would satisfy Growth team's main use case in wanting to be able to get image suggestion data from the data gateway in our local development environments, as well as in beta cluster wikis, to facilitate development, testing and QA. It might also meet the needs of some other use cases mentioned in this task description.

[ ... ]

The endpoint would be something like https://test.wikipedia.org/w/rest.php/growthexperiments/v0/image-suggestion/{pageId}. GrowthExperiments extension could validate the page ID, implement caching, do rate limiting, and wrap errors. Would something like that be acceptable? I think that would satisfy Growth team's main use case in wanting to be able to get image suggestion data from the data gateway in our local development environments, as well as in beta cluster wikis, to facilitate development, testing and QA. It might also meet the needs of some other use cases mentioned in this task description.

We probably want to be circling in someone on the API stream now. Is that API Platform maybe, or @DAbad?

[ ... ]

The endpoint would be something like https://test.wikipedia.org/w/rest.php/growthexperiments/v0/image-suggestion/{pageId}. GrowthExperiments extension could validate the page ID, implement caching, do rate limiting, and wrap errors. Would something like that be acceptable? I think that would satisfy Growth team's main use case in wanting to be able to get image suggestion data from the data gateway in our local development environments, as well as in beta cluster wikis, to facilitate development, testing and QA. It might also meet the needs of some other use cases mentioned in this task description.

We probably want to be circling in someone on the API stream now. Is that API Platform maybe, or @DAbad?

Cc'ing @WDoranWMF in case that's in Platform Engineering's purview. (@DMburugu do you know who we should talk to about this?)

DAbad triaged this task as Medium priority.
DAbad moved this task from Backlog to Investigate on the Foundational Technology Requests board.
DAbad added a project: API Platform.
DAbad added subscribers: BPirkle, Vpoundstone.

Working on prioritization for API Platform stream.

Working on prioritization for API Platform stream.

Hi @DAbad, just checking in on this, any idea of a rough timeline?

Would the proxying solution be permissible in the interim (T306349#8089288)? That would allow us to do some cleanup tasks like T314144: Remove MVP image recommendation provider code, and the old service on WCMS could also be decommissioned.

JTannerWMF added subscribers: Seddon, Dbrant, JTannerWMF.

Hello, the Android team has on its roadmap to bring Image Recommendations into the app. We wanted to start this work for Q2, which is next month but realized that the API Structured Data and Growth team are using isn't public facing. This blocks us from being able to bring the feature into the app as planned. We would greatly appreciate this API being made publicly facing.