Page MenuHomePhabricator

[SPIKE] Decide on best approach for API access to Cassandra
Closed, ResolvedPublicSpike

Description

Spike to explore access to Cassandra generated datasets

User Story
As an API producer, I need guidelines on how I can access Cassandra so that my product feature can query generated datasets and serve the data
Success Criteria
  • Best approach defined for accessing Cassandra image suggestions
  • Approach should be general enough for other generated datasets, not tied to image suggestions
Out of Scope
  • Filtering/business logic will not be part of the generic API. For example, if a query returns multiple records with the same/similar key then the consumer will need to filter that as part of their application logic

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptOct 27 2021, 6:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I demo'd an approach a few months back for creating an HTTP data gateway (see PDF below) using a framework with a fluent API, and feedback seemed positive.

If we were to apply this concretely to T293808: Design Image Suggestion Schema, I would suggest that we implement an endpoint for each of of the described tables (suggestions, feedback, title_cache, and instanceof_cache). However, since that ticket proposes that only suggestions be published widely for reuse, and the others treated as application state for the discussed use-cases, I propose we prefix each with a URI component that denotes this. For example: /public/image_suggestions/:wiki/:page_id vs /private/image_suggestions/feedback/:wiki/:page_id/:image (exact names TBD after proper bikeshedding). How (specifically) we treat those namespaces (public v private) is something we can sort out later (i.e. visibility, access, etc), using a namespace now will preserve our ability to do so (easily) when the time comes, and should at least signal our intent here.

Thanks Eric. Sounds good.

Subscribed @Cparle @Tgr @kostajh for any feedback/questions.

Will there be another API with some business logic to complement the generic API?

The Data Platform team wouldn't implement business logic in an API.

Our goal is make access as generic as we can so that multiple consumers can take the data as modeled and use it.

For example, if Growth doesn't want to show an image suggestion to a user that has 2 rejects then it would be up to them to implement that logic, if that is then served via an API it would be up to them.

ok grand

... so if there was to be an API to handle business logic that might be needed by more than one client, where would that live and who'd be responsible for it?

Good question, you found a nice gray area there. Possibly...the API Platform - not sure, but if we have some requirements or more details we could review with the different groups and make a decision.

Moving this approach to approved/signed off.

This ticket will track the work to implement:

https://phabricator.wikimedia.org/T303408

Thanks Eric. Sounds good.

Subscribed @Cparle @Tgr @kostajh for any feedback/questions.

AIUI this plan sounds OK. To recap Growth team's existing use cases:

  1. We use the search index to find articles that have image recommendations with hasrecommendation:image; those were done in a one-off import a while back. It looks like T299884: Prepare has-recommendation data for import to wiki search indices covers the use-case for repeated updates to the search index, and the Structured-Data-Backlog team is doing it.
  2. When the user is looking at an article that we know has image recommendations, we do a query to the image suggestions API using the path /image-suggestions/v0/{wiki}/{lang}/pages?source=ima&id={pageID}. From T294468#7748996 it sounds like /public/image_suggestions/:wiki/:page_id is a drop in replacement and that ima (image matching algorithm) is the default. Is that correct? @nnikkhoui or @BPirkle based on your work with mediawiki/services/image-suggestion-api does that sound right? a. We currently do the query to fetch the image recommendation metadata via server-side (we do this before the user visits the relevant article and cache the result) code, but @lbowmaker could you please confirm if the API will be available via client-side applications? I assume it will be, but thought I'd double-check.
  3. Currently, when a user accepts or rejects an image suggestion, we enqueue a job with CirrusSearch to reset the weighted tag for hasrecommendation:image for that article. With the new API, we would also send an HTTP request to the feedback endpoint proposed in T294468#7748996. I assume that the search index updating code in T299884 (cc @Cparle) would take into account where an article has feedback before updating the weighted tag hasrecommendation:image for an article, or perhaps a new field like hasfeedback:image.rejected would be useful to someone.

Some minor points/questions:

  1. @lbowmaker the existing API (swagger reference) uses /:wiki/:lang but the proposal in T294468#7748996 references wiki. AIUI, there is/was a preference to build APIs using project and language e.g. wikipedia/en rather than enwiki.
  2. @Eevans Would the proposed API gateway itself sit behind the https://api.wikimedia.org gateway (Platform Team Initiatives (API Gateway))?
  3. @nnikkhoui @BPirkle Based on T294468#7748996, it sounds like the existing mediawiki/services/image-suggestion-api project will be scrapped, is that correct? (i.e. an alternative would be to rewrite it to use the new Cassandra API instead of the SQLite database store, and implement some business logic that various consumers need.)

When the user is looking at an article that we know has image recommendations, we do a query to the image suggestions API using the path /image-suggestions/v0/{wiki}/{lang}/pages?source=ima&id={pageID}. From T294468#7748996 it sounds like /public/image_suggestions/:wiki/:page_id is a drop in replacement and that ima (image matching algorithm) is the default. Is that correct? @nnikkhoui or @BPirkle based on your work with mediawiki/services/image-suggestion-api does that sound right?

That sounds correct to me, with the disclaimer that I have very little knowledge about the new work being done. In particular, "drop in" could imply that the response data formats are identical. I suspect (but do not know for certain) that they are not identical. I would be surprised if they don't communicate essentially the same data, but I would also be surprised if they do it in exactly the same way.

@nnikkhoui @BPirkle Based on T294468#7748996, it sounds like the existing mediawiki/services/image-suggestion-api project will be scrapped, is that correct?

That is my understanding.

@nnikkhoui @BPirkle Based on T294468#7748996, it sounds like the existing mediawiki/services/image-suggestion-api project will be scrapped, is that correct?

That's correct; see T294362

Thanks Eric. Sounds good.

Subscribed @Cparle @Tgr @kostajh for any feedback/questions.

AIUI this plan sounds OK. To recap Growth team's existing use cases:

... With the new API, we would also send an HTTP request to the feedback endpoint proposed in T294468#7748996 ...

If understand what you're saying here, than no, actually. The feedback would land in an event queue, and something (TBD) consuming from that queue would write to Cassandra; There is no proposal for an HTTP endpoint that writes.


Some minor points/questions:

  1. @lbowmaker the existing API (swagger reference) uses /:wiki/:lang but the proposal in T294468#7748996 references wiki. AIUI, there is/was a preference to build APIs using project and language e.g. wikipedia/en rather than enwiki.

Perhaps it wouldn't be an issue here, but if this were a general policy, how would this work for commons, or wikidata?

For what it's worth though, it doesn't have a significant impact on storage if we want to change this to include a wiki & language instead of just wiki.

  1. @Eevans Would the proposed API gateway itself sit behind the https://api.wikimedia.org gateway (Platform Team Initiatives (API Gateway))?

Not the data gateway API discussed above, no. That's only for internal use.

  1. @nnikkhoui @BPirkle Based on T294468#7748996, it sounds like the existing mediawiki/services/image-suggestion-api project will be scrapped, is that correct? (i.e. an alternative would be to rewrite it to use the new Cassandra API instead of the SQLite database store, and implement some business logic that various consumers need.)

Thanks Eric. Sounds good.

Subscribed @Cparle @Tgr @kostajh for any feedback/questions.

AIUI this plan sounds OK. To recap Growth team's existing use cases:

... With the new API, we would also send an HTTP request to the feedback endpoint proposed in T294468#7748996 ...

If understand what you're saying here, than no, actually. The feedback would land in an event queue, and something (TBD) consuming from that queue would write to Cassandra; There is no proposal for an HTTP endpoint that writes.

OK, that is important for the Growth team's use case -- we need to make sure that feedback is recorded somewhere that is accessible from MediaWiki servers. Do I understand correctly that this would be on the Growth team to implement, or is there another team that would own this? @Cparle is that something your team is thinking about in the context of updating the search index in T299884: Prepare has-recommendation data for import to wiki search indices?


Some minor points/questions:

  1. @lbowmaker the existing API (swagger reference) uses /:wiki/:lang but the proposal in T294468#7748996 references wiki. AIUI, there is/was a preference to build APIs using project and language e.g. wikipedia/en rather than enwiki.

Perhaps it wouldn't be an issue here, but if this were a general policy, how would this work for commons, or wikidata?

For multilingual projects you're supposed to just omit the language key: https://api.wikimedia.org/wiki/Documentation/Getting_started/Wikimedia_projects#Multilingual_projects

For what it's worth though, it doesn't have a significant impact on storage if we want to change this to include a wiki & language instead of just wiki.

  1. @Eevans Would the proposed API gateway itself sit behind the https://api.wikimedia.org gateway (Platform Team Initiatives (API Gateway))?

Not the data gateway API discussed above, no. That's only for internal use.

Currently, I believe some community-written bots are making use of the image suggestions API, so that would be a problem for them.

Not the data gateway API discussed above, no. That's only for internal use.

Currently, I believe some community-written bots are making use of the image suggestions API, so that would be a problem for them.

We've communicated with the community bot-writers that the API was a POC and will soon no longer be available, so this is not an issue. While it would be great if the API was publicly available via the gateway, it's not a requirement for our MVP.

AIUI this plan sounds OK. To recap Growth team's existing use cases:

...

  1. Currently, when a user accepts or rejects an image suggestion, we enqueue a job with CirrusSearch to reset the weighted tag for hasrecommendation:image for that article. With the new API, we would also send an HTTP request to the feedback endpoint proposed in T294468#7748996. I assume that the search index updating code in T299884 (cc @Cparle) would take into account where an article has feedback before updating the weighted tag hasrecommendation:image for an article, or perhaps a new field like hasfeedback:image.rejected would be useful to someone.

The most effective way for us to do this would be for rejections to get written into hdfs somewhere so we exclude them when we're gathering the data

I'm still pretty unclear on what's happening with rejections - @Eevans is expecting that a rejection will land in an event queue. How is the event going to be produced, and which team are we expecting to do this? Can Growth do this from MW @kostajh ? I presume that whatever consumes the event and writes to Cassandra will be written by @Eevans 's team ... so maybe if our team wants the data in hdfs we'll need to write the code to consume that event too and do that ourselves? We have no expertise in this, so would appreciate any pointers

AIUI this plan sounds OK. To recap Growth team's existing use cases:

...

  1. Currently, when a user accepts or rejects an image suggestion, we enqueue a job with CirrusSearch to reset the weighted tag for hasrecommendation:image for that article. With the new API, we would also send an HTTP request to the feedback endpoint proposed in T294468#7748996. I assume that the search index updating code in T299884 (cc @Cparle) would take into account where an article has feedback before updating the weighted tag hasrecommendation:image for an article, or perhaps a new field like hasfeedback:image.rejected would be useful to someone.

The most effective way for us to do this would be for rejections to get written into hdfs somewhere so we exclude them when we're gathering the data

I'm still pretty unclear on what's happening with rejections - @Eevans is expecting that a rejection will land in an event queue. How is the event going to be produced, and which team are we expecting to do this? Can Growth do this from MW @kostajh ? I presume that whatever consumes the event and writes to Cassandra will be written by @Eevans 's team ... so maybe if our team wants the data in hdfs we'll need to write the code to consume that event too and do that ourselves? We have no expertise in this, so would appreciate any pointers

Yes, Growth can emit an event with the feedback data.

The most effective way for us to do this would be for rejections to get written into hdfs somewhere so we exclude them when we're gathering the data

Does this mean that all users of the pipeline would not be able to suggest images rejected by the newcomer tool? For example, if a user in the newcomer tool rejects an image, can the SD team then not suggest it to more experienced users for review via a notification?

@SWakiyama @MMiller_WMF see Cormac's answer to your question from yesterday's meeting above - are we okay with moving forward with this?

Also ... I don't think the way the data is being stored allows for that anyway. We store the user who has rejected an image, not the tool they were using at the time, see P21420 Perhaps this is what the comment field is intended for? Not sure.

Copying notes from a meeting earlier this week so that everyone's on the same page:

  • User feedback will be queried from from hive
    • Feedback will be stored in Cassandra as a key value - don’t want to have to call that many times, so will query it more easily in hive
    • Long term will also store it in Cassandra
  • Once Data Platform creates a kafka topic they can write the feedback to, it will get stored in Hive automatically, which will be ready in a few weeks
  • Writing to Cassandra will be ready mid-May - this is good to have but won’t actually be used so isn’t a blocker
  • Growth team will write feedback to Hive using event gate by publishing event message
    • Event gate is the only solution the Growth team needs - they won’t need to interact with the database directly
    • Growth team uust write feedback to Eventgate - the pipeline takes care of the rest
  • API also needs to be deployed

Marking this as resolved, but @CBogen feel free to reopen if there's more to do here.