Page MenuHomePhabricator

[SPIKE] Investigate and Decide on Solution for Image Suggestions Feedback
Closed, ResolvedPublicSpike

Description

User Story
As a platform engineer, I need to provide a way to allow the Growth team to send image suggestions feedback data to the appropriate Cassandra tables
Success Criteria
  • Consensus on approach agreed between Platform and Growth
  • List of any new processes/jobs that need to be implemented to support this
  • Can EventGate be used to write to Kafka, then a job to read from Kafka to write to Cassandra? What exists already?

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMar 2 2022, 9:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Where does the feedback data come from? A client side app? If so, then EventGate->Kafka makes sense.

Once in Kafka, perhaps: https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/cassandra/ to write to Cassandra?

Or, if this had an open source license, we might use it: https://docs.confluent.io/kafka-connect-cassandra/current/overview.html ?

Solution 1 (Preferred)

  • Feature team uses EventGate API to push data to Kafka
  • Platform team implements new Kafka topic and schema
  • Platform team implements event based solution to listen to Kafka topic and write data to Cassandra (dependent on implementation of something like Flink to build application)

Solution 2 (Infrastructure Exists to Implement)

  • Feature team uses EventGate API to push data to Kafka
  • Platform team implements new Kafka topic and schema
  • Platform team implements Airflow job to read from Hive (takes 2-3 hours to feed data from Kafka to Hive) and then load to Cassandra on schedule (feedback data not updated in real time to Cassandra)

I'd also prefer solution 1. Our Hadoop is an Analytics Data Lake, so even though we can do it via Hive, if we were to make some SLOs about this data they'd be very 'best effort'.

Feature team uses EventGate API to push data to Kafka

BTW, this could be done from the client side, but we'd have to be careful about how we accept events. We don't currently have any production use cases that accept events from client side. Since we have no authentication around externally exposed EventGates, anyone could POST events into the stream (if they knew the correct event data format).

If this MW feature has an API endpoint that accepts the feedback, then MW can POST to EventGate via EventBus MW extension. I don't believe we've used EventBus to produce anything but core MW state change events though, so we'll have to talk about this a little bit more, but I believe it will be fine.

Feature team uses EventGate API to push data to Kafka

BTW, this could be done from the client side, but we'd have to be careful about how we accept events. We don't currently have any production use cases that accept events from client side. Since we have no authentication around externally exposed EventGates, anyone could POST events into the stream (if they knew the correct event data format).

If this MW feature has an API endpoint that accepts the feedback, then MW can POST to EventGate via EventBus MW extension. I don't believe we've used EventBus to produce anything but core MW state change events though, so we'll have to talk about this a little bit more, but I believe it will be fine.

For the GrowthExperiments, we process client-side feedback on image suggestions via an API endpoint in MediaWiki so it would be straightforward for us to POST the data to EventGate via EventBus.

Solution 1 (Preferred)

  • Feature team uses EventGate API to push data to Kafka
  • Platform team implements new Kafka topic and schema
  • Platform team implements event based solution to listen to Kafka topic and write data to Cassandra (dependent on implementation of something like Flink to build application)

Solution 2 (Infrastructure Exists to Implement)

  • Feature team uses EventGate API to push data to Kafka
  • Platform team implements new Kafka topic and schema
  • Platform team implements Airflow job to read from Hive (takes 2-3 hours to feed data from Kafka to Hive) and then load to Cassandra on schedule (feedback data not updated in real time to Cassandra)

If we go with solution 1, when might we expect to be able to use that in production? Or put differently, roughly how much time and effort is it compared to solution 2?

kostajh added a subscriber: Cparle.

A potential issue with the 2-3 hour delay will be that image suggestions that have been rejected could be potentially resurfaced to users. One way around that would be for consumers of the API to implement their own feedback cache until a real-time solution is sorted out; e.g. for GrowthExperiments, we could cache feedback for ~24 hours and check that instead of or in addition to the API. That has the downside of only gathering feedback from users who submitted feedback via the GrowthExperiments features, but that's probably sufficient until solution 1 could be implemented, given that there aren't (yet) other consumers of the API that are submitting feedback? (cc @Cparle)

BTW, Stream Connectors is an unimplemented component of Event Platform that was meant to automate this kind of thing: T214430: Event Platform: Stream Connectors

A potential issue with the 2-3 hour delay will be that image suggestions that have been rejected could be potentially resurfaced to users.

IMO we can probably live with this initially, and can measure how much of a problem it is in real life by monitoring how often we get >1 rejection on the same suggestion

Just had a naming bikeshed meeting with @Eevans and @lbowmaker about the image suggestion feedback event schema here: https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/779052.

I think we agree that 'generated-data' is not a good namespace for a dataset. Generated data is the name of a project/platform at WMF, and is not a name of a data model.

So, we need another namespace. We could just namespace toplevel as 'image-suggestions', but I think we agree that on its own this is too generic.

I'm going to make a case for putting this into the mediawiki namespace. Other suggestions please welcome!

By arguing for a mediawiki namespace for this event schema, I am implicitly arguing that other data models for the 'image suggestions' feature also conceptually belong in a mediawiki namespace, including the 'generated' image suggestions dataset.

  • This schema itself is referring to mediawiki entities: page_id, wiki, origin_wiki, user_id.
  • The submitter of this feedback event is a MediaWiki extension serving a feature to end users of MediaWiki.
  • The input of this original image suggestion dataset comes from MediaWiki.
  • For serving, this image suggestion dataset is stored in cassandra, but it is a MediaWiki extension that is intended to use it to serve the suggestions to end users.

Another idea would be to namespace under 'structured_data', but IMO this is a also not a data model name, but another team/project name, and also is very generic. (The fact that a dataset is 'structured' seems like should be a default, not an aberration warranting its own name.)

Anyway, we want a namespace, 'generated-data' and 'structured_data' are not it. If not 'mediawiki/image-suggestions', then are there other ideas?

mediawiki/page/image-suggestion-feedback ?

Ah addtionally, structured data refers to structured data in mediawiki: https://www.mediawiki.org/wiki/Readers/Structured_Data

@Tgr or @kostajh - would you mind taking a look at the latest schema design here, let me know if you see any issues with it. Thanks!

https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/779052

The stream is now live: mediawiki.image_suggestions_feedback

@Tgr - is it possible to easily write a test event to this?

Change 789200 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Bump schema repo versions for mediawiki.image_suggestions_feedback

https://gerrit.wikimedia.org/r/789200

Change 789200 merged by Ottomata:

[eventgate-wikimedia@master] Bump schema repo versions for mediawiki.image_suggestions_feedback

https://gerrit.wikimedia.org/r/789200

Change 789231 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Bump eventgate-main image version for T302925

https://gerrit.wikimedia.org/r/789231

Change 789231 merged by Ottomata:

[operations/deployment-charts@master] Bump eventgate-main image version for T302925

https://gerrit.wikimedia.org/r/789231

Change 789235 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-main - pre cache image-suggestions-feedback schema

https://gerrit.wikimedia.org/r/789235

Change 789235 merged by Ottomata:

[operations/deployment-charts@master] eventgate-main - pre cache image-suggestions-feedback schema

https://gerrit.wikimedia.org/r/789235

Done. And confirmed that canary events are coming in:

[@stat1004:/home/otto] $ kafkacat -b kafka-main1003.eqiad.wmnet -C -t eqiad.mediawiki.image_suggestions_feedback -o beginning -u -c 1 | jq .
{
  "$schema": "/mediawiki/page/image-suggestions-feedback/1.0.0",
  "dt": "2022-04-13T14:12:16.372Z",
  "filename": "https://commons.wikimedia.org/wiki/File:example.JPEG",
  "is_accepted": false,
  "is_rejected": true,
  "meta": {
    "stream": "mediawiki.image_suggestions_feedback",
    "domain": "canary",
    "id": "477733c2-1695-48e9-8560-25fd49a842de",
    "dt": "2022-05-04T18:26:56.310Z",
    "request_id": "fae01250-2a17-4c97-80d9-df22078424cc"
  },
  "origin_wiki": "commons",
  "page_id": 1234,
  "rejection_reason": "Incorrect Image",
  "user_id": 1234,
  "wiki": "enwiki"
}
lbowmaker updated the task description. (Show Details)

Change 809014 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/EventBus@master] EventFactory: Add helper for image suggestion feedback

https://gerrit.wikimedia.org/r/809014

Change 809150 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[schemas/event/secondary@master] image-suggestions-feedback: Make dt field non-required, adjust docs

https://gerrit.wikimedia.org/r/809150

Change 809014 abandoned by Kosta Harlan:

[mediawiki/extensions/EventBus@master] EventFactory: Add helper for image suggestion feedback

Reason:

https://gerrit.wikimedia.org/r/809014

Change 809150 merged by jenkins-bot:

[schemas/event/secondary@master] image-suggestions-feedback: Bump to version 2.0.0

https://gerrit.wikimedia.org/r/809150

Change 888642 had a related patch set uploaded (by Aqu; author: Aqu):

[schemas/event/secondary@master] Fix typo in image suggestion schema

https://gerrit.wikimedia.org/r/888642

Change 888642 merged by Milimetric:

[schemas/event/secondary@master] Fix typo in image suggestion schema

https://gerrit.wikimedia.org/r/888642