NOTE: blocked by T299408 and T299789
User story
---
As a product manager, I want a single Source Of Truth for image suggestions
As a developer, I need a place to store whether an image suggestion is accepted or rejected, so I can use that data in future
So we need to store image suggestions with confidence scores and acceptance/rejection data
---
Cassandra will the the Source Of Truth for images suggestions, and so once we have all the data we want to store it there
We want to store:
- wiki
- article id
- article title
- article wikidata id
- suggested image article title
- confidence score for suggestion
- metadata about the reason an image was suggested (e.g. if it's the P18 property of the article's wikidata id, or if it's a lead image on wiki X)
- whether an image suggestion has been accepted or rejected (or neither)
Note that the confidence score can only come from an elasticsearch query, so we'll need to run an elastic query each time we want to write confidence score data
We'll need to consult with the data platform team about the data modelling, just for reference here are some likely queries that will be run against the data:
- accept/reject a particular suggestion for a particular wiki
- get all suggestions for a particular wiki
- remove a suggested image (because it's no longer recommended for an article)
- remove an unillustrated article (because it has been illustrated)
- get all (`wikidata id`, `image_title`) pairs where `image_title` has been accepted/rejected as a suggestion for `wikidata id`
We'd expect an API to be written to provide this functionality, but possibly in the script we'll be just writing the data direct to Cassandra. Also @DAbad would like to preserve the change history of the data, so this needs to be considered too (but maybe this can be handled with Hive tables rather than Cassandra?)