User story
As a user I want to receive notifications of suggested images for unillustrated articles. In order to make this possible we need to gather all unillustrated articles for relevant wikis together with suggested images for them and store them so they can be persisted in user-accessible persistence layers (cassandra and elasticsearch)
Implementation
- gather all wikidata-ids stored in the parquet written by https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb (or its successor from T300045), plus the metadata associated with them
- gather all wikidata-ids from all commons depicts and is digital representation ofstatements
- merge the two sets into one collection of wikidata ids on commons
- then for each relevant wiki find all unillustrated articles (see the Image Suggestions Algorithm code for how (note that certain types of pages are excluded, we need to replicate this)) with their wikidata-ids, wiki and article title
- get the intersection of wikidata-ids
- store the following in a file in hdfs
- wiki
- article title
- suggested image
- reason the image was suggested
- the names of the wikis the image is a lead image on (if any)
- values of the P31 property of the wikidata item corresponding to the article (so we can filter by it)
- revision id of the article the image is suggested for
- a timeuuid