[XL] Store a list of unillustrated articles with suggested images in hdfs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cparle
	Jan 21 2022, 5:55 PM

Description

NOTE: blocked by T299059

User story

As a user I want to receive notifications of suggested images for unillustrated articles. In order to make this possible we need to gather all unillustrated articles for relevant wikis together with suggested images for them and store them so they can be persisted in user-accessible persistence layers (cassandra and elasticsearch)

Implementation

gather all wikidata-ids stored in the parquet written by https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb (or its successor from T300045), plus the metadata associated with them
gather all wikidata-ids from all commons depicts and is digital representation ofstatements
merge the two sets into one collection of wikidata ids on commons
then for each relevant wiki find all unillustrated articles (see the Image Suggestions Algorithm code for how (note that certain types of pages are excluded, we need to replicate this)) with their wikidata-ids, wiki and article title
get the intersection of wikidata-ids
store the following in a file in hdfs
- wiki
- article title
- suggested image
- reason the image was suggested
- the names of the wikis the image is a lead image on (if any)
- values of the P31 property of the wikidata item corresponding to the article (so we can filter by it)
- revision id of the article the image is suggested for
- a timeuuid

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T299781 [EPIC] Image suggestions backend
Resolved	mfossati	T296814 [EPIC] Article-level image suggestions data pipeline
Resolved	Cparle	T299789 [XL] Store a list of unillustrated articles with suggested images in hdfs
Resolved	Cparle	T301687 Calculate image suggestions confidence score without using elasticsearch

Event Timeline

Cparle created this task.Jan 21 2022, 5:55 PM

Cparle mentioned this in T296814: [EPIC] Article-level image suggestions data pipeline.Jan 21 2022, 6:06 PM

Cparle updated the task description. (Show Details)Jan 21 2022, 6:10 PM

Cparle mentioned this in T299884: Prepare has-recommendation data for import to wiki search indices.Jan 24 2022, 9:55 AM

Cparle mentioned this in T299885: [L] Push unillustrated articles with their suggestions, suggestion reasons and confidence scores to Cassandra.Jan 24 2022, 10:14 AM

Cparle removed a project: Epic.Jan 24 2022, 12:46 PM

CBogen edited projects, added Structured-Data-Backlog; removed Structured-Data-Backlog (Current Work).Jan 24 2022, 2:54 PM

Cparle updated the task description. (Show Details)Jan 24 2022, 5:38 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.Jan 24 2022, 5:49 PM

CBogen moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.

Cparle mentioned this in T292147: [L] Send Image Suggestions notifications to experienced users.Jan 25 2022, 12:10 PM

Cparle updated the task description. (Show Details)Jan 25 2022, 5:52 PM

Cparle updated the task description. (Show Details)

CBogen renamed this task from Store a list of unillustrated articles with suggested images in hdfs to [XL] Store a list of unillustrated articles with suggested images in hdfs.Jan 26 2022, 5:35 PM

CBogen moved this task from Ready for Estimation to Blocked on the Structured-Data-Backlog (Current Work) board.Jan 26 2022, 6:05 PM

Cparle updated the task description. (Show Details)Jan 27 2022, 12:40 PM

Cparle updated the task description. (Show Details)

As per conversation on Slack, @JAllemandou will send some sample data to unblock work on this task. We still need T299059 in order to complete this task, but the sample data will get us started.

Here is some data:

hdfs dfs -du -s -h hdfs:///wmf/data/wmf/structured_data/commons/entity/snapshot=2022-01-24
32.8G

hive>select count(1) from structured_data.commons_entity where snapshot='2022-01-24';
72659834

Let me know if it doesn't work as expected :)

CBogen moved this task from Blocked to Ready for Development on the Structured-Data-Backlog (Current Work) board.Jan 31 2022, 5:21 PM

Cparle claimed this task.Feb 2 2022, 12:48 PM

Cparle moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

gather all wikidata-ids from all commons depicts and is digital representation of statements

Should that include P921 "main subject"? I think it's somewhat common to use it as a stronger version of "depicts".

Cparle updated the task description. (Show Details)Feb 14 2022, 4:44 PM

Cparle added a subtask: T301687: Calculate image suggestions confidence score without using elasticsearch.Feb 14 2022, 4:50 PM

mfossati added a parent task: T296814: [EPIC] Article-level image suggestions data pipeline.Feb 18 2022, 5:52 PM

Cparle closed subtask T301687: Calculate image suggestions confidence score without using elasticsearch as Resolved.Feb 23 2022, 6:45 PM

Cparle updated the task description. (Show Details)Mar 14 2022, 5:28 PM

Cparle mentioned this in T283865: [XL] Estimate coverage of image suggestions at different confidence levels.Mar 21 2022, 10:29 AM

mfossati mentioned this in T302434: [L] Orchestrate image suggestions tasks with Apache Airflow.Apr 6 2022, 9:13 AM

mfossati subscribed.Apr 7 2022, 7:53 AM

Ok this is pretty much done. We're writing to Hive instead of hdfs, to make it easier to export to Cassandra

Some observations/comparisons

Data for ptwiki running the old IMA on the 2022-02-21 snapshot:

21092 suggested images for 106699 articles (387282 unillustrated articles in total)

Data for ptwiki using the new data pipeline on the 2022-02-21 snapshot:

1518613 suggestions for 126259 articles
including suggestions for 81177 of the same articles as the old IMA

We're expecting more suggestions for the new pipeline as we're not limiting the number of suggestions for a single article, and we're expecting suggestions for more articles because we're also using depicts data ... so the above is within the boundaries of what we'd expect

Note that it's difficult to compare the output of the 2 directly, as we've discovered a flaw in some of the underlying library code the original algorithm uses which causes a bunch of data to be dropped. Working around that atm, will do some more data comparisons before closing

technical note: storing to hive is actually storing to HDFS :) Hive is a SQL engine on top of HDFS structured data.

mfossati mentioned this in T306598: [L] Look into preliminary blue links algorithm data for section topics.Apr 21 2022, 8:33 AM

Ok ran the old IMA again but with the data-loss part fixed (I think!), and now from it we're getting suggestions for 82399 articles, 81177 of which are also suggested by the new pipeline

We'd expect all suggestions from the old IMA to also be suggested by the new pipeline.

All the suggestions (or at least the ones I've checked) that are not in the new pipeline that are suggested by the IMA can be explained as follows:

an image that is linked from a great many articles on any wiki is considered an icon (with the threshold depending on the size of the wiki)
an article that has only icons as images is considered unillustrated
the IMA allowed icons to be used as image suggestions, which meant that someone could add a suggestion to an article, but the article would still be considered unillustrated (a bug)
the new implementation removes icons from the suggestion list

... so to me it seems the the new pipeline is working as expected. Waiting confirmation from @mfossati and then we can close this

... aaand merged. Closing the ticket

👍 See https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/merge_requests/44

[XL] Store a list of unillustrated articles with suggested images in hdfsClosed, ResolvedPublicActions