Page MenuHomePhabricator

[XL] Store a list of unillustrated articles with suggested images in hdfs
Closed, ResolvedPublic

Description

NOTE: blocked by T299059

User story

As a user I want to receive notifications of suggested images for unillustrated articles. In order to make this possible we need to gather all unillustrated articles for relevant wikis together with suggested images for them and store them so they can be persisted in user-accessible persistence layers (cassandra and elasticsearch)


Implementation

  • gather all wikidata-ids stored in the parquet written by https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb (or its successor from T300045), plus the metadata associated with them
  • gather all wikidata-ids from all commons depicts and is digital representation ofstatements
  • merge the two sets into one collection of wikidata ids on commons
  • then for each relevant wiki find all unillustrated articles (see the Image Suggestions Algorithm code for how (note that certain types of pages are excluded, we need to replicate this)) with their wikidata-ids, wiki and article title
  • get the intersection of wikidata-ids
  • store the following in a file in hdfs
    • wiki
    • article title
    • suggested image
    • reason the image was suggested
    • the names of the wikis the image is a lead image on (if any)
    • values of the P31 property of the wikidata item corresponding to the article (so we can filter by it)
    • revision id of the article the image is suggested for
    • a timeuuid

Event Timeline

Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
CBogen renamed this task from Store a list of unillustrated articles with suggested images in hdfs to [XL] Store a list of unillustrated articles with suggested images in hdfs.Jan 26 2022, 5:35 PM
Cparle updated the task description. (Show Details)

As per conversation on Slack, @JAllemandou will send some sample data to unblock work on this task. We still need T299059 in order to complete this task, but the sample data will get us started.

Here is some data:

hdfs dfs -du -s -h hdfs:///wmf/data/wmf/structured_data/commons/entity/snapshot=2022-01-24
32.8G

hive>select count(1) from structured_data.commons_entity where snapshot='2022-01-24';
72659834

Let me know if it doesn't work as expected :)

gather all wikidata-ids from all commons depicts and is digital representation of statements

Should that include P921 "main subject"? I think it's somewhat common to use it as a stronger version of "depicts".

Ok this is pretty much done. We're writing to Hive instead of hdfs, to make it easier to export to Cassandra

Some observations/comparisons

Data for ptwiki running the old IMA on the 2022-02-21 snapshot:

  • 21092 suggested images for 106699 articles (387282 unillustrated articles in total)

Data for ptwiki using the new data pipeline on the 2022-02-21 snapshot:

  • 1518613 suggestions for 126259 articles
  • including suggestions for 81177 of the same articles as the old IMA

We're expecting more suggestions for the new pipeline as we're not limiting the number of suggestions for a single article, and we're expecting suggestions for more articles because we're also using depicts data ... so the above is within the boundaries of what we'd expect

Note that it's difficult to compare the output of the 2 directly, as we've discovered a flaw in some of the underlying library code the original algorithm uses which causes a bunch of data to be dropped. Working around that atm, will do some more data comparisons before closing

technical note: storing to hive is actually storing to HDFS :) Hive is a SQL engine on top of HDFS structured data.

Ok ran the old IMA again but with the data-loss part fixed (I think!), and now from it we're getting suggestions for 82399 articles, 81177 of which are also suggested by the new pipeline

We'd expect all suggestions from the old IMA to also be suggested by the new pipeline.

All the suggestions (or at least the ones I've checked) that are not in the new pipeline that are suggested by the IMA can be explained as follows:

  • an image that is linked from a great many articles on any wiki is considered an icon (with the threshold depending on the size of the wiki)
  • an article that has only icons as images is considered unillustrated
  • the IMA allowed icons to be used as image suggestions, which meant that someone could add a suggestion to an article, but the article would still be considered unillustrated (a bug)
  • the new implementation removes icons from the suggestion list

... so to me it seems the the new pipeline is working as expected. Waiting confirmation from @mfossati and then we can close this

... aaand merged. Closing the ticket