Page MenuHomePhabricator

Investigate placeholder image recommendation
Closed, ResolvedPublic

Description

In testing the Image Recommendation API, we noticed that this placeholder image (https://commons.wikimedia.org/wiki/File:Missing_Arachnida.png) was recommended for many articles in cebwiki, likely because it is used often in itwiki for articles with no images.

The algorithm attempts to filter those out, so this task is meant for us to check on why this image is not filtered out. We may be able to discover some improvement.

Event Timeline

@Miriam -- I created this task because I think it's possible that investigating this image might surface potential improvements to the filtering. Do you think this is something you or @AikoChou could look into?

@Aiko and I talked about this. We are going to work on 2 things:

  1. Generate a list of the existing images detected as placeholders from the algorithm, see what are the categories that they are labeled with, and exclude images from those categories when querying for candidates. As of now, placeholder images are detected as containing the following substrings:

['.noantimage','no_free_image','image_manquante','replace_this_image','disambig','default','defaut','falta_imagem_','imageNA','noimage','noenzyimage']

  1. Build a simple computer vision model that can automatically detect whether an image is a placeholder or not.

Hey @Miriam @AikoChou,

Re point 2. This topic came up in a couple of conversations. cc @fkaelin; we've been talking about these things at high level.
I'd like to start reasoning about what it would mean to embed CV in this pipeline, or at least decoupling heuristics.

We can move this discussion to a dedicated phab, and start to sketch requirements, happy to bounce ideas around in case. I'm not suggesting (or promising :)) we'll pick this us next, but I'd like to understand what would be needed to provide this type of capability in our stack.

Some things I'd like to capture.

Where would a CV model fit the current training pipeline

The way I look at it, we could either score to pre-process records, or post process results. The current pipeline looks like:

  1. Train model
  2. Upload to HDFS
  3. Generate production data (clean & explode json columns)
  4. Export and publish PoC wikis data

A fit would be having a model scoring step before 1 of before/after 3.

How do we score instances

My assumption is that the model will be pre-trained and programmatically available.
I take it scoring will be performed in batch. But how?

What does the training lifecycle look like

TBH this is less of a concern for us at run time, but I'd like to get a feel
for how we could orchestrate and integrate models.
Two key points for my side of the fence:

  • model versioning and rollback
  • sharing/publishing should transparent to us and access should be programmatic (no manual reloads).
Is it deep learning or traditional cv

Ultimately, we care about inference and should be model agnostic. However,
the type of artefact could have impact on technology choices (especially if it relies on GPUs for inference).

Forward compatibility

Whatever we do, it should be forward compatible with Train/Lift Wing, and have a clear migration or decommissioning path.

Hi @gmodena @MMiller_WMF

I updated the code in the GitHub repo (in the branch) that improves filtering out placeholders. The workflow is as follows - first use PetScan to search all the subcategories from Category:Image_placeholders (https://petscan.wmflabs.org/?psid=18699732). Next, query for all images from those categories in Hive. Then, exclude these images when querying for candidates in both wikidata commons category (fewer cases) and other wikis (many cases).

Also, in case we can't get the category list from PetScan (server problem or some other reasons). I saved it as a table "aikochou.placeholder_category" in Hive, so if any problem happens, we use the table instead, but it may not be the up-to-date one.

Another point, the code in algorithm.ipynb hasn't been changed yet. I've studied the pipeline and if I can, I would be glad to help with the implementation in algorithm.ipynb as well.

Finally, I'm not sure if the improvement fits the current pipeline well. If there is something I missed or if you have any suggestion, please let me know :)

Hi, after discussions on slack I quickly calculated the percentage of image suggestions that contain an image in a placeholder category, please see below.

Overall Stats
Median: 0.6% -->(for half of the languages, this is 0.6% or less)
Average: 1.33%
Max: 5.25%
Min: 0.15%

Per-language Stats

enwiki: 1.84%
arwiki: 0.28%
kowiki: 0.15%
cswiki: 0.58%
viwiki: 3.00%
frwiki: 2.71%
fawiki: 0.56%
ptwiki: 0.24%
ruwiki: 0.58%
trwiki: 0.15%
plwiki: 0.78%
hewiki: 0.31%
svwiki: 3.43%
ukwiki: 1.61%
huwiki: 0.65%
hywiki: 0.21%
srwiki: 1.71%
euwiki: 5.25%
arzwiki: 0.23%
cebwiki: 4.22%
dewiki: 0.65%
bnwiki: 0.16%

Hey @Miriam @AikoChou,

Thanks for this! I tested the changes, and wanted to validate if my understanding is correct:

  1. The "placeholder category" filter (image_placeholders) will be applied on top of the current threshold-based heuristic allowed_images. Do we have an idea of how much these two sets overlap?
  2. We filter out pages, not images (the left anti joins are on page title/id). If a page has an image that belongs to the "placeholder category" it (the page and its candidate images) will be excluded. Currently we exclude pages before applying the python logic, but we could also prune the recommendation datasets a-posteriori.
  3. The current "placeholder category" consist of 2961 images across all languages. How often would you expect this data stays valid / needs to be refreshed?

The changes LGTM. They would need some adjustments to include them in our current pipeline, but nothing major.

However, since we are very close to PoC deadline, I'd like to avoid changes to the notebook/algo that might
alter its resource consumption, runtime footprint and population statistics (that we need to account for validation).

My preferred way of moving forward, for PoC, would be to perform the filtering downstream in the post-processing
part of the data pipeline. Assuming my understanding is correct, we can grab a snapshot of image_placeholders
and apply a similar filtering logic to the one we use to discard disambiguation pages. This will allow us to:

  1. Re-create production datasets without needing to re-run the algo (which is handy to testing/troubleshooting).
  2. Embed this new filtering step in our metrics collection/analysis machinery.

Problem with this approach: we won't be able to "replace" a discarded image recommendation with a new one in the post-processing phase. Would this be an acceptable tradeoff for PoC?

@AikoChou we are looking at refactoring of the pipeline for v1 of the ImageMatching service. I'd be happy to work together to include your changes in the next iteration of the pipeline. The category extraction logic in get_placeholder_category is very useful, and we could look at making it a standalone, reproducible, component rather than embed it in the notebook global state.

Hey @gmodena,

For point 1. I calculated the number of overlapped images in allowed_images and image_placeholders as follows:

enwiki: allowed_images:  21915, overlapped:  31
arwiki: allowed_images:  4794, overlapped:  15
kowiki: allowed_images:  3302, overlapped:  13
cswiki: allowed_images:  1778, overlapped:  14
viwiki: allowed_images:  2678, overlapped:  16
frwiki: allowed_images:  9013, overlapped:  33
fawiki: allowed_images:  3656, overlapped:  11
ptwiki: allowed_images:  5702, overlapped:  26
ruwiki: allowed_images:  12057, overlapped:  16
trwiki: allowed_images:  2530, overlapped:  13
plwiki: allowed_images:  7232, overlapped:  5
hewiki: allowed_images:  2295, overlapped:  7
svwiki: allowed_images:  2706, overlapped:  11
ukwiki: allowed_images:  7242, overlapped:  22
huwiki: allowed_images:  3265, overlapped:  4
hywiki: allowed_images:  1120, overlapped:  9
srwiki: allowed_images:  1888, overlapped:  9
euwiki: allowed_images:  1788, overlapped:  7
arzwiki: allowed_images:  581, overlapped:  2
cebwiki: allowed_images:  502, overlapped:  1
dewiki: allowed_images:  5769, overlapped:  21
bnwiki: allowed_images:  1227, overlapped:  7

It shows very few of them are overlapped. Oh! so we should also consider using image_placeholders as a filter when querying unillustrated articles?! Currently we regard the pages with placeholder images as illustrated articles, so they will not be in the article list which we do image recommendation.

For point 2. We use left anti joins to filter out images.

LEFT ANTI JOIN image_placeholders
ON image_placeholders.page_title = pp.pp_value

The page_title and pp_value above are actually images, since we query the "mediawiki_page" table with wiki_db ='commonswiki'. So if unillustrated pages has a image in other languages belongs to the "placeholder category", the image will not be added as a candidate image, but the page will still be in the final table (qids_and_properties)

For point 3. The current image_placeholders is calculated based on the snapshot 2021-02 of "mediawiki_categorylinks" and "mediawiki_page" tables, and "placeholder category" queried from PetScan. I calculated the new image_placeholders based on the snapshot 2021-03, which consist of 2996 images. So maybe it should be refreshed once a month?

Hey @AikoChou,

Thanks for clarifying. Moving the filtering logic to the post-processing phase, as I initially suggested, could
lead to discrepancies and articles being erroneously removed..
After code review with @Miriam today, and consulting with @Clarakosi, we decided on the following plan for the approaching PoC deadline:

  1. We'll use a static list of image_placeholders as a "source of truth". This will be generated on the 2021-03 snapshot.
  2. We'll port the image selection query to algorithm.ipynb and use it to generate data for 2021-03.

We'll use the query @Miriam used at https://phabricator.wikimedia.org/T277828#6957015 to validate results.

Weekly updates:

  • Modification of the category-based placeholder detection integrated with the main algorithm

Weekly updates: some placeholder images escape the filters we put together based on categories. I manually went through the top-100 annotated image and I found ~15 of those. We should add those to the list of images to filter out, but also think of more scalable solutions.

kostajh subscribed.

I assume the investigation part of this is long-since resolved, so I'm closing this, but please re-open if there is more to do.