Page MenuHomePhabricator

Are we sure all unillustrated articles are available via the API?
Open, Needs TriagePublic

Description

https://image-suggestion-api.toolforge.org/image-suggestions/v0/wikipedia/ceb/pages?offset=113476 returns a result

https://image-suggestion-api.toolforge.org/image-suggestions/v0/wikipedia/ceb/pages?offset=113477 does not

Are we sure there are only 113476 unillustrated articles in cebwiki? Seems a little low

(also - the format of the result for https://image-suggestion-api.toolforge.org/image-suggestions/v0/wikipedia/ceb/pages?offset=113476 looks wrong - there's no project or page in the response)

Event Timeline

(also - the format of the result for https://image-suggestion-api.toolforge.org/image-suggestions/v0/wikipedia/ceb/pages?offset=113476 looks wrong - there's no project or page in the response)

This part looks like a service bug to me. I'll look into it.

Change 674407 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/services/image-suggestion-api@master] Fixed corrupt page data at end of .json file

https://gerrit.wikimedia.org/r/674407

Change 674407 merged by jenkins-bot:
[mediawiki/services/image-suggestion-api@master] Fixed corrupt page data at end of .json file

https://gerrit.wikimedia.org/r/674407

"Bad last article" fix merged, service restarted, affected .json files regenerated.

Note that cebwiki will now contain only 113475 articles (and all other wikis will correspondingly contain one fewer row than before this fix).

I guess this proves the old saying: "there are only three hard problems in programming: cache invalidation and off-by-one errors"

@BPirkle thanks for that! 113475 unillustrated articles still seems rather low for cebwiki though. I got a list of unillustrated cebwiki articles from @Miriam in mid-Feb that had 1357406 items - I understand many of them may have been disambiguation pages etc that are now filtered out, but it seems unlikely that >90% of them were

Adding @Clarakosi and @gmodena to comment on the discrepancy mentioned by @Cparle .

Hey @BPirkle @Cparle

Ack: we delivered 113458 unique unillustrated articles for cebwiki in the 2021-01 dataset.

The definition of “unillustrated page” is based on an heuristic (e.g. what's an image vs an icon), and different settings might lead to more or less data to be included. In the data pipeline we do not do any tuning / changes to the default thresholds (see threshold above which we consider images as non-icons at https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb).

One nit: disambigations, etc. are still present in current dataset. Volumes might be lower as of the next algo run & dataset release.

Here's a quick check to verify that no records where lost in the export (to the best of my knowledge). gmodena.imagerec_prod is a staging table we use to generate the datasets that are consumed by the api.

select count(distinct(page_id)) as unique_pages, count(*) as suggestions from gmodena.imagerec_prod where snapshot='2021-01' and wiki='cebwiki';
unique_pages	suggestions
113458	190845

And the number of unique pages matches the exported tsv (what is delivered to the API):

cut -f 1 imagerec_prod_2021-01-25/prod-cebwiki-2021-01-25-wd_image_candidates.tsv | sort | uniq | wc -l
113459 <--- this includes one header label!

The number is consistent with the number of articles in raw model output (not post-processed by the ImageMatching data pipeline)

select count(distinct(page_id)) as unique_pages, count(*) as suggestions from gmodena.imagerec where snapshot='2021-01' and wiki_db='cebwiki';
unique_pages	suggestions
117255	117257

And just to make sure, we did not lose any record when uploading to hdfs, here's unique rows count of the notebook output:

cut -f 3 runs/dc4c9aea-4e85-475f-9626-ad0909b92fb6/Output/cebwiki_2021-01-25_wd_image_candidates.tsv | sort | uniq | wc -l
117255

Everything looks consistent.

FYI: The is figure is lower because in the raw output all matching images are stored in a single row, whereas in production data we explode it.

Hi @gmodena! When I run the notebook for the February snapshot (similar to the Jan one) I get the following numbers for cebwiki:

cebwiki
number of unillustrated articles: **1,435,202**

`
So, as @Cparle mentioned, 10 times more than what I see for the same snapshot on the gmodena.imagerec_prod table.

I think this is due to the fact that algorithm.ipynb is saving as tsv output the allimages[wiki] table: this contains all articles with initial candidates, then sets as null the recommendations for which candidates are not valid. To export all unillustrated articles, we should save all pageids from the full qids_and_properties[wiki], which is a superset of allimages[wiki] and contains all unillustrated articles. @gmodena this is my fault as I didn't name the tables in a very user friendly way!

Hi @gmodena! When I run the notebook for the February snapshot (similar to the Jan one) I get the following numbers for cebwiki:

cebwiki
number of unillustrated articles: **1,435,202**

`
So, as @Cparle mentioned, 10 times more than what I see for the same snapshot on the gmodena.imagerec_prod table.

Hey @Miriam thanks for the help with troubleshooting this.

I filed this bug https://phabricator.wikimedia.org/T278571. I just applied the change we discussed in slack, and I'll now re-run the algo on PoC wikis. Let's touch base on monday and revisit.

@Cparle this was a great catch! I'll let you know as soon as new data is available.