Page MenuHomePhabricator

Not all unillustrated articles are stored in the model output.
Closed, ResolvedPublicBUG REPORT

Description

Only a subset of unillustrated articles (with and without recommendations) is saved.

A filtering step in the post-processing of results is excluding a subset of unillustrated articles that
have no recommendation.

This bug was initially reported at https://phabricator.wikimedia.org/T277875#6948836

Steps to Reproduce:

Run the algorithm.py notebook with cebwiki and snapshot 2021-02

Actual Results:

The output file contains 117255 records

Expected Results:

The output file should contain 1435202 records.

Event Timeline

PR at https://github.com/mirrys/ImageMatching/pull/new/T278571-bugfix-save-all-articles

No image source is excluded when building the allimages dataset.

Validation

cebwiki tsv output contains 1435203 records (1435202 unillustrated articles + 1 row of header).

Stats for all 24 PoC wikis stored in Hive follow:

snapshotwiki_dbunillustrated_articles
2021-02arwiki589264
2021-02arzwiki780940
2021-02bnwiki35877
2021-02cswiki183875
2021-02enwiki2958441
2021-02fawiki301817
2021-02hewiki74449
2021-02plwiki563391
2021-02srwiki126288
2021-02viwiki865650
2021-02euwiki105776
2021-02huwiki171752
2021-02itwiki731740
2021-02kowiki275223
2021-02svwiki1659645
2021-02trwiki131637
2021-02ukwiki110592
2021-02cebwiki1435203
2021-02dewiki155648
2021-02eswiki668919
2021-02frwiki962827
2021-02hywiki96635
2021-02ptwiki49152
2021-02ruwiki586493