Together with @Swagoel , we are refining the heuristics to define:
- The list of images considered as icons
- The way in which we detect already illustrated articles
Together with @Swagoel , we are refining the heuristics to define:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | CBogen | T254768 [EPIC] Image recommendations proof-of-concept phase | |||
Resolved | Miriam | T256081 Image matching algorithm | |||
Resolved | Miriam | T268350 Improve algorithm for unillustrated article selection |
Work done by @Swagoel for icon detection:
The following is a threshold function we found to be reasonably effective:
def get_threshold(wiki_size): #change th to optimize precision vs recall. recommended val for accuracy = 5 sze, th, lim = 50000, 10, 4 if (wiki_size >= sze): #if wiki_size > base size, scale threshold by (log of ws/bs) + 1 return (math.log(wiki_size/sze)+1)*th #else scale th down by ratio bs/ws, w min possible val of th = th/limiting val return min((sze/wiki_size) * th, th/lim)
To verify that the threshold “formula” was reasonably effective, we created pseudo ground truth approximations for the languages: 'it', 'zh', 'cs', 'he', 'ta', 'bs', 'be', 'ast' using a high recall version of my classification tool, then evaluated the performance against the pseudo ground truth.
So I re-run the unillustrated article detection using:
I then run the image matching algorithm on top of the new set of unillustrated articles. I eyeballed the results and they looked more consistent than the eariler version. Below the quantitative / coverage results:
kowiki
number of unillustrated articles: 273305 number of articles items with Wikidata image: 15983 number of articles items with Wikidata Commons Category: 28324 number of articles items with Language Links: 83995
arwiki
number of unillustrated articles: 580284 number of articles items with Wikidata image: 7028 number of articles items with Wikidata Commons Category: 26526 number of articles items with Language Links: 121891
viwiki
number of unillustrated articles: 867565 number of articles items with Wikidata image: 49226 number of articles items with Wikidata Commons Category: 57548 number of articles items with Language Links: 117138
cswiki
number of unillustrated articles: 181867 number of articles items with Wikidata image: 8337 number of articles items with Wikidata Commons Category: 21120 number of articles items with Language Links: 69413
frwiki
number of unillustrated articles: 951319 number of articles items with Wikidata image: 10938 number of articles items with Wikidata Commons Category: 39457 number of articles items with Language Links: 236592
enwiki
number of unillustrated articles: 2922830 number of articles items with Wikidata image: 36412 number of articles items with Wikidata Commons Category: 92072 number of articles items with Language Links: 325534