Page MenuHomePhabricator

📊 Project “instance of” field in raw ImageMatching datasets
Closed, ResolvedPublic3 Estimated Story Points

Description

In order to determine an article "type", we should extend ImageMatching raw output with
an "instance of" field.

Acceptance criteria

  • all ImageMatching recommendations have an "instance of" field
  • hive schema has been updated
  • new data is published to hadoop

Note
This task will require re-running the algo on the whole dataset.

Event Timeline

gmodena set the point value for this task to 3.
gmodena updated the task description. (Show Details)

PR at https://github.com/mirrys/ImageMatching/pull/10.

Re-running the algo will take a while. I'll kick-off the job once PR accepted & merged.
I'm currently generating data for a subset of wikis with snapshots 2021-01 and 2021-02.

tl;dr: we are unable to determine the type of ~13% (average) of all unillustrated articles (with and without recommendations), and 15% of unillustrated articles with at least one suggestion.This means that we won't be able to filter out those articles, since we can't determine what they are.

There is going to be some variance per wiki. Below you'll find a breakdown of stats from two samples (January and February 2021). "% missing" denotes the % of articles we won't be able to filter, out of all unillustrated_articles for a given wiki_db. "missing "instance_of" is number of articles we won't be able to filter.

These numbers depict articles we *know* we'll skip. For articles with metadata, we don't have (yet) an estimate of quality and completeness of "instance of" info.

All unillustrated articles

wiki_dbsnapshotunillustrated_articlesmissing "instance_of"% missing
arwiki2021-011004141351913.463262094927003
arwiki2021-021003781339913.34854250931479
arzwiki2021-011495424304728.785892926401946
arzwiki2021-021490014059327.24344131918578
bnwiki2021-0210743164315.293679605324398
cebwiki2021-0111725772926.218818492712589
cebwiki2021-0211813372616.14646203854977
cswiki2021-0155183899116.293061268869035
cswiki2021-0255521887715.988544874912195
dewiki2021-0222018250411.372513398128804
enwiki2021-01299996196016.533753783383778
enwiki2021-02303098193706.390672323802862
eswiki2021-011244461499912.052617199427864
euwiki2021-013470832869.467557911720641
euwiki2021-023478632969.475076180072444
fawiki2021-0111884778796.629532087473811
fawiki2021-0211862078626.627887371438206
frwiki2021-011473781760411.94479501689533
frwiki2021-021491271744511.698082842141263
hewiki2021-0120984406819.386199008768585
hewiki2021-0221160407119.23913043478261
huwiki2021-0139092431011.025273713291723
huwiki2021-0239162426210.882998825391962
hywiki2021-0124354729629.958117762995812
hywiki2021-0225631689226.88931372166517
itwiki2021-011190171470112.352016938756648
kowiki2021-0164390905614.064295698089765
kowiki2021-0262515895414.322962489002638
plwiki2021-011046011394313.329700480874942
plwiki2021-021046951373213.116194660681026
ptwiki2021-0113581154711.39091377659966
ptwiki2021-0213598152911.244300632445949
ruwiki2021-01941611757918.66908805131636
ruwiki2021-02945821739218.38827683914487
srwiki2021-0140222422710.509174083834717
srwiki2021-0240250418610.4
svwiki2021-0113855294316.80683064842081
svwiki2021-0213938793386.699333510298665
trwiki2021-0151316603911.768259412269078
trwiki2021-0245268582512.867809490147566
ukwiki2021-0125467490919.275925707778693
ukwiki2021-0225463485119.051172289203944
viwiki2021-0111483245073.92486414936603
viwiki2021-0212754852194.091792893655722

Unillustrated articles with at least one suggestion

wiki_dbsnapshotunillustrated_articlesmissing "instance_of"% missing
arwiki2021-01607401129418.59400724399078
arwiki2021-02608981124118.45873427698775
arzwiki2021-01352367912.2448632080826427
arzwiki2021-02367317872.142604339658599
bnwiki2021-027067149421.14051223998868
cebwiki2021-01643912510.3898060287928437
cebwiki2021-02651252540.39001919385796546
cswiki2021-0137638715719.015356820234867
cswiki2021-0237803708118.731317620294686
dewiki2021-0214597201813.824758512023019
enwiki2021-011566011626810.388183983499468
enwiki2021-021578771609310.193378389505755
eswiki2021-01767111217315.868649867685209
euwiki2021-0121696241711.140302359882005
euwiki2021-0221753241711.11111111111111
fawiki2021-0155383705512.738565985952368
fawiki2021-0255699704212.64295588789745
frwiki2021-01957981471915.364621390843233
frwiki2021-02965711457815.095629122614449
hewiki2021-0114701365224.841847493367798
hewiki2021-0214829364424.573470901611707
huwiki2021-0126862382514.239446057627875
huwiki2021-0226913378014.04525693902575
hywiki2021-0113602305522.45993236288781
hywiki2021-0213535304322.48245289988918
itwiki2021-01758801188415.6615709014233
kowiki2021-0148964817416.693897557389104
kowiki2021-0247044806817.14990221919905
plwiki2021-01716551224917.094410718023862
plwiki2021-02716831210616.888244074606252
ptwiki2021-0110145128612.676195170034498
ptwiki2021-0210175127412.52088452088452
ruwiki2021-01608281339722.024396659433158
ruwiki2021-02611461323221.640009158407743
srwiki2021-0127498344812.539093752272892
srwiki2021-0227530343112.462767889575009
svwiki2021-019237781358.806304599629778
svwiki2021-029338780498.618972662147836
trwiki2021-0136226530114.633136421354829
trwiki2021-0234177514415.051057728881997
ukwiki2021-0115617313320.061471473394377
ukwiki2021-0215621309819.83227706292811
viwiki2021-018408438734.606108177536749
viwiki2021-028408738374.563131042848479

We could answer questions about the type of page more comprehensively using the tables categorylinks and templatelinks rather than structured data.

For example if I want to know if page 12345 on enwiki is a disambiguation page I can run the sql query

select cl_from from categorylinks where cl_from=12345 and cl_to='Disambiguation_pages';

Hey @Cparle,

Thats a good point!
To clarify: by categorylinks do you mean wmf_raw.mediawiki_categorylinks on the hive warehouse? Do you maybe have an estimate of how comprehensive it is across languages and categories? I'd expect quite a lot, since it's mediawiki.

In future iterations I'd be happy to revisit the current filtering logic, and augment it/replace it with categorylinks if that would help improve UX.

To give you some context, filtering based on wikidata is the default approach in the upstream code. We investigated alternative approaches to improve filtering, including parsing wikitext and mining wmf_raw.mediawiki_categorylinks.

Before committing to a rewrite we want to get some metrics to understand how well wikidata can satisfy our use case, what the error tolerance is, and build some expertise with the mediawiki_categorylinks data model and access patterns (e.g. efficient ways to handle translations).

Happy to discuss further.

To clarify: by categorylinks do you mean wmf_raw.mediawiki_categorylinks on the hive warehouse?

I don't know much about hive, but if that's a copy/snapshot of the data in the mediawiki categorylinks table then yes. The only complicating factor is that cl_from is a text field with the title of the category, so for each language you'll need to figure out what the title of the "Disambiguation pages" category is. I expect there is one for every language

Correcting stats after fixing https://phabricator.wikimedia.org/T278571

All unillustrated articles

We are unable to determine the type of ~25% (average) of all unillustrated articles (with and without recommendations).

% missing "instance of"value
mean24.718111
std12.918784
min5.486282
25%15.473043
50%22.265679
75%30.816904
max49.617633
wiki_dbsnapshotunillustrated_articlesmissing "instance_of"% missing "instance of"
arwiki2021-0258926417341529.429084417171254
arzwiki2021-027809408503710.889056777729403
bnwiki2021-02358771034428.831842127268164
cebwiki2021-02143520315193310.586167949760416
cswiki2021-021838754683925.473283480625426
dewiki2021-021556483340921.464458264802634
enwiki2021-0229584412849919.63314799923338
eswiki2021-0266891911213816.764062614457057
euwiki2021-021057761227011.599984873695355
fawiki2021-023018177805525.86169765122574
frwiki2021-0296282718198318.90090327753584
hewiki2021-02744492821237.894397507018226
huwiki2021-021717523637321.177628208114026
hywiki2021-02966354794849.61763336265329
itwiki2021-0273174014129919.310000819963374
kowiki2021-0227522312599145.77778746689049
plwiki2021-0256339113536724.02718538279809
ptwiki2021-02491521099722.373453776041664
ruwiki2021-0258649328320748.28821486360451
srwiki2021-021262884417634.98036234638287
svwiki2021-0216596451567809.446598519562919
trwiki2021-021316372916822.157903932784855
ukwiki2021-021105924784643.263527199074076
viwiki2021-02865650474925.486281984635823
Unillustrated articles with at least one suggestion

The fix did not impact articles for which a suggestion is available.
We are unable to determine the type of ~15% (average) of all unillustrated articles with at least one suggestion.

% missing "instance of"value
mean14.364244
std6.132483
min0.390019
25%12.124854
50%15.073343
75%18.526880
max24.573471
wiki_dbsnapshotunillustrated_articlesmissing "instance_of"% missing "instance of"
arwiki2021-02608981124118.45873427698775
arzwiki2021-02367317872.142604339658599
bnwiki2021-027067149421.14051223998868
cebwiki2021-02651252540.39001919385796546
cswiki2021-0237803708118.731317620294686
dewiki2021-0214597201813.824758512023019
enwiki2021-021578771609310.193378389505755
eswiki2021-02768651208115.717166460677811
euwiki2021-0221753241711.11111111111111
fawiki2021-0255699704212.64295588789745
frwiki2021-02965711457815.095629122614449
hewiki2021-0214829364424.573470901611707
huwiki2021-0226913378014.04525693902575
hywiki2021-0213535304322.48245289988918
itwiki2021-02763261180415.465241202211566
kowiki2021-0247044806817.14990221919905
plwiki2021-02716831210616.888244074606252
ptwiki2021-0210175127412.52088452088452
ruwiki2021-02611461323221.640009158407743
srwiki2021-0227530343112.462767889575009
svwiki2021-029338780498.618972662147836
trwiki2021-0234177514415.051057728881997
ukwiki2021-0215621309819.83227706292811
viwiki2021-028408738374.563131042848479