Page MenuHomePhabricator

Wikidata Item Quality Model
Open, Needs TriagePublic

Description

Overview

Revisit what an item quality model would look like for Wikidata. Use cases / motivations:

  • Structured data gap metrics for Wikipedia articles

Open Questions

  • What is the scope? Just items with Wikipedia sitelinks? All items? Some other subset perhaps based on instance-of properties?
  • What are the drawbacks of the existing ORES model (see more below) that we can try to address?
  • How might we get labeled data to help evaluate any resulting models?
  • How can we estimate what properties are missing for an item (completeness)? Based on similar instance-ofs? Based on info in corresponding Wikipedia articles? Using Schemas? Something else?
  • What are the other use-cases associated with this model?
  • What features might we use beyond counts of properties/values/references? Embeddings?
  • Can we / should we generate weights for different properties -- i.e. are certain statements "more important" than others to the quality of an item?

Resources

Related Objects

StatusSubtypeAssignedTask
OpenIsaac
OpenIsaac

Event Timeline

  • Looked into Schemas but there seems to be a lot of variation in what they contain and I don't see an easy way right now to map an item to the relevant schema without duplicating an extensive system of schema validation. Still it's very tempting as a means to allow community definitions of what the expectations are for a given type of item so I will continue to explore.
  • Quick data around top instance-of properties for Wikidata items with at least one Wikipedia article sitelink below. Clear long-tail of properties and will start some exploration to see if the sub-class-of ontology is useful for aggregating them to higher-level nodes. My concern is that why semantically a hill and a lake are quite similar and should be the same high-level class, in practice I assume their expected properties are quite different.
          value  num_items                                                               label
0            Q5    4221942                                                               human
1        Q16521    2201295                                                               taxon
2      Q4167410    1396595                                       Wikimedia disambiguation page
3       Q486972     506982                                                    human settlement
4         Q8502     506973                                                            mountain
5         Q4022     391203                                                               river
6     Q13406463     320695                                              Wikimedia list article
7        Q54050     312049                                                                hill
8       Q482994     267636                                                               album
9        Q11424     254947                                                                film
10       Q23397     234597                                                                lake
11         Q532     228822                                                             village
12         Q318     224199                                                              galaxy
13      Q355304     175557                                                         watercourse
14    Q27020041     157860                                                       sports season
15       Q23442     147314                                                              island
16     Q4830453     127013                                                            business
17       Q47521     123253                                                              stream
18        Q5084     111260                                                              hamlet
19     Q7725634     111209                                                       literary work
20       Q16970     110131                                                     church building
21        Q3863     106068                                                            asteroid
22    Q56436498      97433                                                    village in India
23       Q39816      95924                                                              valley
24      Q134556      93556                                                              single
25       Q55488      86071                                                     railway station
26    Q22808320      82069                            Wikimedia human name disambiguation page
27      Q101352      80401                                                         family name
28      Q215380      78312                                                       musical group
29       Q22698      77244                                                                park
30   Q105543609      71015                                            musical work/composition
31       Q12323      70796                                                                 dam
32    Q26211545      68317                                                                desa
33      Q124714      66546                                                              spring
34    Q55659167      66200                                                 natural watercourse
35       Q34442      60762                                                                road
36     Q5398426      58043                                                   television series
37     Q3558970      56061                                                   village of Poland
38      Q187971      51698                                                                wadi
39       Q41176      51414                                                            building
40       Q79007      50270                                                              street
41       Q43229      47929                                                        organization
42       Q15416      47491                                                  television program
43       Q12284      47298                                                               canal
44    Q18340514      44986                            events in a specific year or time period
45      Q820477      44082                                                                mine
46      Q484170      40021                                                   commune of France
47        Q7889      39250                                                          video game
48      Q253019      38784                                                            Ortsteil
49    Q17343829      38516                       unincorporated community in the United States
50      Q476028      37410                                           association football club
51    Q24529780      36535                                                               point
52         Q523      35430                                                                star
53    Q23038290      34830                                                        fossil taxon
54    Q26887310      34589                                    association football team season
55     Q3257686      33532                                                            locality
56       Q39594      32760                                                                 bay
57    Q23925393      32025                                                               douar
58    Q16510064      28952                                                      sporting event
59       Q11173      27974                                                   chemical compound
60       Q14350      27371                                                       radio station
61     Q1681353      27274                                                                 bog
62       Q24862      27041                                                          short film
63    Q21672098      26910                                                  village of Ukraine
64    Q46190676      26902                                                        tennis event
65        Q3947      26865                                                               house
66       Q34763      26660                                                           peninsula
67    Q67206691      26505                                                     infrared source
68    Q47461344      26276                                                        written work
69      Q123705      26085                                                        neighborhood
70    Q20202352      25526                                                  locality of Mexico
71    Q20541692      25309                                        settlement in Galicia, Spain
72       Q33506      25085                                                              museum
73        Q3914      24616                                                              school
74       Q41298      24587                                                            magazine
75        Q9826      24321                                                         high school
76     Q1248784      24101                                                             airport
77    Q26895936      23851                                       American football team season
78        Q4421      23726                                                              forest
79       Q30198      23530                                                               marsh
80     Q6881511      23041                                                          enterprise
81      Q839954      22971                                                 archaeological site
82      Q176799      22893                                                       military unit
83       Q11446      22273                                                                ship
84      Q811979      22179                                             architectural structure
85     Q1529096      21710                                                   village in Turkey
86      Q735428      21622                                                   township in China
87      Q207326      21547                                                              summit
88       Q46831      21403                                                      mountain range
89      Q847017      21166                                                         sports club
90    Q17051044      20896                                                             mahalle
91      Q133056      20669                                                       mountain pass
92      Q185113      20524                                                                cape
93     Q3305213      20143                                                            painting
94     Q2042028      19343                                                              ravine
95       Q11032      19190                                                           newspaper
96      Q473972      19088                                                      protected area
97      Q634099      19019                                          rural settlement of Russia
98     Q2514025      18783                                                            posyolok
99        Q7187      18713                                                                gene
100    Q1134686      18581                                                            frazione
101     Q131681      18488                                                           reservoir
102    Q1539532      18410                                      sports season of a sports club
103   Q15303838      18228                                                   municipality seat
104   Q21079327      18112                                                  intermittent river
105    Q3464665      18072                                            television series season
106     Q740445      17692                                                               ridge
107       Q7278      17598                                                     political party
108      Q23413      17596                                                              castle
109     Q207524      17321                                                               islet
110   Q15632617      17212                                                     fictional human
111     Q506240      16589                                                     television film
112      Q40231      16316                                                     public election
113     Q192287      16271                                               subdivision of Russia
114   Q17018380      16225                                                               bight
115     Q618779      15874                                                               award
116   Q15221623      15757                                                  bilateral relation
117    Q2231510      15597                                                                 col
118    Q1002697      15482                                                          periodical
119     Q842402      15265                                                        Hindu temple
120   Q12308941      15072                                                     male given name
121   Q21191270      15056                                           television series episode
122     Q166735      14643                                                               swamp
123     Q928830      14587                                                       metro station
124       Q3957      14582                                                                town
125      Q40080      14518                                                               beach
126   Q47345468      14430                                           tennis tournament edition
127     Q178561      14425                                                              battle
128   Q13406554      14133                                                  sports competition
129     Q102496      14017                                                              parish
130     Q210272      14006                                                   cultural heritage
131      Q16917      13225                                                            hospital
132   Q21198342      13179                                                        manga series
133   Q18663566      12973                                     dissolved municipality of Japan
134     Q728937      12921                                                        railway line
135   Q26894053      12807                                              basketball team season
136       Q3918      12786                                                          university
137     Q169930      12614                                                       extended play
138    Q2065736      12517                                                   cultural property
139   Q46135307      12334                                         nation at sport competition
140      Q39614      12327                                                            cemetery
141    Q4504495      12223                                                      award ceremony
142     Q187223      12154                                                              lagoon
143   Q65661087      12100                                                  village in Belarus
144    Q1115575      12086                                                        civil parish
145     Q459297      12076                                                               qanat
146     Q160091      11656                                                               plain
147    Q4989906      11609                                                            monument
148    Q2732840      11419                                                      Gram panchayat
149       Q9842      11268                                                      primary school
150     Q891723      10915                                                      public company
151      Q12280      10913                                                              bridge
152     Q783794      10844                                                             company
153     Q163740      10817                                              nonprofit organization
154     Q751876      10812                                                             château
155     Q498162      10714                                             census-designated place
156      Q34038      10697                                                           waterfall
157      Q28337      10655                                                               shoal
158    Q3231690      10614                                                    automobile model
159    Q5358913      10551                                          elementary school in Japan
160      Q62447      10507                                                           aerodrome
161     Q327333      10507                                                   government agency
162    Q4164871      10430                                                            position
163    Q5633421      10410                                                  scientific journal
164    Q4414033      10278                                            rural council of Ukraine
165      Q14659      10205                                                        coat of arms
166   Q11879590      10167                                                   female given name
167   Q15243209      10065                                                   historic district
168   Q13417114      10059                                                        noble family
169       Q7366      10036                                                                song
170     Q179049       9952                                                      nature reserve
171     Q262166       9929                                             municipality of Germany
172     Q184358       9883                                                                reef
173    Q2389082       9859                                            rural commune of Vietnam
174     Q965568       9786                                                           kelurahan
175      Q35127       9758                                                             website
176   Q17524420       9742                                                   aspect of history
177    Q1093829       9724                                           city of the United States
178        Q726       9609                                                               horse
179    Q1555508       9564                                                       radio program
180     Q479053       9563                                                   gromada of Poland
181     Q310890       9409                                                     monotypic taxon
182    Q1500350       9370                          township of the People's Republic of China
183     Q986065       9234                                                subdistrict of China
184      Q34770       9101                                                            language
185   Q61089180       9031                          part of municipality in the Czech Republic
186   Q74817647       8848                                       aspect in a geographic region
187        Q515       8811                                                                city
188      Q41710       8654                                                        ethnic group
189       Q7397       8642                                                            software
190     Q107679       8548                                                               cliff
191    Q2085381       8502                                                           publisher
192     Q174782       8423                                                              square
193   Q23012917       8375                                             village of Burkina Faso
194     Q860861       8289                                                           sculpture
195      Q18127       8281                                                        record label
196      Q32815       8259                                                              mosque
197     Q108325       8227                                                              chapel
198     Q585956       8036                                                               masia
199   Q58483083       8030                                              dramatico-musical work
200    Q5393308       8001                                                     Buddhist temple
201      Q44613       7961                                                           monastery
202     Q159334       7901                                                    secondary school
203      Q56061       7895                                   administrative territorial entity
204     Q747074       7863                                                     comune of Italy
205   Q67206785       7804                                                      near-IR source
206      Q24354       7768                                                             theater
207      Q35666       7649                                                             glacier
208      Q35509       7558                                                                cave
209   Q15773317       7546                                                television character
210    Q1616075       7493                                                  television station
211  Q104093746       7455                                                        lake or pond
212      Q27686       7384                                                               hotel
213     Q483110       7365                                                             stadium
214   Q19832486       7361                                                    locomotive class
215   Q98645843       7240                                        Wikimedia music-related list
216      Q16560       7191                                                              palace
217       Q8436       7171                                                              family
218    Q1076486       7085                                                        sports venue
219   Q22808404       7004                                          station located on surface
220    Q2334719       6989                                                          legal case
221   Q17315159       6972                            international association football match
222    Q2074737       6958                                               municipality of Spain
223    Q4671277       6946                                                academic institution
224    Q1288568       6926                                                     modern language
225    Q3240003       6875                                                           bus route
226   Q17362920       6836                                           Wikimedia duplicated page
227   Q30092776       6748                                                     lake water body
228    Q3700011       6713                                                           kecamatan
229   Q15911738       6627                                         hydroelectric power station
230        Q571       6576                                                                book
231   Q18536594       6536                                              Olympic sporting event
232    Q4285979       6490                                                             gampong
233   Q15127012       6481                                           town of the United States
234     Q736917       6450                                                           formation
235    Q1210950       6424                                                             channel
236     Q759421       6384                                                   Naturschutzgebiet
237      Q11315       6383                                                     shopping center
238   Q14757767       6341                     fourth-level administrative country subdivision
239   Q51031626       6319                            sport competition at a multi-sport event
240     Q786820       6301                                             automobile manufacturer
241     Q537127       6291                                                         road bridge
242     Q149566       6268                                                       middle school
243    Q5153359       6266                                  municipality of the Czech Republic
244    Q1497375       6224                                              architectural ensemble
245     Q188509       6206                                                              suburb
246    Q2023000       6105                                                              khutor
247   Q22808403       6101                                         station located underground
248     Q398141       6070                                                     school district
249     Q211748       6003                                                           oil field
250       Q3937       5937                                                           supernova
251       Q7075       5936                                                             library
252    Q2385804       5879                                             educational institution
253   Q12089225       5854                                                     mineral species
254      Q55678       5837                                                        railway stop
255     Q180684       5795                                                            conflict
256   Q38033430       5646                                                      class of award
257     Q192611       5643                                                  electoral district
258  Q112193867       5627                                                    class of disease
259     Q273057       5624                                                         discography
260      Q28640       5602                                                          profession
261    Q3184121       5569                                              municipality of Brazil
262   Q55521176       5537                                     lower secondary school in Japan
263      Q48204       5477                                               voluntary association
264      Q31855       5458                                                  research institute
265    Q1402592       5429                                                        island group
266     Q317557       5372                                                       parish church
267     Q811704       5339                                                 rolling stock class
268     Q152450       5333                                                  municipal election
269   Q67015883       5302                                           group or class of enzymes
270      Q87167       5297                                                          manuscript
271   Q19730508       5293                                                 former municipality
272    Q1114461       5277                                                    comics character
273    Q2001305       5269                                                  television channel
274   Q20871353       5202                                cadastral area in the Czech Republic
275    Q1131296       5130                                               freguesia of Portugal
276     Q618123       5029                                                geographical feature
277     Q868557       5023                                                      music festival
278   Q15630849       5017                                                 village of Bulgaria
279   Q22580836       5001                                                   village in Latvia
280        Q341       4998                                                       free software
281    Q2247863       4987                                             high proper-motion star
282    Q1081138       4974                                                       historic site
283    Q3658341       4972                                                  literary character
284   Q21286738       4963                                  Wikimedia permanent duplicate item
285      Q25391       4951                                                                dune
286     Q207694       4940                                                          art museum
287      Q46970       4937                                                             airline
288   Q14839548       4924                                            minor locality in Sweden
289    Q2983893       4903                                                             quarter
290    Q3812392       4878                                         union council of Bangladesh
291   Q56351315       4824                                                Japanese high school
292    Q1194951       4769                                                national sports team
293   Q94993988       4759                                        commercial traffic aerodrome
294   Q26896697       4737                                                baseball team season
295      Q39715       4727                                                          lighthouse
296   Q22674925       4680                                                   former settlement
297   Q13393265       4678                                                     basketball team
298     Q845945       4584                                                       Shinto shrine
299   Q26213387       4574                                                  Olympic delegation
300     Q478251       4568                                                                 bar
301    Q2175765       4538                                                           tram stop
302    Q1142192       4510                                                       peculiar star
303     Q188451       4508                                                         music genre
304    Q2088357       4479                                                    musical ensemble
305   Q51049922       4479                                                  village in Estonia
306     Q350895       4439                                                   abandoned village
307    Q2831984       4419                                                    comic book album
308   Q16887036       4400                                                                 gap
309    Q3240227       4395                                                          raised bog
310      Q34627       4334                                                           synagogue
311    Q5327369       4323                                                              chōchō
312      Q41253       4288                                                       movie theater
313   Q19692233       4287                                           Wikimedia list of persons
314       Q3950       4259                                                               villa
315    Q7551933       4221                                                            mahallah
316     Q820655       4214                                                             statute
317      Q11303       4208                                                          skyscraper
318     Q737498       4168                                                    academic journal
319      Q15284       4168                                                        municipality
320       Q6243       4125                                                       variable star
321    Q1190554       4096                                                          occurrence
322   Q18608583       4096                                            recurring sporting event
323    Q2485448       4035                                               sports governing body
324        Q577       4032                                                                year
325   Q19953632       4024                            former administrative territorial entity
326      Q95074       4014                                                 fictional character
327    Q4663385       4005                                              former railway station
328       Q1004       3971                                                              comics
329    Q1931185       3934                                           astronomical radio source
330     Q623109       3917                                                       sports league
331    Q4498974       3900                                                     ice hockey team
332    Q2031836       3899                                             Eastern Orthodox church
333      Q52371       3892                                                            regiment
334    Q3917681       3891                                                             embassy
335    Q2376564       3885                                                         interchange
336    Q3055118       3884                                         single entity of population
337    Q1656682       3871                                                               event
338      Q21199       3839                                                      natural number
339     Q721747       3803                                                commemorative plaque
340     Q879050       3798                                                         manor house
341   Q12397176       3792                                                   parish of Galicia
342    Q1040689       3767                                                             synonym
343   Q22988604       3758                                        mythological Greek character
344   Q17376095       3751                                   cadastral municipality of Austria
345    Q1080794       3742                                                        state school
346    Q1400565       3726                                                                spur
347     Q813672       3705                                                               basin
348    Q3177968       3690                                                           waterhole
349     Q277759       3646                                                         book series
350   Q13027888       3642                                                       baseball team
351   Q21850100       3631                                               municipal flag design
352     Q179700       3629                                                              statue
353    Q1484611       3600                                                          buurtschap
354   Q15275719       3586                                                     recurring event
355     Q559026       3575                                                          ship class
356     Q846659       3561                                                     Jewish cemetery
357    Q1573906       3549                                                        concert tour
358   Q18691601       3527                                                    ward of Tanzania
359     Q180958       3522                                                             faculty
360   Q14406742       3518                                                   comic book series
361    Q5185279       3508                                                                poem
362     Q157031       3495                                                          foundation
363      Q13890       3477                                                         double star
364       Q2488       3475                                                       spiral galaxy
365    Q1530705       3450                              village development committee of Nepal
366   Q28912853       3449                                                            fraction
367    Q1244442       3438                                                     school building
368     Q615980       3432                                      parish of the Church of Sweden
369   Q18706073       3415  Public institution of intermunicipal cooperation with own taxation
370      Q70208       3413                                         municipality of Switzerland
371   Q15071808       3402                                               sieĺsaviet of Belarus
372      Q44559       3401                                                   extrasolar planet
373      Q28111       3401                                       township in the United States
374    Q1549591       3399                                                            big city
375     Q170321       3389                                                             wetland
376    Q1088552       3376                                            Catholic church building
377   Q17517379       3371                                                 animated short film
378      Q82794       3363                                                   geographic region
379   Q18611609       3351                                               unconfirmed exoplanet
380     Q189004       3346                                                             college
381   Q10929058       3320                                                       product model
382   Q14795564       3308                   point in time with respect to recurrent timeframe
383   Q47018478       3290                                      calendar month of a given year
384      Q50707       3283                                                    composite number
385   Q21170330       3278                                          tennis qualification event
386      Q12518       3276                                                               tower
387     Q355567       3276                                                         noble title
388     Q378427       3275                                                      literary award
389     Q427287       3259                                                                 wat
390     Q184188       3257                                       canton of France (until 2015)
391      Q22746       3241                                                          urban park
392      Q12136       3240                                                             disease
393     Q132241       3237                                                            festival
394    Q1154710       3222                                          association football venue
395    Q1079023       3217                                                        championship
396   Q15911314       3215                                                         association
397    Q1057954       3186                                                         by-election
398    Q4224624       3183                                                kmetstvo of Bulgaria
399     Q399811       3172                                           Japanese television drama
400     Q202444       3154                                                          given name
401      Q44539       3148                                                              temple
402      Q11436       3142                                                            aircraft
403   Q69388744       3132                                                       mineral index
404     Q751708       3130                                        village in the United States
405     Q631305       3128                                                      rock formation
406     Q570116       3119                                                  tourist attraction
407    Q3024240       3112                                                  historical country
408      Q74047       3053                                                          ghost town
409     Q431289       3051                                                               brand
410   Q55463514       3050                                                         tribal area
411   Q62391930       3042                                              beauty pageant edition
412    Q3001412       3032                                                          horse race
413      Q83620       3031                                                        thoroughfare
414   Q12973014       3029                                                         sports team
415      Q11707       2998                                                          restaurant
416   Q15261477       2989                                              gubernatorial election
417     Q178790       2989                                                         labor union
418     Q490329       2982                                                 dong of South Korea
419     Q167270       2979                                                           trademark
420   Q60169073       2976                                                     ward of Nigeria
421   Q19860854       2965                                     destroyed building or structure
422   Q24036024       2961                                                            apylinkė
423     Q150784       2950                                                              canyon
424    Q2154519       2945                                          astrophysical X-ray source
425   Q20732405       2927                                        volost of the Russian Empire
426     Q591942       2923                                                        distributary
427      Q12140       2915                                                          medication
428   Q19658107       2912                                              neighborhood of Brazil
429    Q6784672       2889                                            municipality of Slovakia
430   Q13221722       2883                      third-level administrative country subdivision
431   Q15238777       2870                                                    legislative term
432   Q15056995       2867                                                      aircraft model
433   Q15623926       2866                                         Wikimedia set index article
434     Q659103       2862                                                  commune of Romania
435   Q34841063       2861                                           seat of the local council
436   Q19692072       2854                                United States Supreme Court decision
437      Q55818       2853                                                       impact crater
438      Q44782       2847                                                                port
439   Q12813115       2828                                                urban area in Sweden
440    Q2590631       2820                                             municipality of Hungary
441     Q856713       2791                                                      Christian hymn
442     Q185187       2780                                                           watermill
443     Q740752       2753                                                   transport company
444      Q35054       2743                                                         post office
445      Q40357       2741                                                              prison
446    Q1259759       2735                                                          miniseries
447    Q1785071       2727                                                                fort
448     Q202866       2721                                                       animated film
449   Q65770283       2688                                          association football final
450    Q1307276       2661                                         single-family detached home
451   Q16334295       2658                                                     group of humans
452     Q220505       2654                                                       film festival
453     Q188055       2648                                                               siege
454      Q33384       2645                                                             dialect
455   Q11722303       2628                            rural municipality of Sweden and Finland
456     Q182676       2628                                                        mountain hut
457  Q104635718       2628                                        Wikimedia artist discography
458  Q112826905       2627                                          class of anatomical entity
459   Q14752149       2609                                               amateur football club
460      Q65943       2597                                                             theorem
461    Q5741069       2597                                                          rock group
462   Q67201586       2596                                                    HI (21cm) source
463  Q107357104       2585                                                       type of dance
464    Q1366112       2577                                             drama television series
465   Q13220204       2568                     second-level administrative country subdivision
466     Q174736       2558                                                           destroyer
467   Q15125752       2556                                              rural district of Iran
468     Q744913       2555                                                   aviation accident
469     Q708676       2554                                             charitable organization
470    Q5773747       2548                                                      historic house
471   Q27020779       2541                                              ice hockey team season
472    Q2319498       2535                                                            landmark
473   Q15711870       2532                                                  animated character
474   Q18333556       2523                                     registration district in Sweden
475     Q695850       2517                                                             airbase
476   Q15081032       2503                                  historical motorcycle manufacturer
477     Q877358       2500                          United Nations Security Council resolution
478    Q1110794       2490                                                     daily newspaper
479   Q55075651       2487                                                              upland
480       Q8514       2475                                                              desert
481      Q38723       2474                                        higher education institution
482    Q1952852       2473                                              municipality of Mexico
483    Q1523821       2467                                                              socken
484    Q1852859       2458                                  populated place in the Netherlands
485     Q543654       2452                                                             rathaus
486   Q21869758       2419                                                   delegated commune
487     Q744099       2412                                                            hillfort
488   Q53532033       2410                                              volleyball team season
489      Q38720       2407                                                            windmill
490   Q17715832       2399                                                         castle ruin
491     Q190429       2386                                                          depression
492     Q742421       2377                                                   theatrical troupe
493     Q667509       2373                                             municipality of Austria
494     Q167346       2371                                                    botanical garden
495   Q15661340       2366                                                        ancient city
496   Q23866334       2361                                                    motorcycle model
497    Q2624046       2359                                                      mountain chain
498      Q14660       2329                                                                flag
499   Q10438042       2325                                                         bus company
500    Q2225692       2324                   fourth-level administrative division in Indonesia
501      Q79913       2321                                       non-governmental organization
502    Q2742167       2320                                                 religious community
503   Q29964144       2313                                                             year BC
504     Q106259       2295                                                              polder
505      Q45776       2291                                                               fjord
506      Q22687       2283                                                                bank
507        Q282       2273                                                                wine
508    Q2354973       2270                                                         road tunnel
509      Q28564       2258                                                      public library
510     Q131569       2251                                                              treaty
511   Q16352482       2251                                      dispersed settlement in Latvia
512     Q131596       2248                                                                farm
513     Q131734       2237                                                             brewery
514   Q15217609       2234                                                         titular see
515    Q4414032       2232                                        rural district of Kazakhstan
516    Q2785216       2225                                                municipality section
517     Q131436       2211                                                          board game
518   Q63952888       2209                                             anime television series
519   Q10594991       2207                                                         nature area
520   Q10590726       2199                                                         video album
521   Q50393057       2191                                           Paralympic sporting event
522   Q64578911       2190                                                     former hospital
523       Q7944       2183                                                          earthquake
524    Q9212979       2183                                                         musical duo
525     Q101072       2182                                                          definition
526      Q46169       2168                                                       national park
527   Q53534649       2163                                                 cycling team season
528   Q29791211       2159                                        sport in a geographic region
529   Q45400320       2158                                               open-access publisher
530   Q80096233       2157                                                    information list
531     Q483453       2154                                                            fountain
532     Q814648       2141                                                   parish of Denmark
533    Q4671329       2139                                                      academy school
534      Q37901       2138                                                              strait
535     Q811534       2129                                                     remarkable tree
536    Q3146899       2121                                      diocese of the Catholic Church
537     Q186117       2118                                                            timeline
538     Q210167       2106                                                video game developer
539    Q1320047       2102                                                      book publisher
540    Q1762059       2101                                             film production company
541     Q155239       2095                                        Indian reservation of Canada
542    Q2221906       2094                                                 geographic location
543   Q42744322       2092                                       urban municipality of Germany
544    Q1478437       2086                                    association football competition
545   Q80447738       2083                                                     anime character
546     Q494829       2078                                                         bus station
547     Q249556       2077                                                     railway company
548      Q83405       2073                                                             factory
549     Q655686       2069                                                 commercial building
550     Q131186       2068                                                               choir
551   Q19571328       2064                                                    electoral result
552   Q18524218       2056                                                    canton of France
553     Q151885       2043                                                             concept
554   Q11266439       2035                                                  Wikimedia template
555     Q160742       2029                                                               abbey
556   Q27787439       2028                                               film festival edition
557   Q46195901       2019                                              Paralympics delegation
558    Q1279564       2019                                              short story collection
559   Q17143723       1982                                                     Catholic parish
560     Q581714       1980                                                     animated series
561       Q2977       1978                                                           cathedral
562   Q12225020       1977                                                               uzlah
563     Q575759       1956                                                        war memorial
564     Q773668       1951                                                 open-access journal
565    Q2122052       1943                                                 qualification event
566     Q491713       1942                                                               sound
567   Q13366104       1942                                                         even number
568   Q15642541       1938                                 human-geographic territorial entity
569   Q27971968       1925                                constituency of the House of Commons
570     Q169534       1922                                                            division
571       Q2811       1919                                                           submarine
572    Q1569167       1919                                                video game character
573    Q2990963       1908                                          figure skating competition
574   Q18558301       1902                                                 college sports team
575   Q13366129       1899                                                          odd number
576      Q92026       1896                                                     Japanese castle
577   Q23827464       1896                                              centre of Saudi Arabia
578   Q15773347       1886                                                      film character
579     Q192078       1879                                                   lenticular galaxy
580    Q2179958       1878                                                    district of Peru
581   Q18759100       1873                                                           baronetcy
582     Q385337       1862                                                             deanery
583     Q811430       1861                                       human-made geographic feature
584    Q1210334       1861                                                      railway bridge
585   Q35823051       1859                      nation at the World Championships in Athletics
586   Q47150325       1858                                        calendar day of a given year
587   Q12708689       1856                                           National Secondary School
588   Q11862829       1856                                                 academic discipline
589    Q2198484       1849                                                  municipal district
590   Q14750991       1847              Commonwealth War Graves Commission maintained cemetery
591   Q11670533       1841                                                    elevated station
592    Q1021645       1839                                                     office building
593      Q42998       1832                                                           orchestra
594    Q2635894       1827                                                         radio drama
595   Q87576284       1826                                                     manga character
596     Q193622       1822                                                               order
597   Q10648343       1821                                                                 duo
598    Q2066754       1819                                                               manor
599   Q17350442       1808                                                               venue
600   Q12859788       1797                                                           steamship
601   Q12737077       1782                                                          occupation
602     Q902814       1773                                                         border town
603   Q17156793       1768                                              American football team
604   Q10977433       1765                                           protected area of Ukraine
605       Q8054       1760                                                             protein
606    Q1141470       1759                                                          double act
607    Q6558431       1749                                            coal-fired power station
608    Q1366722       1746                                                               final
609     Q102356       1737                                                             brigade
610        Q358       1734                                                       heritage site
611     Q744296       1734                                                       wooden church
612    Q2990946       1728                                                     golf tournament
613    Q1959314       1722                                            protected area of Russia
614   Q47154513       1722                              structural class of chemical compounds
615       Q8072       1722                                                             volcano
616       Q4886       1717                                                            cultivar
617      Q62049       1710                                                         county seat
618      Q43501       1706                                                                 zoo
619     Q101659       1704                                                              dolmen
620    Q1584134       1703                                                               mound
621   Q22807280       1698                                       flag of a country subdivision
622      Q82117       1697                                                           city gate
623   Q50823455       1687                                                Hebrew calendar year
624     Q641226       1686                                                               arena
625   Q12020836       1682                                                 timber-framed house
626    Q1445650       1682                                                             holiday
627   Q15056993       1681                                                     aircraft family
628   Q12488913       1677                                                    kampung of Papua
629     Q645883       1677                                                  military operation
630     Q162875       1674                                                           mausoleum
631    Q2618461       1673                                                legislative election
632    Q1077097       1671                                                              tambon
633   Q42211429       1667                                               township of Minnesota
634    Q1365179       1665                                                     private mansion
635  Q104146934       1661             United States of America State-level electoral district
636   Q23002054       1656                      private not-for-profit educational institution
637      Q15324       1650                                                       body of water
638   Q17205621       1648                                                    township of Iowa
639    Q5341295       1646                                            educational organization
640    Q3624938       1644                                             city district of Brazil
641    Q3331189       1643                                    version, edition, or translation
642     Q476068       1630                                Act of Congress in the United States
643   Q50846468       1628                                                         sports tour
644    Q5154047       1627                                         quarter/commune of Cambodia
645     Q166118       1626                                                             archive
646  Q113813711       1624                                                           coin type
647   Q66480449       1624                               Wikimedia surname disambiguation page
648     Q861951       1623                                                      police station
649    Q1497364       1617                                                    building complex
650   Q16024164       1615                                                     medical journal
651   Q23002039       1615                 public educational institution of the United States
652    Q1664720       1614                                                           institute
653    Q1969448       1609                                                                term
654     Q294414       1599                                                       public office
655   Q21100463       1597                              natural monument in the Czech Republic
656     Q687188       1593                                                     ward of Vietnam
657    Q7058673       1591                                                   video game series
658    Q5003624       1588                                                            memorial
659    Q3504085       1587                                        rural municipality of Poland
660      Q50053       1585                                                         binary star
661     Q955824       1585                                                     learned society
662     Q679165       1584                                                            squadron
663      Q25295       1581                                                     language family
664     Q655311       1577                                                               onsen
665    Q9035798       1574                                            township of Pennsylvania
666      Q25379       1573                                                                play
667     Q428661       1569                                                              U-boat
668    Q1666019       1565                                                      pressure group
669    Q7543083       1556                                                              avenue
670    Q2989398       1549                                                  commune of Algeria
671      Q75520       1546                                                             plateau
672    Q4287745       1546                                                medical organization
673    Q1371849       1532                                                         filmography
674   Q41982239       1530                                                  navigation channel
675    Q1993624       1530                                                spectroscopic binary
676    Q3699460       1524                                                              defile
677   Q21503788       1518                                             Landschaftsschutzgebiet
678    Q2151232       1516                                                            townland
679  Q104841013       1516                                                             hromada
680    Q5926864       1515                                                      group of lakes
681   Q12292478       1511                                                              estate
682     Q929833       1511                                                        rare disease
683   Q14645593       1510                                                    rugby union team
684     Q132821       1510                                                              murder
685     Q746549       1504                                                                dish
686   Q11448906       1503                                                       science award
687   Q15991303       1502                                         association football league
688      Q82799       1498                                                                name
689      Q24764       1495                                     municipality of the Philippines
690     Q712597       1490                                                             article
691   Q23039057       1488                                                           bus model
692      Q55043       1487                                                           gymnasium
693    Q2661988       1481                                          urban settlement in Russia
694    Q3240715       1479                                                              crater
695   Q21980538       1479                                             commercial organization
696     Q130003       1476                                                          ski resort
697  Q106179098       1471                                                      sailboat class
698   Q16363214       1470                                             small village in Latvia
699    Q1343246       1466                                               English country house
700    Q1128397       1466                                                             convent
701      Q43109       1461                                                          referendum
702    Q2087181       1459                                                        house museum
703     Q271669       1455                                                            landform
704    Q7210356       1452                                              political organization
705    Q1404150       1451                                                                crag
706    Q1311958       1444                                                      railway tunnel
707      Q22667       1442                                                             railway
708   Q17317604       1442                                        professional wrestling event
709   Q17198545       1442                                                township of Illinois
710    Q1048525       1432                                                         golf course
711   Q12206807       1430                                                               block
712   Q55710865       1427                                                                 pan
713   Q19723451       1424                                                    smartphone model
714     Q838948       1422                                                         work of art
715    Q2143825       1422                                                        hiking trail
716    Q9309832       1421                                                      nature reserve
717     Q190903       1420                                                    herbaceous plant
718    Q2039348       1417                                     municipality of the Netherlands
719      Q18142       1414                                                  high-rise building
720   Q19776628       1411                                                 Latin-script letter
721     Q958314       1411                                                       grape variety
722     Q640506       1409                                                             cabinet
723     Q422211       1402                                 Site of Special Scientific Interest
724    Q1289426       1400                                                     county of China
725   Q17205735       1399                                                  township of Kansas
726   Q15145593       1399                                                           tram line
727      Q46622       1398                                           controlled-access highway
728    Q6270791       1394                                                township of Missouri
729     Q192350       1389                                                            ministry
730   Q20088085       1387                                        dictionary page in Wikipedia
731    Q1491746       1383                                                        galaxy group
732   Q47443726       1382                                         recurring tennis tournament
733   Q63565252       1375                                                               brook
734   Q64037785       1374                                                   county courthouse
735     Q109607       1372                                                               ruins
736    Q1363599       1371                                                     passenger train
737    Q1802801       1369                                       rural municipality of Austria
738   Q15141321       1365                                                       train service
739   Q15893266       1361                                                       former entity
740     Q170584       1360                                                             project
741    Q1070912       1357                                                                zhou
742   Q17299750       1355                                                  snooker tournament
743     Q149621       1353                                                            district
744     Q702081       1348                                      water board in the Netherlands
745   Q17318027       1347                                            rural commune of Morocco
746   Q14752696       1344                                             ancient Roman structure
747   Q90834785       1344                                             racing automobile model
748   Q28371991       1340                                                    village of Yemen
749  Q105774620       1339          Public General Act of the Parliament of the United Kingdom
750    Q1329623       1336                                                     cultural center
751       Q9143       1333                                                programming language
752     Q178885       1325                                                               deity
753   Q15265344       1325                                                         broadcaster
754    Q1989945       1322                                                            agrotown
755   Q18340550       1316                                                  Wikimedia timeline
756   Q27968055       1315                                             recurrent event edition
757    Q1756006       1315                                                UCI Continental Team
758     Q500834       1313                                                          tournament
759    Q4387609       1313                                                  architectural firm
760   Q17198620       1311                                                    township of Ohio
761     Q211503       1309                                                               clans
762     Q685309       1305                                  former municipality of Switzerland
763    Q3192808       1302                                               commune of Madagascar
764    Q2922711       1300                                                           bowl game
765   Q13226383       1299                                                            facility
766   Q15640053       1293                                                         tram system
767    Q1589009       1291                                              privately held company
768     Q126807       1291                                                        kindergarten
769   Q25550691       1288                                                           town hall
770   Q15092344       1285                                                urban area in Norway
771      Q57831       1285                                                            fortress
772   Q24027556       1284                                                township of Arkansas
773   Q56242063       1282                                          protestant church building
774    Q7930989       1275                                                        city or town
775   Q56557504       1270                                                        city of Iran
776    Q6382533       1269                                                           battalion
777    Q1147395       1269                                                  district of Turkey
778    Q7694920       1268                                                     tehsil of India
779    Q1802963       1265                                                             mansion
780   Q28147344       1264                                        Dutch municipal coat of arms
781    Q6979593       1262                                  national association football team
782     Q245016       1258                                                       military base
783      Q41487       1257                                                               court
784     Q194408       1256                                                             nunatak
785   Q11415657       1253                                                   professional name
786   Q23828039       1250                                        village/town/city in Lebanon
787     Q644371       1241                                               international airport
788   Q10862618       1240                                                              saddle
789   Q10358588       1239                                                   infantry regiment
790     Q120560       1235                                                      minor basilica
791     Q902104       1233                                                  private university
792   Q35112127       1223                                                   historic building
793    Q1663017       1223                                                  engineering school
794     Q483515       1221                                                               myeon
795     Q131647       1219                                                               medal
796    Q3062294       1218                                                        Latin phrase
797     Q269770       1217                                                     boarding school
798   Q69106538       1216                                Chinese national-type primary school
799     Q191992       1215                                                            headland
800   Q10313934       1212                                                   intermittent lake
801     Q274153       1208                                                         water tower
802      Q41207       1206                                                                coin
803      Q11353       1206                                                    district of Iran
804     Q381885       1203                                                                tomb
805     Q695793       1199                                                              rapids
806    Q1187691       1194                                           roadside station in Japan
807        Q198       1192                                                                 war
808   Q20071150       1191                                          botanical natural monument
809    Q1068842       1191                                                          footbridge
810     Q178193       1189                                                           steamboat
811   Q29154515       1184                                                chapter of the Bible
812     Q133311       1178                                                               tribe
813    Q3329412       1174                                               archaeological museum
814   Q21278897       1172                                                 Wiktionary redirect
815      Q56019       1168                                                       military rank
816   Q15720476       1167                                                     volleyball team
817   Q18536800       1166                                            mixed martial arts event
818   Q23983664       1166                                   demographics of country or region
819     Q623605       1164                                                           gristmill
820    Q1595639       1163                                                        local museum
821    Q7930614       1162                                                   village of Taiwan
822    Q1302471       1159                                                      unit of volume
823     Q194195       1155                                                      amusement park
824     Q162602       1155                                                        river island
825   Q55468440       1147                                        underground irrigation canal
826      Q57821       1139                                                       fortification
827     Q875538       1133                                                   public university
828    Q1078765       1132                                                         train wreck
829    Q1440300       1131                                                   observation tower
830    Q1107656       1131                                                              garden
831    Q1348589       1131                                                        lunar crater
832   Q29168811       1127                                               animated feature film
833   Q24566025       1126                                                     Norse runestone
834    Q7379880       1125                                                   runic inscription
835   Q15075508       1123                                                          beer brand
836     Q637600       1120                                                              sabkha
837    Q6063801       1119                                                 parish of Venezuela
838    Q8776398       1117                               collective population entity of Spain
839    Q1229765       1116                                                          watercraft
840    Q7630601       1115                                                sub-county of Uganda
841   Q16735822       1115                                                      history museum
842    Q5650788       1113                                               Szlachta coat of arms
843   Q34379419       1112                                                           sand area
844   Q17205774       1111                                                township of Michigan
845   Q10517054       1110                                                       handball team
846   Q11229656       1109                                                   tank landing ship
847    Q2006279       1106                                           provincial park of Canada
848   Q21070568       1105                                          human who may be fictional
849    Q2555896       1104                                            municipality of Colombia
850     Q166142       1103                                                         application
851    Q1388729       1103                                                              picket
852   Q12513144       1103                                      vocational school in Indonesia
853   Q16887380       1103                                                               group
854   Q15726209       1100                                school district in the United States
855     Q158438       1094                                                         arch bridge
856     Q417841       1093                                                      protein family
857    Q9651979       1092                                               female beauty pageant
858     Q131263       1090                                                            barracks
859    Q1475691       1089                                                         Mars crater
860    Q1153690       1089                                           long-period variable star
861    Q4438121       1086                                                 sports organization
862   Q73364223       1086                                                     society journal
863   Q28739697       1085                                             ancient county of China
864   Q26132862       1084                                     Olympic sports discipline event
865     Q431603       1083                                                      advocacy group
866     Q223393       1083                                                      literary genre
867   Q56750657       1081                                                    hermitage church
868    Q1065118       1078                                                   district of China
869    Q2178147       1078                                                   trade association
870    Q1426271       1078                                                     route nationale
871   Q58092637       1077                                                               round
872   Q11755880       1072                                                residential building
873    Q8719053       1072                                                         music venue
874   Q65948724       1069                                               neighborhood in Japan
875   Q15079663       1068                                          rapid transit railway line
876    Q1141942       1067                                   Alpha² Canum Venaticorum variable
877     Q105190       1064                                                               levee
878   Q61881926       1063                                                 German noble family
879       Q2095       1063                                                                food
880   Q13526752       1061                                                               lugar
881   Q28738741       1061                                                interacting galaxies
882     Q757292       1058                                                   border checkpoint
883     Q188040       1053                                                              quarry
884    Q3199915       1051                                                            massacre
885   Q12812139       1050                                                      technical term
886    Q2022036       1044                                                           golf club
887    Q1195098       1043                                                                ward
888    Q4632675       1038                                                      dwelling place
889     Q188025       1037                                                           salt lake
890    Q1141225       1036                                                               kofun
891    Q1245089       1036                                                          promontory
892   Q53460949       1036                                                       manga chapter
893     Q838795       1035                                                         comic strip
894    Q2592651       1034                                           union council of Pakistan
895      Q24856       1033                                                         film series
896    Q4271324       1031                                                  mythical character
897    Q2651004       1030                                                             palazzo
898     Q105731       1027                                                                lock
899    Q2235308       1027                                                           ship type
900     Q815324       1026                                         town municipality of Turkey
901    Q1067164       1024                                                     transport route
902       Q9135       1023                                                    operating system
903      Q27590       1022                                                               heath
904   Q11483816       1022                                                        annual event
905     Q480477       1020                                               Local court (Germany)
906      Q25653       1020                                                          ferry ship
907    Q3253281       1019                                                                pond
908     Q428069       1015                                                       intermetallic
909    Q3117863       1014                                                isolated human group
910      Q47053       1011                                                             estuary
911   Q17201685       1009                                                 township of Indiana
912   Q72802508       1007                                                emission-line galaxy
913    Q1667921       1006                                                        novel series
914    Q2679157       1003                                              commune of Ivory Coast
915     Q732717       1001                                              law enforcement agency
916   Q12538685       1001                                                            roadshow
917   Q24034552        998                                                mathematical concept
918    Q1233637        998                                                         river mouth
919     Q582525        994                                                     ethnic township
920     Q212198        992                                                                 pub
921   Q24397514        989                     United States House of Representatives election
922     Q879146        985                                              Christian denomination
923     Q526877        984                                                          web series
924      Q54114        984                                                           boulevard
925     Q777120        979                                             borough of Pennsylvania
926    Q1760610        978                                                          comic book
927   Q15731356        976                                                      apple cultivar
928   Q15944511        970                                           association football team
929     Q269949        969                                                             highway
930     Q156362        965                                                              winery
931   Q56219051        964                                 family name prefixed with Mac or Mc
932     Q424857        964                                                                ōaza
933    Q4291381        961                                                 amalgamated hromada
934    Q3270632        960                                               national championship
935    Q2438638        958                                                            solitary
936   Q17202187        957                                federal electoral district of Canada
937     Q226730        956                                                         silent film
938       Q8142        956                                                            currency
939    Q1007870        954                                                         art gallery
940     Q356055        953                                                            teleplay
941   Q57733494        951                                                     badminton event
942    Q2992826        951                                                 athletic conference
943   Q33146843        950                                           municipality of Catalonia
944     Q164950        946                                                             dynasty
945   Q66715753        944                          Wikimedia list of persons by position held
946     Q142714        944                                                           card game
947   Q27968043        939                                                    festival edition
948       Q6999        937                                                 astronomical object
949     Q484652        936                                          international organization
950   Q55788864        933                           developmental defect during embryogenesis
951      Q55491        932                                         underground railway station
952     Q697295        929                                                              shrine
953    Q2772772        928                                                     military museum
954    Q3409032        926                                                   unisex given name
955    Q3301455        925                                                            precinct
956      Q14296        925                                                    top-level domain
957    Q1788582        925                                                       state highway
958     Q188913        922                                                          plantation
959    Q1788716        921                                                 military decoration
960  Q106071004        921                                                    town of New York
961    Q2579179        920                                                   parish of Ecuador
962  Q110095011        918                                                                None
963    Q2641403        915                                                        samba school
964    Q2135465        914                                                       legal concept
965     Q167037        913                                                         corporation
966    Q4229812        913                                                  commune of Moldova
967   Q28328984        912                                                  village in Armenia
968     Q811165        911                                     architectural heritage monument
969   Q26817508        911                                                       rose cultivar
970   Q14073567        910                                                         sibling duo
971    Q3297186        909                                                      limited series
972     Q748019        908                                                  scientific society
973     Q148837        907                                                               polis
974    Q2638480        907                                                 church congregation
975    Q1144661        904                                                      amusement ride
976   Q22969563        903                                                        Bodendenkmal
977     Q149918        902                                            communications satellite
978    Q3677932        902                                               barrio of Puerto Rico
979   Q15079786        902                                                              ballet
980    Q1631107        901                                                        bibliography
981   Q20741022        900                                                digital camera model
982    Q2309609        900                                                       wayside cross
983    Q1542966        900                                                Gymnasium in Germany
984      Q44377        899                                                              tunnel
985    Q2557101        899                         low-ionization nuclear emission-line region
986     Q193475        898                                                              menhir
987     Q645924        896                                        classical Kuiper belt object
988      Q57058        895                                             municipality of Croatia
989   Q15280243        894                                                    mayoral election
990   Q11812394        894                                                     theatre company
991    Q7216840        894                                    urban-type settlement in Ukraine
992    Q1821082        893                                            national road in Belgium
993   Q17487588        892                                             unavailable combination
994    Q1620908        891                                                   historical region
995    Q5135744        891                                      catholic religious institution
996    Q2630741        890                                                           community
997   Q12809484        890                        township-level division similar to townships
998    Q1254933        889                                            astronomical observatory
999      Q33837        889                                                         archipelago

I understand your concerns, but I'll start considering the "Instance Of" as the main "category" for the item. We could later try to cluster instances based on statements similarities, but I would keep that for later.

Regarding how to measure "quality" I would consider the following (not limited to) dimensions

  • Completeness: The raw number of statements per item, compared with the others in the same "category".
  • Verifiability: The amount of statements supported by references.
  • Is updated?: Considering all the statements that have Start/End date qualifiers I would try to find a way to understand whether they are updated or not (think on position held, population, etc)
  • More than one editor: As we found in our study on controversies, there are many items that are just edited by one person. It might be interesting to measure how "collective" an item is, and check the correlation between the number of editors and the other dimensions proposed.

The question about considering just items with sitelinks, opens a wider problem that is about relevance. In the aforementioned we found a correlation between the total number of pageviews (on sitelinks) and the amount of editors collaborating on one item. I wont cut out all the items without sitelinks but I think it would be useful to understand the relevance of an item, especially when the score is used as ranking for editing/improving that item.

but I'll start considering the "Instance Of" as the main "category" for the item

Makes sense -- I'll continue exploration into that side

Completeness: The raw number of statements per item, compared with the others in the same "category".

Agreed, as with Wikipedia, more content generally is a good indicator of quality and this is the simplest to implement.

Verifiability: The amount of statements supported by references.

Yep. I need to take another look at Amaral et al. Assessing the Quality of Sources in Wikidata Across Languages to see how they break down the different types of sources and whether there is reusable code there. The ORES model has some features related to references too that might be reusable like the proportion of claims that have a reference (code) but I wonder if we want to stick with something generic like that or maintain a list of properties where references are particularly important.

Is updated?: Considering all the statements that have Start/End date qualifiers I would try to find a way to understand whether they are updated or not (think on position held, population, etc)

Hmm...that's interesting but tricky. I assumed that most things like population have a point-in-time but not a clear "this data will be out-of-date by X"? I'd have to dig through the data though. Getting much bigger than this quarterly project, but this is where having a good system to identify Wikidata-esque claims within Wikipedia templates could help a lot with detecting potentially stale data. But that might also be more useful as an input for a structured editing task than a quality model.

More than one editor: As we found in our study on controversies, there are many items that are just edited by one person. It might be interesting to measure how "collective" an item is, and check the correlation between the number of editors and the other dimensions proposed.

Yeah, I know Aaron had resisted using edit history information for the Wikipedia quality models because it's a weird feature. It's not easily actionable (to improve the quality, you have to find someone else to make edits on the page) and a single editor can build a very high-quality article. That said, Wikidata is quite different than Wikipedia and depending on how we frame a feature like this, it might make sense to include in the model or perhaps just as a means of prioritizing items to improve for a tool like Item Quality Evaluator.

I wont cut out all the items without sitelinks but I think it would be useful to understand the relevance of an item, especially when the score is used as ranking for editing/improving that item.

Yeah, I need to compute or find a nicer description of the different major subgraphs / types of items within Wikidata. For example:

  • items with Wikipedia article sitelinks (my initial focus)
  • items that look like Wikipedia topics but maybe don't have sitelinks yet like items about people with no Wikipedia articles or asteroids
  • scholarly articles
  • categories
  • Wikipedia list items
  • Wikipedia disambiguation items
  • items that are relevant to Commons images
  • others?

Some of these can be split by instance-of but I wonder whether we should have different thresholds for items with sitelinks vs. items without sitelinks or like you suggest just consider the presence of sitelinks to be mostly relevant to prioritization within recommender systems.

I'll also add that we should probably revisit the "important languages" part of the ORES model to see if that's still relevant. We'll definitely want something related to labels/descriptions. See: https://www.wikidata.org/wiki/Wikidata:Item_quality#Translations and ORES code

Weekly update:

  • Summarizing some past research shared / further examinations of the existing ORES model shared by LP:
    • We have to be careful to adjust expectations for a given claim depending on its property type (distribution of property types on Wikidata) -- e.g., no references for external-id properties. Current model uses a static list for this but we might want to re-evaluate.
    • Even though number of sitelinks might correlate positively with quality, it's a feature we should avoid as it's really a proxy for popularity and not item quality
    • Wikidata is constantly shifting in big ways and out-of-date data / rules can lead to models handling particular instance-ofs poorly. We should do our best to make aspects of the model unsupervised or not dependent on a fixed set of data so it can adapt easily.
    • The current model is actually pretty good so maybe this is less about iterating on it significantly and more about thinking about redesigning it for new LiftWing paradigm and to be less susceptible to data drift.
  • Something I've been mulling over is how to ensure the model is actionable in a way that aligns with community goals and points to specific steps a contributor could take to raise quality.
    • For instance, adding/improving references is quite actionable and important. For the verifiability component then, it's worthwhile to ensure that the model handles this well -- i.e. has a good sense of which statements do and do not need references and differentiates between the different types of references (external vs. Wikipedia).
    • If we're less concerned about making items super extensive but do want to "require" a core set of basic properties (similar to Schemas or inteGraality), we might try to identify that core set of properties for each instance-of and try not to rely less on raw counts of statements in determining scores.
    • What about consistency -- is there some way to capture how well an item matches related ones? And if so, should an item be penalized for being "unique"?
  • LP also asked us to consider how to extend this to Lexemes and Properties. Will have to think through that and whether we can reuse some of the resulting model for those item types or if they require fully separate approaches.

Update: past few weeks have been busy so I haven't had a chance to look into this but I'm hoping to get more time in December to focus on it.

Able to start thinking about this again and a few thoughts:

  • Machine-in-the-loop: when we built quality models for the Wikipedia language communities, it was with the idea that the models could potentially support the existing editor processes for assigning article quality scores -- e.g., https://en.wikipedia.org/wiki/Wikipedia:Content_assessment. This generally aligns with our machine-in-the-loop practice of only building models that clearly could support and receive feedback from existing community processes. For the Wikidata, while there are reasonable guidelines for item quality, the only community-generated data was a one-off labeling campaign from 2020 via Wiki labels. This presents a major challenge: how do we improve on the existing ORES model to make it more maintainable / effective without a clear feedback loop that can be used to validate/update the model? One possible approach is to instead treat this as a task-identification model -- i.e. instead of seeking to model quality directly and therefore allowing vague features like the total # of references, we could design a model that seeks to explicitly build a list of missing/to-be-improved properties/aliases/descriptions/references. This list of changes could then always be converted into a quality score -- e.g., by computing a simple ratio of existing properties to missing properties or something like that -- but that would be secondary to the model. The community process that can provide feedback for this style of model then is just the regular editing process (albeit quite weakly because an edit doesn't tell you what else is missing). Eventually, it could feed into an actual interface similar to the Growth team's structured tasks that would provide even more direct feedback, but in the meantime this still feels much more machine-in-the-loop than a direct quality model.
  • Reducing data drift: alongside this shift in design from quality -> task identification, we can also make the model more sustainable by doing less hard-coding of outliers (like asteroids) and try to redesign the model to adapt to the existing structure of Wikidata when it is trained. For example, taking more the approach previously taken for external identifiers / media where the relevant data structures that inform the model are easy to auto-generate and thus could be updated with each model training. This could be extended to e.g., lists of properties that commonly have references and lists of properties that commonly appear for a given instance-of.
    • Then the model would take an item as input and perhaps go something like:
      • Extract it's instance-of and sitelinks
      • Sitelinks would be used to help determine which aliases/descriptions should exist
      • Instance-ofs would be used to identify which properties are expected
      • For each of those expected properties, it would either be rated as missing, incomplete (missing reference etc.), or complete
      • And then all of this information could be compiled as specific tasks
      • And for the quality score, the list of tasks could be compared against the existing data to come to some general score.
    • The challenge then still is in the smart compiling of expected properties for a given instance-of, but I feel much better about the structure of this model because it's more transparent and anyone who is familiar with Wikidata could easily inspect the list of expected properties for a given instance-of and tweak it.
    • I'm now working on extracting the list of existing properties for each instance-of to see if most have a clear set of common properties

Weekly updates:

  • I focused on the references component of the model this week. I built heavily on Amaral, Gabriel, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo Rodrigues, and Elena Simperl. "Assessing the quality of sources in Wikidata across languages: a hybrid approach." Journal of Data and Information Quality (JDIQ) 13, no. 4 (2021): 1-35.
  • I wrote a Python function (code below) that takes the references for a claim and maps it to high-level categories that tell us about the quality of the reference -- e.g., has an External URL associated with it vs. referring to internal Wikidata item or import from another Wikimedia project. I can imagine weak and strong recommendations based on this -- e.g., high priority would be adding missing references and lower priority might be updating Imported from Wikimedia Project to a external URL and very low priority might be adding a second reference.
  • Using that function, I can generate basic descriptive stats on reference distributions on Wikidata (table below) and split by property (top-100-most-common properties below). From this data, you can see that we might be able to automatically infer which properties definitely need references, which ones probably should have references, and which ones probably don't by just setting some basic heuristics. One challenge will be whether we use the current state of Wikidata (which is heavily bot-influenced so for certain properties, reflects the choice of a few people) or try to build a more nuanced dataset based on edit history of which properties have references when editors add them.
  • Next step will be returning to determining expected claim coverage for a given instance-of
# Code for categorizing references for a claim per a simple taxonomy that by proxy tells us something about authority/accessibility/usefulness of the reference
# types of references from least -> best
# so if a claim has two references and one is Internal-Stated and one is External-Direct, we keep External-Direct
REF_ORDER = {r:i for r,i in enumerate(
    ['Internal-Inferred', 'Internal-Stated', 'Internal-Wikimedia',
     'External-Identifier', 'External-Direct'])}

EXTERNAL_ID_PROPERTIES = set()
# all Wikidata properties that are external IDs -- used for detecting when used as part of a reference
# TODO: Maybe update to SPARQL query that is external identifier properties ONLY with URL formatter properties? (maybe that's essentially the same thing?)
# https://quarry.wmcloud.org/query/69919
with open('quarry-69919-wikidata-external-ids-run692643.tsv', 'r') as fin:
    for line in fin:
        EXTERNAL_ID_PROPERTIES.add(f'P{line.strip()}')

def getReferenceType(references):
    """Map references for a claim to different categories.
    
    Heavily inspired by: https://arxiv.org/pdf/2109.09405.pdf
    Also: https://www.wikidata.org/wiki/Help:Sources
    """
    if references is None:
        ref_count = 'unreferenced'
        best_ref_type = None
    else:
        ref_count = 'single' if len(references) == 1 else 'multiple'
        best_ref_types = []
        for ref in references:
            # reference URL OR official website OR archive URL OR URL OR external data available at 
            if 'P854' in ref['snaksOrder'] or 'P856' in ref['snaksOrder'] or 'P1065' in ref['snaksOrder'] or 'P953' in ref['snaksOrder'] or 'P2699' in ref['snaksOrder'] or 'P1325' in ref['snaksOrder']:
                best_ref_types.append('External-Direct')
                break
            elif [p for p in ref['snaksOrder'] if p in EXTERNAL_ID_PROPERTIES]:
                best_ref_types.append('External-Identifier')
            # Wikimedia import URL OR imported from Wikimedia project
            elif 'P4656' in ref['snaksOrder'] or 'P143' in ref['snaksOrder']:
                best_ref_types.append('Internal-Wikimedia')
            # stated in
            elif 'P248' in ref['snaksOrder']:
                best_ref_types.append('Internal-Stated')
            # inferred from Wikidata item OR based on heuristic OR based on
            elif 'P3452' in ref['snaksOrder'] or 'P887' in ref['snaksOrder'] or 'P144' in ref['snaksOrder']:
                best_ref_types.append('Internal-Inferred')
            # title OR published in -- hard to interpret without more info but probably links to Wikidata item
            elif 'P1476' in ref['snaksOrder'] or 'P1433' in ref['snaksOrder']:
                best_ref_types.append('Internal-Stated')
            else:
                best_ref_types.append(f'Unknown: {ref["snaksOrder"]}')
        best_ref_type = max(best_ref_types, key=lambda x: REF_ORDER.get(x, -1))
    return (ref_count, best_ref_type)
High-level descriptive stats for every num_refs/best_ref category over 1000 claims:
I manually inspect the top Unknown Properties to make sure they shouldn't be part of
one of the official categories but otherwise they'd end up being mapped to unreferenced

+------------+-----------------------------------------------------+----------+
|num_refs    |best_ref                                             |num_claims|
+------------+-----------------------------------------------------+----------+
|single      |External-Direct                                      |651044816 |
|unreferenced|null                                                 |339814593 |
|single      |Internal-Stated                                      |191615754 |
|single      |External-Identifier                                  |154045142 |
|single      |Internal-Wikimedia                                   |55315642  |
|single      |Internal-Inferred                                    |21253250  |
|multiple    |Internal-Stated                                      |3218113   |
|multiple    |External-Direct                                      |2825364   |
|multiple    |Internal-Wikimedia                                   |2791394   |
|multiple    |External-Identifier                                  |2262353   |
|single      |Unknown: ['P813']                                    |1243513   |
|single      |Unknown: ['P1640', 'P813']                           |101331    |
|multiple    |Internal-Inferred                                    |85786     |
|single      |Unknown: ['P1810', 'P813']                           |81642     |
|single      |Unknown: ['P6104']                                   |46210     |
|multiple    |Unknown: ['P813']                                    |15468     |
|single      |Unknown: ['P123']                                    |9992      |
|single      |Unknown: ['P195']                                    |7011      |
|multiple    |Unknown: ['P1640', 'P813']                           |4594      |
|single      |Unknown: ['P459']                                    |3949      |
|single      |Unknown: ['P217', 'P195']                            |3045      |
|single      |Unknown: ['P217']                                    |3019      |
|single      |Unknown: ['P373']                                    |2986      |
|multiple    |Unknown: ['P304']                                    |2812      |
|single      |Unknown: ['P195', 'P217']                            |2558      |
|single      |Unknown: ['P1683']                                   |1572      |
|single      |Unknown: ['P958']                                    |1549      |
|single      |Unknown: ['P3014']                                   |1509      |
|single      |Unknown: ['P10253']                                  |1348      |
|multiple    |Unknown: ['P1343']                                   |1285      |
|single      |Unknown: ['P304']                                    |1256      |
|single      |Unknown: ['P973']                                    |1194      |
|single      |Unknown: ['P407']                                    |1118      |
|single      |Unknown: ['P1343']                                   |1089      |
Top-100 most common properties on Wikidata and reference distribution
+--------+----------+-----------------+-------------+-------------+------------+
|property|num_claims|prop_unreferenced|prop_external|prop_internal|prop_unknown|
+--------+----------+-----------------+-------------+-------------+------------+
|P2860   |287780712 |0.001            |0.999        |0.0          |0.0         |
|P2093   |134861830 |0.118            |0.871        |0.011        |0.0         |
|P31     |106419606 |0.404            |0.386        |0.21         |0.0         |
|P1476   |43568811  |0.148            |0.837        |0.014        |0.001       |
|P577    |41854467  |0.119            |0.846        |0.035        |0.0         |
|P1433   |39313917  |0.115            |0.862        |0.022        |0.001       |
|P304    |36279684  |0.101            |0.886        |0.013        |0.0         |
|P478    |36170462  |0.1              |0.885        |0.014        |0.0         |
|P1215   |33122905  |0.0              |0.0          |1.0          |0.0         |
|P433    |33009310  |0.1              |0.897        |0.002        |0.0         |
|P698    |32069969  |0.019            |0.98         |0.001        |0.0         |
|P528    |28768139  |0.005            |0.004        |0.991        |0.0         |
|P356    |28598411  |0.16             |0.831        |0.009        |0.0         |
|P50     |27852061  |0.372            |0.604        |0.023        |0.0         |
|P921    |24859815  |0.291            |0.038        |0.671        |0.0         |
|P407    |16251420  |0.787            |0.123        |0.089        |0.0         |
|P17     |15428560  |0.541            |0.218        |0.241        |0.0         |
|P131    |11726405  |0.44             |0.266        |0.294        |0.0         |
|P106    |10064651  |0.604            |0.199        |0.196        |0.001       |
|P625    |9602347   |0.281            |0.299        |0.421        |0.0         |
|P21     |8314129   |0.592            |0.103        |0.306        |0.0         |
|P3083   |8152578   |0.996            |0.0          |0.004        |0.0         |
|P6257   |8094420   |0.0              |0.0          |1.0          |0.0         |
|P6258   |8094293   |0.0              |0.0          |1.0          |0.0         |
|P6259   |8079526   |0.0              |0.0          |1.0          |0.0         |
|P2671   |7391570   |0.998            |0.0          |0.002        |0.0         |
|P59     |7374426   |0.994            |0.0          |0.006        |0.0         |
|P735    |7089362   |0.888            |0.082        |0.029        |0.001       |
|P932    |6564521   |0.125            |0.872        |0.003        |0.0         |
|P569    |6051485   |0.186            |0.386        |0.428        |0.0         |
|P2214   |5843262   |0.0              |0.0          |1.0          |0.0         |
|P10752  |5226951   |0.0              |0.0          |1.0          |0.0         |
|P10751  |5221836   |0.0              |0.0          |1.0          |0.0         |
|P27     |4720704   |0.691            |0.085        |0.223        |0.0         |
|P373    |4671599   |0.842            |0.001        |0.156        |0.001       |
|P18     |4630968   |0.736            |0.016        |0.248        |0.0         |
|P2216   |4599813   |0.0              |0.0          |1.0          |0.0         |
|P5875   |4581754   |1.0              |0.0          |0.0          |0.0         |
|P361    |4507712   |0.454            |0.271        |0.273        |0.003       |
|P646    |4420934   |0.713            |0.0          |0.286        |0.0         |
|P684    |4321135   |0.003            |0.0          |0.997        |0.0         |
|P734    |4262859   |0.765            |0.113        |0.121        |0.001       |
|P1566   |3750650   |0.136            |0.0          |0.864        |0.0         |
|P171    |3629882   |0.88             |0.014        |0.106        |0.0         |
|P225    |3621880   |0.786            |0.022        |0.192        |0.0         |
|P105    |3618190   |0.864            |0.014        |0.122        |0.0         |
|P2583   |3489597   |0.0              |0.0          |1.0          |0.0         |
|P279    |3356545   |0.225            |0.479        |0.296        |0.0         |
|P2888   |3287592   |0.864            |0.133        |0.002        |0.0         |
|P214    |3194245   |0.498            |0.172        |0.318        |0.012       |
|P19     |3188730   |0.217            |0.156        |0.627        |0.0         |
|P570    |3099086   |0.185            |0.402        |0.413        |0.001       |
|P1087   |2934424   |0.0              |0.153        |0.847        |0.0         |
|P703    |2858721   |0.008            |0.488        |0.504        |0.0         |
|P276    |2709383   |0.447            |0.459        |0.094        |0.0         |
|P571    |2668156   |0.25             |0.319        |0.43         |0.0         |
|P846    |2574389   |0.013            |0.001        |0.956        |0.03        |
|P69     |2527872   |0.286            |0.278        |0.435        |0.0         |
|P1412   |2499485   |0.71             |0.197        |0.093        |0.0         |
|P1082   |2468600   |0.066            |0.5          |0.431        |0.004       |
|P971    |2464447   |0.8              |0.0          |0.199        |0.001       |
|P953    |2446948   |0.689            |0.278        |0.03         |0.003       |
|P10585  |2279526   |0.015            |0.985        |0.0          |0.0         |
|P1435   |2133882   |0.213            |0.344        |0.443        |0.0         |
|P527    |2092753   |0.473            |0.412        |0.109        |0.006       |
|P195    |2058009   |0.35             |0.609        |0.041        |0.0         |
|P421    |2023470   |0.778            |0.04         |0.182        |0.0         |
|P641    |1978144   |0.638            |0.138        |0.224        |0.0         |
|P6216   |1966767   |0.758            |0.172        |0.069        |0.0         |
|P281    |1954649   |0.391            |0.322        |0.287        |0.0         |
|P7859   |1918647   |0.023            |0.977        |0.0          |0.0         |
|P496    |1775935   |0.992            |0.006        |0.001        |0.002       |
|P856    |1756660   |0.375            |0.166        |0.455        |0.004       |
|P108    |1739754   |0.23             |0.636        |0.133        |0.001       |
|P1104   |1645473   |0.066            |0.046        |0.887        |0.0         |
|P136    |1639956   |0.496            |0.104        |0.4          |0.0         |
|P1448   |1623660   |0.089            |0.258        |0.651        |0.001       |
|P40     |1598046   |0.173            |0.017        |0.809        |0.0         |
|P213    |1584397   |0.551            |0.271        |0.176        |0.001       |
|P54     |1566111   |0.233            |0.011        |0.756        |0.0         |
|P6179   |1540572   |0.916            |0.084        |0.0          |0.0         |
|P39     |1528039   |0.359            |0.314        |0.327        |0.0         |
|P161    |1473134   |0.193            |0.062        |0.745        |0.0         |
|P495    |1471885   |0.619            |0.129        |0.252        |0.0         |
|P2326   |1468833   |1.0              |0.0          |0.0          |0.0         |
|P227    |1459531   |0.457            |0.173        |0.367        |0.004       |
|P244    |1458838   |0.383            |0.253        |0.362        |0.001       |
|P186    |1434411   |0.21             |0.576        |0.214        |0.001       |
|P166    |1427240   |0.562            |0.167        |0.271        |0.001       |
|P2044   |1374724   |0.265            |0.081        |0.654        |0.0         |
|P5055   |1362707   |0.003            |0.0          |0.997        |0.0         |
|P6375   |1347933   |0.612            |0.253        |0.135        |0.0         |
|P235    |1312921   |0.846            |0.142        |0.012        |0.0         |
|P234    |1304992   |0.857            |0.13         |0.013        |0.0         |
|P1343   |1296440   |0.543            |0.28         |0.093        |0.084       |
|P20     |1280216   |0.295            |0.21         |0.495        |0.0         |
|P1090   |1279195   |0.0              |0.0          |1.0          |0.0         |
|P155    |1267723   |0.653            |0.004        |0.343        |0.0         |
|P156    |1249070   |0.663            |0.005        |0.332        |0.0         |
|P680    |1206178   |0.001            |0.972        |0.027        |0.0         |
+--------+----------+-----------------+-------------+-------------+------------+

I'm trying to implement a link-prediction task on Wikidata, to be used as proxy for claims coverage. I'm building on top of Goyal & Ferrara's work. The existing libraries might require some tweaks to work on the full Wikidata Graph, but before addressing the scalability issues I want to test this approach on a small sample to see the suitability of this approach.

@Lydia_Pintscher I was reminded recently of Recoin (and the closely related PropertySuggester) and that got me wondering: is there a reason that the ORES model was used instead of Recoin? Or maybe more specifically, is there any reason not to use Recoin for assessing Wikidata item quality? What are its drawbacks?

Looking through it, my impression was that it's quite good and that my approach likely would have been very similar. I do see a few places we could augment it:

  • Also assessing references in a similar way (based on how often a property is referenced on other items) to identify claims where references are missing or could be improved (e.g., imported from wikipedia)
  • Also assessing labels/descriptions based on which language sitelinks exist for the item -- e.g., if Japanese Wikipedia article, should also have Japanese label/description

And then I know you asked about Properties / Lexemes -- presumably this same strategy could be adopted for them if it's indeed working well for items!

I started a PAWS notebook where I will evaluate the proposed strategy (Recoin with additional of reference/labels rules) against the 2020 dataset (~4k items) of assessed Wikidata item qualities. This will allow me to relatively cheapily assess the method before trying to scale up.

Notebook: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/eval_wikidata_quality_model.ipynb

@Lydia_Pintscher I was reminded recently of Recoin (and the closely related PropertySuggester) and that got me wondering: is there a reason that the ORES model was used instead of Recoin? Or maybe more specifically, is there any reason not to use Recoin for assessing Wikidata item quality? What are its drawbacks?

Looking through it, my impression was that it's quite good and that my approach likely would have been very similar. I do see a few places we could augment it:

  • Also assessing references in a similar way (based on how often a property is referenced on other items) to identify claims where references are missing or could be improved (e.g., imported from wikipedia)
  • Also assessing labels/descriptions based on which language sitelinks exist for the item -- e.g., if Japanese Wikipedia article, should also have Japanese label/description

And then I know you asked about Properties / Lexemes -- presumably this same strategy could be adopted for them if it's indeed working well for items!

Recoin I believe didn't exist at that point. It was also not integrated in the existing production systems. I don't think we ever did a proper analysis of what it's currently capable of and how good it is for judging Item quality.