We have to specify features based on the quality criteria. These will be used for developing the prediction model.
|Resolved||johl||T127047 Collection of topics for HPI hackathon|
|Open||None||T76230 [Epic] data quality and trust|
|Resolved||awight||T187836 [Epic] Audit of pending ORES GUI deployments|
|Resolved||Glorian_WD||T127470 Deploy item quality classification model for Wikidata|
|Resolved||Glorian_WD||T157498 Train/test item quality model for Wikidata|
|Resolved||Glorian_WD||T157497 Engineer features for item quality model|
|Resolved||Ladsgroup||T158430 Use suggested properties to get signal for completeness|
|Resolved||hoo||T164994 Enable wbgetsuggestions API to get recommended properties even if they have existed in an item|
@Glorian_WD and I have been discussion how we'll get features that will give us some signal about which properties are expected for specific types of items. Here's my skeleton proposal:
- query for most used statements (e.g. instance-of:human)
- for the top N most used properties, query for the most secondary properties (instance-of:human, occupation:author)
- for all items that pass some basic threshold of quality (e.g. has an external reference and >= N site-links) find the frequency of all other properties.
- build an index on this so it can be quickly looked-up during scoring.