|Resolved||johl||T127047 Collection of topics for HPI hackathon|
|Open||None||T76230 [Epic] data quality and trust|
|Resolved||awight||T187836 [Epic] Audit of pending ORES GUI deployments|
|Resolved||Glorian_WD||T127470 Deploy item quality classification model for Wikidata|
|Resolved||Glorian_WD||T157498 Train/test item quality model for Wikidata|
|Resolved||Glorian_WD||T157495 Complete Wikidata item quality campaign|
|Resolved||Halfak||T157493 Deploy Wikidata item quality campaign|
|Resolved||Halfak||T161002 Late march wikilabels deployment|
|Resolved||Halfak||T159570 Deploy the pilot of Wikidata item quality campaign|
|Resolved||Halfak||T155828 Design item_quality form for Wikidata|
|Resolved||Glorian_WD||T157489 [Discuss] item quality in Wikidata|
Aggregating two labels might be hard. Maybe we could ask for two labels of a set of items and run a follow-up campaign to get a 3rd label on the items where labelers disagree.
Where are the labels defined? What criteria separates a "C" from a "B"?
@Halfak @Ladsgroup : I believe in order to rate items, people could follow the guideline in showcase items (https://www.wikidata.org/wiki/Wikidata:Showcase_items).
Maybe @Lydia_Pintscher can confirm this.
Yes people should in general follow the showcase item criteria. So A would be "meets all criteria" and E would be "meets none of the criteria"? In this case we should list the criteria and say that.
@Lydia_Pintscher, what do you think about the middle quality classes? Could we pick and choose criteria and make statements about what types of items belongs at which level?
- E: Anything that doesn't the D criteria
- D: A few useful statements and a description in at least one language
- C: At least one non-trivial statement is referenced.
- B: Aliases and description are translated into >= 5 languages
- A: All Showcase criteria met
This is just an example. For English Wikipedia's 1.0 assessments they have descriptions that are a bit more subjective and make references to process and the level of coverage.
- Stub: The article is either a very short article or a rough collection of information that will need much work to become a meaningful article. It is usually very short; but, if the material is irrelevant or incomprehensible, an article of any length falls into this category. Although Stub-class articles are the lowest class of the normal classes, they are adequate enough to be an accepted article, though they do have risks of being dropped from being an article all together.
- Start: The article has a usable amount of good content but is weak in many areas. Quality of the prose may be distinctly unencyclopedic, and MoS compliance non-existent. The article should satisfy fundamental content policies, such as BLP. Frequently, the referencing is inadequate, although enough sources are usually provided to establish verifiability. No Start-Class article should be in any danger of being speedily deleted.
- C: The article cites more than one reliable source and is better developed in style, structure, and quality than Start-Class, but it fails one or more of the criteria for B-Class. It may have some gaps or missing elements; need editing for clarity, balance, or flow; or contain policy violations, such as bias or original research. Articles on fictional topics are likely to be marked as C-Class if they are written from an in-universe perspective. It is most likely that C-Class articles have a reasonable encyclopedic style.
- B: The article is suitably referenced, with inline citations. The article reasonably covers the topic, and does not contain obvious omissions or inaccuracies. The article has a defined structure. The article is reasonably well-written. The article contains supporting materials where appropriate. The article presents its content in an appropriately understandable way.
- GA: Well written: the prose is clear and concise, and the spelling and grammar are correct. Verifiable and it contains no original research. It contains no copyright violations nor plagiarism. Broad in its coverage: it addresses the main aspects of the topic. Neutral: it represents viewpoints fairly and without editorial bias, giving due weight to each. Stable: it does not change significantly from day to day. Images are relevant to the topic, and have suitable captions.
- FA: See https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria
Note that, the more consistent our labels, the better test data we'll have. It means that we'll be able to get test statistics that actually reflect reality and that our prediction probabilities will have more consistent and predictable properties.
@Halfak: Which kind of classifier would be used? (Particularly: Will/Can it create some continuous score or will it only put out one of the defined classes?)
I should say that I expect that a GradientBoosting or RandomForrest model will likely fit this prediction problem well, but we might change to a different classifier strategy if we can push the fitness.
@Halfak, please find the attached file which specifies the criteria for each grade. I made that by referring to the showcase item criteria.
After having a discussion about this with Jan and Lydia, we think maintaining the vagueness in each criteria is important. In other words, we do not want to be too specific in defining the criteria, so that people can use their common sense in evaluating items.
What do you think about the attached criteria?
Updated form to include summaries of each class: https://github.com/wiki-ai/wikilabels-wmflabs-deploy/pull/29
Added HTML snippet support for https://github.com/wiki-ai/wikilabels/pull/162