Page MenuHomePhabricator

Design item_quality form for Wikidata
Closed, ResolvedPublic

Event Timeline

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJan 20 2017, 1:38 PM

I wonder if we could provide some guidance for each quality level in the form -- to help people rate consistently. I'm worried that one labeler's "B" will be another labeler's "C"

I was thinking of two reviewers per task so it'll be less subjective but overall things will be more or less objective once we aggravate all labels and build a classifier on top of it.

Aggregating two labels might be hard. Maybe we could ask for two labels of a set of items and run a follow-up campaign to get a 3rd label on the items where labelers disagree.

Where are the labels defined? What criteria separates a "C" from a "B"?

@Halfak @Ladsgroup : I believe in order to rate items, people could follow the guideline in showcase items (https://www.wikidata.org/wiki/Wikidata:Showcase_items).

Maybe @Lydia_Pintscher can confirm this.

This comment was removed by Glorian_WD.

Yes people should in general follow the showcase item criteria. So A would be "meets all criteria" and E would be "meets none of the criteria"? In this case we should list the criteria and say that.

Halfak added a comment.EditedJan 25 2017, 6:08 PM

@Lydia_Pintscher, what do you think about the middle quality classes? Could we pick and choose criteria and make statements about what types of items belongs at which level?

E.g.

  • E: Anything that doesn't the D criteria
  • D: A few useful statements and a description in at least one language
  • C: At least one non-trivial statement is referenced.
  • B: Aliases and description are translated into >= 5 languages
  • A: All Showcase criteria met

This is just an example. For English Wikipedia's 1.0 assessments they have descriptions that are a bit more subjective and make references to process and the level of coverage.

  • Stub: The article is either a very short article or a rough collection of information that will need much work to become a meaningful article. It is usually very short; but, if the material is irrelevant or incomprehensible, an article of any length falls into this category. Although Stub-class articles are the lowest class of the normal classes, they are adequate enough to be an accepted article, though they do have risks of being dropped from being an article all together.
  • Start: The article has a usable amount of good content but is weak in many areas. Quality of the prose may be distinctly unencyclopedic, and MoS compliance non-existent. The article should satisfy fundamental content policies, such as BLP. Frequently, the referencing is inadequate, although enough sources are usually provided to establish verifiability. No Start-Class article should be in any danger of being speedily deleted.
  • C: The article cites more than one reliable source and is better developed in style, structure, and quality than Start-Class, but it fails one or more of the criteria for B-Class. It may have some gaps or missing elements; need editing for clarity, balance, or flow; or contain policy violations, such as bias or original research. Articles on fictional topics are likely to be marked as C-Class if they are written from an in-universe perspective. It is most likely that C-Class articles have a reasonable encyclopedic style.
  • B: The article is suitably referenced, with inline citations. The article reasonably covers the topic, and does not contain obvious omissions or inaccuracies. The article has a defined structure. The article is reasonably well-written. The article contains supporting materials where appropriate. The article presents its content in an appropriately understandable way.
  • GA: Well written: the prose is clear and concise, and the spelling and grammar are correct. Verifiable and it contains no original research. It contains no copyright violations nor plagiarism. Broad in its coverage: it addresses the main aspects of the topic. Neutral: it represents viewpoints fairly and without editorial bias, giving due weight to each. Stable: it does not change significantly from day to day. Images are relevant to the topic, and have suitable captions.
  • FA: See https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria

@Halfak : I thought we do not have to define the criteria for each quality grade because the classifier will automatically define those for us. Am I right?

@Glorian_Yapinus, we'll need to have consistency in labeling in order for the classifier to be able to learn the distinctions between the middle-classes. It's OK to be a little bit inconsistent and the classifier will make up for that, but we want to be as consistent as possible.

Note that, the more consistent our labels, the better test data we'll have. It means that we'll be able to get test statistics that actually reflect reality and that our prediction probabilities will have more consistent and predictable properties.

I talked to @Glorian_WD; I suggested to take a look at the creation process for grading Rubrics and to specify the extremes first and than work towards the middle grades.

@Halfak: Which kind of classifier would be used? (Particularly: Will/Can it create some continuous score or will it only put out one of the defined classes?)

@Jan_Dittrich : the classifiers will classify items in defined classes.

@Jan_Dittrich & @Glorian_Yapinus, the classifier will classify into defined classed, but it will also output probabilities that will let us linearize the scale.

I should say that I expect that a GradientBoosting or RandomForrest model will likely fit this prediction problem well, but we might change to a different classifier strategy if we can push the fitness.

@Halfak, please find the attached file which specifies the criteria for each grade. I made that by referring to the showcase item criteria.

After having a discussion about this with Jan and Lydia, we think maintaining the vagueness in each criteria is important. In other words, we do not want to be too specific in defining the criteria, so that people can use their common sense in evaluating items.

What do you think about the attached criteria?

Looks good. Maybe it's time to move it to the Wiki and to host a discussion about it. Once there's some buy-in, I think we'll be good to go.

Halfak renamed this task from Build item_quality form to Design item_quality form for Wikidata.Feb 7 2017, 9:59 PM
Halfak moved this task from Untriaged to Ideas on the Scoring-platform-team board.Feb 9 2017, 3:23 PM
Halfak closed this task as Resolved.Mar 16 2017, 9:21 PM