Page MenuHomePhabricator

Add “integer” constraint
Closed, ResolvedPublic3 Estimated Story Points

Description

Some properties are supposed to represent integers. As an editor I want to be able to define which properties these are so I can easily find cases where the property is used in a wrong way via the existing constraint system.

Violation message shown to the user on violating statements: "Integer constraint: This "$property" statement should only take integer values."
Constraint statement on the property: "property constraint" -> "integer constraint"

Examples where this is likely to be used:

  • number of children of a person
  • number of participants in an event

Patch-For-Review:

Event Timeline

Maybe Community-consensus-needed can be removed as this would mainly be needed for actually adding it to specific properties. Something that can evolve.

That being said, an integer datatype could be just as useful ..

Could the output include an integer triple as well?

Which output? And what do you mean with an integer triple in this context? (Sorry, it's late...)

On Wikidata Query Server. Currently xds:integer(?v) is needed on quantities

Change 432004 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/WikibaseQualityConstraints@master] Add 'integer' constraint

https://gerrit.wikimedia.org/r/432004

Question that came up during code review: should a value like “7251±0.3” (integer value but non-integer bounds) be a violation or not? @Lydia_Pintscher Amir said you discussed this and it should be accepted?

From all I know about quantity values and uncertainty bounds (a.k.a. "precision") I would say that "7251" is still a valid integer, even if it is followed by "±0.3".

My main argument is: Would you consider "5±1" an integer? I think you would, right? If you do, then it would not make sense to consider a value like "5±0.999" that is a little bit more precise to not be an integer any more.

Or the other way around: It would be plausible to say "5±0.999" is not an integer. But if you do this, "5±1" can not be an integer too.

For practical purposes I would introduce a separate "precise" constraint that disallows uncertainty bounds, if something like this is needed. I would implement the integer constraint in a way that is independent from that and ignores uncertainty bounds entirely.

For practical purposes I would introduce a separate "precise" constraint that disallows uncertainty bounds, if something like this is needed. I would implement the integer constraint in a way that is independent from that and ignores uncertainty bounds entirely.

That already exists: T170610: Add “no bounds” constraint – and in fact I expect that these constraints will typically be used together (effectively providing T112247: [RFC] Create a "number" datatype for exact values), so I think it’s mostly a theoretical question.

Yeah let's accept 7251±0.3 as a valid value.

Sorry for being late.

I have now been working with quantity datatype properties a lot and I have to disagree here. I think that we should allow only integer bounds when the value is integer, as bounds cannot be non-integers in those cases.

Let’s have a look at actual numbers: right now we have ~3.8M claims of quantity properties with integer constraint (~2.7M mainsnak, ~1.1M qualifier, barely any in reference). In exactly one case there is an integer value with non-integer bounds, which is the P1114 qualifier of https://www.wikidata.org/wiki/Q26882302#P186 – that claim has other isses anyway and needs to be fixed. (I removed ~10 other wrong uses of bounds this morning).


Maybe I should add a more general rant about the quantity datatype here: users don’t understand it, which is why the vast majority of bounds and a substantial amount of units are wrong. Reasons:

  • The meaning of bounds in quantity datatype properties is not well-defined (particularly here: https://www.wikidata.org/wiki/Help:Data_type#quantity and https://www.mediawiki.org/wiki/Wikibase/DataModel#Quantities). The term “uncertainty interval” indicates that it should be used as measurement uncertainty, confidence intervals, etc., but this is actually not the case in Wikidata.
    • This leads to a situation where users use bounds as they personally prefer to, but one cannot rely on a particular meaning of any bounds given in Wikidata.
    • This also encourages users to abuse bounds for other purposes, e.g. compensate the lack of other datatypes.
    • General rule: valid bounds can also be found in the referenced sources of a claim. I’d say that clearly more than 95% of all bounds in Wikidata fail that criterion, as they are personal flavor of individual users or residuals of the automatic ±0 bounds addition of the software that we saw in the past.
  • Due to the lack of a “range” datatype, users add bounds as follows: source A claims a person has 2 children, and source B claims the same person has 3 children. Users add: 2.5±0.5 children, as this covers the range of values found in sources. (Yet I am not sure whether we should have a “range” datatype; multiple claims and use of ranks are the solution here.)
  • The lack of a “number” datatype makes the integer constraint necessary. This works out to some extent as we can see, but it is not optimal:
    • A “number” datatype would make accidental decimal places impossible.
    • A lot of wrong uses of units could be avoided as well (units such as “apple”, “passenger”, “train”, etc.) if the “number” datatype had a different kind of or even no unit attached.

Change 432004 merged by jenkins-bot:
[mediawiki/extensions/WikibaseQualityConstraints@master] Add 'integer' constraint

https://gerrit.wikimedia.org/r/432004

[…] bounds cannot be non-integers in those cases.

I'm not sure I understand this. Let's say the size of a team was 50±0.5 people in 2017. Such a confidence interval tells me that there must have been some fluctuation over the year, but not a huge one. At the same time, it's common practice to round such numbers to integers because it would look odd to talk about an average including "half" people. The datatype allows to model something like this. Why shouldn't it?

wrong uses of units […]

Why do you think "passenger" is not a valid unit?

Let's say the size of a team was 50±0.5 people in 2017. Such a confidence interval tells me that there must have been some fluctuation over the year, but not a huge one.

What’s the exact meaning of such “±0.5 bounds”? It just transports the qualitative notion of “some fluctuation over the year” with a meaningless quantitative number that you somehow found appropriate. Say you had a team size of 49 most of the year which suddenly increased to 52 shortly before end of the year, and you managed to somehow calculate the quantity as given above. Unfortunately, at no time ever there were 50 people in your team, also your team size has never been within the bounds, and there have never been half team members, as you state by yourself.

This is a good example where bounds are (ab)used to replace either ranges (49–52), or where alternatively two independent claims (49 at beginning of the year, 52 at the end of the year) should be used instead. If you just wanted to state that the value is some approximation or estimation, we typically add qualifiers “sourcing circumstances (P1480): circa (Q5727902)” to the claim. This way we avoid to desperately quantify a qualitative claim.

Why do you think "passenger" is not a valid unit?

Countable sets such as “number of passengers/apples/…” do not have a unit (or better: have unit 1, in Wikidata in some situations represented as http://www.wikidata.org/entity/Q199). Constraints in Wikidata typically respect that (e.g. “number of participants” property at https://www.wikidata.org/wiki/Property:P1132, but have a look at the violations: https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P1132#%22Units%22_violations).

@MisterSynergy So do we need a second constraint or a parameter for this one to configure it to handle the bounds?

Not sure. At this point I don’t know of any situation where we need integer values with non-integer bounds. This is also based on the observation that in 3.8M claims of 99 different quantity properties with integer constraint deployed, the situation of non-integer bounds does not occur. If anyone can come up with such an example where bounds are properly used, it would be great and I would immediately re-evaluate my position. If not, I think we need only one constraint with integer requirement for both amount and lowerBound/upperBound.

Maybe it is worth to mention that as a physicist I have a rather scientific expectation about which purpose bounds should serve. However, the entire bounds concept is a scientific one, thus I think it is appropriate to use it scientifically in Wikidata as well.

The constraint system has already become more and more complex, and even as an editor who constantly works with it, I find it hard to keep up with all the new possibilities. Let’s keep things simple, as long as there is no reason to add another constraint or option.

Ok then let's also check the bounds and flag a violation if they are not integers.

Change 434743 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseQualityConstraints@master] Check bounds in IntegerChecker

https://gerrit.wikimedia.org/r/434743

Change 434743 merged by jenkins-bot:
[mediawiki/extensions/WikibaseQualityConstraints@master] Check bounds in IntegerChecker

https://gerrit.wikimedia.org/r/434743