Page MenuHomePhabricator

Consider ranks in constraint checks
Closed, ResolvedPublic

Description

@Nikki suggested that the “single value” constraint should not report a violation if an item has one normal rank statement and one deprecated rank statement, since the deprecated rank already marks the second value as “wrong”.

Are there any other constraints that could also take statement rank into account? For instance, should “item/value required statement” or “conflicts with” not count statements that have deprecated rank?

Related Objects

Event Timeline

My gut feeling is we should generally not take deprecated into account. We should be careful with taking "best rank" into account. That might make sense for some and not for other constraints.

thiemowmde triaged this task as Medium priority.Jun 28 2017, 6:46 PM
thiemowmde subscribed.

I agree with what @Lydia_Pintscher said. I suggest to always ignore deprecated statements in all constraint checks, except if it really, really makes sense to consider deprecated stuff.

Given that there are only 50k deprecated statements on Wikidata overall (compared to 160M normal ones and 500k preferred ones), we don’t want to spend too much time on figuring out what treatment of ranks makes sense for each constraint type. For now, we’ll hide constraint reports for deprecated statements in the gadget, and then see if more problems related to ranks pop up.

@Nikki did you notice this problem on any particular item which had a “single value” violation, or was it more of a general concern?

It probably happened on a particular item, but I have absolutely no idea which. :(

Given that there are only 50k deprecated statements on Wikidata overall (compared to 160M normal ones and 500k preferred ones), we don’t want to spend too much time on figuring out what treatment of ranks makes sense for each constraint type. For now, we’ll hide constraint reports for deprecated statements in the gadget, and then see if more problems related to ranks pop up.

Sorry but that's not good enough. 50k is a significant amount. This will and should grow. And we have the API that also needs to handle this.

50k is few enough that I have no clear picture of how deprecated rank is actually used on Wikidata, which makes it hard to figure out what makes sense here. It doesn’t help either that only ~400 of those statements have a “reason for deprecation” qualifier. (In fact, there are more normal-rank statements with “reason for deprecation” than actually deprecated ones.)


How about this proposal?

Don’t check constraints on the deprecated statement itself

  • Yes: conflicts with, difference within range, inverse, item requires claim, mandatory qualifier, multi value, one of, allowed qualifiers, range, single value, symmetric, value requires claim, type, unique value, value type
  • No: Commons link, used as qualifier, used as reference, used as value

Rationale: most constraints are more “logical” / “semantical”, and the deprecated rank already marks the statement as logically / semantically wrong. Commons link, used as qualifier, used as reference, and used as value, on the other hand, are more technical constraints, and should be checked even on those statements.

Ignore deprecated statements when checking the constraint

  • Don’t use other statements: Commons link, mandatory qualifiers, allowed qualifiers, one of, range, used as qualifier, used as reference, used as value
  • Yes: conflicts with, difference within range, inverse, item requires claim, multi value, single value, symmetric, value requires claim, type, unique value, value type
  • No: none?

Do we need any distinctions here?


As it turns out, the checkers for “single value” and “multi value” have been ignoring deprecated statements ever since the baseline commit in which they were first introduced. @Nikki: according to the IRC log, there never was a particular item, and if I’d just quickly checked the source code this issue would never have been opened :/ one example item where you can verify that there’s no “single value” violation is Q12113 (“place of death” statements).

did you notice this problem on any particular item which had a “single value” violation, or was it more of a general concern?

An example would be:

https://www.wikidata.org/wiki/Q23015723#P496

which appears in:

https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P496#Single_value

Yeah, multiple statements as on Q12113#P20 are important. Note that they might all have normal rank.

For many constraints, statements with deprecated ranks don't need to be checked. I had thought that some of the daily reports already exclude them. Obviously, formatting checks might still apply.

For other constraints, checks on preferred values are sufficient.

Do we need a feature to define to which ranks a constraint applies?

@Pigsonthewing the database reports are powered by different code – the extension doesn’t report a violation on that statement (see https://www.wikidata.org/wiki/Special:ConstraintReport/Q23015723).

@Lucas_Werkmeister_WMDE then we need to fork this ticket for constraint violation reports of the kind I linked to.

@Pigsonthewing those reports are created by @Ivan_A_Krestinin’s KrBot – as far as I know, he doesn’t use Phabricator to track bugs.

Sorry but that's not good enough. 50k is a significant amount. This will and should grow. And we have the API that also needs to handle this.

@Lydia_Pintscher, I find this sentence not sufficient. What will "grow"?

Are you sure you got the "50k" right? There are only 50k deprecated statements in total. How many of these are involved in a constraint violation? A few hundred, maybe? And how many of these reports are semantically correct and worth fixing? A few dozen, maybe?

We discussed this in a big round on Tuesday and figured, as a group, this is just not worth the effort. At least not right now. We decided to not show any constrain reports on statements that are deprecated anyway, but not remove these reports from the API yet. If it turns out a specific type of constraint is really valuable even on deprecated statements, it can quite easily be re-enabled later.

The suggestion made in T167653#3406413 is perfectly fine, but means we are going to spend a lot more time on this right now.

Okay, I created a bunch of subtasks where we can discuss the details (based on my earlier comment). Especially the tasks without assignee could do with some more input.