Page MenuHomePhabricator

Let wbgetclaims return best-rank statements
Closed, DeclinedPublic

Description

As a developer using the Wikibase Action API, I want to be able to easily access best-rank statements, in order to follow the same logic as SPARQL queries (wdt:) or Lua modules (getBestStatements()).

Problem:
The wbgetclaims action has a rank parameter, but it only lets you filter by preferred, normal, or deprecated rank. If you want to use best-rank statements (i.e. preferred-rank if any, else normal-rank, never deprecated-rank), you have to implement that logic yourself, at the risk of getting it wrong; this also requires getting the statements for all ranks, or potentially making two network requests (first with rank=preferred, then with rank=normal), adding some network overhead either way. (For certain properties, it’s not uncommon to have tons of non-best-rank statements.)

Example:

Screenshots/mockups:

BDD
GIVEN an Item with some preferred-rank, normal-rank and deprecated-rank Statements for the same Property
WHEN I call wbgetclaims with the given Item ID, Property ID, and rank=best
THEN only the preferred-rank Statements are returned

GIVEN an Item with some normal-rank and deprecated-rank Statements for the same Property
WHEN I call wbgetclaims with the given Item ID, Property ID, and rank=best
THEN only the normal-rank Statements are returned

GIVEN an Item with some deprecated-rank Statements for the same Property
WHEN I call wbgetclaims with the given Item ID, Property ID, and rank=best
THEN no Statements are returned

Acceptance criteria:

  • the wbgetclaims rank parameter accepts a fourth value, best, which causes the API to only return best-rank statements
  • the documentation of the API parameter values makes it clear that unlike preferred/normal/deprecated, this is not a rank that an individual statement can have
    • (nevertheless, I think adding this to the rank parameter makes more sense than having it as a separate parameter, because it can’t be meaningfully combined with any other rank value)

Open questions:

Event Timeline

Now that I think of it, since this is not a value of rank, I'm not sure it should be passed to the rank query param, perhaps it should be it's own boolean query param? I understand that having one paramter set renders the other useless, but in that case IMO we should implement some proper error or warning to alert the requester, rather than add another value that isn't actually valid for that parameter and is in fact a union of two possible values.

I disagree… it is a value that’s valid for that parameter, when change the parameter accordingly. I don’t think it’s a strict requirement that the API parameter corresponds 1:1 to the string buried deeply inside the Wikibase JSON representation. And having two parameters that can never be used together seems unnecessarily annoying to me (action=query&prop=revisions has a similar problem, for example).

Of course, it's not a requirement, but it contributes to somewhat of a cognitive dissonance for people who are also working on these statements from the UI, where 'best' is not an option. Essentially, all this could be solved if the rank parameter could accept multiple values, something like rank=preferred|normal to get these 'Best Ranked' statements.

Essentially, all this could be solved if the rank parameter could accept multiple values, something like rank=preferred|normal to get these 'Best Ranked' statements.

But that wouldn’t be the same thing – “all the preferred-rank and all the normal-rank statements” is different from the best-rank statements. Also, it would require users to list the ranks that make up best-rank statements by themselves, and allow them to shoot themselves in the foot by specifying something like rank=preferred|deprecated instead, which they probably don’t want.

But that wouldn’t be the same thing – “all the preferred-rank and all the normal-rank statements” is different from the best-rank statements. Also, it would require users to list the ranks that make up best-rank statements by themselves, and allow them to shoot themselves in the foot by specifying something like rank=preferred|deprecated instead, which they probably don’t want.

Is it different, though? Receiving all the preferred and normal statements (and not the deprecated ones) is exactly what would enable them to properly choose the "Best Statement", sure, there is still the question of ordering, but this is definitely something we could deal with.

Is it different, though? Receiving all the preferred and normal statements (and not the deprecated ones) is exactly what would enable them to properly choose the "Best Statement", sure, there is still the question of ordering, but this is definitely something we could deal with.

But that still requires extra work by the API user (and adds network overhead if there are many non-best normal-rank statements). When I get a list of best statements, I expect that they’re all equivalent, and I don’t have to look at the rank anymore. (Most of the time, I also expect only a single best-rank statement – e.g. I built commons:Module:Statement on that assumption – but that’s beyond the scope of this API, I’d say.)

Making the rank parameter multi-valued would be possible in general (the same is true for the property parameter, I suppose), but I don’t see it as a solution for the use case of this task. I also think such a change should be mutually exclusive with allowing rank=best, because something like rank=best|normal has no sensible meaning IMHO. (On the other hand, while I still don’t like the separate-flag approach, I’ll concede that a separate best=1 flag and a multi-valued rank parameter could be implemented together – you’d get an error if you used both in the same request, but it’s not conceptually broken in the same way, I’d say.)

Best rank is a fairly important data reuse concept and one that should be standardized across all our access methods, i.e. not every reuser should be required to recompute it themselves.

Right, I agree that it shouldn't be re-computed by the consumer, but after a short discussion with @WMDE-leszek I realized that this "best rank" is not core to the data model that we keep referring to, it is derived from it though in the paragraphs below it.

I think we can, however, add a union to the parameter to indicate that we want both ranks and ask to order them by the rank, so that all a re-user has to do is use the top of the list returned by the response from: rank=preferred|normal&orderby=rank. In my opinion, this is the best way to avoid delegating this sifting logic to the client, while maintaining the flexibility and simplicity that the data model affords as it is currently defined. IMO I don't think we have to worry about consumers potentially requesting other ranks of statements since we currently allow them to do so, and giving re-users the opportunity to request statements beyond best-ranked will allow them to create their own interpretations of the data model. For instance, just off the top of my mind, an application that compares the top two or three statements returned from rank=preferred|normal&orderby=rank to determine which one is truly the "best ranked".

To add to that, if network overhead is truly a concern, then I don't think we should shy away from adding another http api layer (I suppose it would be a module, if we speak in action api terms) to really just get "the one" best ranked statement. But I don't really know if there's a sufficient business case for it in terms of product.

There can be more than one best ranked statement though.

Right, I agree that it shouldn't be re-computed by the consumer, but after a short discussion with @WMDE-leszek I realized that this "best rank" is not core to the data model that we keep referring to, it is derived from it though in the paragraphs below it.

This feels like an artificial distinction to me. It’s part of the data model, and as Lydia said it’s an important concept; why is this “core” subset of the data model so important now?

I think we can, however, add a union to the parameter to indicate that we want both ranks and ask to order them by the rank, so that all a re-user has to do is use the top of the list returned by the response from: rank=preferred|normal&orderby=rank. In my opinion, this is the best way to avoid delegating this sifting logic to the client, while maintaining the flexibility and simplicity that the data model affords as it is currently defined. IMO I don't think we have to worry about consumers potentially requesting other ranks of statements since we currently allow them to do so, and giving re-users the opportunity to request statements beyond best-ranked will allow them to create their own interpretations of the data model. For instance, just off the top of my mind, an application that compares the top two or three statements returned from rank=preferred|normal&orderby=rank to determine which one is truly the "best ranked".

To me this doesn’t address what I wrote before:

When I get a list of best statements, I expect that they’re all equivalent, and I don’t have to look at the rank anymore.

In my opinion, any solution here that still requires me to look at the rank adds very little value, regardless of what order the API returns statements in. If the API can return preferred-rank and normal-rank statements together, then that’s not best-rank statements, and I still need to implement the logic on my side.


To add to that, if network overhead is truly a concern, then I don't think we should shy away from adding another http api layer (I suppose it would be a module, if we speak in action api terms) to really just get "the one" best ranked statement. But I don't really know if there's a sufficient business case for it in terms of product.

I was actually starting to think the same thing: a new wbgetbeststatements action could also be an option. (Or wbgetbestclaims for consistency, though IMHO we should stop saying “claims” when we really mean “statements”. Claims are main snak + qualifiers, statements are claim + references; both APIs return full statements, not just the claim part.)

why is this “core” subset of the data model so important now?

The requirements document states that quite clearly (emphasis mine).

"There are three possible ranks"

and

"This model is intentionally left coarse and simple. The three levels translate to different treatments in data access, UI (e.g., what is displayed by default), and export (one could, e.g., have an export with only the preferred and normal Statements). The ranks may also be useful for protecting Statements from editing (e.g., by protecting only preferred and normal statements). More fine-grained rankings do not seem to have such a clear interpretation and would thus increase the UI complexity unnecessarily. Having only two ranks (or no ranks at all), on the other hand, would make it harder to cope with Statements that are not trusted, known to contain wrong claims, or simply unpatrolled (if ranks are used for protection)."

For all I can see you can replace UI with any interface for that matter, users are users no matter what interface they choose to access the program's data and models with.

In my opinion, any solution here that still requires me to look at the rank adds very little value

What do you mean by "requires me to look at the rank"? In the ordered solution, you don't really have to look at the rank if you're only interested in "one" preferred statement (which is the actual implementation requirement that brought us to this point), if you mean "know about rank" in general, then neither solution applies, and we should really just go with this additional API layer.

Also, I'm not sure if it's as subjective as just saying "adds very little value" and having a done deal. Which value are you trying to extract here? In my case, I'd like to bring this task forward, and within the scope of this sprint, as it means that the client (and any future clients we develop) will not have to be responsible for logic that doesn't belong in them (order and filter belong at the API level, and I don't think this is really controversial either). The value I emphasize here is complexity, or to be more exact, maintainability and understandability of that particular client system and context, as well as that of the HTTP API.

I was actually starting to think the same thing: a new wbgetbeststatements action could also be an option. (Or wbgetbestclaims for consistency, though IMHO we should stop saying “claims” when we really mean “statements”. Claims are main snak + qualifiers, statements are claim + references; both APIs return full statements, not just the claim part.)

Yes, this might be a lot of work though, right? And that definitely is something @Lydia_Pintscher should call the shots on in terms of priority and planning.

why is this “core” subset of the data model so important now?

The requirements document states that quite clearly (emphasis mine).

"There are three possible ranks"

and

"This model is intentionally left coarse and simple. The three levels translate to different treatments in data access, UI (e.g., what is displayed by default), and export (one could, e.g., have an export with only the preferred and normal Statements). The ranks may also be useful for protecting Statements from editing (e.g., by protecting only preferred and normal statements). More fine-grained rankings do not seem to have such a clear interpretation and would thus increase the UI complexity unnecessarily. Having only two ranks (or no ranks at all), on the other hand, would make it harder to cope with Statements that are not trusted, known to contain wrong claims, or simply unpatrolled (if ranks are used for protection)."

For all I can see you can replace UI with any interface for that matter, users are users no matter what interface they choose to access the program's data and models with.

Frankly, this feels a lot like cherry-picking, as you neglected to quote the very next paragraph from DataModel#Ranks_of_Statements:

Another useful concept can be constructed based on the ranks defined above: the "best rank" for the Statements about a given Property with respect to a given Item. If there is at least one Statement with preferred rank about the property (in the context of a given Item), the best rank for that property is preferred. Otherwise, the best rank is normal. Correspondingly, the "best Statements" about a given Property in the context of a given Item are the ones that have the best rank for that Property.

You can't just decide that the first two paragraphs a part of some ad hoc "core" of the data-model and the third paragraph is just supplementary and doesn't matter anymore. It has been made clear that the best-rank concept is in practice core to how our data is used and structured. If this wiki page doesn't sufficiently reflect that yet, then, I think, we should adjust the wording on the wiki page instead.

In my opinion, any solution here that still requires me to look at the rank adds very little value

What do you mean by "requires me to look at the rank"? In the ordered solution, you don't really have to look at the rank if you're only interested in "one" preferred statement (which is the actual implementation requirement that brought us to this point), if you mean "know about rank" in general, then neither solution applies, and we should really just go with this additional API layer.

Also, I'm not sure if it's as subjective as just saying "adds very little value" and having a done deal. Which value are you trying to extract here? In my case, I'd like to bring this task forward, and within the scope of this sprint, as it means that the client (and any future clients we develop) will not have to be responsible for logic that doesn't belong in them (order and filter belong at the API level, and I don't think this is really controversial either). The value I emphasize here is complexity, or to be more exact, maintainability and understandability of that particular client system and context, as well as that of the HTTP API.

I, as a user of the API, want to get the best rank values for a property from an item, as defined in the data-model. This could explicitly be one or multiple best-ranked values and I already can get this via Lua. Getting a (sorted) list of preferred and normal-ranked statements does not gain me much, as I still have to look at the rank of the statements as I have to figure out which of the values are of preferred rank and which are of normal rank.

For this story/subtask we want to get the best-ranked values as defined in the data-model, and then pick the first of it. Getting a (sorted) list of preferred and normal-ranked statements makes things less understandable and hides domain logic instead of making use of it.

Especially in the action api we should make sure that we are sticking to the domain, especially for reasons of understandability. That being said, I don't have a strong opinion yet for whether this should be an extension to wbgetclaims or a new wbgetbeststatements endpoint or something similar.

You can't just decide that the first two paragraphs a part of some ad hoc "core" of the data-model and the third paragraph is just supplementary and doesn't matter anymore.

I am really not sure where this interpretation is coming from. I'm not ignoring this paragraph, I referred to it before in this conversation and I believe it only strengthens my point: "can be constructed based on the ranks defined above" is exactly where I get that there are core values to rank, and that "best" isn't one of them.

At this point, I would like to stop and ask how useful this discussion is? I have not seen an argument that gives any examples as to how expecting our interface consumers (that is, users) to know about the three values allowed in rank, and their ordering "preferred then normal, never deprecated" makes the system harder to use, harder to reason about or even fails to implement some domain logic.

I would suggest focusing this debate on finding a solution rather than just continuing this circular argumentation. Likewise, I'm not here to target anyone, I try to present examples to back up my reasoning, and I would like to receive the same courtesy from you.

"Best rank" aka "truthy" should be a concept built in Wikibase software. We should make data reusers least impacted if additional ranks are added to Wikibase (Example proposal: https://www.wikidata.org/wiki/User:ChristianKl/Draft:New_Ranks, which also change how normal rank statements are regarded).

why is this “core” subset of the data model so important now?

The requirements document states that quite clearly (emphasis mine).

"There are three possible ranks"

and

"This model is intentionally left coarse and simple. The three levels translate to different treatments in data access, UI (e.g., what is displayed by default), and export (one could, e.g., have an export with only the preferred and normal Statements). The ranks may also be useful for protecting Statements from editing (e.g., by protecting only preferred and normal statements). More fine-grained rankings do not seem to have such a clear interpretation and would thus increase the UI complexity unnecessarily. Having only two ranks (or no ranks at all), on the other hand, would make it harder to cope with Statements that are not trusted, known to contain wrong claims, or simply unpatrolled (if ranks are used for protection)."

For all I can see you can replace UI with any interface for that matter, users are users no matter what interface they choose to access the program's data and models with.

I think we need to distinguish here between the ranks a statement can have, and how you can filter statements by rank. This task isn’t proposing to add a new rank that a statement can actually have (there are such proposals in the community, including also T210961, but as far as I’m aware none of the proposed ranks are intended to be truthy / included in best-rank, so I would leave them aside here). But when it comes to filtering statements by rank, best-rank is a useful concept, and it’s part of the data model too. In fact I would argue it’s more useful than filtering by any single rank value (or list of rank values): in Lua, we have getBestStatements() and getAllStatements(), but getAllStatements() doesn’t have a rank filter, so best-rank filtering is the only kind of rank filter we make easy.

What do you mean by "requires me to look at the rank"? In the ordered solution, you don't really have to look at the rank if you're only interested in "one" preferred statement (which is the actual implementation requirement that brought us to this point), if you mean "know about rank" in general, then neither solution applies, and we should really just go with this additional API layer.

But there’s a difference between expecting only a single best-rank statement (while still being able to detect other best-rank statements, to flag such cases as unexpected, e.g. putting them in a maintenance category: that’s why Module:Statement returns a hasOther flag), and always picking the first best-rank statement regardless of whether any others exist.

Also, there are plenty of use cases for multiple best-rank statements too, even if that’s not what we happen to need in T298142. All the other best-rank interfaces we provide that I’m aware of – in Lua and RDF – produce a list of best-rank statements, not a single one.

The value I emphasize here is complexity, or to be more exact, maintainability and understandability of that particular client system and context, as well as that of the HTTP API.

I think it’s likely that adding support for filtering by multiple ranks, and then also sorting by rank, actually adds more complexity to the system than returning best-rank statements (which we already have code for: StatementList::getBestStatements()).

After a discussion within the team, we have decided not to tackle this feature at this point in time and in the scope of this project. We will, however, keep this feature in mind for a later date and hopefully will be able to implement it in the new lexeme special page to mitigate any remaining tech debt.