Page MenuHomePhabricator

Duplicates in the result set
Closed, ResolvedPublic

Description

As an editor I want to obtain a unique result set in order to query for data with more than one conditions.

Problem:
We currently allow duplicate item and itemLabels to display in the result. For a limit= 5 (say), the query result is duplicated five times.

Screenshots/mockups:

BDD
GIVEN a visual query
WHEN running the query
THEN the result set only contains unique results

Acceptance criteria:

  • Omit the duplicates in the resultset

Event Timeline

amy_rc updated the task description. (Show Details)
amy_rc removed a subscriber: Aklapper.
amy_rc added a subscriber: Aklapper.

So I looked at this. It's a bigger problem in general and it's due to the way we handle "not matching". It's slightly complex, so bear with me.

so the query that query builder produces is this: https://w.wiki/unm (with some modifications):

SELECT ?item ?itemLabel ?instance ?instanceLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  ?item (p:P106/ps:P106/(wdt:P279*)) wd:Q2526255.
  ?item (p:P31/ps:P31/(wdt:P279*)) ?instance.
  FILTER(?instance != wd:Q5)
}
LIMIT 5

What is happening? it's basically going up the ladder of P279 for not matching, so P279 of Q5 and its P279 and so on, so each item becomes several rows (as wdqs is a graph db of a triples not items).
So the result would be:

wd:Q7251Alan Turingwd:Q103940464continuant
wd:Q7251Alan Turingwd:Q99527517collection entity
wd:Q7251Alan Turingwd:Q53617489independent continuant
wd:Q7251Alan Turingwd:Q28813620set
wd:Q7251Alan Turingwd:Q27043950anatomical entity
wd:Q7251Alan Turingwd:Q16887380group
wd:Q7251Alan Turingwd:Q26720107subject of a right
wd:Q7251Alan Turingwd:Q35120entity
wd:Q7251Alan Turingwd:Q23958946individual entity
wd:Q7251Alan Turingwd:Q159344heterotroph
wd:Q7251Alan Turingwd:Q7239organism
wd:Q7251Alan Turingwd:Q24229398agent
wd:Q7251Alan Turingwd:Q18336849item with given name property
wd:Q7251Alan Turingwd:Q830077subject
wd:Q7251Alan Turingwd:Q795052individual
wd:Q7251Alan Turingwd:Q45983014organisms by adaptation
wd:Q7251Alan Turingwd:Q72638consumer
wd:Q7251Alan Turingwd:Q3778211legal person
wd:Q7251Alan Turingwd:Q215627person
wd:Q7251Alan Turingwd:Q164509omnivore
wd:Q7251Alan Turingwd:Q154954natural person
wd:Q7251Alan Turingwd:Q5human

And it only removes the last line (and leaves the rest) making the query both incorrect and full of duplicates.

I talked to @Lucas_Werkmeister_WMDE and came up with several solutions but each has pros and cons.

One:
https://w.wiki/unp

SELECT DISTINCT ?item ?itemLabel ?instance ?instanceLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  ?item (p:P106/ps:P106/(wdt:P279*)) wd:Q2526255.
  ?item (p:P31/ps:P31) ?class.
  MINUS { ?item (p:P31/ps:P31/(wdt:P279)*) wd:Q5. }
}
LIMIT 5

Basically take every one who has P31, and remove anything that has the Q5 in the P279 ladder

Pros:

  • Correct

Con:

  • It times out

Two: https://w.wiki/unu
The other way to handle it is to actually discard P279 ladder for "not matching" part.

SELECT DISTINCT ?item ?itemLabel ?instance ?instanceLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  ?item (p:P106/ps:P106/(wdt:P279*)) wd:Q2526255.
  MINUS {?item p:P31/ps:P31 wd:Q5. }
}
LIMIT 5

Pros:

  • It's fast

Cons:

  • It's limited, If I want to filter out galaxies from my result, it wouldn't exclude spiral galaxies, etc.

I don't know which way to go. I think @Lydia_Pintscher should decide here.

One:
https://w.wiki/unp

SELECT DISTINCT ?item ?itemLabel ?instance ?instanceLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  ?item (p:P106/ps:P106/(wdt:P279*)) wd:Q2526255.
  ?item (p:P31/ps:P31) ?class.
  MINUS { ?item (p:P31/ps:P31/(wdt:P279)*) wd:Q5. }
}
LIMIT 5

Basically take every one who has P31, and remove anything that has the Q5 in the P279 ladder

Pros:

  • Correct

Con:

  • It times out

You can rearrange that query to optimize (compare T166139):

SELECT DISTINCT ?item ?itemLabel ?instance ?instanceLabel WHERE {
  {
    SELECT DISTINCT ?item ?instance WHERE {
      ?item (p:P106/ps:P106/(wdt:P279*)) wd:Q2526255.
      ?item (p:P31/ps:P31) ?instance.
      MINUS { ?item (p:P31/ps:P31/(wdt:P279)*) wd:Q5. }
    }
    LIMIT 5
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Then it doesn’t time out anymore (though it’s still expensive).

Edit: Note that this will still give you duplicate results when items have multiple P31 statements (fictional human + television character, for example).

Query tried during the discussion call:

SELECT DISTINCT ?item ?itemLabel ?instance ?instanceLabel WHERE {
  {
    SELECT DISTINCT ?item ?instance WHERE {
      ?item (p:P106/ps:P106/(wdt:P279*)) wd:Q2526255.
      ?item (p:P31/ps:P31) ?instance.
      MINUS { ?item (p:P31/ps:P31/(wdt:P279)*) wd:Q5. }
    }
    LIMIT 5
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}