Page MenuHomePhabricator

Grouping by property is not powerful enough for some use-cases
Open, Needs TriagePublic

Description

For both the SOAP and VG dashboards, we expected a direct relationship between the items and the group: P-195 (collection) is on each artwork item ; P-400 (platform) is on each game item, etc.

The idea was to have a very simple interface (ie, simple parameters) so that folks with little SPARQL-fu could easily use it.

By doing so, however, the kind of grouping possible was arbitrarily restricted.

Typically, in some use cases, the grouping is 1 (or more item away). For example, building a coverage dashboard of French churches per department: on the church item, P-131 points to the city, and from the city P-131 points to the department.

(Added complexity here is that the city will have more than one value of P-131, which means we need to further restrict in SPARQL).


The general use case here is overall reasonable, however I’m unsure how to build a path towards this which maintains the original goal of accessibility:

  • |grouping_property=P195 → easy ;
  • |grouping_sparql=?entity wdt:P195 ?grouping . → less so

(Would also need to think through how to reconstruct the no-group query if allowing 'free-form' input).

Event Timeline

For grouping by some sub-national level, maybe a separate query to get levels first is the most efficient. A shorthand for that could be a P31 statement or a property (if there is one).

Below a sample for Italy that could replace one that times-out:

SELECT ?grouping (COUNT(DISTINCT ?entity) as ?count) (SAMPLE(?entity) as ?sample) 
WITH
{
    SELECT DISTINCT ?grouping {   { ?grouping wdt:P31 wd:Q15089 } UNION { ?grouping wdt:P31 wd:Q15110 } minus { ?grouping  wdt:P576 [] }   }
} as %groupings
WHERE 
{ 
  INCLUDE %groupings 
  ?entity wdt:P17 wd:Q38; wdt:P625 []. 
  ?entity wdt:P131/wdt:P131* ?grouping . 
} 
GROUP BY ?grouping 
HAVING (?count > 100) 
ORDER BY DESC(?count) 
LIMIT 1000

For (non-contemporary) people, I think an interesting grouping could be by century, but I'm not entirely sure how that could work without a dedicated statement.

As mentioned on Facebook, I agree that the tool should retain grouping_property to make using this tool easy for simple cases. But for more complex types of grouping, I think having a separate field like grouping_sparql (that will be used in case grouping_property is not present) would be nice.

I was able to "abuse" the current tool because grouping_property is currently not being sanitized to match the pattern /^P[0-9]+$/ so I was able to insert arbitrary SPARQL to group the selected items by whatever I want (basically, the SPARQL version of the SQL injection exploit). However, I had to use really weird SPARQL clauses so that my "hack" works with both the positive and MINUS (for the "No grouping" row) SPARQL queries.

Trying to chart a path here:

  1. Do nothing. Assume that SPARQL injection (either "easy ones" like Maarten’s |grouping_property=P131/wdt:P131 or more complex ones like |grouping_property=p:P195 [ ps:P195 ?id ; pq:P2868 wd:Q29188408 ]' are enough .
  2. Allow arbitrary SPARQL:

One year later, some more thoughts:

  1. "SPARQL injection" has proven very powerful − see the examples at https://www.wikidata.org/wiki/Wikidata:Tools/inteGraality#Advanced_grouping
  2. Yet, there are some things that are not possible, or prohibitively complicated:
    • the “Restrict the groupings” example is very clever, but it plain breaks the lookup queries
    • the constructs are "unnatural", using reverse path expressions and so on
    • it would still not be suitable for lexemes (see T294889), which might rather be grouped using predicates like wikibase:lexicalCategory
  3. (Also, for T236590, I have already carved a special grouping behaviour.)

Considering the above, I think the correct way forward is indeed to allow a more free-form grouping, through an additional parameter grouping_expression or grouping_sparql. There would be two possibilities:

  1. Soft-free-form, where the fragment would be inserted between the variables ala ?entity <grouping_expression> ?grouping
  2. complete free-form, where the fragment is used as is − and it’s up to the user to use the correct variables ?entity and ?grouping at the right spots.

I am leaning towards the complete free-form − to allow users as much flexibility as possible, and have them deal with the complexity (which is fair as it is the power mode). The soft form would still disallow some use cases (like the "restrict grouping") for no good reason.

Unless there is opposition, that’s the plan.