Grouping by property is not powerful enough for some use-cases
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	JeanFred
	May 23 2019, 2:38 PM

Description

For both the SOAP and VG dashboards, we expected a direct relationship between the items and the group: P-195 (collection) is on each artwork item ; P-400 (platform) is on each game item, etc.

The idea was to have a very simple interface (ie, simple parameters) so that folks with little SPARQL-fu could easily use it.

By doing so, however, the kind of grouping possible was arbitrarily restricted.

Typically, in some use cases, the grouping is 1 (or more item away). For example, building a coverage dashboard of French churches per department: on the church item, P-131 points to the city, and from the city P-131 points to the department.

(Added complexity here is that the city will have more than one value of P-131, which means we need to further restrict in SPARQL).

The general use case here is overall reasonable, however I’m unsure how to build a path towards this which maintains the original goal of accessibility:

|grouping_property=P195 → easy ;
|grouping_sparql=?entity wdt:P195 ?grouping . → less so

(Would also need to think through how to reconstruct the no-group query if allowing 'free-form' input).

Related Objects

Mentioned In: T236590: Allow grouping by date properties
Mentioned Here: T294889: Assess whether integraality works/makese sense for Lexemes
T236590: Allow grouping by date properties
P31 Fork of P29 (An Untitled Masterwork)
P131 Failures when applying 178205
P195 webconsole log from viewing https://www.mediawiki.org/wiki/Talk:Sandbox
P400 dberrors by appserver

Event Timeline

JeanFred created this task.May 23 2019, 2:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2019, 2:38 PM

JeanFred added a subscriber: Ayack.May 23 2019, 2:38 PM

JeanFred updated the task description. (Show Details)May 23 2019, 2:40 PM

JeanFred updated the task description. (Show Details)May 23 2019, 2:57 PM

For grouping by some sub-national level, maybe a separate query to get levels first is the most efficient. A shorthand for that could be a P31 statement or a property (if there is one).

Below a sample for Italy that could replace one that times-out:

SELECT ?grouping (COUNT(DISTINCT ?entity) as ?count) (SAMPLE(?entity) as ?sample) 
WITH
{
    SELECT DISTINCT ?grouping {   { ?grouping wdt:P31 wd:Q15089 } UNION { ?grouping wdt:P31 wd:Q15110 } minus { ?grouping  wdt:P576 [] }   }
} as %groupings
WHERE 
{ 
  INCLUDE %groupings 
  ?entity wdt:P17 wd:Q38; wdt:P625 []. 
  ?entity wdt:P131/wdt:P131* ?grouping . 
} 
GROUP BY ?grouping 
HAVING (?count > 100) 
ORDER BY DESC(?count) 
LIMIT 1000

For (non-contemporary) people, I think an interesting grouping could be by century, but I'm not entirely sure how that could work without a dedicated statement.

JeanFred moved this task from Backlog to Enhancements on the Tool-inteGraality board.May 24 2019, 9:00 AM

As mentioned on Facebook, I agree that the tool should retain grouping_property to make using this tool easy for simple cases. But for more complex types of grouping, I think having a separate field like grouping_sparql (that will be used in case grouping_property is not present) would be nice.

I was able to "abuse" the current tool because grouping_property is currently not being sanitized to match the pattern /^P[0-9]+$/ so I was able to insert arbitrary SPARQL to group the selected items by whatever I want (basically, the SPARQL version of the SQL injection exploit). However, I had to use really weird SPARQL clauses so that my "hack" works with both the positive and MINUS (for the "No grouping" row) SPARQL queries.

Multichill subscribed.Jan 13 2020, 10:31 PM

I did a SPARL injection by adding |grouping_property=P131/wdt:P131 https://www.wikidata.org/wiki/User:Multichill/Windmill_sandbox

JeanFred mentioned this in T236590: Allow grouping by date properties.Apr 16 2021, 10:07 AM

Trying to chart a path here:

Do nothing. Assume that SPARQL injection (either "easy ones" like Maarten’s |grouping_property=P131/wdt:P131 or more complex ones like |grouping_property=p:P195 [ ps:P195 ?id ; pq:P2868 wd:Q29188408 ]' are enough .
Allow arbitrary SPARQL:
- Make sure all current uses have a working alternative in the source code − see https://phabricator.wikimedia.org/source/tool-integraality/browse/master/integraality/property_statistics.py?grep=self.grouping_property
  - For the simple ?entity wdt:{self.grouping_property} ?grouping ., this can probably be straight replaced
  - Make sure this also works for MINUS { ?entity wdt:{self.grouping_property} [] . }
  - Work out how to substitute the grouping in the queries like ?entity wdt:{self.grouping_property} wd:{grouping} (probably mandate a variable name?)
- Decide whether this should be invoked by a separate template parameter (eg the grouping_sparql I used above) or smartly switched around (as suggested by @Lokal_Profil in T236590#6993285).

One year later, some more thoughts:

"SPARQL injection" has proven very powerful − see the examples at https://www.wikidata.org/wiki/Wikidata:Tools/inteGraality#Advanced_grouping
Yet, there are some things that are not possible, or prohibitively complicated:
- the “Restrict the groupings” example is very clever, but it plain breaks the lookup queries
- the constructs are "unnatural", using reverse path expressions and so on
- it would still not be suitable for lexemes (see T294889), which might rather be grouped using predicates like wikibase:lexicalCategory
(Also, for T236590, I have already carved a special grouping behaviour.)

Considering the above, I think the correct way forward is indeed to allow a more free-form grouping, through an additional parameter grouping_expression or grouping_sparql. There would be two possibilities:

Soft-free-form, where the fragment would be inserted between the variables ala ?entity <grouping_expression> ?grouping
complete free-form, where the fragment is used as is − and it’s up to the user to use the correct variables ?entity and ?grouping at the right spots.

I am leaning towards the complete free-form − to allow users as much flexibility as possible, and have them deal with the complexity (which is fair as it is the power mode). The soft form would still disallow some use cases (like the "restrict grouping") for no good reason.

Unless there is opposition, that’s the plan.

JeanFred moved this task from Enhancements to Needs input on the Tool-inteGraality board.Nov 13 2021, 9:09 PM

Grouping by property is not powerful enough for some use-casesOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Grouping by property is not powerful enough for some use-cases
Open, Needs TriagePublic
Actions