The current analysis parses queries and extracts:
- Operators (list, and map with number of usage)
- Nodes (variables, URIs, literals, blanck nodes) map with number of usage
- Prefixes (map with number of usage)
- Services (map with number of usage)
- Wikidata names (URIs with main value matching regex "^[QP]\\d+$")
- Expressions
- Paths
The values used to identify operators, expressions, path or nodes are string, either the detailed name (for operators or nodes for instance), or the full print of the subtree portion (for path or expressions for instance).
One thing we badly miss for our analysis is triple-pattern-matching information: when a triple-pattern is met , which form is it in ( <? - P - O>, <S - P - ?> for instance), and what are the defined value it embeds (URIs, literals etc). With that information we should be able to be more precise in term of triple-pattern usages in queries, possibly also getting a better feel of subgraphs heavily used.