Page MenuHomePhabricator

Extract operator/nodes/triples/paths/exprs list from queries
Open, LowPublic


Augment query-analysis QueryInfo with a list of operators+nodes+paths(+exprs?) that will be populated in order of AST-visit (and saved in Parquet).
One complexity of this task is to find a common representation suitable for parquet for the various different items.

Event Timeline

Idea on how to store the SPARQL query as a list:
Let's make a list of generic custom class QueryElem[T]. QueryElem contains elemType: String and elem: T.

Classes for each element type needs to be created, e.g NodeClass extends QueryElem. Class defnitions of all elements are given below:

For nodes:
elemType = "Node", elem: NodeInfo = some_node (NodeInfo is a case class containing NodeType and NodeValue like ("NODE_VAR", "x") )

For expression:
elemType = "expression"
Expressions can get quite convoluted, 1 variable, 2 variable, n variable. Like BIND("AK" as ?x), (?x+?y as ?z), (REGEX("[abc]*") as ?x) respecively. Moreover they can go very deep as well like FILTER(?x==1 || ?y==2 || ?z==3)
I am not entirely sure how to represent expressions

For BGP:
elemType = "BGP", elem: List[TripleInfo] = List(triple1, triple2, triple3, triple4, ...) (TripleInfo contains NodeInfo for Sub, Pred and Obj)

For services:
elemType = "service", serviceName:"service_name", elem: BGP (service_name like wikibase:label)

For tables:
elemType = "table", elem: TableData

TableData is: tableVars: List[NodeInfo], tableRow: List[Rows]
Row is: List[NodeInfo]

For paths (sub path obj) :
A path predicate is identified as PATH in NodeType anyways, so we can consider paths to be ordinary triples. Or create a special pathTriple
elemType = "pathTriple", elem: TripleInfo

For filters:
elemType = "filter", elem: Expression (Expression class as described above)

For extends:
elemType = "extend", elem: Expression, expVar: NodeInfo (Expression class as described above)
e.g (?x+?y as ?z), here ?z is the expVar and elem is ?x+?y
elem can be a single Node as well: BIND ("AK" as ?x)

Could it be anything else? This requires more thinking and not sure what to put in elem for extends.

Op Names:
elemType = "operations", elem = "join" (elem can be join, optional, project etc. Sometimes elem will be redundant, like BGP, path, table etc which have their own classes)

Let me know if and what I am missing, how else can we represent a query as list?

The problem I see with using a generic class in the QueryElem object is the conversion to parquet. I don't think it'll work out of the box, leading to having to devise our own conversion. Let's brainstorm on ideas on this, possibly in meeting to make it faster :)

Update 1 June 2021:

Had a chat with @JAllemandou and based on the Wikidata Checkpoint Meeting of 27/5/2021, we will be taking up this ticket later as required. For now, we focus on productionizing the existing data extracted from SPARQL queries and get the data flowing (T273854).

We will need more info on how to flatten the AST but so far we have talked about making a simple list of tuples. The order of the list shows how the AST was traversed and each element in the list is a tuple of Type and Value.
e.g (operator, join), (filter, ?x+?y = ?z), (node_var, x), (extend, ?x+?y as ?z) etc