Page MenuHomePhabricator

When replacing Blazegraph, need to understand how the proprietary GAS Service is used, in order to replace it
Closed, ResolvedPublic

Description

Need to determine how the GAS SERVICE is used, and what could replace it with minimal impact.

There are examples of BFS (breadth first search) usage at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#GAS_Service. And, a single example of SSSP (shortest path) usage at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples.

Much more information from the community is needed.

Event Timeline

@Jheald @Mahir256 Was it one of you who had this query trying to connect the longest continuous train line? Does that use this service?

Yes, w.wiki/bKf (for all train lines emanating from St Pancras) and w.wiki/4xyY (for the path from Narvik to Singapore) use that service.

This comment was removed by AWesterinen.

There are a few more examples of queries using the service that can be found by searching the archives of the Request-a-Query page, here, and a few more still if the search is widened further to all wiki pages on wikidata (here; limited to pages in English, to suppress translation duplicates).

I haven't completely looked through all the search returns exhaustively; but so far all the ones I have looked at are essentially similar to the two railway queries posted by Mahir above -- i.e. either
(i) find all the nodes that can be reached using a particular choice of properties from a particular starting point, with a count of how many hops they are away;
or alternatively
(ii) return the nodes that are on the shortest path from A to some particular B.

All of these queries I have seen use either the BFS "breadth first search" or the SSSP "single source shortest path" gas service, with the two seemly completely interchangeable -- the only difference seemingly to be that BFS returns a whole number for the number of hops found to each node, whereas SSSP seems to return a real number. But the performance of the two seems to be entirely similar for the queries they are being used on, wiith identical results if BFS is replaced by SSSP or vice-versa.

In some limited cases it is possible to do this without the GAS service -- for example here is a pure SPARQL query to find the ancestors of the composer Bach https://w.wiki/54eC, with their ancestral depth, which works by finding the ancestors then gets the depth by counting the number of intermediates (hat-tip to Tony Bowden and Andrew Gray, 2018) ; but
(i) the depth-counting algorithm is inefficient, scaling worse than order(N^2) as the number of generations available increases, causing it rapidly to blow up towards timeout as the number of found nodes increases, whereas the GAS service remains fast and efficient
(ii) it's not possible to set a maximum depth, nor to stop the search once a particular target is reached, whereas both of these are possible with the GAS service
(iii) the intermediate-counting approach assumes there will be exactly one and only one path to each ancestor from the starting point, and the count will be very wrong if this is not so; whereas the GAS service finds the shortest path
(iv) the intermediate-counting approach only works if the link-properties are directed (like "father" or "mother"). It fails when this is not the case (eg for properties like "adjacent station"), because the counting of intermediate adjacent stations can back-track on itself, eg going all the way to the end of the line, then coming back again, rather than counting the shortest path.

So this counting the number of steps and/or finding the shortest path seems to be the particular niche the GAS service has found. According to the API docs other algorithms could be implemented, and at least two others are supplied -- one to extract connected subgraphs (another variant of BFS ?), and one to calculate PageRank scores -- but I haven't so far found examples of either applied to wikidata.

((PS. Thanks for the appreciation in the two railway queries. Though I can't at all remember which the "original query" was that I'm being thanked for in the comment there!! Though whatever it was, I am sure that, like everything else, it was something that I in my turn had borrowed from other wiser heads!))

(Note: previous comment inadvertently saved when only 1/3 written, so it may be needed to check web version (if not here already) for full text).