Page MenuHomePhabricator

Investigate continuing long result sets
Closed, DeclinedPublic

Description

Obviously continuing results works pretty well because the gui does it quickly - but how can wikigrok et al use it?

Event Timeline

Manybubbles raised the priority of this task from to Needs Triage.
Manybubbles updated the task description. (Show Details)
Manybubbles moved this task to Needs triage on the Discovery-ARCHIVED board.
Manybubbles subscribed.
Manybubbles set Security to None.

Bad news - GUI loads the whole result set and pages it using JS. Better news - there are named result sets, which can be created like this:

prefix wdt: <http://www.wikidata.org/entity/assert/>
  INSERT INTO %s1 SELECT ?p ?d WHERE {
  ?p wdt:P580 ?d
 }

and then queried like this:

SELECT * WHERE {
  INCLUDE %s1
} LIMIT 50 OFFSET 200

But we don't know how heavy those are (i.e. if it's ok to have thousands of them), how to make them short-lived and if they would work on non-master head of the cluster setup (AFAIK updates are not allowed except on master).

I'm pretty sure limit/offset won't guarantee a continuation without some kind of transaction isolation. And limit/offset is usually horribly inefficient in dbs anyway and probably isn't different here. The sort where > some id strategy that works in relational dbs also probably doesn't work well here either.

Well, from what I read in the docs they say on named result set, limit/offset is much more efficient than re-running old query since all the join, etc. work is already done. The question is how expensive it is.

OK, looks like we can't run this query on follower nodes. Which means it's not useable for generic queries.

We're talking about not using follower nodes - just having each server run the updater itself. That would allow this kind of thing i think.

Smalyshev changed the task status from Open to Stalled.May 26 2016, 9:57 PM
Smalyshev lowered the priority of this task from High to Low.
Aklapper changed the task status from Stalled to Open.Nov 1 2020, 10:42 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

I believe the current main strategies to “paginate” query results are:

  • Load all the results, then paginate in memory. As Stas said, this is what the WDQS UI does. It works reasonably well if the number of results isn’t outrageously large, and you’re doing the pagination for the users’ benefit, not as much to save resources.
  • Paginate a small part of the query, then run the rest on the paginated subset. For example:
SELECT ?item ?dob ?dod WITH {
  SELECT ?item WHERE {
    ?item wdt:P31 wd:Q5.
  }
  OFFSET 20000
  LIMIT 1000
} AS %items WHERE {
  INCLUDE %items.
  ?item wdt:P569 ?dob;
         wdt:P570 ?dod.
  FILTER(?dob > ?dod)
}

As long as the paginated subquery only uses a single triple, this is reasonably reliable, as far as I’m aware. Alternatively, you can paginate that subquery based on some identifier as well, ideally one that allows using the hint:Prior hint:rangeSafe true. query hint.

The third strategy envisioned in this task seems to be to store temporary result sets on the server, and then access them in later queries. However, as this seems like a major undertaking – it would at least require partial write permissions on the servers, the ability to route each user’s queries consistently to the same server (the same thing that blocked T136479, which was ultimately declined), and probably some “garbage collection” of no-longer-used result sets – I think it makes sense to just close this task. Feel free to reopen if you disagree.