Page MenuHomePhabricator

LDF endpoint ordering is not stable between servers when paging
Closed, ResolvedPublic

Description

When paging LDF results, duplicate or skipped results appear. The cause is most probably that the servers have different ordering for LDF triples.

Event Timeline

Original email:

We're trying to iterate over all triples with predicate http://www.wikidata.org/prop/direct/P3417 via the LDF endpoint. The way we do that is as follows.

    Issue a GET request for https://query.wikidata.org/bigdata/ldf?subject=&predicate=http://www.wikidata.org/prop/direct/P3417 with a header accepting "application/ld+json"
    Iterate over the resulting json storing all entries in the @graph field with @id matching the quora predicate P3417
    Find the nextPage entry and extract the url if it exists
    Repeat 1 - 3 using the url found in step (3)
    Terminate when nextPage isn't found

When doing the above we've found:

    We're processing the correct number of urls and all urls processed are unique
    The urls have roughly 100 entries per url (which is as expected)
    We're finding many repeat entries in the above iteration so while the total number of entries processed is roughly equal to the total number of matching triples, the number of unique entries processed is far less.

For some more context:

LDF is a way to cheaply get large lists of triples from WDQS, and displace some logic on the clients. Retrieving this list is done page by page. We already have use cases for this. The iteration order is just following the underlying index, which might be different on each node. So specific pages on returned by different nodes might have different content. Adding a sorting step would allow the order to be consistent between nodes, but would increase the cost enough that it breaks the main point of LDF (cheaply retrieve a list of triples).

Potential solutions:

client affinity

The same client always ending up on the same WDQS node would ensure that this client will get a consistent ordering. That consistent ordering would get cached at varnish level and provide a consistent answer to the next client. This breaks down if there is a refresh by another client doing the same query at the time of invalidation. LVS is able to do source hashing scheduling (see LVS docs or T151971), but in our case the source IP is Varnish. We could remove the LVS in front of WDQS and use Varnish to do balancing, but this is exactly the opposite of the current effort to standardise all our services on the same model.

single LDF server

We can route all LDF requests to a single server, with a fallback mechanism to route traffic to another server in case the first one is down. This is a not scalable option. And we don't have anything in place (AFAIK) to manage an automatic fallback (it does not look like LVS has a scheduler that would work in this scenario).

make any WDQS node return consistent pages

As stated above, sorting results before paging would solve the issue, but is probably too expensive to consider (result sets are expected to be large). The LDF implementation that we use does not seem to support this. It might be possible to configure a different indexing strategy, with consistent iteration order, but that's an unknown. Basically WDQS is stateless (as seen from a client) but accidentally exposes internal state in a subtle way.

At this point, we are mostly out of ideas. @BBlack, @ema if you have any idea, they would be welcomed!

At this point, the only workable option is the "single LDF server" (appart from abandoning LDF completely). So let's see how we could implement that and see what feedback we get.

limitations

  • As far as I can see, LVS does not support an active / passive failover, so failover will be done by manually changing the active LDF endpoint with pybal. The LDF service will not be highly available.
  • Restricting to a single server is an inherently non scalable solution (even if for the foreseeable future a single server should be able to easily handle the LDF load).

todo

  • new external endpoint (for example ldf.wikidata.org)
  • new LVS configuration
  • modification to wdqs nginx configuration to route LDF / non LDF traffic from different hostnames
  • anything else?

Change 344197 had a related patch set uploaded (by Smalyshev):
[operations/puppet@production] Direct LDF requests to single host to solve paging issues

https://gerrit.wikimedia.org/r/344197

I don't think we need new external endpoint - it looks like our VCL routing allows switching by path (unless I misunderstand what's going on).

Change 344197 merged by Gehel:
[operations/puppet@production] Direct LDF requests to single host to solve paging issues

https://gerrit.wikimedia.org/r/344197

Varnish patch deployed. I'll keep an eye on logs to make sure all request are routed as we expect. We still need to find a longer term solution, but that's another ticket.