LDF endpoint ordering is not stable between servers when paging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Mar 3 2017, 8:38 PM

Description

When paging LDF results, duplicate or skipped results appear. The cause is most probably that the servers have different ordering for LDF triples.

Details

	Subject	Repo	Branch	Lines +/-
	Direct LDF requests to single host to solve paging issues	operations/puppet	production	+13 -1

Customize query in gerrit

Related Objects

Mentioned In: T161240: Expose wikidata query service LDF endpoint in a scalable and available way
Mentioned Here: T161240: Expose wikidata query service LDF endpoint in a scalable and available way
T151971: Move logstash ingestion behind LVS

Event Timeline

Smalyshev created this task.Mar 3 2017, 8:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2017, 8:38 PM

Original email:

We're trying to iterate over all triples with predicate http://www.wikidata.org/prop/direct/P3417 via the LDF endpoint. The way we do that is as follows.

    Issue a GET request for https://query.wikidata.org/bigdata/ldf?subject=&predicate=http://www.wikidata.org/prop/direct/P3417 with a header accepting "application/ld+json"
    Iterate over the resulting json storing all entries in the @graph field with @id matching the quora predicate P3417
    Find the nextPage entry and extract the url if it exists
    Repeat 1 - 3 using the url found in step (3)
    Terminate when nextPage isn't found

When doing the above we've found:

    We're processing the correct number of urls and all urls processed are unique
    The urls have roughly 100 entries per url (which is as expected)
    We're finding many repeat entries in the above iteration so while the total number of entries processed is roughly equal to the total number of matching triples, the number of unique entries processed is far less.

Smalyshev triaged this task as High priority.Mar 3 2017, 8:39 PM

Smalyshev added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptMar 3 2017, 8:39 PM

Smalyshev claimed this task.Mar 3 2017, 8:39 PM

Smalyshev moved this task from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board.Mar 6 2017, 7:21 PM

For some more context:

LDF is a way to cheaply get large lists of triples from WDQS, and displace some logic on the clients. Retrieving this list is done page by page. We already have use cases for this. The iteration order is just following the underlying index, which might be different on each node. So specific pages on returned by different nodes might have different content. Adding a sorting step would allow the order to be consistent between nodes, but would increase the cost enough that it breaks the main point of LDF (cheaply retrieve a list of triples).

Potential solutions:

client affinity

The same client always ending up on the same WDQS node would ensure that this client will get a consistent ordering. That consistent ordering would get cached at varnish level and provide a consistent answer to the next client. This breaks down if there is a refresh by another client doing the same query at the time of invalidation. LVS is able to do source hashing scheduling (see LVS docs or T151971), but in our case the source IP is Varnish. We could remove the LVS in front of WDQS and use Varnish to do balancing, but this is exactly the opposite of the current effort to standardise all our services on the same model.

single LDF server

We can route all LDF requests to a single server, with a fallback mechanism to route traffic to another server in case the first one is down. This is a not scalable option. And we don't have anything in place (AFAIK) to manage an automatic fallback (it does not look like LVS has a scheduler that would work in this scenario).

make any WDQS node return consistent pages

As stated above, sorting results before paging would solve the issue, but is probably too expensive to consider (result sets are expected to be large). The LDF implementation that we use does not seem to support this. It might be possible to configure a different indexing strategy, with consistent iteration order, but that's an unknown. Basically WDQS is stateless (as seen from a client) but accidentally exposes internal state in a subtle way.

At this point, we are mostly out of ideas. @BBlack, @ema if you have any idea, they would be welcomed!

Smalyshev added a project: Traffic.Mar 8 2017, 1:20 AM

Restricted Application added a project: SRE. · View Herald TranscriptMar 8 2017, 1:20 AM

Smalyshev moved this task from Incoming to Blazegraph on the Wikidata-Query-Service board.Mar 8 2017, 2:01 AM

• ema moved this task from Backlog to LoadBalancer on the Traffic board.Mar 8 2017, 7:10 AM

At this point, the only workable option is the "single LDF server" (appart from abandoning LDF completely). So let's see how we could implement that and see what feedback we get.

limitations

As far as I can see, LVS does not support an active / passive failover, so failover will be done by manually changing the active LDF endpoint with pybal. The LDF service will not be highly available.
Restricting to a single server is an inherently non scalable solution (even if for the foreseeable future a single server should be able to easily handle the LDF load).

todo

new external endpoint (for example ldf.wikidata.org)
new LVS configuration
modification to wdqs nginx configuration to route LDF / non LDF traffic from different hostnames
anything else?

Change 344197 had a related patch set uploaded (by Smalyshev):
[operations/puppet@production] Direct LDF requests to single host to solve paging issues

https://gerrit.wikimedia.org/r/344197

gerritbot added a project: Patch-For-Review.Mar 22 2017, 8:18 PM

I don't think we need new external endpoint - it looks like our VCL routing allows switching by path (unless I misunderstand what's going on).

Smalyshev moved this task from In progress to Needs review on the Discovery-Wikidata-Query-Service-Sprint board.Mar 22 2017, 8:28 PM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Mar 23 2017, 2:39 PM

Change 344197 merged by Gehel:
[operations/puppet@production] Direct LDF requests to single host to solve paging issues

https://gerrit.wikimedia.org/r/344197

Varnish patch deployed. I'll keep an eye on logs to make sure all request are routed as we expect. We still need to find a longer term solution, but that's another ticket.

This is done. Longer term solution is tracked on T161240.

Smalyshev closed this task as Resolved.Mar 23 2017, 8:01 PM

Smalyshev removed a project: Discovery-Wikidata-Query-Service-Sprint.Jul 14 2017, 10:31 PM

LDF endpoint ordering is not stable between servers when pagingClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

LDF endpoint ordering is not stable between servers when paging
Closed, ResolvedPublic
Actions