Page MenuHomePhabricator

Expose wikidata query service LDF endpoint in a scalable and available way
Open, MediumPublic

Description

T159574 has been fixed by sending all LDF traffic to a single server. This has obvious impacts on availability and scalability. This has to be addressed in some way.

The application has to present consistent responses to work correctly behind LVS. Any kind of client affinity can be put in place for optimisation, but not for correctness. It has to account for transient failures, routing quirks, ...

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The problem is as follows:

  • Blazegraph stores the triples in BTree. Btree does have an ordering, and it is stable, but what is stored are not strings but terms generated from strings, which seem to be identified and ordered by essentially running ID - i.e. Term(1), Term(2), etc.
  • Since different servers are loaded in different times from different dump/update combinations, same URIs/strings get different Term IDs - same URI may be Term(123) on one server and Term(3456) on another. This means order of the triples in indexes will be different on different servers.
  • When paginating through results, different page requests may be directed to different servers, thus creating completely wrong picture in the summary data.
  • Due to the way we do load balancing (IP-based kernel balancing) we can not ensure any request affinity, as the balancer does not even look inside the packets.
  • Using client IP for balancing (pretty much the only thing we have that is not inside the packet) is not possible since we have varnish in front of LVS, which means client IPs are always the ones from Varnish servers. We do have real client IPs, but the are also inside the packets, so LVS can't use them.
Smalyshev triaged this task as Medium priority.Feb 12 2018, 8:02 AM

The recent crash of wdqs1004 (T188045) had an impact on the LDF service, which was hosted on wdqs1004 at the time of the crash. The LDF service has been manually routed to wdqs1005, but this raises the concern again of the stability of this service.