Page MenuHomePhabricator

Define the constraints of the new WDQS cluster
Closed, ResolvedPublic

Description

Since we are creating a "more controlled WDQS cluster" (T178492), we should define more precisely what "more controlled" means and what kind of constraints we want to put in place on this new cluster. This is meant to be the start of a discussion, not a final decision yet:

Context
We want a WDQS cluster that can be used to serve synchronous user facing traffic. This requires high availability and fairly constant response times.

Rules

  • clients must be production services, no access from labs, or from the general internet
  • cluster must be used only for synchronous user facing traffic, no batch jobs
  • requests are expected to be cheap
  • clients use a specific user agent

Event Timeline

Which timeout do we set for this cluster? If it’s only for synchronous, user facing queries, then I’d think that WDQS’s 60 seconds are actually too long. On the other hand, individual services can always lower the timeout using the maxQueryTimeMillis parameter.

Without getting into the specific numbers (which we will tune based on experience), I agree that we could (and probably should) have a short timeout, since we expect requests to be short... I'd say that in the context of this task, the important point is to specify that we expect the requests to be cheap. Whether or not we need to put a hard constraint on this is at this point an implementation detail.

debt triaged this task as Medium priority.Jan 4 2018, 6:22 PM

Since this is an internal controlled one, I'd keep it short but not too short and let the clients self-police. One thing that I do want to have for this one is enforcing setting user agent, so we know who uses them. If we notice clients are not self-policing well, we could also enforce explicit timeout (i.e. client should explicitly have either header or query string timeout setting - which means some human took a decision on it, hopefully after a long careful thinking :)

Summarily I think we could start with:

  1. 30 secs timeout
  2. Requiring user-agent to be set
  3. Only allowing internal access

I am not 100% sure about labs, we might think of allowing some labs access since some rather widely used tools run there, and while technically not production, they certainly have many people relying on them. But this is for future times, we should start with prod-only I think.

So the technical limitations are:

  1. 30 secs timeout
  2. Requiring user-agent to be set
  3. Only allowing internal access

I propose to keep the "conceptual" limitations of:

  1. cluster must be used only for synchronous user facing traffic, no batch jobs
  2. requests are expected to be cheap

What else?

Since there are no objections here, I'll move this on wiki and then we can close this task.

I moved this on wiki, we can evolve the contraints there if need be.

debt added a subscriber: debt.

👍