Create a more controlled WDQS cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Oct 18 2017, 1:14 PM

Description

WDQS is by design a fragile service. We allow arbitrary users to run arbitrary SPARQL queries. This is similar to allow any user direct access to our production MySQL databases. This is fine as long as the expectations are managed. It is definitely wrong when services from which we expect stable response time and availability start depending on it.

Splitting WDQS in a public, do whatever you want service and a different more controlled service make sense. This would be similar to what we do with MySQL, splitting it between a production service and labs replicas, with different expectations.

My understanding is that the structured data on commons project will rely on WDQS for some of its functionalities. If we want that project to be stable, we need to address the WDQS stability issues.

Related Objects
Search...

Status	Assigned	Task
Resolved	• mobrovac	T190266 Switch the Recommendation API to use the internal WDQS cluster
Resolved	Smalyshev	T178492 Create a more controlled WDQS cluster
		Restricted Task
Resolved	mpopov	T179850 Run analysis of WDQS internal and external traffic
Resolved	Gehel	T184083 Define the constraints of the new WDQS cluster
Resolved	Gehel	T187766 Install / configure new WDQS servers
Resolved	Gehel	T187800 rack/setup/install wdqs200[4-6]
Resolved	ayounsi	T188303 switch port configuration for wdq200[4-6]
Resolved	RobH	T182991 New WDQS clusters eqiad + codfw
		Unknown Object (Task)
		Unknown Object (Task)
Resolved	Smalyshev	T187767 Choose a service name for the new internal WDQS cluster and configure LVS
Resolved	Smalyshev	T192835 Monitoring for internal cluster
Resolved	Smalyshev	T192942 Identify and migrate existing internal clients of wdqs to the new internal cluster

Event Timeline

Gehel created this task.Oct 18 2017, 1:14 PM

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptOct 18 2017, 1:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

My understanding is that the structured data on commons project will rely on WDQS for some of its functionalities. If we want that project to be stable, we need to address the WDQS stability issues.

In what form does it rely on it?
What I can say is that Wikibase-Quality-Constraints depends on WDQS and also on short response times.
Could we maybe create an official list?

I am not familiar with the details of dependent services (would be nice to document it somewhere? or point me to the place where it is already documented :) but in general unless you need an instant response, 429 should not be a cause for immediate failure. For offline functionality, 429 should cause retry within prescribed time. For online, it'd depend on the case, but if it's a secondary UI element, the script can just try and load it later, maybe?

• mobrovac awarded a token.Oct 19 2017, 7:27 AM

The Recommendation API service depends entirely on WDQS (and partially on AQS). While in theory we could (and should) account for 429s sent by WDQS in the service, the problem described in the task description is genuine. I second the effort of splitting the service (or its access) into two logical entities - one for external, the other for internal clients.

Gehel added a subtask: Restricted Task.Oct 19 2017, 8:26 AM

Volans subscribed.Oct 31 2017, 10:59 PM

Random thoughts gathered in multiple discussion (note that this is very much a brainstorming, nothing written below is an actual decision of a path to follow):

I have heard WDQS mentioned in the context of structured data on commons, but I have no idea how it is going to be used. If anyone has a pointer to some documentation / discussion, please let me know.
Since we have 2 active / active WDQS clusters (eqiad / codfw), we could use one of them to serve internal traffic and one as external endpoint. This defeats the purpose of having a backup datacenter, so that's not a long term solution.
The hard to support use case is the external endpoint, since it is subject to uncontrolled load. Do we want to continue supporting this use case (I very much think we should continue to support it, but it isn't my place to make that decision, that question should be asked, and answered).
Since each WDQS node is independent and does its own updates by querying wikidata, increasing the number of nodes increases the load on wikidata. Is that an issue? Is there a way to better share the resources required for updates?

In T178492#3725811, @Gehel wrote:

Since we have 2 active / active WDQS clusters (eqiad / codfw), we could use one of them to serve internal traffic and one as external endpoint. This defeats the purpose of having a backup datacenter, so that's not a long term solution.

Perhaps a viable solution for the short- to mid-term is to have two LVS end points in each DC: one that serves external traffic, and another one internal, much like we have api.svc.{site}.wmnet and api-async.svc.{site}.wmnet for the MW API, respectively,

In T178492#3726074, @mobrovac wrote:

Perhaps a viable solution for the short- to mid-term is to have two LVS end points in each DC: one that serves external traffic, and another one internal, much like we have api.svc.{site}.wmnet and api-async.svc.{site}.wmnet for the MW API, respectively,

But if those 2 LVS endpoints are served by the same cluster, we've not achieved anything in term of isolation...

It might make sense to add whitelisting / authentication / authorization on the "controlled" wdqs cluster to ensure the usage is under control. This might or might not be required. If we go that way, we might want to address this as a more general problem, which might make sense to address in a more general way.

I think for now limiting it by IP should probably work? I think IP ranges from production hosts, labs and outside are segregated?

In T178492#3725811, @Gehel wrote:

Random thoughts gathered in multiple discussion (note that this is very much a brainstorming, nothing written below is an actual decision of a path to follow):

I have heard WDQS mentioned in the context of structured data on commons, but I have no idea how it is going to be used. If anyone has a pointer to some documentation / discussion, please let me know.

We will have a lot of structured data on Commons in the same way/format as it currently exists on Wikidata. We will want to provide people with the ability to search/query for media files on Commons based on the statements on these files. People will also want to combine the data on Commons and Wikidata in queries. To make this all work we'll want to load that data into WDQS as well (or set up a separate query service but iirc Stas said no and I can see the benefits of not doing it.) I am not sure we are much farther in planning than this.

Since we have 2 active / active WDQS clusters (eqiad / codfw), we could use one of them to serve internal traffic and one as external endpoint. This defeats the purpose of having a backup datacenter, so that's not a long term solution.

The hard to support use case is the external endpoint, since it is subject to uncontrolled load. Do we want to continue supporting this use case (I very much think we should continue to support it, but it isn't my place to make that decision, that question should be asked, and answered).

Yes that is definitely a use-case we want to continue to support.

Since each WDQS node is independent and does its own updates by querying wikidata, increasing the number of nodes increases the load on wikidata. Is that an issue? Is there a way to better share the resources required for updates?

In T178492#3754772, @Lydia_Pintscher wrote:

In T178492#3725811, @Gehel wrote:

I have heard WDQS mentioned in the context of structured data on commons, but I have no idea how it is going to be used. If anyone has a pointer to some documentation / discussion, please let me know.

We will have a lot of structured data on Commons in the same way/format as it currently exists on Wikidata. We will want to provide people with the ability to search/query for media files on Commons based on the statements on these files. People will also want to combine the data on Commons and Wikidata in queries. To make this all work we'll want to load that data into WDQS as well (or set up a separate query service but iirc Stas said no and I can see the benefits of not doing it.) I am not sure we are much farther in planning than this.

As I understand, this will allow users to run queries in fairly controlled way, not to run arbitrary SPARQL queries. This means that we can probably fairly easily ensure that those queries have a reasonable cost and don't endanger the "controlled" WDQS cluster. As I understand it, this pattern is similar to what we do with any MySQL backed service, we don't allow random SQL, but we take input from users and inject it in known queries.

I'm not too worried about the additional data at this point (we have more head room on storage space than we have on computational resources).

I believe people will also want to run arbitrary queries. However I guess the majority coming from users who are searching files on the Commons website will be fairly controlled.

In T178492#3738420, @Smalyshev wrote:

I think for now limiting it by IP should probably work? I think IP ranges from production hosts, labs and outside are segregated?

Yes, allowing access to a controlled cluster to only production hosts is probably fine and is similar to what we do for all our services. The idea of more controlled access has been raised a few times (I remember a conversation with @BBlack about maps and similar concerns). Having API keys or a similar mechanism would allow us open up a controlled cluster to trusted clients (even from labs), or would allow to more easily identify / block abusive consumers including production consumers.

In T178492#3754832, @Lydia_Pintscher wrote:

I believe people will also want to run arbitrary queries. However I guess the majority coming from users who are searching files on the Commons website will be fairly controlled.

So we want users to be able to run arbitrary search on commons, but not arbitrary SPARQL queries, right (at least in the context of the structured data on commons project, we of course still want users to be able to run arbitrary SPARQL queries on the current query.wikidata.org)? Feel free to ping me for a chat (irc, hangouts, other, ...) to clarify that more if you think that's needed.

Daniel_Mietchen subscribed.Dec 1 2017, 12:21 AM

Gehel mentioned this in T182991: New WDQS clusters eqiad + codfw.Dec 15 2017, 2:49 PM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:05 PM

RobH changed the status of subtask T182991: New WDQS clusters eqiad + codfw from Open to Stalled.Dec 18 2017, 10:21 PM

Smalyshev moved this task from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board.Dec 21 2017, 1:38 AM

Gehel mentioned this in T184083: Define the constraints of the new WDQS cluster.Jan 3 2018, 5:06 PM

Smalyshev moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.Jan 11 2018, 8:19 PM

debt closed subtask T184083: Define the constraints of the new WDQS cluster as Resolved.Feb 2 2018, 5:59 PM

Smalyshev triaged this task as High priority.Feb 12 2018, 8:07 AM

Gehel closed subtask Restricted Task as Resolved.Feb 20 2018, 9:35 AM

Gehel added a project: Epic.Feb 20 2018, 9:51 AM

Gehel mentioned this in T187800: rack/setup/install wdqs200[4-6].Feb 22 2018, 6:49 PM