Add client identifier to requests sent from Kartotherian to WDQS
Open, NormalPublic

Description

It would be nice if Kartotherian when calling Wikidata Query Service (WDQS) on behalf of the client would identify which client requested it. Since WDQS is an internal WMF service, there should be no PII proliferation concern about it, and we could have more efficient load controls and throttling for it.

Possible ways, in order of preference:

  • X-Client-IP (unless varnish would kill it)
  • Adding client IP to the User Agent
  • Adding another header
  • Adding client IP to request string as extra parameter
  • something else?
Smalyshev triaged this task as Normal priority.
Mholloway updated the task description. (Show Details)Jul 31 2018, 2:59 PM
Pchelolo moved this task from Backlog to watching on the Services board.Jul 31 2018, 7:15 PM
Pchelolo edited projects, added Services (watching); removed Services.
Mholloway added a comment.EditedJul 31 2018, 9:12 PM

It would be very easy to add the requesting IP to the user-agent string. We could also explore adding an X-Client-IP header; needs investigation on the Varnish end.

Does WDQS have an internal URI (similar to restbase.svc.codfw.wmnet) that we could/should use in production? Currently we're hard-coding the public endpoint.

config.wikidataQueryService = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql';

https://github.com/kartotherian/geoshapes/blob/1acbc3768278ef4478517f16cc1a317e364bde99/geoshapes.js#L36

We have internal cluster, but not sure about internal URL for public cluster or whether we should be using it (as it probably would circumvent Varnish). @Gehel, could you weigh in on this?

Nginx and Varnish already attach the X-Client-IP header to incoming requests (cf. this sample request), so all you have to do is actually use the provided header.

It would be very easy to add the requesting IP to the user-agent string.

I believe this violates multiple privacy policies we have in place, but as stated above, work-arounds are not even needed.

Does WDQS have an internal URI (similar to restbase.svc.codfw.wmnet) that we could/should use in production? Currently we're hard-coding the public endpoint.

Yes. there are two WDQS cluster: an external and an internal one. The former services external requests coming from query.wikidata.org, while the latter is to be used for requests originating from entities inside our production environment. Therefore, Kartotherian should use http://wdqs-internal.discovery.wmnet - the internal cluster.

I would also suggest using X-Request-ID which uniquely identifies a single request.

Nginx and Varnish already attach the X-Client-IP header to incoming requests

Yes, but we need it to be the IP of the client of Karthoterian, not the IP of the server running Karthoterian.

It would be very easy to add the requesting IP to the user-agent string.

I believe this violates multiple privacy policies we have in place.

Why do you think so? WDQS is an internal WMF service, and the webrequest logs track incoming IPs anyway. So no additional information is logged and no PII is sent out.

Therefore, Kartotherian should use http://wdqs-internal.discovery.wmnet - the internal cluster.

Not sure about that, since Kartotherian is acting as a proxy in this case - i.e. we have load generated by public requests driven by the clients everywhere, and queries written by the clients, not just internal WMF workloads with queries curated by us, as far as I can understand. I'd prefer using internal clusters only for curated queries for now, unless Kartotherian load is very small.

I would also suggest using X-Request-ID which uniquely identifies a single request.

But we do not need to uniquely identify a single request. We need to identify a client, sending multiple requests. That's the whole point of throttling.

Yes, but we need it to be the IP of the client of Karthoterian, not the IP of the server running Karthoterian.

Aren't clients making requests to Kartotherian? If they are, then X-Client-IP will be set to the external (to our prod environment) client issuing the request.

Why do you think so? WDQS is an internal WMF service, and the webrequest logs track incoming IPs anyway. So no additional information is logged and no PII is sent out.

This may constitute data manipulation not allowed by current policies. But, I am not an expert in this field, I just raised it as a potential legal problem. On the technical side, though, changing the UA for every client is IMHO a bad practice and should not even be considered a solution here.

Not sure about that, since Kartotherian is acting as a proxy in this case - i.e. we have load generated by public requests driven by the clients everywhere, and queries written by the clients, not just internal WMF workloads with queries curated by us, as far as I can understand. I'd prefer using internal clusters only for curated queries for now, unless Kartotherian load is very small.

There is precedent for this: the recommendation api service uses the internal WDQS cluster and issues requests to it upon receiving client requests. However, the requests sent to WDQS are strictly bounded (i.e. external clients cannot make arbitrary queries to WDQS). If that is the case with Katotherian as well, then the same principle can be applied. The original idea of the internal WDQS cluster was that internal services requiring it should function regardless of the external client load faced by WDQS directly.

But we do not need to uniquely identify a single request. We need to identify a client, sending multiple requests. That's the whole point of throttling.

It wasn't clear enough to me from the task desc that throttling is the main purpose of this.

Smalyshev added a comment.EditedAug 1 2018, 6:20 PM

Aren't clients making requests to Kartotherian? If they are, then X-Client-IP will be set to the external (to our prod environment) client issuing the request.

Yes. But when Kartotherian calls to WDQS, it calls external Varnish endpoint (now), which might replace X-Client-IP - not 100% sure about it, needs checking, but I think judging from the logs that's what is happening since the IP I see in the logs is one of Kartotherian.

On the technical side, though, changing the UA for every client is IMHO a bad practice and should not even be considered a solution here.

I am not sure why. What's wrong with changing the UA?
But I am not hard-set on the UA, that'd just make it work automatically, but if it puts it in other place as long as Java can extract it it'd be fine I think.

There is precedent for this: the recommendation api service uses the internal WDQS cluster and issues requests to it upon receiving client requests.

The difference here is whether the queries produced by the service are ours (i.e. written by WMF stuff or trusted volunteers, etc., approved as part of code review process, etc.) or just anybody can run any query there. In the former case, we are reasonably sure we can have certain guarantees. In the latter case, the guarantees are much worse - even with current throttling systems, misbehaving bots launching an avalanche of heavy queries can impact the service.

This is why we split endpoints into external - with less guarantees - and internal, which is supposed to only run known queries that we know who wrote them and why and be much more stable.

The differentiator here is not who initiates the query, but who writes the query itself. As I understand, Kartotherian runs arbitrary user queries, which can be very heavy and unoptimized, and we have little visibility into who produces these queries.

The original idea of the internal WDQS cluster was that internal services requiring it should function regardless of the external client load faced by WDQS directly.

True, but if we allow to run same queries on both services, then essentially we wouldn't be able to guarantee that. See T195559: Consider switching SPARQL endpoint for Karthoterian to WDQS internal cluster for discussion on this.

It wasn't clear enough to me from the task desc that throttling is the main purpose of this.

Sorry, yes, the main reason is (proper) throttling, see parent T195438. The problem is that now Kartotherian is treated as a single client, which means its throttling limits apply on all Kartotherian requests as if they were coming from a single client. With more visibility into Kartotherian clients, they will each have their own throttling limits, so one heavy client would not impact other, more benign, clients.

Aren't clients making requests to Kartotherian? If they are, then X-Client-IP will be set to the external (to our prod environment) client issuing the request.

Yes. But when Kartotherian calls to WDQS, it calls external Varnish endpoint (now), which might replace X-Client-IP - not 100% sure about it, needs checking, but I think judging from the logs that's what is happening since the IP I see in the logs is one of Kartotherian.

The flow is: client -> Nginx -> Varnish -> Kartotherian -> WDQS. Hence, when Kartotherian receives the request, that request will have the external client's IP in the X-Client-IP header and that's immutable regardless of what Kartotherian does next (i.e. whether the requests it issues pass through Varnish or not).

The ideal solution here would be for Kartotherian to call directly WDQS (via its discovery end point mentioned in the previous comments) and supply it the aforementioned header if the idea is to throttle requests on WDQS.

On the technical side, though, changing the UA for every client is IMHO a bad practice and should not even be considered a solution here.

I am not sure why. What's wrong with changing the UA?
But I am not hard-set on the UA, that'd just make it work automatically, but if it puts it in other place as long as Java can extract it it'd be fine I think.

By definition, the User-Agent header identifies the user agent making the request, not the client. Furthermore, putting semi-random information in that header (like IPs) makes analytics, metrics, logging and other use-cases harder, amongst other things.

The differentiator here is not who initiates the query, but who writes the query itself. As I understand, Kartotherian runs arbitrary user queries, which can be very heavy and unoptimized, and we have little visibility into who produces these queries.

Ah I see. In that case, perhaps it would worth investigating whether the set of queries could be restricted somehow?

The ideal solution here would be for Kartotherian to call directly WDQS (via its discovery end point mentioned in the previous comments) and supply it the aforementioned header if the idea is to throttle requests on WDQS.

I am not sure whether we have direct endpoint for external cluster, but if we do, calling it with proper X-Client-IP would work too. That would lose the caching though so if somebody uses the same query twice within a short period - e.g. reloading a page with the same map, etc. - that might have performance impact. Not sure though what are cache hit/miss rates for Kartotherian there, I guess we could check it somewhere in the data?

In that case, perhaps it would worth investigating whether the set of queries could be restricted somehow?

If you have any ideas about that, you're welcome, but right now my task is lass ambitious - just make sure whole Karthoterian->WDQS service is not blocked due to one over-eager client.

It could also work if Karthoterian implemented its own resource management system - i.e. tracked how heavy queries are for each client and throttled/denied heavy ones - but since WDQS already has such system in place, we might as well use it.

I did a quick check with Pivot, and it looks like about 7% of requests from Karthoterian to WDQS are cache hits. So maybe going around the cache is not a huge deal.

Mholloway added a comment.EditedSep 21 2018, 2:19 PM

I've got a patch to pass through the X-Client-IP header to WDQS. However, if I'm interpreting the VCL code correctly, it looks like it will be stripped in Varnish and replaced with the IP of the requesting Kartotherian host when hitting the public WDQS endpoint:

https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L186-L191

sub recv_fe_ip_processing {

	[...]

	if (client.ip !~ local_host) {
		// only the local nginx TLS terminator should set these at all -
		// there are no other internal exceptions to that rule
		unset req.http.X-Client-IP;
		unset req.http.X-Connection-Properties;
	}

	[...]

	if (!req.http.X-Client-IP) {
		unset req.http.via-nginx;
		set req.http.X-Client-IP = client.ip;
		if (!req.http.X-Client-IP) {
			[...]
		}
	}

        [...]

}

If we're sure that it's reasonably safe to bypass the cache and hit the internal WDQS endpoint instead, then that doesn't matter, of course. Updating Kartotherian to hit the internal endpoint is just a matter of deploying a config change. (I should note that we're blocked on deploying Kartotherian in production until updating the maps cluster to Stretch is complete, but we could still test this in the beta cluster in the meantime.)

Mholloway moved this task from Backlog to In progress on the Maps-Sprint board.Sep 24 2018, 5:51 AM
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-releng) [2018-09-27T07:38:16Z] <mdholloway> deployment-maps04 updated tilerator and kartotherian node modules (T195513, T200594)

Gehel added a comment.Sep 27 2018, 8:38 AM

I'm late to the party, so a few notes in no particular order:

  • WDQS queries from Kartotherian are arbitrary, and it is not really possible to restrict them without heavily impacting functionality. In most cases they will come from a user editing a <mapframe/> tag, so we have some level of control there, but anyone could also access Kartotherian directly. The external cluster seems to be the right place to send those queries.
  • We have varnish in front of Kartotherian, so there is already some higher level caching in place, caching the requests themselves is probably not necessary. We could configure Kartotherian to use wdqs.discovery.wmnet which is the internal endpoint behind query.wikidata.org. This internal traffic should not need to go through our web proxy.
  • The problem described in this task is that the traffic from kartotherian is currently being throttled, as wdqs is throttling based on UA and IP, and Kartotherian funnels traffic from a multitude of clients. Note that at the moment, this appears when wdqs is already overloaded. Since Kartotherian is acting as a proxy to wdqs in this case, it make sense to have it publish an X-Client-IP header.