Page MenuHomePhabricator

WDQS requests from aluminium.wikimedia.org being throttled
Closed, ResolvedPublic

Description

I am getting this in the WDQS logs on wdqs1003:

IP:aluminium.wikimedia.org UA:@kartotheria
n/geoshapes/0.0.13 (https://mediawiki.org/Maps) - A request is being throttled.

I'd like to figure out

  1. What is this workload and why it's not using internal cluster?
  2. Why it is sending so much traffic that it's being throttled?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
  1. aluminium is a proxy used for accessing the outside world from inside of the cluster. Why not directly? No idea.
  2. Because it's being used for map generation, e.g. https://www.mediawiki.org/wiki/Help:Extension:Kartographer#GeoShapes_via_Wikidata_Query

Looking at kartotherian configuration, I can't find a reference to wdqs, so I presume this is hardcoded.

We do configure a proxy in the kartotherian and tilerator configs. I'm not sure if it is used for anything else, but it sounds like a slightly bad idea to have kartotherian / tilerator retrieve stuff from the wild internet. If it is only used to access WDQS, we should remove it.

Since the traffic from any user will come with the same IP / UA, we can expect kartotherian to go over our usual throttling rules. I'm not sure if we should vastly increase the throttling limits for the internal WDQS cluster, or if we should add a whitelist for some UA.

@Mholloway / @Pnorman: could you have a look as well?

We need to look what kind of queries we get from kartotherian. In general, since these are basically user queries, we may want to apply the same limits as the rest of the user traffic. I've created T195559 for evaluating whether we need to switch it to internal cluster or not.

Smalyshev triaged this task as Medium priority.May 25 2018, 5:48 AM

We don't want to use the internal endpoint since kartotherian allows arbitrary queries. We may want to set headers because it is essentially acting as a proxy, and throttling should be per user of kartotherian, not for all of kartotherian.

We don't want to use the internal endpoint since kartotherian allows arbitrary queries

Makes sense.

We may want to set headers because it is essentially acting as a proxy, and throttling should be per user of kartotherian, not for all of kartotherian.

Yes, that would be the best solution. Since it's not IP+UserAgent, maybe add something to the user agent string? Or X-Forwarded-For header, or some other custom header?

Gehel claimed this task.

The specifics of this task are being addressed in T205607 and T200594 (most specifically in https://github.com/kartotherian/geoshapes/pull/1). I'm closing this task as the actual work is being tracked.