Page MenuHomePhabricator

WDQS requests from aluminium.wikimedia.org being throttled
Closed, ResolvedPublic

Description

I am getting this in the WDQS logs on wdqs1003:

IP:aluminium.wikimedia.org UA:@kartotheria
n/geoshapes/0.0.13 (https://mediawiki.org/Maps) - A request is being throttled.

I'd like to figure out

  1. What is this workload and why it's not using internal cluster?
  2. Why it is sending so much traffic that it's being throttled?

Event Timeline

Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptMay 24 2018, 4:09 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev updated the task description. (Show Details)May 24 2018, 4:09 AM
MaxSem added a subscriber: MaxSem.May 24 2018, 7:03 AM
  1. aluminium is a proxy used for accessing the outside world from inside of the cluster. Why not directly? No idea.
  2. Because it's being used for map generation, e.g. https://www.mediawiki.org/wiki/Help:Extension:Kartographer#GeoShapes_via_Wikidata_Query

Looking at kartotherian configuration, I can't find a reference to wdqs, so I presume this is hardcoded.

We do configure a proxy in the kartotherian and tilerator configs. I'm not sure if it is used for anything else, but it sounds like a slightly bad idea to have kartotherian / tilerator retrieve stuff from the wild internet. If it is only used to access WDQS, we should remove it.

Since the traffic from any user will come with the same IP / UA, we can expect kartotherian to go over our usual throttling rules. I'm not sure if we should vastly increase the throttling limits for the internal WDQS cluster, or if we should add a whitelist for some UA.

@Mholloway / @Pnorman: could you have a look as well?

We need to look what kind of queries we get from kartotherian. In general, since these are basically user queries, we may want to apply the same limits as the rest of the user traffic. I've created T195559 for evaluating whether we need to switch it to internal cluster or not.

Smalyshev triaged this task as Normal priority.May 25 2018, 5:48 AM

We don't want to use the internal endpoint since kartotherian allows arbitrary queries. We may want to set headers because it is essentially acting as a proxy, and throttling should be per user of kartotherian, not for all of kartotherian.

Smalyshev added a comment.EditedJul 10 2018, 7:35 PM

We don't want to use the internal endpoint since kartotherian allows arbitrary queries

Makes sense.

We may want to set headers because it is essentially acting as a proxy, and throttling should be per user of kartotherian, not for all of kartotherian.

Yes, that would be the best solution. Since it's not IP+UserAgent, maybe add something to the user agent string? Or X-Forwarded-For header, or some other custom header?

Gehel closed this task as Resolved.Oct 3 2018, 8:50 AM
Gehel claimed this task.

The specifics of this task are being addressed in T205607 and T200594 (most specifically in https://github.com/kartotherian/geoshapes/pull/1). I'm closing this task as the actual work is being tracked.