Page MenuHomePhabricator

WDQS should use internal endpoint to communicate to Wikidata
Open, HighPublic5 Estimated Story Points

Description

As discovered in T199146, WDQS uses the external endpoint (www.wikidata.org) through a proxy to talk to Wikidata. It should talk directly to api.svc.${site}.wmnet instead.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Doesn't seem to work, if I go to https://api.svc.eqiad.wmnet/wiki/Special:EntityData/Q2408871.ttl?nocache=1530836328152&flavor=dump I get this:

<h1>Domain not configured</h1>
<p>This domain points to a <a href="https://www.wikimedia.org">Wikimedia Foundation</a> server, but is not configured on this server.</p>

I guess it needs some kind of a hostname to work?

@BBlack following your comments on T199146, do you know a way to access via api.svc but still to have it routed to correct wiki?

Smalyshev changed the task status from Open to Stalled.Jul 11 2018, 6:35 PM
Smalyshev triaged this task as Low priority.

It's a complicated topic I think, on our end. There are ways to make it work today, but when I try to write down generic steps any internal service could take to talk to any other (esp MW or RB), it bogs down in complications that are probably less than ideal in various language/platform contexts.

For this very particular case, the simplest way would be to do your language/platform/library's equivalent of:

curl -H 'Host: www.wikidata.org' 'https://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q2408871.ttl?nocache=1530836328152&flavor=dump'

That is, use the internal service endpoint hostname in the URI for TLS connection purposes, but then explicitly set the request Host header to www.wikidata.org for use at the HTTP level.

Whether you need the appservers-rw or api-ro or restbase-async (...) for a particular URL path for other cases underneath www.wikidata.org is the deep complication here....

@BBlack I am getting rather strange result with appservers-ro.discovery.wmnet - if I call the URL you provided, the call takes a lot of time:

real 0m4.270s

while if I call to www.wikidata.org, I get:

real 0m0.127s

Same with api-ro. appservers-rw is a bit faster:

real 0m0.320s

But still 3x from going through frontend (and it's not caching - I changed the URL, result is the same, and varnish settings all say "miss").

@BBlack I am getting rather strange result with appservers-ro.discovery.wmnet - if I call the URL you provided, the call takes a lot of time:

real 0m4.270s

while if I call to www.wikidata.org, I get:

real 0m0.127s

Same with api-ro. appservers-rw is a bit faster:

real 0m0.320s

But still 3x from going through frontend (and it's not caching - I changed the URL, result is the same, and varnish settings all say "miss").

Is this still true? I see

deploy1001:~$ for i in appservers-ro appservers-rw api-ro api-rw ; do echo -n $i; time curl -s -o /dev/null -X GET -H 'Host: www.wikidata.org' "https://${i}.discovery.wmnet/wiki/Special:EntityData/Q2408871.ttl?nocache=1530836328152&flavor=dump" ; done
appservers-ro
real	0m0.097s
user	0m0.020s
sys	0m0.012s
appservers-rw
real	0m0.113s
user	0m0.028s
sys	0m0.000s
api-ro
real	0m0.113s
user	0m0.024s
sys	0m0.012s
api-rw
real	0m0.128s
user	0m0.028s
sys	0m0.004s

Note that it's quite important where the tests are run from. That is the active DC is going to be faster anyway. Running those from codfw yields entirely different results as there are back and forths between the DCs that need to be served.

deploy2001:~$ for i in appservers-ro appservers-rw api-ro api-rw ; do echo -n $i; time curl -s -o /dev/null -H 'Host: www.wikidata.org' "https://${i}.discovery.wmnet/wiki/Special:EntityData/Q2408871.ttl?nocache=1530836328152&flavor=dump" ; done
appservers-ro
real	0m6.435s
user	0m0.024s
sys	0m0.012s
appservers-rw
real	0m0.323s
user	0m0.012s
sys	0m0.020s
api-ro
real	0m5.061s
user	0m0.028s
sys	0m0.004s
api-rw
real	0m0.276s
user	0m0.032s
sys	0m0.000s

Which are comparable to the numbers posted above.

There's a change though that WDQS no longer uses nocache for cache-busting in most common cases (see T217897 for more details). So I am not sure using internal endpoint now makes sense.

There's a change though that WDQS no longer uses nocache for cache-busting in most common cases (see T217897 for more details). So I am not sure using internal endpoint now makes sense.

It's not just about caches though. It's also about easier service level operations, e.g. switchover between DCs becomes easier and less error prone if the internal endpoint is used instead of the external one.

But won't we lose use of the varnish cache if we use the internal endpoint?

But won't we lose use of the varnish cache if we use the internal endpoint?

Yes that's true. That being said, is that particularly important? Will WDQS fail in spectacular ways if it requests objects over the uncached endpoints?

Will WDQS fail in spectacular ways if it requests objects over the uncached endpoints?

It won't fail, it would increase load on Wikidata from WDQS by several times since we have 14 servers (not counting external clients, test servers, etc.) that want these data. It probably won't be catastrophic - we've been running for years before without caching - but would tax Wikidata servers more and the loads would sometimes fail that may lead to some data not being properly updated.

Given the recent issues with WDQS I would like this to have higher priority. Currently the Wikidata's top requests in any time I checked is from WDQS:

0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> select user_agent, count(*) as hitcount from wmf.webrequest where uri_host = 'www.wikidata.org' and year = 2019 and month = '08' and day = 6 and hour = 05 and (dt like '2019-08-06T05:38%') group by user_agent order by hitcount desc limit 50;
<...>
user_agent	hitcount
Wikidata Query Service Updater	14892
Object Revision Evaluation Service <ahalfaker@wikimedia.org>	3480

The ratio of the top external request is one order of magnitude smaller then the WDQS (ORES is also internal, but that's another issue).

I highly disagree with timing isolated requests and benchmarking just one reuqest. Wikidata caches lots of bits of every request, making the second request to the same thing is usually faster. Not to mention database caches.

The other thing I want to mention and was missing here is overhead of encryption and TLS handshakes. In the @BBlack's example, we still use TLS but if you use plain http request, it's considerably faster (in both overhead of encryption and decryption):

ladsgroup@mwmaint1002:~$ time curl -H 'Host: www.wikidata.org' 'http://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q7251.ttl?revision=992109551&flavor=dump' > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  123k    0  123k    0     0   508k      0 --:--:-- --:--:-- --:--:--  510k

real	0m0.256s
user	0m0.008s
sys	0m0.004s

Unless there's any reason to encrypt requests internally, I think this would help us greatly.

The other thing I want to mention and was missing here is overhead of encryption and TLS handshakes. In the @BBlack's example, we still use TLS but if you use plain http request, it's considerably faster (in both overhead of encryption and decryption):

ladsgroup@mwmaint1002:~$ time curl -H 'Host: www.wikidata.org' 'http://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q7251.ttl?revision=992109551&flavor=dump' > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  123k    0  123k    0     0   508k      0 --:--:-- --:--:-- --:--:--  510k

real	0m0.256s
user	0m0.008s
sys	0m0.004s

Unless there's any reason to encrypt requests internally, I think this would help us greatly.

In the same DC the numbers are comparable. e.g.

akosiaris@deploy1001:$ ab -n 100 -c 5 -H "Host: www.wikidata.org" 'http://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q7251.ttl?revision=992109551&flavor=dump'
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking appservers-ro.discovery.wmnet (be patient).....done


Server Software:        mw1250.eqiad.wmnet
Server Hostname:        appservers-ro.discovery.wmnet
Server Port:            80

Document Path:          /wiki/Special:EntityData/Q7251.ttl?revision=992109551&flavor=dump
Document Length:        125989 bytes

Concurrency Level:      5
Time taken for tests:   5.984 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      12655928 bytes
HTML transferred:       12598900 bytes
Requests per second:    16.71 [#/sec] (mean)
Time per request:       299.186 [ms] (mean)
Time per request:       59.837 [ms] (mean, across all concurrent requests)
Transfer rate:          2065.49 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:   209  281 156.7    244    1324
Waiting:      208  280 156.7    243    1323
Total:        209  281 156.7    244    1324

Percentage of the requests served within a certain time (ms)
  50%    244
  66%    253
  75%    264
  80%    271
  90%    317
  95%    375
  98%   1044
  99%   1324
 100%   1324 (longest request)

vs

akosiaris@deploy1001:$ ab -n 100 -c 5 -H "Host: www.wikidata.org" 'https://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q7251.ttl?revision=992109551&flavor=dump'
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking appservers-ro.discovery.wmnet (be patient).....done


Server Software:        mw1269.eqiad.wmnet
Server Hostname:        appservers-ro.discovery.wmnet
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-AES256-GCM-SHA384,256,256
TLS Server Name:        www.wikidata.org

Document Path:          /wiki/Special:EntityData/Q7251.ttl?revision=992109551&flavor=dump
Document Length:        125989 bytes

Concurrency Level:      5
Time taken for tests:   5.385 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      12653400 bytes
HTML transferred:       12598900 bytes
Requests per second:    18.57 [#/sec] (mean)
Time per request:       269.239 [ms] (mean)
Time per request:       53.848 [ms] (mean, across all concurrent requests)
Transfer rate:          2294.77 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    3   1.5      3      14
Processing:   220  262  39.2    255     485
Waiting:      218  260  39.2    254     483
Total:        223  265  39.2    258     488

Percentage of the requests served within a certain time (ms)
  50%    258
  66%    264
  75%    269
  80%    275
  90%    294
  95%    331
  98%    475
  99%    488
 100%    488 (longest request)

In fact, if anything HTTPS requests were typically faster in this benchmarks proving your point about mediawiki + caching. It seems like the slow thing in this is mediawiki, not TLS termination.

Across DCs this is going to be of course very different, but depending on the payload this is possibly the place where it's warranted the most.

Anyway, my point is more on the grounds "Don't ditch TLS unless it's absolutely warranted" than anything else. I do agree with you that this should be moved forward.

Ladsgroup changed the task status from Stalled to Open.Aug 6 2019, 3:03 PM

The TLS overhead for smaller items will be bigger but I understand if SRE thinks it should be encrypted, then it should be encrypted.

I took the liberty to reopen this. Using the internal node even with TLS is still a noticeable performance gain, not to mention other benefits of it.

Gehel raised the priority of this task from Low to High.Feb 17 2021, 8:09 AM
Gehel moved this task from Operations to All WDQS-related tasks on the Wikidata-Query-Service board.

Let's reprioritize this given the recent issues we've seen.

For what is worth, we now have the services proxy (envoy based) with persistent connections and doing TLS on its own so any costs from switching to TLS connections to the internal LVS services will be largely mitigated. In fact, if anything I expect the latencies from that part of the equation to decrease since it won't have to go through a proxy and the edge caches. The question of whether bypassing the edge caches will hugely increase the load on mediawiki still stands, but there have been many changes on the mediawiki caching infrastructure too (e.g. we now have onhost memcached) so that might very well be largely mitigated as well.

I think we ought to revisit this indeed. Having the updater go through an extra 4 (outgoing proxy + 3 layers of edge caches) layers of the infrastructure, one of which is in NO WAY deemed critical to have High Availability (the outgoing proxy) doesn't help with either easy debugging nor ease of operations during maintenance/emergencies.

The new Flink based WDQS updater (T244590) will mitigate the potential caching issues since it will be running centrally and not duplicate the work for each WDQS node. Since this is expected to be completed this quarter, it make sense to leave the current updater as-is, but ensure we go through the proper channels for the new one.

The new updater is currently running on the analytics network (working on getting k8s deployment reading), we could set it up to appservers-ro but I think a hole needs to be open between the two networks (see similar issue in T274951).

The new updater is currently running on the analytics network (working on getting k8s deployment reading), we could set it up to appservers-ro but I think a hole needs to be open between the two networks (see similar issue in T274951).

We're moving to production on k8s soon enough, I don't think we should fix this on the analytics network.

In terms of implementation in our new updater, the comment from @BBlack is the starting point:

For this very particular case, the simplest way would be to do your language/platform/library's equivalent of:

curl -H 'Host: www.wikidata.org' 'https://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q2408871.ttl?nocache=1530836328152&flavor=dump'

That is, use the internal service endpoint hostname in the URI for TLS connection purposes, but then explicitly set the request Host header to www.wikidata.org for use at the HTTP level.

In terms of implementation in our new updater, the comment from @BBlack is the starting point:

For this very particular case, the simplest way would be to do your language/platform/library's equivalent of:

curl -H 'Host: www.wikidata.org' 'https://appservers-ro.discovery.wmnet/wiki/Special:EntityData/Q2408871.ttl?nocache=1530836328152&flavor=dump'

That is, use the internal service endpoint hostname in the URI for TLS connection purposes, but then explicitly set the request Host header to www.wikidata.org for use at the HTTP level.

Indeed. But with a minor correction, instead of appservers-ro please use instead api-ro in order to hit the API cluster as the appserver cluster is meant to be the end-user browser serving cluster.