- Wikifunctions 'dereferences' ZIDs by calling Wikifunctions's action API, and QIDs/etc. by calling Wikidata's
- These fetches take ~80ms when run from the (bare metal) deployment host into the mw-k8s service
- time curl -Is 'https://www.wikifunctions.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4'
- time curl -Is 'https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1'
- These fetches (seem to) take ~120ms (? recently down from ~240ms) when run from inside the k8s containers ("seems to" because mere mortals can't run curl commands from within the running pods, but this is based on surmised results)
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | ecarg | T383157 [25Q3] Improve performance and drive down tech debt & developer inhibitors | |||
| Resolved | cmassaro | T383806 [spike] find out where the delay is coming from | |||
| Resolved | cmassaro | T388683 Fetching of JSON objects from the Wikifunctions and Wikidata APIs takes longer from the orchestrator in k8s than from the deployment host |
Event Timeline
Running some benchmarks, I see
For https://www.wikifunctions.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4
Via the mesh
From a deployment server
deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'http://127.0.0.1:6501/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4' --no-parser New configuration template added to /home/akosiaris/.siege Run siege -C to view the current settings in that file { "transactions": 10000, "availability": 100.00, "elapsed_time": 64.44, "data_transferred": 95.23, "response_time": 0.06, "transaction_rate": 155.18, "throughput": 1.48, "concurrency": 9.78, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 2.24, "shortest_transaction": 0.04 }
From an orchestrator
wikikube-worker2082:~$ sudo nsenter -t 94455 -n siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'http://127.0.0.1:6501/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4' --no-parser New configuration template added to /root/.siege Run siege -C to view the current settings in that file { "transactions": 10000, "availability": 100.00, "elapsed_time": 60.65, "data_transferred": 95.23, "response_time": 0.06, "transaction_rate": 164.88, "throughput": 1.57, "concurrency": 9.89, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 0.40, "shortest_transaction": 0.04 }
Most numbers are consistent between the 2 sets, including latencies that are at the 60ms mark.
Directly to the internal service
From a deployment server
deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'https://mw-wikifunctions.discovery.wmnet:4451/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4' --no-parser { "transactions": 10000, "availability": 100.00, "elapsed_time": 110.32, "data_transferred": 95.23, "response_time": 0.11, "transaction_rate": 90.65, "throughput": 0.86, "concurrency": 9.93, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 0.64, "shortest_transaction": 0.09 }
From an orchestrator
wikikube-worker2082:~$ sudo nsenter -t 94455 -n siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'https://mw-wikifunctions.discovery.wmnet:4451/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4' --no-parser { "transactions": 10000, "availability": 100.00, "elapsed_time": 116.81, "data_transferred": 95.23, "response_time": 0.11, "transaction_rate": 85.61, "throughput": 0.82, "concurrency": 9.80, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 2.52, "shortest_transaction": 0.09 }
Results look awfully similar again. 110ms response time. In this case, siege has to do the HTTPS initialization, showcasing very nicely how expensive it is.
The latter example also has higher longest transaction, which implies higher p95. But p95 is very sensitive to capacity issues, thus altering a lot between runs (I did witness that). I assume what you are witnessing is consistent and doesn't happen 1 out of 20 times.
Via the CDN (straight URLs as given)
From deployment server
deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'https://www.wikifunctions.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4' --no-parser --no-follow { "transactions": 10000, "availability": 100.00, "elapsed_time": 81.31, "data_transferred": 88.91, "response_time": 0.08, "transaction_rate": 122.99, "throughput": 1.09, "concurrency": 9.80, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 0.76, "shortest_transaction": 0.05 }
From an orchestrator
This can't be run. By design, workloads in WikiKube are less privileged than the legacy infrastructure and can't talk to the CDN.
For what is worth, response time here is on the 80ms range as listed in the task. However, this test can't be run from a workload, by design. Also, part of the latency here is probably because we are going through the CDN. I 've double checked cache-control (private, must-revalidate, max-age=0) and x-cache (pass and miss values) headers, these aren't cached requests. The extra 20ms (from 60ms) is probably the CDN overhead.
For https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1
Via the mesh
From a deployment server
deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'http://127.0.0.1:6501/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1' --no-parser --no-follow { "transactions": 10000, "availability": 100.00, "elapsed_time": 144.39, "data_transferred": 547.38, "response_time": 0.14, "transaction_rate": 69.26, "throughput": 3.79, "concurrency": 9.85, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 2.23, "shortest_transaction": 0.10 }
From an orchestrator
wikikube-worker2082:~$ sudo nsenter -t 94455 -n siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'http://127.0.0.1:6501/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1' --no-parser --no-follow { "transactions": 10000, "availability": 100.00, "elapsed_time": 143.48, "data_transferred": 547.38, "response_time": 0.14, "transaction_rate": 69.70, "throughput": 3.82, "concurrency": 9.89, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 1.35, "shortest_transaction": 0.10 }
Consistent again. 140ms latencies for both cases.
Directly to the internal service
From a deployment server
deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'https://mw-api-int.discovery.wmnet:4446/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1' --no-parser --no-follow { "transactions": 10000, "availability": 100.00, "elapsed_time": 220.41, "data_transferred": 547.38, "response_time": 0.21, "transaction_rate": 45.37, "throughput": 2.48, "concurrency": 9.38, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 19.03, "shortest_transaction": 0.15 }
From an orchestrator
wikikube-worker2113:~$ sudo nsenter -t 620502 -n siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'https://mw-api-int.discovery.wmnet:4446/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1' --no-parser --no-follow New configuration template added to /root/.siege Run siege -C to view the current settings in that file { "transactions": 10000, "availability": 100.00, "elapsed_time": 196.65, "data_transferred": 547.38, "response_time": 0.19, "transaction_rate": 50.85, "throughput": 2.78, "concurrency": 9.68, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 3.13, "shortest_transaction": 0.15 }
Numbers here are a bit different, about 20ms worse for the orchestrator case. Still same order of magnitude and around the 200ms mark. Unless there is something wrong in the configuration of wikifunctions -> wikidata, this IS NOT the case that is being exercised as it is not via the mesh.
Again, siege has to pay the TLS toll that the mesh's HTTPS persistent connections avoid.
The difference in host and PID is because the orchestrator isn't allowed apparently to reach out to wikidata, but wikifunctions.org MediaWiki is.
Via the CDN (straight URLs as given)
From deployment server
deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1' --no-parser --no-follow { "transactions": 10000, "availability": 100.00, "elapsed_time": 160.23, "data_transferred": 534.23, "response_time": 0.16, "transaction_rate": 62.41, "throughput": 3.33, "concurrency": 9.92, "successful_transactions": 10000, "failed_transactions": 0, "longest_transaction": 0.88, "shortest_transaction": 0.12 }
From a wikifunctions orchestrator
This can't be run. By design, workloads in WikiKube are less privileged than the legacy infrastructure and can't talk to the CDN.
Same stuff to point out here as above.
("seems to" because mere mortals can't run curl commands from within the running pods
Once we 've upgraded to Kubernetes past 1.25, we should be able to address this with ephemeral debug containers (see https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/#ephemeral-container).
These fetches (seem to) take ~120ms (? recently down from ~240ms) when run from inside the k8s containers ("seems to" because mere mortals can't run curl commands from within the running pods, but this is based on surmised results)
Given the benchmarks above don't corroborate the above, may I ask how those results were surmised? More importantly, what is exactly that you are witnessing and what kind of problems is it creating ?
The ~120 ms number comes from our logging. You can see some details here.
We got those numbers by logging before and after the fetch calls from inside Node. But this mostly in line with what you found, @akosiaris, with 110 ms. So the discrepancy is just due to HTTPS initialization? And perhaps a bit of overhead from Node's fetch?
The 110ms number was a case that you shouldn't be exposed to. Talking directly to the internal endpoints for the wikifunctions.org call. You should definitely be going via the mesh, which initializes the HTTPS connections, keeping them persistent to avoid paying exactly that cost repeatedly. Double check your configuration but if your code talks to http://127.0.0.1:6501 when reaching out to wikifunctions.org (with the proper Host header) it's good to go.
You should almost never find yourself paying HTTPS init costs. That's what the Service Mesh was built for, to avoid that.
Same thing goes for the wikidata.org benchmarks, albeit there the endpoints are obviously more expensive anyway, which explains the different numbers.
And perhaps a bit of overhead from Node's fetch?
It's a newer API as far as I know, maybe there is something there? I suggest running some timing benchmarks to prove or refute that.