Page MenuHomePhabricator

Fetching of JSON objects from the Wikifunctions and Wikidata APIs takes longer from the orchestrator in k8s than from the deployment host
Closed, ResolvedPublic

Description

  • Wikifunctions 'dereferences' ZIDs by calling Wikifunctions's action API, and QIDs/etc. by calling Wikidata's
  • These fetches take ~80ms when run from the (bare metal) deployment host into the mw-k8s service
    • time curl -Is 'https://www.wikifunctions.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4'
    • time curl -Is 'https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1'
  • These fetches (seem to) take ~120ms (? recently down from ~240ms) when run from inside the k8s containers ("seems to" because mere mortals can't run curl commands from within the running pods, but this is based on surmised results)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdforrester-WMF renamed this task from Fetching of JSON objects from the Wikifunctions and Wikidata APIs takes longer in prod than from external machines, and even longer from the orchestrator in k8s than from the deployment host to Fetching of JSON objects from the Wikifunctions and Wikidata APIs takes longer from the orchestrator in k8s than from the deployment host.Mar 12 2025, 3:13 PM
Jdforrester-WMF removed cmassaro as the assignee of this task.
Jdforrester-WMF updated the task description. (Show Details)

Running some benchmarks, I see

For https://www.wikifunctions.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4

Via the mesh

From a deployment server

deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'http://127.0.0.1:6501/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4'  --no-parser
New configuration template added to /home/akosiaris/.siege
Run siege -C to view the current settings in that file

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			       64.44,
	"data_transferred":		       95.23,
	"response_time":		        0.06,
	"transaction_rate":		      155.18,
	"throughput":			        1.48,
	"concurrency":			        9.78,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        2.24,
	"shortest_transaction":		        0.04
}

From an orchestrator

wikikube-worker2082:~$ sudo nsenter -t 94455 -n siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'http://127.0.0.1:6501/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4'  --no-parser
New configuration template added to /root/.siege
Run siege -C to view the current settings in that file

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			       60.65,
	"data_transferred":		       95.23,
	"response_time":		        0.06,
	"transaction_rate":		      164.88,
	"throughput":			        1.57,
	"concurrency":			        9.89,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        0.40,
	"shortest_transaction":		        0.04
}

Most numbers are consistent between the 2 sets, including latencies that are at the 60ms mark.

Directly to the internal service

From a deployment server

deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'https://mw-wikifunctions.discovery.wmnet:4451/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4'  --no-parser

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			      110.32,
	"data_transferred":		       95.23,
	"response_time":		        0.11,
	"transaction_rate":		       90.65,
	"throughput":			        0.86,
	"concurrency":			        9.93,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        0.64,
	"shortest_transaction":		        0.09
}

From an orchestrator

wikikube-worker2082:~$ sudo nsenter -t 94455 -n siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'https://mw-wikifunctions.discovery.wmnet:4451/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4'  --no-parser

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			      116.81,
	"data_transferred":		       95.23,
	"response_time":		        0.11,
	"transaction_rate":		       85.61,
	"throughput":			        0.82,
	"concurrency":			        9.80,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        2.52,
	"shortest_transaction":		        0.09
}

Results look awfully similar again. 110ms response time. In this case, siege has to do the HTTPS initialization, showcasing very nicely how expensive it is.

The latter example also has higher longest transaction, which implies higher p95. But p95 is very sensitive to capacity issues, thus altering a lot between runs (I did witness that). I assume what you are witnessing is consistent and doesn't happen 1 out of 20 times.

Via the CDN (straight URLs as given)

From deployment server

deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikifunctions.org' 'https://www.wikifunctions.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z1|Z2|Z4'  --no-parser --no-follow

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			       81.31,
	"data_transferred":		       88.91,
	"response_time":		        0.08,
	"transaction_rate":		      122.99,
	"throughput":			        1.09,
	"concurrency":			        9.80,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        0.76,
	"shortest_transaction":		        0.05
}

From an orchestrator

This can't be run. By design, workloads in WikiKube are less privileged than the legacy infrastructure and can't talk to the CDN.

For what is worth, response time here is on the 80ms range as listed in the task. However, this test can't be run from a workload, by design. Also, part of the latency here is probably because we are going through the CDN. I 've double checked cache-control (private, must-revalidate, max-age=0) and x-cache (pass and miss values) headers, these aren't cached requests. The extra 20ms (from 60ms) is probably the CDN overhead.

For https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1

Via the mesh

From a deployment server

deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'http://127.0.0.1:6501/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1'  --no-parser --no-follow

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			      144.39,
	"data_transferred":		      547.38,
	"response_time":		        0.14,
	"transaction_rate":		       69.26,
	"throughput":			        3.79,
	"concurrency":			        9.85,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        2.23,
	"shortest_transaction":		        0.10
}

From an orchestrator

wikikube-worker2082:~$ sudo nsenter -t 94455 -n siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'http://127.0.0.1:6501/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1'  --no-parser --no-follow

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			      143.48,
	"data_transferred":		      547.38,
	"response_time":		        0.14,
	"transaction_rate":		       69.70,
	"throughput":			        3.82,
	"concurrency":			        9.89,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        1.35,
	"shortest_transaction":		        0.10
}

Consistent again. 140ms latencies for both cases.

Directly to the internal service

From a deployment server

deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'https://mw-api-int.discovery.wmnet:4446/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1'  --no-parser --no-follow


{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			      220.41,
	"data_transferred":		      547.38,
	"response_time":		        0.21,
	"transaction_rate":		       45.37,
	"throughput":			        2.48,
	"concurrency":			        9.38,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		       19.03,
	"shortest_transaction":		        0.15
}

From an orchestrator

wikikube-worker2113:~$ sudo nsenter -t 620502 -n siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'https://mw-api-int.discovery.wmnet:4446/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1'  --no-parser --no-follow
New configuration template added to /root/.siege
Run siege -C to view the current settings in that file

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			      196.65,
	"data_transferred":		      547.38,
	"response_time":		        0.19,
	"transaction_rate":		       50.85,
	"throughput":			        2.78,
	"concurrency":			        9.68,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        3.13,
	"shortest_transaction":		        0.15
}

Numbers here are a bit different, about 20ms worse for the orchestrator case. Still same order of magnitude and around the 200ms mark. Unless there is something wrong in the configuration of wikifunctions -> wikidata, this IS NOT the case that is being exercised as it is not via the mesh.

Again, siege has to pay the TLS toll that the mesh's HTTPS persistent connections avoid.

The difference in host and PID is because the orchestrator isn't allowed apparently to reach out to wikidata, but wikifunctions.org MediaWiki is.

Via the CDN (straight URLs as given)

From deployment server

deploy2002:~$ siege -b -r 1000 -c 10 -H 'Host: www.wikidata.org' 'https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q42%7CL1'  --no-parser --no-follow

{	"transactions":			       10000,
	"availability":			      100.00,
	"elapsed_time":			      160.23,
	"data_transferred":		      534.23,
	"response_time":		        0.16,
	"transaction_rate":		       62.41,
	"throughput":			        3.33,
	"concurrency":			        9.92,
	"successful_transactions":	       10000,
	"failed_transactions":		           0,
	"longest_transaction":		        0.88,
	"shortest_transaction":		        0.12
}

From a wikifunctions orchestrator

This can't be run. By design, workloads in WikiKube are less privileged than the legacy infrastructure and can't talk to the CDN.

Same stuff to point out here as above.

("seems to" because mere mortals can't run curl commands from within the running pods

Once we 've upgraded to Kubernetes past 1.25, we should be able to address this with ephemeral debug containers (see https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/#ephemeral-container).

These fetches (seem to) take ~120ms (? recently down from ~240ms) when run from inside the k8s containers ("seems to" because mere mortals can't run curl commands from within the running pods, but this is based on surmised results)

Given the benchmarks above don't corroborate the above, may I ask how those results were surmised? More importantly, what is exactly that you are witnessing and what kind of problems is it creating ?

These fetches (seem to) take ~120ms (? recently down from ~240ms) when run from inside the k8s containers ("seems to" because mere mortals can't run curl commands from within the running pods, but this is based on surmised results)

Given the benchmarks above don't corroborate the above, may I ask how those results were surmised? More importantly, what is exactly that you are witnessing and what kind of problems is it creating ?

Ping @ecarg and @cmassaro to answer here for completion.

The ~120 ms number comes from our logging. You can see some details here.

We got those numbers by logging before and after the fetch calls from inside Node. But this mostly in line with what you found, @akosiaris, with 110 ms. So the discrepancy is just due to HTTPS initialization? And perhaps a bit of overhead from Node's fetch?

Does this mean we're sufficiently understanding of the situation to Resolve this?

Does this mean we're sufficiently understanding of the situation to Resolve this?

Seems that way to me, but I don't know what others think.

The ~120 ms number comes from our logging. You can see some details here.

We got those numbers by logging before and after the fetch calls from inside Node. But this mostly in line with what you found, @akosiaris, with 110 ms. So the discrepancy is just due to HTTPS initialization?

The 110ms number was a case that you shouldn't be exposed to. Talking directly to the internal endpoints for the wikifunctions.org call. You should definitely be going via the mesh, which initializes the HTTPS connections, keeping them persistent to avoid paying exactly that cost repeatedly. Double check your configuration but if your code talks to http://127.0.0.1:6501 when reaching out to wikifunctions.org (with the proper Host header) it's good to go.

You should almost never find yourself paying HTTPS init costs. That's what the Service Mesh was built for, to avoid that.

Same thing goes for the wikidata.org benchmarks, albeit there the endpoints are obviously more expensive anyway, which explains the different numbers.

And perhaps a bit of overhead from Node's fetch?

It's a newer API as far as I know, maybe there is something there? I suggest running some timing benchmarks to prove or refute that.

DSantamaria changed the task status from Open to In Progress.Apr 4 2025, 1:59 PM
DSantamaria assigned this task to cmassaro.