Page MenuHomePhabricator

Simulate client dispatch in a single scrape
Closed, ResolvedPublic

Description

From @Krinkle's estimations discussions on tagging cost, we learned that the main issue isn't cost in storing or processing for Prometheus server but dispatching it from our client in a single scrape. That's something we can simulate.

I'm creating the task so we can document the findings.

Event Timeline

+1 on the simulation! As an additional data point on the other big constraint (i.e. ingestion/processing) to give a sense of scale on the cardinality numbers we're taking about: in eqiad prometheus ops instance (our biggest instance) ingests around 150k sample/s, all more or less on a 60s schedule.

In other words that means that scraping 9M metrics on a 60s interval from a single target would match all of eqiad ops metrics scraped combined!

Change 858678 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/navtiming@master] Simulate navtiming-prom cardinality

https://gerrit.wikimedia.org/r/858678

Status quo: Before

Current state

Beta cluster
krinkle@deployment-webperf21:~$

Nov16  51:53 /usr/bin/python3 /srv/deployment/performance/navtiming/run_navtiming.py

$ curl -s localhost:9230 | grep -v '^#' | grep messages_total
webperf_consumed_messages_total 1447.0
webperf_handled_messages_total{schema="NavigationTiming"} 380.0
webperf_handled_messages_total{schema="PaintTiming"} 612.0
webperf_handled_messages_total{schema="FirstInputTiming"} 203.0
webperf_handled_messages_total{schema="CpuBenchmark"} 189.0
webperf_handled_messages_total{schema="SaveTiming"} 63.0

$ curl -s localhost:9230 | grep -v '^#' | wc -l
2,484 lines
Production
Nov16 829:01 /usr/bin/python3 /srv/deployment/performance/navtiming/run_navtiming.py

krinkle@webperf1003:~$ curl -s localhost:9230 | grep -v '^#' | grep messages_total
webperf_consumed_messages_total 1.3425974e+07
webperf_handled_messages_total{schema="FirstInputTiming"} 1.343128e+06
webperf_handled_messages_total{schema="QuickSurveyInitiation"} 894124.0
webperf_handled_messages_total{schema="PaintTiming"} 5.317784e+06
webperf_handled_messages_total{schema="SaveTiming"} 569949.0
webperf_handled_messages_total{schema="CpuBenchmark"} 2.139178e+06
webperf_handled_messages_total{schema="NavigationTiming"} 3.123438e+06
webperf_handled_messages_total{schema="QuickSurveysResponses"} 34733.0

[1 minute later]
krinkle@webperf1003:~$ curl -s localhost:9230 | grep -v '^#' | grep messages_total
webperf_consumed_messages_total 1.3426211e+07
webperf_handled_messages_total{schema="FirstInputTiming"} 1.343147e+06
webperf_handled_messages_total{schema="QuickSurveyInitiation"} 894141.0
webperf_handled_messages_total{schema="PaintTiming"} 5.317876e+06
webperf_handled_messages_total{schema="SaveTiming"} 569966.0
webperf_handled_messages_total{schema="CpuBenchmark"} 2.139216e+06
webperf_handled_messages_total{schema="NavigationTiming"} 3.123492e+06
webperf_handled_messages_total{schema="QuickSurveysResponses"} 34733.0

krinkle@webperf1003:~$ curl -s localhost:9230 | grep -v '^#' | wc -l
86,853 lines

[24h later]
$ curl -s localhost:9230 | grep -v '^#' | wc -l
89,796 lines

time curl -s localhost:9230 > /dev/null
real	0m1.214s	user	0m0.015s	sys	0m0.016s
real	0m1.096s	user	0m0.018s	sys	0m0.019s
real	0m1.050s	user	0m0.011s	sys	0m0.010s
$ time curl -s --compressed localhost:9230 > /dev/null
real	0m1.115s	user	0m0.008s	sys	0m0.008s
real	0m1.127s	user	0m0.020s	sys	0m0.011s

$ curl -s localhost:9230 | wc -c
11,971,264 bytes
$ curl -s localhost:9230 | gzip -9 - | wc -c
499,351 bytes

Simulation 1: Near-maximum

Simulation code in patchset 3 of https://gerrit.wikimedia.org/r/c/performance/navtiming/+/858678/. Basically: For every incoming beacon, instead of emitting only the exact tagset for that 1 value, use a for-loop that emits thousands of slight variations (e.g. ignore real country, and emit the same value for every shortlisted country, same for other tags). The number of values emitted should be insignificant gives Prometheus' pull-model and pre-aggregate format (the buffer is not +1 for X, +1 for X like statd but simply a snapshot of running counters, doesn't increase over time once a variant is known besides maybe a larger int).

Copied to my home directory on the server, and then invoked with three notable changes to the CLI paramters: 1) --dry-run to skip Statsd/Graphite output that would overlap with the real instance, 2) use a dedicated Kafka consumer offset as otherwise either we or the real one get no input, and 3) have the Prometheus client on an alternate port.

Simulation 1
/usr/bin/python3 navtiming_krinkle.py --brokers kafka-jumbo1001.eqiad.wmnet:9093,kafka-jumbo1002.eqiad.wmnet:9093,kafka-jumbo1003.eqiad.wmnet:9093,kafka-jumbo1004.eqiad.wmnet:9093,kafka-jumbo1005.eqiad.wmnet:9093,kafka-jumbo1006.eqiad.wmnet:9093,kafka-jumbo1007.eqiad.wmnet:9093,kafka-jumbo1008.eqiad.wmnet:9093,kafka-jumbo1009.eqiad.wmnet:9093 --security-protocol SSL --ssl-cafile /etc/ssl/certs/wmf-ca-certificates.crt --consumer-group navtiming_krinkle --dry-run --listen localhost:9442

krinkle@webperf1003:~$ curl -s localhost:9442 | grep -v '^#' | wc -l
187,883 lines

[after 1min]
krinkle@webperf1003:~$ curl -s localhost:9442 | grep -v '^#' | wc -l
189,629 lines

[after 5min]
krinkle@webperf1003:~$ curl -s localhost:9442 | grep -v '^#' | wc -l
294,371 lines

time curl -s localhost:9442 > /dev/null
real	0m2.123s	user	0m0.027s	sys	0m0.008s
real	0m2.249s	user	0m0.017s	sys	0m0.022s

$ cat dump | wc -c
64,412,897 bytes

$ cat dump | gzip - | wc -c
1,206,661 bytes

Simulation 2: Natural

Simulation code in patchset 4 of https://gerrit.wikimedia.org/r/c/performance/navtiming/+/858678/. This only simulates the new mw_skin field and otherwise only acts 1:1 with real input data. Unlike the previous simulation, this has to run for longer in order to be representative of the cost we're trying to measure.

/usr/bin/python3 navtiming_krinkle_lessfake.py --brokers kafka-jumbo1001.eqiad.wmnet:9093,kafka-jumbo1002.eqiad.wmnet:9093,kafka-jumbo1003.eqiad.wmnet:9093,kafka-jumbo1004.eqiad.wmnet:9093,kafka-jumbo1005.eqiad.wmnet:9093,kafka-jumbo1006.eqiad.wmnet:9093,kafka-jumbo1007.eqiad.wmnet:9093,kafka-jumbo1008.eqiad.wmnet:9093,kafka-jumbo1009.eqiad.wmnet:9093 --security-protocol SSL --ssl-cafile /etc/ssl/certs/wmf-ca-certificates.crt --consumer-group navtiming_krinkle --dry-run --listen localhost:9442

$ curl -s localhost:9442 | grep -v '^#' | wc -l
7,787 lines

[after 40min]
$ curl -s localhost:9442 | grep -v '^#' | wc -l
56,306 lines

[after 24h]
$ curl -s localhost:9442 | grep -v '^#' | wc -l
135,639 lines

$ time curl -s --compressed localhost:9442 > /dev/null
real	0m1.926s	user	0m0.019s	sys	0m0.018s
real	0m1.887s	user	0m0.024s	sys	0m0.008s

$ curl -s localhost:9442 | gzip -9 - | wc -c
888,302 bytes

Misc data

Filippo (@fgiunchedi) raised the point of HTTP compression, which would make a big difference. And indeed it does, both positively in terms of bandwidth, but also cost-wise in terms of compression for multi-megabyte datasets. I confirmed in upstream code that scrapers prefer HTTP compression https://github.com/prometheus/prometheus/blob/6b53aeb012080ab2a50acd229bfe9943125abfa6/scrape/scrape.go#L824.

High-level host metrics (CPU, mem, network) at https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=webperf1003&from=1668470400000&to=1668988800000&var-datasource=thanos&var-cluster=webperf from Nov 19 08:00 to Nov 20 22:00.

Conclusion: Another order of magnitude removed

What I've done is: 1) modify the part where we emit to Prometheus to use for-loop on every incoming navtiming beacon from Kafka, simulating every possible combination, e.g. instead of x = x in known_values ? x : 'other'; do for x in known_values: and then call it with curl a few times to get data points such as gzipped byte size, non-comment line count, and crude time(1) measure.

For the Simulation 1, this gives us a theoretical max. This came out another magnitude lower than my spreadsheet estimate. Our first crude estimate last month was ~45 million metrics (not including the x10 for buckets). We then adjusted the spreadsheet to remove the cache_host tag from all metrics other than navtiming_responseStart (i.e. same as today in Graphite). This reduced the estimate ~2.7 million.

In implementating Simulation 1, I realized that the spreadsheet format led to a silly inefficiency. Given we're special-casing the navtiming_responseStart metric, we don't actually need to have the usual tags and the cache_host tag on the same metric. That is, rather than have navtiming_responseStart with all the usual tags *and* cache_host on it, we can emit the usual one as usual, and emit navtiming_responseStart_by_cachehost with only a cache_host tag. This makes quite a big difference, and would actually be more idiomatic within the conventions of Prometheus as it means the main metric names all carry the same set of tags (principle of least surprise/POLA).

Spreadsheet revision 2:

  • 17 metrics * 10,500 normal tag combos = 179,520
  • 1 metric (responseStart) * 10,500 * 4 cache_status * 60 cache_host = 2,520,000

Improved further (revision 3):

  • 18 metrics (inc responseStart) * 10,500 = 189,000
  • 1 metric (responseStart_by_cachehost) * 4 * 60 = 240

This brought the maxed out estimate down from ~3 million to under 0.2M.

Conclusion: Cost

The current scrape time from the (relatively empty) Promethues-navtiming client is ~1.2 second emitting 86K lines, as measured with curl -s --compressed.

With the Simulation 1 process, scrape time increased to ~2.0 seconds emitting 300K lines. Note that in the context of scraping the numbers tend to be 10X larger since this includes meta metrics about the python process itself, histogram buckets, and other meta counters.

With the Simulation 2 process, scrape timem was at ~1.8 seconds with 140K lines. CPU and memory usage looked comparable to the current Python process. On the host overview board I eye-balled the as ~200M extra RAM which is slightly more than the current one at 140M (the increase was 200M at the OS level because I'm running it next to the real one). The 140M->200M size lines up almost perfectly with the result size of curl -s | wc -c before gzip compression which in the raw data above was ~62M (compressed: 800KB).

Krinkle triaged this task as High priority.Nov 21 2022, 5:50 PM

Change 858678 abandoned by Krinkle:

[performance/navtiming@master] Simulate navtiming-prom cardinality

Reason:

https://gerrit.wikimedia.org/r/858678