Migrate EventStreams to k8s deployment pipeline
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	Nov 19 2019, 2:54 PM

Details

Subject	Repo	Branch	Lines +/-
eventstreams: Remove profile	operations/puppet	production	+0 -134
lvs: Rename eventstreams-tls to eventstreams	operations/puppet	production	+1 -1
eventstreams: Remove from scb role	operations/puppet	production	+2 -93
eventstreams: Remove old lvs service	operations/puppet	production	+0 -42
eventstreams: Remove all conftool data	operations/puppet	production	+10 -10
eventstreams: Switch to service_setup	operations/puppet	production	+1 -1
eventstreams: Switch to lvs_setup	operations/puppet	production	+1 -1
eventstreams: Switch scb LVS to monitoring_setup	operations/puppet	production	+1 -1
DNM: eventstreams: switch lvs to kubernetes	operations/puppet	production	+2 -2
eventstreams - reduce resource requests now that mem leak is gone	operations/deployment-charts	master	+6 -6
eventstreams - bump image version to 2020-03-23-173308-production	operations/deployment-charts	master	+3 -3
Bump KafkaSSE dependency to 400721a to get memory leak workaround	mediawiki/services/eventstreams	master	+2 -2
eventstreams - bump image version to 2020-03-12-155244-production	operations/deployment-charts	master	+3 -3
eventstreams - default client_ip_connection_limit: 2 per worker	operations/deployment-charts	master	+191 -192
Add eventstreams_client_connections_total metric	mediawiki/services/eventstreams	master	+29 -3
eventstreams: Pass 0 seconds as "0s"	operations/deployment-charts	master	+202 -184
eventstreams: use latest envoy image	operations/deployment-charts	master	+3 -3
_tls_helpers: Double quote UPSTREAM_TIMEOUT value	operations/deployment-charts	master	+202 -184
eventstreams/envoy: Disable upstream_timout	operations/deployment-charts	master	+331 -388
Use ottomata/KafkaSSE.git#disconnecting temporarily	mediawiki/services/eventstreams	master	+2 -2
eventstreams: Increase CPU limits by 25%	operations/deployment-charts	master	+3 -3
eventstreams: Swith caching layer to TLS enabled eventstreams	operations/puppet	production	+1 -1
eventstreams: Move new LVS TLS IP to production state	operations/puppet	production	+1 -1
eventstreams: Switch to monitoring_setup	operations/puppet	production	+1 -1
eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits	operations/deployment-charts	master	+231 -241
eventstreams-tls: Switch state to lvs_setup	operations/puppet	production	+1 -1
Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb	operations/puppet	production	+65 -23
eventstreams: Add kubernetes hosts to conftool	operations/puppet	production	+12 -12
eventstreams - Use schema.wm.org stream schema documentation links	operations/deployment-charts	master	+19 -19
eventstreams - Use kafka brokers in codfw in codfw k8s cluster	operations/deployment-charts	master	+1 -1
eventstreams - bump limits to 2000m cpu, 1900mi memory, 8 replicas	operations/deployment-charts	master	+189 -39
eventstreams - bump cpu limit to 2000m for benchmarking	operations/deployment-charts	master	+7 -0
eventstreams - scrape prometheus port	operations/deployment-charts	master	+175 -156
eventstreams - fix missing '-' in '---', and '.' in template reference	operations/deployment-charts	master	+165 -147
eventstreams - Bump image version to latest	operations/deployment-charts	master	+1 -1
Use new service-runner metrics for built in prometheus metrics	mediawiki/services/eventstreams	master	+46 -219
eventstreams - define tls.telemetry in values.yaml files	operations/deployment-charts	master	+202 -178
Bump eventstreams chart version and repackage	operations/deployment-charts	master	+170 -142
Fix chart name in eventstreams staging helmfile	operations/deployment-charts	master	+1 -1
eventstreams: Create kubernetes token	operations/puppet	production	+15 -0
evenstreams: Add forgotten admin/ values files	operations/deployment-charts	master	+6 -0
eventstreams: Add the namespace and calico rules	operations/deployment-charts	master	+694 -0
eventstreams: Add k8s tokens	labs/private	master	+8 -0
New eventstreams chart	operations/deployment-charts	master	+1 K -0
Remove now unused deployment-prep eventstreams config	operations/puppet	production	+0 -25
Correct eventstreams repo name to mediawiki/services/eventstreams	integration/config	master	+1 -1
Setup eventstreams service-pipeline-test-and-publish	integration/config	master	+5 -0

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T198901 Migrate production services to kubernetes using the pipeline
		Resolved		Ottomata	T238658 Migrate EventStreams to k8s deployment pipeline

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ping @colewhite on all the service runner prometheus stuff above in the last 2 comments! :) Re T205870: Fully migrate producers off statsd and https://github.com/wikimedia/service-runner/pull/221

Ottomata moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Mar 6 2020, 3:10 PM

merged and pooled 1 kubernetes host with a very low weight (1) in the pool and that on just in eqiad

AWESOME! yeah let's turn it on more next week.

those panels still reference service_runner_request_duration_seconds_bucket

Ah, oops, I changed them but never hit save!

you might also add a eventstreams_connected_clients_total that is a counter. It has the benefit of always increasing so it can show events that last less than the polling period.

Indeed, if _total is introduced I believe this gauge can be dropped as it'll be redundant.

Hm, will it? The rate of change is very small. Most folks connect and then stay connect for very long periods of time. If I want to track the number of concurrent connections at any given time. The total number of connections ever isn't very useful. Is it?

Agreed, bear in mind also cardinality here if we're adding metrics for a lot of paths.

For eventstreams, there aren't a lot of paths, but for other service-runner services there might be.

Not sure where best to discuss the overall service-runner prometheus behavior, but possibly on T205870? Maybe that ticket should have a subticket specificly for service-runner?

I created a PR to service-runner for the updates to heapwatch metrics. Thanks for the feedback!

The express_router_request_duration_seconds is a harder nut to crack. The metric is generated in a low-level request wrapper in Hyperswitch/service-template-node and is used nearly everywhere. Restbase in particular makes heavy use of the path label in its dashboard. Converting it to a histogram is easy enough, but in combination with the path variable will generate even more metrics (one per bucket plus count and sum) and corresponding cardinality in the labels.

Tracking duration metrics per path seems to me an anti-pattern, especially so when url path parameters enter the mix. Unfortunately, current functionality and dashboards will be broken without it. Tracking overall request duration seems to me a reasonable substitute, but loses the granularity to track down which requests are taking the longest.

Maybe tracking request duration, perhaps greater than a configurable lower bound, may be a job for logging?

Agreed, bear in mind also cardinality here if we're adding metrics for a lot of paths.

Perhaps using a single metric name e.g. 'express_router_request_duration_seconds' for all services is a bad idea? Maybe these should be named per service instead? Then the path label cardinality issue would only be an issue for those services with a lot of paths.

In T238658#5948568, @Ottomata wrote:

merged and pooled 1 kubernetes host with a very low weight (1) in the pool and that on just in eqiad

AWESOME! yeah let's turn it on more next week.

those panels still reference service_runner_request_duration_seconds_bucket

Ah, oops, I changed them but never hit save!

you might also add a eventstreams_connected_clients_total that is a counter. It has the benefit of always increasing so it can show events that last less than the polling period.

Indeed, if _total is introduced I believe this gauge can be dropped as it'll be redundant.

Hm, will it? The rate of change is very small. Most folks connect and then stay connect for very long periods of time. If I want to track the number of concurrent connections at any given time. The total number of connections ever isn't very useful. Is it?

It does have it's usefulness. As I pointed out, gauges have the problem that you will never get insights into events that last less than the current polling period (60s currently). Counters have the capability to expose that. But your point about the long lived connections is correct. I 'd say keep both?

Converting it to a histogram is easy enough, but in combination with the path variable will generate even more metrics (one per bucket plus count and sum) and corresponding cardinality in the labels.

That's true, on the other hand, it's can be useful at times. It might be one of the things that's fine to pay the cost for.

Tracking duration metrics per path seems to me an anti-pattern, especially so when url path parameters enter the mix.

Why is that? It's interesting at the very least to be able to have input into which parts of an application could benefit from looking into them more performance wise. For example, looking into https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&refresh=1m&fullscreen&panelId=20, you can tell that the most expensive endpoints are:

--domain_v1_feed_onthisday_events_--mm_--dd
--domain_v1_media_image_featured_--yyyy_--mm_--dd

In fact it's also useful after outages. e.g. this is Kobe Bryant for wikifeeds

https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1580052209253&to=1580087199801&fullscreen&panelId=20

Aside from the obvious things (e.g. getting the picture was performing badly, getting the featured page was also performing badly), there's information into the fact that the deaths endpoint could not return anything (it's conspicuously missing from the metrics).

Unfortunately, current functionality and dashboards will be broken without it. Tracking overall request duration seems to me a reasonable substitute, but loses the granularity to track down which requests are taking the longest.

Sure, but it does lack some of the things I showcased above. Maybe it's a cost worth paying?

In T238658#5953276, @akosiaris wrote:

It does have it's usefulness. As I pointed out, gauges have the problem that you will never get insights into events that last less than the current polling period (60s currently). Counters have the capability to expose that. But your point about the long lived connections is correct. I 'd say keep both?

+1 on keeping both in this case. The total connections is generally useful to calculate rates, which we don't expect to be high in this case but still!

In T238658#5949217, @colewhite wrote:

I created a PR to service-runner for the updates to heapwatch metrics. Thanks for the feedback!

Thank you! Following up there.

The express_router_request_duration_seconds is a harder nut to crack. The metric is generated in a low-level request wrapper in Hyperswitch/service-template-node and is used nearly everywhere. Restbase in particular makes heavy use of the path label in its dashboard. Converting it to a histogram is easy enough, but in combination with the path variable will generate even more metrics (one per bucket plus count and sum) and corresponding cardinality in the labels.

Tracking duration metrics per path seems to me an anti-pattern, especially so when url path parameters enter the mix. Unfortunately, current functionality and dashboards will be broken without it. Tracking overall request duration seems to me a reasonable substitute, but loses the granularity to track down which requests are taking the longest.

Maybe tracking request duration, perhaps greater than a configurable lower bound, may be a job for logging?

I agree, for troubleshooting purposes we should be logging which requests take longer than a configurable time.

For observability purposes and keeping in mind cardinality we could also aggregate status codes by hundreds to get [12345]xx counts too. Maybe for paths we could keep that but aggregate on e.g. first level path only.

In T238658#5949877, @Ottomata wrote:

Agreed, bear in mind also cardinality here if we're adding metrics for a lot of paths.

Perhaps using a single metric name e.g. 'express_router_request_duration_seconds' for all services is a bad idea? Maybe these should be named per service instead? Then the path label cardinality issue would only be an issue for those services with a lot of paths.

The metric won't be a single one for all services though, when Prometheus pulls from k8s services it'll attach tags based on e.g. the namespace name so we'll get a breakdown per service already, unless I'm missing something! Would that work?

The metric won't be a single one for all services though, when Prometheus pulls from k8s services it'll attach tags

By tags do you mean labels, or is this a prometheus thing I don't know about?! My understanding that the metric name here is 'express_router_request_duration_seconds', and every service-runner based app will emit a metric with the same name.

Change 578525 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb

https://gerrit.wikimedia.org/r/578525

Change 578525 merged by Alexandros Kosiaris:
[operations/puppet@production] Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb

https://gerrit.wikimedia.org/r/578525

Change 578537 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits

https://gerrit.wikimedia.org/r/578537

Change 578542 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams-tls: Switch state to lvs_setup

https://gerrit.wikimedia.org/r/578542

Change 578542 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams-tls: Switch state to lvs_setup

https://gerrit.wikimedia.org/r/578542

Change 578555 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Switch to monitoring_setup

https://gerrit.wikimedia.org/r/578555

Change 578537 merged by Ottomata:
[operations/deployment-charts@master] eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits

https://gerrit.wikimedia.org/r/578537

Ottomata mentioned this in rDEPLOYCHARTS2e72032bbfd3: eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits.Mar 10 2020, 5:10 PM

Change 578555 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Switch to monitoring_setup

https://gerrit.wikimedia.org/r/578555

Change 578896 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Move new LVS TLS IP to production state

https://gerrit.wikimedia.org/r/578896

Change 578896 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Move new LVS TLS IP to production state

https://gerrit.wikimedia.org/r/578896

Change 578921 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Swith caching layer to TLS enabled eventstreams

https://gerrit.wikimedia.org/r/578921

Change 578921 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Swith caching layer to TLS enabled eventstreams

https://gerrit.wikimedia.org/r/578921

@Ottomata, eventstreams has been switched to kubernetes and TLS.

The following 2 graphs show the migration of the users from the old infra to the new infra:

It does look like it's working pretty nicely. I 'd wait a few days (mid of next week?) to see if anything shows up and then proceed with tearing down the old infrastructure.

Change 578929 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventstreams: Increase CPU limits by 25%

https://gerrit.wikimedia.org/r/578929

Change 578929 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] eventstreams: Increase CPU limits by 25%

https://gerrit.wikimedia.org/r/578929

I 've bumped CPU limits for eventstreams itself by 25%. The cause was that inexplicable CPU throttling event[1] that seems to have some periodicity.

I seem to have caught the 2nd one in the midst of it. Interestingly, there seems to be no correlation with traffic levels, but there is some correlation with the amount of GCs (they seems to dropped during the peak of the events, with peaks right before and after. Memory increased too, but not up to the configurable limits.

In a wait and see situation now.

I 'll push the change to codfw as well which exhibits that behavior as well, albeit way less pronounced.

[1] https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s?orgId=1&from=1583927706274&to=1583928732533&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventstreams

Change 579029 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Use ottomata/KafkaSSE.git#disconnecting temporarily

https://gerrit.wikimedia.org/r/579029

Change 579029 merged by Ottomata:
[mediawiki/services/eventstreams@master] Use ottomata/KafkaSSE.git#disconnecting temporarily

https://gerrit.wikimedia.org/r/579029

I've added some debug logging in the eqiad canary pod and got a clue.

[2020-03-11T20:51:37.421Z]  INFO: eventstreams/5172 on eventstreams-canary-767476f8b4-7hzdb: Initializing sseClient and starting consume loop. (id=c246b97d-ad80-4dcd-b938-f4b3da7b4d50)
...
[2020-03-11T20:51:52.381Z] DEBUG: eventstreams/5172 on eventstreams-canary-767476f8b4-7hzdb: HTTP request aborted, calling KafkaSSE disconnect. (id=c246b97d-ad80-4dcd-b938-f4b3da7b4d50)

This seems to happen pretty regularly, about every 15 seconds. It looks like something is causing connections to close after 15 seconds, and clients are just reconnecting. (Note that the 'HTTP request aborted' is being caused by the firing of the HTTP ClientRequest 'aborted' (not 'abort') event. From the docs, it seems as if the ClientRequest shouldn't have an 'aborted' event, but an 'abort' one. Perhaps it is propagated up through the req somehow? Ping @Pchelolo for any insights here. Anyway...)

In all the cases I checked, the request is aborted and closed exactly 15 seconds after it is started. Something is making that happen. Is there a request timeout somewhere?

If I consume from the HTTP port on a pod directly, I don't get disconnected. If I consume via HTTPS, I do. Is envoy disconnecting after 15 seconds? Ping @Joe?

Via HTTPS 4982 port: response finishes after 15 seconds

date; time curl -H 'X-Client-IP: testotto0' -s --resolve 'eventstreams.discovery.wmnet:4982:10.64.64.205'  https://eventstreams.discovery.wmnet:4892/v2/stream/recentchange  | pv -l >/dev/null
Wed Mar 11 21:13:11 UTC 2020
1.17k 0:00:16 [  73 /s]

real	0m16.043s
user	0m0.056s
sys	0m0.032s

Via HTTP 8092 port: response continues indefintely.

date; time curl -H 'X-Client-IP: testotto0' -s --resolve 'eventstreams.discovery.wmnet:8092:10.64.64.205'  http://eventstreams.discovery.wmnet:8092/v2/stream/recentchange  | pv -l >/dev/null
Wed Mar 11 21:14:01 UTC 2020
66k 0:00:48 [ 110 /s]                                                                                                                                                                                                                                  

# ctrl-c
real	0m48.182s
user	0m0.060s
sys	0m0.108s

Change 579232 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventstreams/envoy: Disable upstream_timout

https://gerrit.wikimedia.org/r/579232

Change 579232 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] eventstreams/envoy: Disable upstream_timout

https://gerrit.wikimedia.org/r/579232

Change 579237 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] _tls_helpers: Double quote UPSTREAM_TIMEOUT value

https://gerrit.wikimedia.org/r/579237

Change 579237 merged by jenkins-bot:
[operations/deployment-charts@master] _tls_helpers: Double quote UPSTREAM_TIMEOUT value

https://gerrit.wikimedia.org/r/579237

Change 579244 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] eventstreams: use latest envoy image

https://gerrit.wikimedia.org/r/579244

Change 579244 merged by jenkins-bot:
[operations/deployment-charts@master] eventstreams: use latest envoy image

https://gerrit.wikimedia.org/r/579244

Change 579248 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventstreams: Pass 0 seconds as "0s"

https://gerrit.wikimedia.org/r/579248

Change 579248 merged by jenkins-bot:
[operations/deployment-charts@master] eventstreams: Pass 0 seconds as "0s"

https://gerrit.wikimedia.org/r/579248

In all the cases I checked, the request is aborted and closed exactly 15 seconds after it is started. Something is making that happen. Is there a request timeout somewhere?

@Ottomata, yes in envoy. There is a default upstream timeout of 15s. Have a look at 295abe7, de2cb54, 17e3b04 and last but not least 0bc3fe10ed6de5665e1f16458c3177a428bf2258 for the nice rabbithole we went down this morning with @Joe

If I consume from the HTTP port on a pod directly, I don't get disconnected. If I consume via HTTPS, I do. Is envoy disconnecting after 15 seconds? Ping @Joe?

Via HTTPS 4982 port: response finishes after 15 seconds
date; time curl -H 'X-Client-IP: testotto0' -s --resolve 'eventstreams.discovery.wmnet:4982:10.64.64.205'  https://eventstreams.discovery.wmnet:4892/v2/stream/recentchange  | pv -l >/dev/null

Actually, there is a typo there. 4982 vs 4892 and --resolve never works. Just so that you don't fight trying to figure out what is going on like I did for the last 1 hour.

But anyway, staging and codfw are now using the new images and the new charts.

eqiad I haven't touched because helmfile diff spewed out too much of a diff and decided to leave it to you.

Connections seem to be much more long lived now per https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventstreams&from=now-1h&to=now (notice all those NaN). We 'll have to wait a bit to have a definite answer on the memory though. It does look promising however.

Actually, there is a typo there. 4982 vs 4892 and --resolve never works. Just so that you don't fight trying to figure out what is going on like I did for the last 1 hour.

AHHH I'm sorry about that! I had a hundred commands and tests and was trying to distill them to a clear one for you all. :/

Ok, I will deploy eqiad. The diff you are probably noticing is the fact that I have an upgraded release in canary, but not prod.

THANK YOU THANK YOU

Ottomata mentioned this in T247447: Pywikibot eventstreams.py 504 Server Error: Gateway Timeout.Mar 12 2020, 1:29 PM

TheSandDoctor awarded a token.Mar 12 2020, 1:56 PM

Perhaps using a single metric name e.g. 'express_router_request_duration_seconds' for all services is a bad idea? Maybe these should be named per service instead?

The metric won't be a single one for all services though, when Prometheus pulls from k8s services it'll attach tags

By tags do you mean labels, or is this a prometheus thing I don't know about?! My understanding that the metric name here is 'express_router_request_duration_seconds', and every service-runner based app will emit a metric with the same name.

@fgiunchedi my suggestion would be to use a per service app name metric for this, instead of using one for all services that happen to use express. I currently have eventstreams_connected_clients, so this would be eventstreams_request_duration_seconds with path specific labels.

Change 579297 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Add eventstreams_client_connections_total metric

https://gerrit.wikimedia.org/r/579297

Change 579297 merged by Ottomata:
[mediawiki/services/eventstreams@master] Add eventstreams_client_connections_total metric

https://gerrit.wikimedia.org/r/579297

Change 579312 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventstreams - default client_ip_connection_limit: 2 per worker

https://gerrit.wikimedia.org/r/579312

Change 579313 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventstreams - bump image version to 2020-03-12-155244-production

https://gerrit.wikimedia.org/r/579313

Change 579312 merged by Ottomata:
[operations/deployment-charts@master] eventstreams - default client_ip_connection_limit: 2 per worker

https://gerrit.wikimedia.org/r/579312

Change 579313 merged by Ottomata:
[operations/deployment-charts@master] eventstreams - bump image version to 2020-03-12-155244-production

https://gerrit.wikimedia.org/r/579313

I think there are still some issues, but most things seem to be working fine. There is a periodic spike of new connection, as well as a seemingly very large number of http reqs / second which doesn't make a lot of sense to me. I'm not working today so I won't be able to check in on this until Monday. As far as I can tell normal client connections are working fine.

While things do indeed look way better, the memory leak is most certainly still there. Looking at https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s?orgId=1&from=1583743681938&to=1584109361939&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventstreams&fullscreen&panelId=56 specifically makes it pretty clear

In T238658#5964111, @Ottomata wrote:

Perhaps using a single metric name e.g. 'express_router_request_duration_seconds' for all services is a bad idea? Maybe these should be named per service instead?

The metric won't be a single one for all services though, when Prometheus pulls from k8s services it'll attach tags

By tags do you mean labels, or is this a prometheus thing I don't know about?! My understanding that the metric name here is 'express_router_request_duration_seconds', and every service-runner based app will emit a metric with the same name.

@fgiunchedi my suggestion would be to use a per service app name metric for this, instead of using one for all services that happen to use express. I currently have eventstreams_connected_clients, so this would be eventstreams_request_duration_seconds with path specific labels.

Yes my mistake, I meant labels! I agree we need to be able to break down by service, some labels are added automatically by k8s based on where the service lives, e.g.

express_router_request_duration_seconds{app="eventstreams",chart="eventstreams",instance="10.64.64.248:9102",job="k8s-pods",kubernetes_namespace="eventstreams",kubernetes_pod_name="eventstreams-production-5556dcc9df-k49g4",method="GET",path="_info",pod_template_hash="5556dcc9df",release="production",routing_tag="eventstreams",service="eventstreams",status="200"}

That's why I mentioned that IMHO service from service-runner isn't very useful as a tag and we could key dashboards on app instead (or any other relevant label), hope that makes sense!

service from service-runner isn't very useful as a tag and we could key dashboards on app instead

Ya makes sense, but my suggestion was about trying to solve the possible cardinality issue for a path label for apps that have a lot of paths. If every one of these has the same metric name from all apps, the path cardinality may indeed get large.

But yeah, I agree the service-runner's service and k8s service DO NOT mean the same thing, and we should probably have a different label. As part of T242861 I chose app in eventstreams and eventgate, which is settable by the app/service's helmfile. For those apps, service from service-runner is redundant.

While things do indeed look way better, the memory leak is most certainly still there.

Indeed. The code is slightly different than what is on scb, but not in any places that I would think would cause a memory leak like this. Will continue investigating!

That's why I mentioned that IMHO service from service-runner isn't very useful as a tag and we could key dashboards on app instead (or any other relevant label), hope that makes sense!

Just some clarification (aka pedantic notes) from yours truly:

service isn't from service-runner. It's from the sidecar prometheus-statsd-exporter in most services, eventstreams (and eventgate IIRC) are the pioneers here. It's a compromise done by me back then when crafting the first charts and of course it can be revised and altered. I knew back then it was already limited, but it was one sane path forward.
I wouldn't suggest app instead as the semantics of it were meant to distinguish the actual application (the software) from the service. e.g. eventgate the application powers 4 services (eventgate-analytics, eventgate-logging-external, eventgate-logging, eventgate-main).
Overall I think the having a label called service makes much sense and it probably should be somehow set during deployment (I already like eventgate's approach of setting it explicitly, although I haven't studied the details of it yet).

service isn't from service-runner. It's from the sidecar prometheus-statsd-exporter in most services, eventstreams (and eventgate IIRC) are the pioneers here. It's a compromise done by me back then when crafting the first charts and of course it can be revised and altered. I knew back then it was already limited, but it was one sane path forward.

AH! That is interesting. evenstreams is using @colewhite's new service-runner prometheus stuff, which exports service explicitly as a label for backwards compatibility with this. Maybe we don't need that after all?

I wouldn't suggest app instead as the semantics of it were meant to distinguish the actual application (the software) from the service. e.g. eventgate the application powers 4 services (eventgate-analytics, eventgate-logging-external, eventgate-logging, eventgate-main).

Let's follow up with this more on T242861: Clarify multi-service instance concepts in helm charts and enable canary releases. 'service' is going to be really hard to reconcile with k8's Service resource naming. Also, does application necessarily mean the software e.g. apache or nginx or eventgate? I'm not sure that it does. Anyway yeah, I'm ready to bikeshed more on T242861 whenever you are :p

In T238658#5972100, @Ottomata wrote:

service isn't from service-runner. It's from the sidecar prometheus-statsd-exporter in most services, eventstreams (and eventgate IIRC) are the pioneers here. It's a compromise done by me back then when crafting the first charts and of course it can be revised and altered. I knew back then it was already limited, but it was one sane path forward.

AH! That is interesting. evenstreams is using @colewhite's new service-runner prometheus stuff, which exports service explicitly as a label for backwards compatibility with this. Maybe we don't need that after all?

Sure it might very well be. I am fine with dropping it from statsd-exporter/service-runner itself as long as we expose it in some other way (e.g. a kubernetes label) so that we don't end up breaking all grafana dashboards.

I wouldn't suggest app instead as the semantics of it were meant to distinguish the actual application (the software) from the service. e.g. eventgate the application powers 4 services (eventgate-analytics, eventgate-logging-external, eventgate-logging, eventgate-main).

Let's follow up with this more on T242861: Clarify multi-service instance concepts in helm charts and enable canary releases. 'service' is going to be really hard to reconcile with k8's Service resource naming. Also, does application necessarily mean the software e.g. apache or nginx or eventgate? I'm not sure that it does. Anyway yeah, I'm ready to bikeshed more on T242861 whenever you are :p

In T238658#5972092, @Ottomata wrote:

While things do indeed look way better, the memory leak is most certainly still there.

Indeed. The code is slightly different than what is on scb, but not in any places that I would think would cause a memory leak like this. Will continue investigating!

Could it be that some native library leaks (are there even any for eventstreams?) ?

Thanks for the context on service @akosiaris , now it is much more clear in my mind what the status quo is. In the interest of compatibility and time (and picking our battles) I'd say let's go ahead and keep service as it is, for sure the whole conversation is interesting but definitely for another task.

In T238658#5972237, @akosiaris wrote:

Sure it might very well be. I am fine with dropping it from statsd-exporter/service-runner itself as long as we expose it in some other way (e.g. a kubernetes label) so that we don't end up breaking all grafana dashboards.

In T238658#5972298, @fgiunchedi wrote:

Thanks for the context on service @akosiaris , now it is much more clear in my mind what the status quo is. In the interest of compatibility and time (and picking our battles) I'd say let's go ahead and keep service as it is, for sure the whole conversation is interesting but definitely for another task.

Do we have enough information to decide whether to remove the service label from service-runner or leave as is? I see the benefits of both approaches and am good either way.

In T238658#5954016, @fgiunchedi wrote:

Maybe tracking request duration, perhaps greater than a configurable lower bound, may be a job for logging?

I agree, for troubleshooting purposes we should be logging which requests take longer than a configurable time.

For observability purposes and keeping in mind cardinality we could also aggregate status codes by hundreds to get [12345]xx counts too. Maybe for paths we could keep that but aggregate on e.g. first level path only.

Do we have enough information to decide whether or not to exchange this metric for logs when taking longer than a configurable time?

Switching to aggregating by hundreds status code is easy enough. Keep in mind that this will apply to all consumers of hyperswitch, but I don't think this will be a problem.

In T238658#5973616, @colewhite wrote:

In T238658#5972237, @akosiaris wrote:

Sure it might very well be. I am fine with dropping it from statsd-exporter/service-runner itself as long as we expose it in some other way (e.g. a kubernetes label) so that we don't end up breaking all grafana dashboards.

In T238658#5972298, @fgiunchedi wrote:

Thanks for the context on service @akosiaris , now it is much more clear in my mind what the status quo is. In the interest of compatibility and time (and picking our battles) I'd say let's go ahead and keep service as it is, for sure the whole conversation is interesting but definitely for another task.

Do we have enough information to decide whether to remove the service label from service-runner or leave as is? I see the benefits of both approaches and am good either way.

Moved this to T247820 in order to not clutter this task with unrelated stuff.

In T238658#5954016, @fgiunchedi wrote:

Maybe tracking request duration, perhaps greater than a configurable lower bound, may be a job for logging?

I agree, for troubleshooting purposes we should be logging which requests take longer than a configurable time.

For observability purposes and keeping in mind cardinality we could also aggregate status codes by hundreds to get [12345]xx counts too. Maybe for paths we could keep that but aggregate on e.g. first level path only.

Do we have enough information to decide whether or not to exchange this metric for logs when taking longer than a configurable time?

Switching to aggregating by hundreds status code is easy enough. Keep in mind that this will apply to all consumers of hyperswitch, but I don't think this will be a problem.

Moved this to T247820 for the same reason.

@akosiaris when you find have a moment, I'm trying to set debug.enabled=true on the eventstreams canary pod. Some resource doesn't finish spawning, and eventually the helm upgrade times out. (Tiller just spits out Deployment is not ready for 10 minutes.) I'm having trouble finding out why. Setting debug.enabled=true in my docker-desktop dev env works fine. I'm guessing it has something either to do with containerPort allocation, or perhaps with spawning up the wmfdebug container sidecar. Is there somewhere I can find out what resource helm is waiting for? Some tiller logs I'm missing? Some kubernetes node logs maybe?

I could try and simplify the debug mode settings one by one to see what is causing this problem, but that will require several iterations of chart versions packaged and merged, which isn't ideal for production. Maybe there is some way I can use staging on deploy1001 to troubleshoot the chart like I would in development, without having to package and merge it?

I think I found the leak.

https://github.com/Blizzard/node-rdkafka/issues/731#issuecomment-600915799

Ottomata mentioned this in T240181: Documentation improvements for Eventstreams.Mar 20 2020, 2:22 PM

Change 582837 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Bump KafkaSSE dependency to 400721a to get memory leak workaround

https://gerrit.wikimedia.org/r/582837

Change 582837 merged by Ottomata:
[mediawiki/services/eventstreams@master] Bump KafkaSSE dependency to 400721a to get memory leak workaround

https://gerrit.wikimedia.org/r/582837

Change 582864 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventstreams - bump image version to 2020-03-23-173308-production

https://gerrit.wikimedia.org/r/582864

Change 582864 merged by Ottomata:
[operations/deployment-charts@master] eventstreams - bump image version to 2020-03-23-173308-production

https://gerrit.wikimedia.org/r/582864

Change 582866 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventstreams - reduce resource requests now that mem leak is gone

https://gerrit.wikimedia.org/r/582866

Change 582866 merged by Ottomata:
[operations/deployment-charts@master] eventstreams - reduce resource requests now that mem leak is gone

https://gerrit.wikimedia.org/r/582866

@akosiaris eventstreams in k8s looks good to me! Let's proceed removing it from scb. There are a couple of open changes associated with this ticket, but I'm not sure if they are relevant anymore or not.

I can take care of puppet stuff and stopping the service on scb, but I don't know if there are extra conftool things that need to happen.

Change 583071 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Switch scb LVS to monitoring_setup

https://gerrit.wikimedia.org/r/583071

Change 583072 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Switch to lvs_setup

https://gerrit.wikimedia.org/r/583072

Change 583073 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Switch to service_setup

https://gerrit.wikimedia.org/r/583073

Change 583074 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Remove old lvs service

https://gerrit.wikimedia.org/r/583074

Change 583075 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] lvs: Rename eventstreams-tls to eventstreams

https://gerrit.wikimedia.org/r/583075

Change 583076 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Remove from scb role

https://gerrit.wikimedia.org/r/583076

Change 583077 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] eventstreams: Remove profile

https://gerrit.wikimedia.org/r/583077

Change 566772 abandoned by Alexandros Kosiaris:
DNM: eventstreams: switch lvs to kubernetes

Reason:
Things have changed, the alternative is now in https://gerrit.wikimedia.org/r/#/q/topic:scb_eventstreams (status:open OR status:merged)

https://gerrit.wikimedia.org/r/566772

In T238658#5993144, @Ottomata wrote:

@akosiaris eventstreams in k8s looks good to me!

Nice! Congrats for solving it!

Let's proceed removing it from scb. There are a couple of open changes associated with this ticket, but I'm not sure if they are relevant anymore or not.

Not just a couple anymore. I 've uploaded https://gerrit.wikimedia.org/r/#/q/topic:scb_eventstreams+(status:open+OR+status:merged). I 'll do the LVS parts and conftool parts, I 'll leave the rest to you

Change 583071 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Switch scb LVS to monitoring_setup

https://gerrit.wikimedia.org/r/583071

Change 583072 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Switch to lvs_setup

https://gerrit.wikimedia.org/r/583072

Ah we need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/583073 first, before the others, yes?

Change 583073 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Switch to service_setup

https://gerrit.wikimedia.org/r/583073

Change 566773 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Remove all conftool data

https://gerrit.wikimedia.org/r/566773

Change 583074 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Remove old lvs service

https://gerrit.wikimedia.org/r/583074

Change 583076 merged by Alexandros Kosiaris:
[operations/puppet@production] eventstreams: Remove from scb role

https://gerrit.wikimedia.org/r/583076

In T238658#5999287, @Ottomata wrote:

Ah we need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/583073 first, before the others, yes?

Yes. Merged now, alongside with a few followups as I was at it. There's a couple of stranglers yet, but no rush.

Change 583075 merged by Alexandros Kosiaris:
[operations/puppet@production] lvs: Rename eventstreams-tls to eventstreams

https://gerrit.wikimedia.org/r/583075

Change 583077 merged by Ottomata:
[operations/puppet@production] eventstreams: Remove profile

https://gerrit.wikimedia.org/r/583077

Ottomata moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Mar 30 2020, 7:37 PM

Ottomata set Final Story Points to 21.

@Ottomata I think we can resolve this. Feel free to reopen.

BTullis mentioned this in T294911: Apparent latency warning in 90th centile of eventgate-logging-external.Nov 3 2021, 11:53 AM

Migrate EventStreams to k8s deployment pipelineClosed, ResolvedPublic13 Estimated Story PointsActions

Details

Related ObjectsSearch...

Event Timeline

Migrate EventStreams to k8s deployment pipeline
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...