Maniphest T320702

Jaeger secure access to OpenSearch cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Oct 13 2022, 9:06 AM

Description

Track the implementation of Jaeger secure access to OpenSearch. Items in no particular order (possibly incomplete list):

k8s cluster is allowed to talk to OpenSearch API (e.g. firewall rules)
Jaeger can talk https to OpenSearch (ideally with some kind of mutual authentication)

I (Filippo) have investigated this a little and so far there's a few options (in no particular order) for access:

jaeger accesses opensearch as a "standard" service. In other words we point jaeger to a load-balanced HTTPS service/api backed by logstash hosts running logging::opensearch::collector (essentially stateless frontends).
- pros: familiar pattern of deployment/traffic, we can depool/pool logstash hosts as we do e.g. for kibana, jager config has sth like "https://logs-opensearch.discovery.wmnet"
- cons: requires more work up front to set up (load balanced service, envoy configs, etc), access control needs to happen at the HTTP level
jaeger accesses opensearch as a cluster. In this case jaeger talks HTTP to port 9200 directly to all opensearch cluster hosts (currently all hosts running logging::opensearch::data and logging::opensearch::collector), cluster node list happens automatically via sniffing.
- pros: easy to setup (open port 9200 on the logstash cluster to k8x-aux workers)
- cons: no tls out of the box, need to bake some production hostnames in jaeger config
Jaeger outputs to kafka-logging as a buffer, jaeger-ingester (perhaps deployed within the logging cluster) reads from kafka-logging and persists to opensearch (from @herron)
- Pros: reuses existing/understood logging pipeline architecture & monitoring, allows for queueing and recovery during backend outages/maintenances, helps backend cluster stability by absorbing bursts
- Cons: presumably some setup/maint to be done around packaging, puppetizing/etc, montioring for the jaeger-ingester service

Details

Subject	Repo	Branch	Lines +/-
hieradata: set logs-api in 'production'	operations/puppet	production	+1 -5
logs-api: allow GET / only for health check	operations/puppet	production	+8 -3
hieradata: logs-api to lvs_setup state	operations/puppet	production	+1 -1
conftool-data: add logs-api codfw	operations/puppet	production	+6 -6
Add logs-api service	operations/puppet	production	+49 -6
wmnet: add logs-api svc records	operations/dns	master	+4 -0
hieradata: fix kibana7 vhost selection for pybal health checks	operations/puppet	production	+1 -1
opensearch: add aliases to dashboards vhost	operations/puppet	production	+8 -0
ssl: update public cert for kibana + logs-api	operations/puppet	production	+28 -28
ssl: add public cert for kibana + logs-api	operations/puppet	production	+31 -0
opensearch: reverse-proxy access to opensearch API	operations/puppet	production	+113 -3

Related Objects
Search...

Status	Assigned	Task
Open	None	T340551 distributed tracing epic
Open	None	T320549 distributed tracing v0 [minimum viable]
Resolved	fgiunchedi	T320702 Jaeger secure access to OpenSearch cluster

Event Timeline

fgiunchedi created this task.Oct 13 2022, 9:06 AM

fgiunchedi updated the task description. (Show Details)Jan 16 2023, 4:07 PM

fgiunchedi added a project: User-fgiunchedi.Jan 16 2023, 4:33 PM

I like option one better for a few reasons:

centralizes the interface for consumers of opensearch data (solving the current problem of canary checks directly referencing a random host)
allows us to apply some access control around the API that is not limited to IP-level controls
allows us pool/depool capability
eliminates the need to maintain a list of hosts in client configs

We're not in a place to have opensearch serve https natively yet, but with option one we could have that.

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Jan 19 2023, 11:13 AM

Thank you for the feedback @colewhite ! Agreed going for option 1 seems like a good first step. I have a WIP patch here https://gerrit.wikimedia.org/r/c/operations/puppet/+/881839 for the apache bits, let me know what you think!

Both options have great merit, and choosing a favorite is hard. I would default to my preference for simple and fewer moving parts and think #2 would be easier to troubleshoot in the future. However, option #1 is closer to the general direction and have other services access OpenSearch securely, so it is most likely the better long-term decision. Apologies for the delay in commenting.

lmata added a project: SRE Observability (FY2022/2023-Q3).Jan 20 2023, 2:40 PM

Potential option 3: Jaeger outputs to kafka-logging as a buffer, jaeger-ingester (perhaps deployed within the logging cluster) reads from kafka-logging and persists to opensearch

Pros: reuses existing/understood logging pipeline architecture & monitoring, allows for queueing and recovery during backend outages/maintenances, helps backend cluster stability by absorbing bursts

Cons: presumably some setup/maint to be done around packaging, puppetizing/etc, montioring for the jaeger-ingester service

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Jan 23 2023, 10:11 AM

Thank you for the feedback! I think option 3 is interesting too, and something to be considered if we do run into the problems that the "pros" address! I'll move it to the task description so we do lose track of it

fgiunchedi updated the task description. (Show Details)Jan 24 2023, 9:34 AM

Change 881839 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: reverse-proxy access to opensearch API

https://gerrit.wikimedia.org/r/881839

gerritbot added a project: Patch-For-Review.Feb 2 2023, 2:21 PM

herron updated the task description. (Show Details)Feb 3 2023, 2:12 PM

Change 881839 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: reverse-proxy access to opensearch API

https://gerrit.wikimedia.org/r/881839

Change 888634 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] ssl: add public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888634

Change 888634 merged by Filippo Giunchedi:

[operations/puppet@production] ssl: add public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888634

Maintenance_bot removed a project: Patch-For-Review.Feb 13 2023, 9:30 AM

Change 888639 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] ssl: update public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888639

gerritbot added a project: Patch-For-Review.Feb 13 2023, 9:39 AM

Change 888639 merged by Filippo Giunchedi:

[operations/puppet@production] ssl: update public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888639

Change 888646 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: add aliases to dashboards vhost

https://gerrit.wikimedia.org/r/888646

Change 888646 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: add aliases to dashboards vhost

https://gerrit.wikimedia.org/r/888646

Change 888648 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: fix kibana7 vhost selection for pybal health checks

https://gerrit.wikimedia.org/r/888648

Change 888648 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: fix kibana7 vhost selection for pybal health checks

https://gerrit.wikimedia.org/r/888648

Mentioned in SAL (#wikimedia-operations) [2023-02-13T10:56:20Z] <godog> roll-restart pybal in eqiad to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/888648 - T320702

Maintenance_bot removed a project: Patch-For-Review.Feb 13 2023, 11:31 AM

Change 888696 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: add logs-api svc records

https://gerrit.wikimedia.org/r/888696

gerritbot added a project: Patch-For-Review.Feb 13 2023, 1:34 PM

Change 888700 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Add logs-api service

https://gerrit.wikimedia.org/r/888700

Change 888696 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: add logs-api svc records

https://gerrit.wikimedia.org/r/888696

Change 888700 merged by Filippo Giunchedi:

[operations/puppet@production] Add logs-api service

https://gerrit.wikimedia.org/r/888700

Change 889063 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] conftool-data: add logs-api codfw

https://gerrit.wikimedia.org/r/889063

Change 889063 merged by Filippo Giunchedi:

[operations/puppet@production] conftool-data: add logs-api codfw

https://gerrit.wikimedia.org/r/889063

Maintenance_bot removed a project: Patch-For-Review.Feb 14 2023, 9:10 AM

Change 889066 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: logs-api to lvs_setup state

https://gerrit.wikimedia.org/r/889066

gerritbot added a project: Patch-For-Review.Feb 14 2023, 9:15 AM

Change 889066 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: logs-api to lvs_setup state

https://gerrit.wikimedia.org/r/889066

Mentioned in SAL (#wikimedia-operations) [2023-02-14T09:50:14Z] <godog> roll-restart pybal in eqiad/codfw to pick up logs-api service - T320702

Maintenance_bot removed a project: Patch-For-Review.Feb 14 2023, 10:10 AM

Change 889083 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logs-api: allow GET / only for health check

https://gerrit.wikimedia.org/r/889083

gerritbot added a project: Patch-For-Review.Feb 14 2023, 10:49 AM

Change 889083 merged by Filippo Giunchedi:

[operations/puppet@production] logs-api: allow GET / only for health check

https://gerrit.wikimedia.org/r/889083

Maintenance_bot removed a project: Patch-For-Review.Feb 14 2023, 3:10 PM

Change 889494 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set logs-api in 'production'

https://gerrit.wikimedia.org/r/889494

gerritbot added a project: Patch-For-Review.Feb 15 2023, 10:05 AM

fgiunchedi mentioned this in T329735: Run logstash canary checks via logs-api.svc.Feb 15 2023, 1:17 PM

Change 889494 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set logs-api in 'production'

https://gerrit.wikimedia.org/r/889494

Maintenance_bot removed a project: Patch-For-Review.Feb 16 2023, 9:30 AM

Calling this done! We have (authenticated) logs-api service available for jaeger to use

lmata moved this task from Inbox to Done on the SRE Observability (FY2022/2023-Q3) board.Mar 8 2023, 8:50 PM

Jaeger secure access to OpenSearch clusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Jaeger secure access to OpenSearch cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...