Page MenuHomePhabricator

Jaeger secure access to OpenSearch cluster
Closed, ResolvedPublic

Description

Track the implementation of Jaeger secure access to OpenSearch. Items in no particular order (possibly incomplete list):

  • k8s cluster is allowed to talk to OpenSearch API (e.g. firewall rules)
  • Jaeger can talk https to OpenSearch (ideally with some kind of mutual authentication)

I (Filippo) have investigated this a little and so far there's a few options (in no particular order) for access:

  1. jaeger accesses opensearch as a "standard" service. In other words we point jaeger to a load-balanced HTTPS service/api backed by logstash hosts running logging::opensearch::collector (essentially stateless frontends).
    • pros: familiar pattern of deployment/traffic, we can depool/pool logstash hosts as we do e.g. for kibana, jager config has sth like "https://logs-opensearch.discovery.wmnet"
    • cons: requires more work up front to set up (load balanced service, envoy configs, etc), access control needs to happen at the HTTP level
  2. jaeger accesses opensearch as a cluster. In this case jaeger talks HTTP to port 9200 directly to all opensearch cluster hosts (currently all hosts running logging::opensearch::data and logging::opensearch::collector), cluster node list happens automatically via sniffing.
    • pros: easy to setup (open port 9200 on the logstash cluster to k8x-aux workers)
    • cons: no tls out of the box, need to bake some production hostnames in jaeger config
  3. Jaeger outputs to kafka-logging as a buffer, jaeger-ingester (perhaps deployed within the logging cluster) reads from kafka-logging and persists to opensearch (from @herron)
    • Pros: reuses existing/understood logging pipeline architecture & monitoring, allows for queueing and recovery during backend outages/maintenances, helps backend cluster stability by absorbing bursts
    • Cons: presumably some setup/maint to be done around packaging, puppetizing/etc, montioring for the jaeger-ingester service

Event Timeline

I like option one better for a few reasons:

  1. centralizes the interface for consumers of opensearch data (solving the current problem of canary checks directly referencing a random host)
  2. allows us to apply some access control around the API that is not limited to IP-level controls
  3. allows us pool/depool capability
  4. eliminates the need to maintain a list of hosts in client configs

We're not in a place to have opensearch serve https natively yet, but with option one we could have that.

Thank you for the feedback @colewhite ! Agreed going for option 1 seems like a good first step. I have a WIP patch here https://gerrit.wikimedia.org/r/c/operations/puppet/+/881839 for the apache bits, let me know what you think!

Both options have great merit, and choosing a favorite is hard. I would default to my preference for simple and fewer moving parts and think #2 would be easier to troubleshoot in the future. However, option #1 is closer to the general direction and have other services access OpenSearch securely, so it is most likely the better long-term decision. Apologies for the delay in commenting.

Potential option 3: Jaeger outputs to kafka-logging as a buffer, jaeger-ingester (perhaps deployed within the logging cluster) reads from kafka-logging and persists to opensearch

Pros: reuses existing/understood logging pipeline architecture & monitoring, allows for queueing and recovery during backend outages/maintenances, helps backend cluster stability by absorbing bursts

Cons: presumably some setup/maint to be done around packaging, puppetizing/etc, montioring for the jaeger-ingester service

Thank you for the feedback! I think option 3 is interesting too, and something to be considered if we do run into the problems that the "pros" address! I'll move it to the task description so we do lose track of it

Change 881839 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: reverse-proxy access to opensearch API

https://gerrit.wikimedia.org/r/881839

Change 881839 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: reverse-proxy access to opensearch API

https://gerrit.wikimedia.org/r/881839

Change 888634 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] ssl: add public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888634

Change 888634 merged by Filippo Giunchedi:

[operations/puppet@production] ssl: add public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888634

Change 888639 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] ssl: update public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888639

Change 888639 merged by Filippo Giunchedi:

[operations/puppet@production] ssl: update public cert for kibana + logs-api

https://gerrit.wikimedia.org/r/888639

Change 888646 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: add aliases to dashboards vhost

https://gerrit.wikimedia.org/r/888646

Change 888646 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: add aliases to dashboards vhost

https://gerrit.wikimedia.org/r/888646

Change 888648 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: fix kibana7 vhost selection for pybal health checks

https://gerrit.wikimedia.org/r/888648

Change 888648 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: fix kibana7 vhost selection for pybal health checks

https://gerrit.wikimedia.org/r/888648

Change 888696 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: add logs-api svc records

https://gerrit.wikimedia.org/r/888696

Change 888700 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Add logs-api service

https://gerrit.wikimedia.org/r/888700

Change 888696 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: add logs-api svc records

https://gerrit.wikimedia.org/r/888696

Change 888700 merged by Filippo Giunchedi:

[operations/puppet@production] Add logs-api service

https://gerrit.wikimedia.org/r/888700

Change 889063 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] conftool-data: add logs-api codfw

https://gerrit.wikimedia.org/r/889063

Change 889063 merged by Filippo Giunchedi:

[operations/puppet@production] conftool-data: add logs-api codfw

https://gerrit.wikimedia.org/r/889063

Change 889066 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: logs-api to lvs_setup state

https://gerrit.wikimedia.org/r/889066

Change 889066 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: logs-api to lvs_setup state

https://gerrit.wikimedia.org/r/889066

Mentioned in SAL (#wikimedia-operations) [2023-02-14T09:50:14Z] <godog> roll-restart pybal in eqiad/codfw to pick up logs-api service - T320702

Change 889083 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logs-api: allow GET / only for health check

https://gerrit.wikimedia.org/r/889083

Change 889083 merged by Filippo Giunchedi:

[operations/puppet@production] logs-api: allow GET / only for health check

https://gerrit.wikimedia.org/r/889083

Change 889494 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set logs-api in 'production'

https://gerrit.wikimedia.org/r/889494

Change 889494 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set logs-api in 'production'

https://gerrit.wikimedia.org/r/889494

fgiunchedi claimed this task.

Calling this done! We have (authenticated) logs-api service available for jaeger to use