Page MenuHomePhabricator

ATS should alert if the number of total or active connections reached maximum
Closed, ResolvedPublic

Description

Our ATS configuration currently limits the number of active and total connections with the settings proxy.config.net.max_connections_in and proxy.config.net.max_connections_active_in respectively. We need to alert if either of those two values are reached.

Note that proxy.config.net.max_connections_active_in was renamed to proxy.config.net.max_requests_in in ATX 9.x:

https://github.com/apache/trafficserver/issues/6695#issuecomment-951426792

Event Timeline

ema triaged this task as High priority.Oct 8 2021, 11:53 AM

Change 826362 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Parse ATS config to node exporter text

https://gerrit.wikimedia.org/r/826362

BCornwall changed the task status from Open to Stalled.Aug 24 2022, 7:20 PM

Change needs some testing but will be stalled until https://phabricator.wikimedia.org/T309651 is fixed

Change of plans: Kwaku has expressed an interest in backwards-compatibility so ATS 8 support will be added.

BCornwall changed the task status from Stalled to In Progress.Sep 7 2022, 6:20 PM

Change 830686 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] ats: Enable node_ats_config monitoring

https://gerrit.wikimedia.org/r/830686

Change 830950 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] ats: Alert on high connection/request count

https://gerrit.wikimedia.org/r/830950

Change 826362 merged by BCornwall:

[operations/puppet@production] prometheus: Parse ATS config to node exporter text

https://gerrit.wikimedia.org/r/826362

Change 830686 merged by BCornwall:

[operations/puppet@production] ats: Enable node_ats_config monitoring

https://gerrit.wikimedia.org/r/830686

Change 831073 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] ats: Use variable for ATS 8 in ATS config monitor

https://gerrit.wikimedia.org/r/831073

Change 831073 merged by BCornwall:

[operations/puppet@production] ats: Use variable for ATS 8 in ATS config monitor

https://gerrit.wikimedia.org/r/831073

Change 831624 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Remove quotes from ATS config gauge

https://gerrit.wikimedia.org/r/831624

Change 831624 merged by BCornwall:

[operations/puppet@production] prometheus: Remove quotes from ATS config gauge

https://gerrit.wikimedia.org/r/831624

Change 832327 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Prometheus: Remove ATS gauge periods

https://gerrit.wikimedia.org/r/832327

Change 832327 merged by BCornwall:

[operations/puppet@production] Prometheus: Remove ATS gauge periods

https://gerrit.wikimedia.org/r/832327

Change 838911 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Add records for ATS percent usage

https://gerrit.wikimedia.org/r/838911

Change 838911 merged by BCornwall:

[operations/puppet@production] prometheus: Add records for ATS percent usage

https://gerrit.wikimedia.org/r/838911

Change 830950 merged by BCornwall:

[operations/alerts@master] ats: Alert on high connection/request count

https://gerrit.wikimedia.org/r/830950

Vgutierrez subscribed.

reopening this as I've found some issues:

metric names aren't consistent with existing ones, all the previous metrics are named using the prefix trafficserver, the ones added here are using the prefix ats.

the puppetization ignores the fact that several ATS instances can be deployed in one server. right now within define profile::trafficserver::monitoring we are instantiating the class like this:

class { 'prometheus::node_ats_config': }

this will make puppet fail if several instances of profile::trafficserver::monitoring are defined on the same server.

It also lacks support for gathering the metric from several ATS instances (ats-backend and ats-tls) this shouldn't be a big deal cause we're removing support of ats-tls in https://gerrit.wikimedia.org/r/c/operations/puppet/+/849178 but considering that all the existing puppetization supports multiple ATS instances running, we should keep providing that support unless we get rid of the multi instance support

oh, and we're seeing some errors like:

Oct 27 02:06:46 cp4043 prometheus-ats-config[1787]: Traffic Server: failed to fetch proxy.config.net.max_connections_in: [5] Error establishing socket connection.

I'm assuming that that's happening on reimage time where a race condition is hit and prometheus-ats-config runs while trafficserver hasn't started yet, we could leverage systemctl is-active trafficserver to prevent this from happening

Change 851139 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Rename ats_ metrics to trafficserver_

https://gerrit.wikimedia.org/r/851139

Change 851669 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Handle inactive trafficserver service

https://gerrit.wikimedia.org/r/851669

Change 853387 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Alter node_ats_config class to include

https://gerrit.wikimedia.org/r/853387

Change 853387 merged by Ssingh:

[operations/puppet@production] prometheus: Alter node_ats_config class to include

https://gerrit.wikimedia.org/r/853387

Change 851139 merged by BCornwall:

[operations/puppet@production] prometheus: Rename ats_ metrics to trafficserver_

https://gerrit.wikimedia.org/r/851139

Change 855494 had a related patch set uploaded (by Vgutierrez; author: BCornwall):

[operations/puppet@production] prometheus: Rename ats_ metrics to trafficserver_

https://gerrit.wikimedia.org/r/855494

Change 855494 merged by Vgutierrez:

[operations/puppet@production] prometheus: Rename ats_ metrics to trafficserver_

https://gerrit.wikimedia.org/r/855494

Change 856586 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] prometheus::ats_node_config: remove prometheus-ats-config

https://gerrit.wikimedia.org/r/856586

Change 856586 merged by Vgutierrez:

[operations/puppet@production] prometheus::ats_node_config: remove prometheus-ats-config

https://gerrit.wikimedia.org/r/856586

Change 856593 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] prometheus/trafficserver: Remove node_ats_config

https://gerrit.wikimedia.org/r/856593

Change 856593 merged by Vgutierrez:

[operations/puppet@production] prometheus/trafficserver: Remove node_ats_config

https://gerrit.wikimedia.org/r/856593

Change 857070 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Refactor ATS config monitoring

https://gerrit.wikimedia.org/r/857070

@Vgutierrez, while https://gerrit.wikimedia.org/r/c/operations/puppet/+/857070/ doesn't have strict support for multiple ATS instances, bblack suggested that by simplifying all this it would make it trivial to add that support should we decide to pursue it.

Is that acceptable or do you feel strongly that the support should be added now?

Change 851669 abandoned by BCornwall:

[operations/puppet@production] prometheus: Handle inactive trafficserver service

Reason:

I28f84282e875ea3db647334086bef177f995783a handles this in a better way

https://gerrit.wikimedia.org/r/851669

Change 857070 merged by BCornwall:

[operations/puppet@production] prometheus: Refactor ATS config monitoring

https://gerrit.wikimedia.org/r/857070

The new patch which was just deployed addresses all these concerns. I'll close the ticket but please feel free to re-open if something isn't quite right. Thanks!

Change 858418 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Remove old ats config export job

https://gerrit.wikimedia.org/r/858418

Change 858418 merged by BCornwall:

[operations/puppet@production] prometheus: Remove old ats config export job

https://gerrit.wikimedia.org/r/858418

Change 858658 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] node: Exclude trafficserver promfile mtime check

https://gerrit.wikimedia.org/r/858658

Change 858658 had a related patch set uploaded (by Vgutierrez; author: BCornwall):

[operations/alerts@master] node: Exclude trafficserver promfile mtime check

https://gerrit.wikimedia.org/r/858658

Change 858658 merged by Vgutierrez:

[operations/alerts@master] node: Exclude trafficserver promfile mtime check

https://gerrit.wikimedia.org/r/858658