increase of network errors on alert1001 after certspotter has been enabled
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Vgutierrez
	Mar 11 2022, 10:02 AM

Description

alert1001 showed an increase of both network traffic and network errors after certspotter has been enabled. I've stopped it manually and that triggered a decrease on both metrics: https://grafana.wikimedia.org/goto/qKcze3Y7k?orgId=1

Details

Subject	Repo	Branch	Lines +/-
P:icinga: add profile for performance tweaking	operations/puppet	production	+12 -0
certspotter: re-enable systemd timer	operations/puppet	production	+1 -1
certspotter: add -start_at_end to only fetch new logs	operations/puppet	production	+1 -1
certspotter: Temporarily disable certspotter	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T204994 Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes
Open	None	T204993 Update certspotter
Resolved	ssingh	T303593 increase of network errors on alert1001 after certspotter has been enabled

Event Timeline

Vgutierrez created this task.Mar 11 2022, 10:02 AM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMar 11 2022, 10:02 AM

Change 769928 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] certspotter: Temporarily disable certspotter

https://gerrit.wikimedia.org/r/769928

gerritbot added a project: Patch-For-Review.Mar 11 2022, 10:04 AM

Change 769928 merged by Vgutierrez:

[operations/puppet@production] certspotter: Temporarily disable certspotter

https://gerrit.wikimedia.org/r/769928

Mentioned in SAL (#wikimedia-operations) [2022-03-11T10:25:34Z] <vgutierrez> disable certspotter - T303593

Vgutierrez triaged this task as Medium priority.Mar 11 2022, 10:26 AM

BTullis subscribed.Mar 11 2022, 11:05 AM

Volans subscribed.Mar 11 2022, 1:35 PM

Change 770000 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] certspotter: add -start_at_end to only fetch new logs

https://gerrit.wikimedia.org/r/770000

Change 770000 merged by Ssingh:

[operations/puppet@production] certspotter: add -start_at_end to only fetch new logs

https://gerrit.wikimedia.org/r/770000

Change 770012 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] certspotter: re-enable systemd timer

https://gerrit.wikimedia.org/r/770012

Change 770012 merged by Ssingh:

[operations/puppet@production] certspotter: re-enable systemd timer

https://gerrit.wikimedia.org/r/770012

Change 771610 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:icinga: add profile for performance tweaking

https://gerrit.wikimedia.org/r/771610

Change 771610 merged by Ssingh:

[operations/puppet@production] P:icinga: add profile for performance tweaking

https://gerrit.wikimedia.org/r/771610

In T303593#7796742, @gerritbot wrote:

Change 771610 merged by Ssingh:

[operations/puppet@production] P:icinga: add profile for performance tweaking

https://gerrit.wikimedia.org/r/771610

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=alert1001&var-datasource=thanos&var-cluster=alerting&from=now-6h&to=now

This has alleviated some of the network errors, so that's a good sign. (15:07 UTC mark).

I think we have two options to (try to) completely fix this issue:

update certspotter to limit the concurrency when it fetches the logs. This does not mean reducing the CT log server it queries but to limit the concurrency so that it pauses between those queries.
as per the commit message above, "We first start by setting interface::rps to the alerting_host role in order to improve the network performance. In the next commit, we will try to adjust the RX queuelen, which is currently set to 200 (with a maximum support of 2047)." We can try doing this manually to check if it makes a difference perhaps and then puppetize it.

No strong preferences on either one. We will be modifying certspotter to exclude our own certificates so that path has to be taken. The network fixes did help and that's quicker perhaps and worth the attempt.

In T303593#7797822, @ssingh wrote:

as per the commit message above, "We first start by setting interface::rps to the alerting_host role in order to improve the network performance. In the next commit, we will try to adjust the RX queuelen, which is currently set to 200 (with a maximum support of 2047)." We can try doing this manually to check if it makes a difference perhaps and then puppetize it.

Sounds like a good next step to me 👍

BBlack moved this task from Backlog to Revive/Active? on the Traffic-Icebox board.Apr 7 2022, 9:04 PM

BCornwall raised the priority of this task from Medium to Needs Triage.Mar 30 2023, 8:39 PM

BCornwall edited projects, added Traffic; removed Traffic-Icebox.

Since the larger network issues have been fixed, I'm going to close this as resolved. Further improvements suggested by @ssingh would probably be better served in a new ticket for increased clarity.

increase of network errors on alert1001 after certspotter has been enabledClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

increase of network errors on alert1001 after certspotter has been enabled
Closed, ResolvedPublic
Actions

Related Objects
Search...