Page MenuHomePhabricator

increase of network errors on alert1001 after certspotter has been enabled
Closed, ResolvedPublic

Description

alert1001 showed an increase of both network traffic and network errors after certspotter has been enabled. I've stopped it manually and that triggered a decrease on both metrics: https://grafana.wikimedia.org/goto/qKcze3Y7k?orgId=1

image.png (268×2 px, 58 KB)

Event Timeline

Change 769928 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] certspotter: Temporarily disable certspotter

https://gerrit.wikimedia.org/r/769928

Change 769928 merged by Vgutierrez:

[operations/puppet@production] certspotter: Temporarily disable certspotter

https://gerrit.wikimedia.org/r/769928

Change 770000 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] certspotter: add -start_at_end to only fetch new logs

https://gerrit.wikimedia.org/r/770000

Change 770000 merged by Ssingh:

[operations/puppet@production] certspotter: add -start_at_end to only fetch new logs

https://gerrit.wikimedia.org/r/770000

Change 770012 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] certspotter: re-enable systemd timer

https://gerrit.wikimedia.org/r/770012

Change 770012 merged by Ssingh:

[operations/puppet@production] certspotter: re-enable systemd timer

https://gerrit.wikimedia.org/r/770012

Change 771610 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:icinga: add profile for performance tweaking

https://gerrit.wikimedia.org/r/771610

Change 771610 merged by Ssingh:

[operations/puppet@production] P:icinga: add profile for performance tweaking

https://gerrit.wikimedia.org/r/771610

Change 771610 merged by Ssingh:

[operations/puppet@production] P:icinga: add profile for performance tweaking

https://gerrit.wikimedia.org/r/771610

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=alert1001&var-datasource=thanos&var-cluster=alerting&from=now-6h&to=now

This has alleviated some of the network errors, so that's a good sign. (15:07 UTC mark).

I think we have two options to (try to) completely fix this issue:

  • update certspotter to limit the concurrency when it fetches the logs. This does not mean reducing the CT log server it queries but to limit the concurrency so that it pauses between those queries.
  • as per the commit message above, "We first start by setting interface::rps to the alerting_host role in order to improve the network performance. In the next commit, we will try to adjust the RX queuelen, which is currently set to 200 (with a maximum support of 2047)." We can try doing this manually to check if it makes a difference perhaps and then puppetize it.

No strong preferences on either one. We will be modifying certspotter to exclude our own certificates so that path has to be taken. The network fixes did help and that's quicker perhaps and worth the attempt.

  • as per the commit message above, "We first start by setting interface::rps to the alerting_host role in order to improve the network performance. In the next commit, we will try to adjust the RX queuelen, which is currently set to 200 (with a maximum support of 2047)." We can try doing this manually to check if it makes a difference perhaps and then puppetize it.

Sounds like a good next step to me 👍

BCornwall raised the priority of this task from Medium to Needs Triage.Mar 30 2023, 8:39 PM
BCornwall edited projects, added Traffic; removed Traffic-Icebox.
BCornwall subscribed.

Since the larger network issues have been fixed, I'm going to close this as resolved. Further improvements suggested by @ssingh would probably be better served in a new ticket for increased clarity.