alert1001 showed an increase of both network traffic and network errors after certspotter has been enabled. I've stopped it manually and that triggered a decrease on both metrics: https://grafana.wikimedia.org/goto/qKcze3Y7k?orgId=1
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T204994 Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes | |||
Open | None | T204993 Update certspotter | |||
Resolved | ssingh | T303593 increase of network errors on alert1001 after certspotter has been enabled |
Event Timeline
Change 769928 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] certspotter: Temporarily disable certspotter
Change 769928 merged by Vgutierrez:
[operations/puppet@production] certspotter: Temporarily disable certspotter
Mentioned in SAL (#wikimedia-operations) [2022-03-11T10:25:34Z] <vgutierrez> disable certspotter - T303593
Change 770000 had a related patch set uploaded (by Ssingh; author: Ssingh):
[operations/puppet@production] certspotter: add -start_at_end to only fetch new logs
Change 770000 merged by Ssingh:
[operations/puppet@production] certspotter: add -start_at_end to only fetch new logs
Change 770012 had a related patch set uploaded (by Ssingh; author: Ssingh):
[operations/puppet@production] certspotter: re-enable systemd timer
Change 770012 merged by Ssingh:
[operations/puppet@production] certspotter: re-enable systemd timer
Change 771610 had a related patch set uploaded (by Ssingh; author: Ssingh):
[operations/puppet@production] P:icinga: add profile for performance tweaking
Change 771610 merged by Ssingh:
[operations/puppet@production] P:icinga: add profile for performance tweaking
This has alleviated some of the network errors, so that's a good sign. (15:07 UTC mark).
I think we have two options to (try to) completely fix this issue:
- update certspotter to limit the concurrency when it fetches the logs. This does not mean reducing the CT log server it queries but to limit the concurrency so that it pauses between those queries.
- as per the commit message above, "We first start by setting interface::rps to the alerting_host role in order to improve the network performance. In the next commit, we will try to adjust the RX queuelen, which is currently set to 200 (with a maximum support of 2047)." We can try doing this manually to check if it makes a difference perhaps and then puppetize it.
No strong preferences on either one. We will be modifying certspotter to exclude our own certificates so that path has to be taken. The network fixes did help and that's quicker perhaps and worth the attempt.
Since the larger network issues have been fixed, I'm going to close this as resolved. Further improvements suggested by @ssingh would probably be better served in a new ticket for increased clarity.