Audit/log AM silences
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Oct 25 2022, 3:27 PM

Description

While looking into T321547 I realized we're not logging which silences get POST'ed to the AM API. The spicerack silences are logged by other means, however we should have a generic mechanism to audit/log all silences (e.g. those set by humans via alerts.w.o).

To this end, the simplest approach seems to be to instruct apache to log POST bodies. Somewhat surprisingly to me, this is easier said than done. There are three approaches that emerged after a chat with @elukey (in no particular order):

mod_dumpio https://httpd.apache.org/docs/current/mod/mod_dumpio.html
mod_security e.g. https://serverfault.com/questions/728575/what-rule-can-i-use-in-modsecurity-to-log-post-payload-for-a-specific-site
mod_ext_filter https://httpd.apache.org/docs/current/mod/mod_ext_filter.html

Each have their pros and cons, to be evaluated here (and then implemented).

Next steps are to make sure AM clients go through apache:

amtool via /etc/prometheus/amtool.yml
karma via /etc/karma.yml

Details

Subject	Repo	Branch	Lines +/-
alertmanager: sanitise silence audit log	operations/puppet	production	+4 -0
alertmanager: let karma use apache to access AM	operations/puppet	production	+2 -2
alertmanager: ship amtool.yml for AM api access	operations/puppet	production	+11 -0
alertmanager: also allow local access to the API	operations/puppet	production	+2 -0
alertmanager: allow api access for alertmanagers hosts too	operations/puppet	production	+10 -13
alertmanager::api: enable POST logging	operations/puppet	production	+19 -0

Customize query in gerrit

Related Objects

Mentioned In: T321547: PyBalBGPUnstable didn't report T321545
Mentioned Here: T321547: PyBalBGPUnstable didn't report T321545

Event Timeline

fgiunchedi created this task.Oct 25 2022, 3:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 25 2022, 3:27 PM

fgiunchedi mentioned this in T321547: PyBalBGPUnstable didn't report T321545.Oct 25 2022, 3:28 PM

RhinosF1 subscribed.Oct 26 2022, 12:51 PM

herron subscribed.Oct 26 2022, 2:17 PM

lmata subscribed.Oct 26 2022, 2:17 PM

colewhite subscribed.Oct 26 2022, 2:17 PM

TheresNoTime removed a subscriber: RhinosF1.Dec 15 2022, 11:35 PM

initial inclination/impressions:

mod_dumpio I think we could put on the back burner since this warns of generating extreme volumes of debug logs.

mod_security should be flexible enough to create custom post data logging rules without introducing a lot of extra log noise. FWIW I've had success using to evaluate, rate-limit and log POST data on the lists system, so there is a modest amount of existing modsec config in puppet.

mod_ext_filter strikes me as a more manual approach, writing our own filter script to capture the log data we want. IMO could be a backup option if for some reason mod_security doesn't pan out for this case.

I'll volunteer to get started on a POC for this using mod_security, unless someone chimes in with a strong preference otherwise!

Change 965785 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] alertmanager::api: enable POST logging

https://gerrit.wikimedia.org/r/965785

gerritbot added a project: Patch-For-Review.Oct 13 2023, 5:12 PM

Above is a patch for initial audit logging of POST data via modsec. Once we have some example data to work with we can refine the rules to log more human readable entries.

Also, another option that came to mind while working on this is potentially increasing logging verbosity of the am process itself (untested, and possible side effects)

fgiunchedi updated the task description. (Show Details)Oct 16 2023, 8:56 AM

fgiunchedi added a project: User-fgiunchedi.Oct 16 2023, 1:14 PM

Change 965785 merged by Herron:

[operations/puppet@production] alertmanager::api: enable POST logging

https://gerrit.wikimedia.org/r/965785

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2023, 4:10 PM

lmata added a project: SRE Observability (FY2023/2024-Q2).Oct 17 2023, 12:22 AM

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Oct 23 2023, 10:09 AM

Change 967904 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: allow api access for alertmanagers hosts too

https://gerrit.wikimedia.org/r/967904

gerritbot added a project: Patch-For-Review.Oct 23 2023, 12:35 PM

Change 967904 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: allow api access for alertmanagers hosts too

https://gerrit.wikimedia.org/r/967904

Maintenance_bot removed a project: Patch-For-Review.Oct 23 2023, 3:10 PM

Change 968119 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: let karma use apache to access AM

https://gerrit.wikimedia.org/r/968119

gerritbot added a project: Patch-For-Review.Oct 24 2023, 7:26 AM

Change 968231 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: also allow local access to the API

https://gerrit.wikimedia.org/r/968231

Change 968232 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: ship amtool.yml for AM api access

https://gerrit.wikimedia.org/r/968232

Change 968231 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: also allow local access to the API

https://gerrit.wikimedia.org/r/968231

Change 968232 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: ship amtool.yml for AM api access

https://gerrit.wikimedia.org/r/968232

Change 968119 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: let karma use apache to access AM

https://gerrit.wikimedia.org/r/968119

Change 968615 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: sanitise silence audit log

https://gerrit.wikimedia.org/r/968615

Change 968615 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: sanitise silence audit log

https://gerrit.wikimedia.org/r/968615

Maintenance_bot removed a project: Patch-For-Review.Oct 26 2023, 8:10 AM

I believe we're in a good spot in terms of auditing silences now, for example:

--77ff9252-A--
[26/Oct/2023:08:24:38 +0000] ZToiRohqouyPQsWblnWeHQAAAAA 2620:0:861:103:10:64:32:25 37936 2620:0:861:3:208:80:154:88 80
--77ff9252-B--
POST /api/v2/silences HTTP/1.1
Host: alertmanager-eqiad.wikimedia.org
User-Agent: pywmflib/1.2.3 spicerack.alertmanager.AlertmanagerHosts +https://wikitech.wikimedia.org/wiki/Python/Wmflib
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 284
Content-Type: application/json

--77ff9252-C--
{"matchers": [{"name": "instance", "value": "^(kafka\\-jumbo1008)(\\..+)?(:[0-9]+)?$", "isRegex": true}], "startsAt": "2023-10-26T08:24:38.983943+00:00", "endsAt": "2023-10-26T10:24:38.983943+00:00", "comment": "host reimage - brouberol@cumin1001", "createdBy": "brouberol@cumin1001"}
--77ff9252-F--
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Origin
Content-Length: 53
Backend-Timing: D=2684 t=1698308678996540
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive

Calling this done since we have an audit trail of 30d for silences issues via alertmanager-{codfw,eqiad}.wikimedia.org (i.e. cumin, karma, amtool, etc)

lmata moved this task from Inbox to Done on the SRE Observability (FY2023/2024-Q2) board.Jan 26 2024, 1:08 AM

Audit/log AM silencesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Audit/log AM silences
Closed, ResolvedPublic
Actions