Page MenuHomePhabricator

Audit/log AM silences
Closed, ResolvedPublic

Description

While looking into T321547 I realized we're not logging which silences get POST'ed to the AM API. The spicerack silences are logged by other means, however we should have a generic mechanism to audit/log all silences (e.g. those set by humans via alerts.w.o).

To this end, the simplest approach seems to be to instruct apache to log POST bodies. Somewhat surprisingly to me, this is easier said than done. There are three approaches that emerged after a chat with @elukey (in no particular order):

Each have their pros and cons, to be evaluated here (and then implemented).

Next steps are to make sure AM clients go through apache:

  • amtool via /etc/prometheus/amtool.yml
  • karma via /etc/karma.yml

Event Timeline

initial inclination/impressions:

mod_dumpio I think we could put on the back burner since this warns of generating extreme volumes of debug logs.

mod_security should be flexible enough to create custom post data logging rules without introducing a lot of extra log noise. FWIW I've had success using to evaluate, rate-limit and log POST data on the lists system, so there is a modest amount of existing modsec config in puppet.

mod_ext_filter strikes me as a more manual approach, writing our own filter script to capture the log data we want. IMO could be a backup option if for some reason mod_security doesn't pan out for this case.

I'll volunteer to get started on a POC for this using mod_security, unless someone chimes in with a strong preference otherwise!

Change 965785 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] alertmanager::api: enable POST logging

https://gerrit.wikimedia.org/r/965785

Above is a patch for initial audit logging of POST data via modsec. Once we have some example data to work with we can refine the rules to log more human readable entries.

Also, another option that came to mind while working on this is potentially increasing logging verbosity of the am process itself (untested, and possible side effects)

Change 965785 merged by Herron:

[operations/puppet@production] alertmanager::api: enable POST logging

https://gerrit.wikimedia.org/r/965785

Change 967904 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: allow api access for alertmanagers hosts too

https://gerrit.wikimedia.org/r/967904

Change 967904 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: allow api access for alertmanagers hosts too

https://gerrit.wikimedia.org/r/967904

Change 968119 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: let karma use apache to access AM

https://gerrit.wikimedia.org/r/968119

Change 968231 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: also allow local access to the API

https://gerrit.wikimedia.org/r/968231

Change 968232 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: ship amtool.yml for AM api access

https://gerrit.wikimedia.org/r/968232

Change 968231 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: also allow local access to the API

https://gerrit.wikimedia.org/r/968231

Change 968232 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: ship amtool.yml for AM api access

https://gerrit.wikimedia.org/r/968232

Change 968119 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: let karma use apache to access AM

https://gerrit.wikimedia.org/r/968119

Change 968615 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: sanitise silence audit log

https://gerrit.wikimedia.org/r/968615

Change 968615 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: sanitise silence audit log

https://gerrit.wikimedia.org/r/968615

I believe we're in a good spot in terms of auditing silences now, for example:

--77ff9252-A--
[26/Oct/2023:08:24:38 +0000] ZToiRohqouyPQsWblnWeHQAAAAA 2620:0:861:103:10:64:32:25 37936 2620:0:861:3:208:80:154:88 80
--77ff9252-B--
POST /api/v2/silences HTTP/1.1
Host: alertmanager-eqiad.wikimedia.org
User-Agent: pywmflib/1.2.3 spicerack.alertmanager.AlertmanagerHosts +https://wikitech.wikimedia.org/wiki/Python/Wmflib
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 284
Content-Type: application/json

--77ff9252-C--
{"matchers": [{"name": "instance", "value": "^(kafka\\-jumbo1008)(\\..+)?(:[0-9]+)?$", "isRegex": true}], "startsAt": "2023-10-26T08:24:38.983943+00:00", "endsAt": "2023-10-26T10:24:38.983943+00:00", "comment": "host reimage - brouberol@cumin1001", "createdBy": "brouberol@cumin1001"}
--77ff9252-F--
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Origin
Content-Length: 53
Backend-Timing: D=2684 t=1698308678996540
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
fgiunchedi claimed this task.

Calling this done since we have an audit trail of 30d for silences issues via alertmanager-{codfw,eqiad}.wikimedia.org (i.e. cumin, karma, amtool, etc)