Page MenuHomePhabricator

PoC alert/notification functionality with Elastic Stack
Open, Stalled, NormalPublic

Description

Referred to in T123243 and T211700 there has been talk for some time of looking into https://github.com/Yelp/elastalert (or alternatives?) for alerting and correlation of logs (mentioned in the logging design doc as well). One of the ideas here is that this replaces the work done in T208611 (which will make @Volans very happy)

I'm going to try to workshop this out a bit in the logging cloud project and then possibly move demo functionality to deployment-prep depending on how things go.

Details

Related Gerrit Patches:
operations/puppet : productionelastalert: enable on logstash1007
operations/puppet : productionelastalert: new module
operations/puppet : productionaptrepo: add component/elastalert

Event Timeline

chasemp triaged this task as Normal priority.Jan 16 2019, 3:25 PM
chasemp created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2019, 3:25 PM
EBjune added a subscriber: Gehel.Jan 16 2019, 4:24 PM
chasemp added a project: Restricted Project.Apr 1 2019, 7:33 PM

Change 502773 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP elastalert module

https://gerrit.wikimedia.org/r/502773

Change 503014 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] aptrepo: add component/elastalert

https://gerrit.wikimedia.org/r/503014

chasemp reassigned this task from chasemp to fgiunchedi.Apr 11 2019, 7:29 PM

Reassiging to reflect the reality of Filippo's awesomeness

Change 503014 merged by Filippo Giunchedi:
[operations/puppet@production] aptrepo: add component/elastalert

https://gerrit.wikimedia.org/r/503014

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Apr 23 2019, 12:28 PM

Change 505762 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP elastalert: enable on logstash1007

https://gerrit.wikimedia.org/r/505762

Elastalert is running on deployment-logstash2 now (I had to fudge with it a little because the instance is jessie (cfr T218729)) but other than that it'll work like in production (i.e. with https://gerrit.wikimedia.org/r/c/operations/puppet/+/505762 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/502773 merged in production, as opposed to cherry-picked on deployment-prep puppetmaster)

Rules are only on the host itself for experimentation purposes, for the first iteration we'll have the rules in private.git and possibly in the future in a separate rules private (in the sense of gerrit access) git repository to enable self-service.

The service name is elastalert@security and config / rules live in /etc/elastalert/security. I left a badpass.yaml example file, feel free to change/tweak as needed! cc @Dsharpe and let me know how we can help!

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.May 13 2019, 8:56 AM
sbassett moved this task from Backlog to Waiting on the Security-Team board.Oct 4 2019, 5:03 PM
sbassett changed the task status from Open to Stalled.Oct 4 2019, 5:05 PM

Hey @fgiunchedi - I don't believe anyone on the Security-Team currently has access to deployment-logstash2 / https://logstash-beta.wmflabs.org/, so this isn't really feasible for us to test until 1) @chasemp returns 2) more of us get access. Setting to stalled for now.

fgiunchedi removed fgiunchedi as the assignee of this task.Oct 11 2019, 5:52 PM
fgiunchedi added a subscriber: fgiunchedi.

Hi @sbassett, apologies for the delayed reply! I'm not sure if deployment-prep access is all-or-nothing for services or shell access. In the sense that access to https://logstash-beta.wmflabs.org is one shared user/password and credentials are stored in a file in one of the deployment-prep hosts.

At any rate, I'm not sure when I'll have time / bandwidth to resume work on this and e.g. make sure elastalert deployment-prep works as expected, unassigning for now