Page MenuHomePhabricator

Ingest webrequest sampled 1000 into logstash
Open, MediumPublic

Description

We have 5xx.json available in logstash (and on file on centrallog hosts in /srv/weblog/webrequest, which is useful to debug/investigate errors.

With ELK7 and in general more capacity (and headroom) I (Filippo) think we should be ingesting the sampled (1/1000) webrequest stream, specifically for:

  • Access to dashboards to debug/investigate abuse incidents
  • Sharing dashboards and findings during incidents

The data has PII, however I don't think it is at a greater risk than PII already in kafka/logstash (e.g. ip addresses and user agent)

Implementation wise, we currently funnel 5xx as such:

kafkatee -> grep/jq/logger -> rsyslog -> kafka -> logstash

The easy (not necessarily simple) thing to do is to replicate the same with sampled-1000 kafkatee output, (i.e. an additional load of max ~200 logs/s)

Event Timeline

My (perhaps dated or incorrect) understanding is that:

  1. We currently have no RBAC in Logstash;
  2. Everyone in the "NDA" group have access to all data stored in Logstash;
  3. Access to access logs in general is more restricted, to a subset of NDA users, to the analytics-privatedata group (membership managed by the D/E team);
  4. sampled-1000 is a subset of access logs, available in the centrallog hosts, where only ops/roots have access to (so even more restricted)

Are any of these assumptions incorrect at this time? If not, does that mean that this task is effectively proposing to expand access to (a 1:1000 sample of) access logs to a wider group of individuals? Not saying (yet) whether this is a problem per se, but I think it'd be good to establish shared understanding of what is being proposed and discussed here -- specifically whether this is a technical change, or an access control or PII-sharing change. Thanks!

My (perhaps dated or incorrect) understanding is that:

  1. We currently have no RBAC in Logstash;
  2. Everyone in the "NDA" group have access to all data stored in Logstash;
  3. Access to access logs in general is more restricted, to a subset of NDA users, to the analytics-privatedata group (membership managed by the D/E team);
  4. sampled-1000 is a subset of access logs, available in the centrallog hosts, where only ops/roots have access to (so even more restricted)

Are any of these assumptions incorrect at this time? If not, does that mean that this task is effectively proposing to expand access to (a 1:1000 sample of) access logs to a wider group of individuals? Not saying (yet) whether this is a problem per se, but I think it'd be good to establish shared understanding of what is being proposed and discussed here -- specifically whether this is a technical change, or an access control or PII-sharing change. Thanks!

Thank you @faidon for the questions -- assumptions 1, 2 and 4 are correct TTBOMK. Whereas for 3 (in my opinion) things are a little fuzzier since cn=nda has access to webrequest_sampled_128 via turnilo (though the raw data isn't available for download).

Hope that helps!

jbond triaged this task as Medium priority.Feb 16 2022, 4:58 PM

This is still valid, though nowadays the implementation will be much simpler: we can ingest webrequest_sampled directly from Kafka!