Given kibana/logstash analysis capabilities I think it would be good to have 5xx there too. It should help with having breakdowns by different dimensions to debug what's going on, especially during outages can turn useful.
Implementation-wise we have started with kafkatee on oxygen sending logs to logstash. Other options include have a separate kafka topic just for errors, to be consumed from logstash for example.
Initial implementation using kafkatee on oxygen:
- kafkatee blocks signals for its children, preventing among other things a SIGPIPE to cleanup pipelines, for a proposed fix see https://gerrit.wikimedia.org/r/#/c/352591/
- Build/upload/upgrade new kafkatee version
- Kibana dashboards
- Move hostname field into host so logs appear to be originated from varnish machines and not from oxygen https://gerrit.wikimedia.org/r/#/c/353853/
- Reconstruct the url into url field so e.g. "Top URLs" visualization does the right thing https://gerrit.wikimedia.org/r/#/c/353282/