Store Kubernetes events for more than one hour
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JMeybohm
	Sep 11 2020, 2:00 PM

Description

Kubernetes does store events in etcd for an hour (by default) and we can view them with kubectl.

Unfortunately this does not allow for easy searching or aggregation and is not helpful for investigating things that had happened longer than an hour ago and we can also not create alerts from those events.
I think we should store the events externally and elasticsearch is a good candidate for this.

There are two project I know for routing k8s events to elasticsearch/kafka (in no particular order):

And then there is Grafana Loki, but that is completely new and unknown stuff I guess.

Is there a chance we can push the events to elasticsearch, directly or via kafka?

Details

Subject	Repo	Branch	Lines +/-
eventrouter: deploy to codfw and eqiad	operations/deployment-charts	master	+2 -0
eventrouter: Bump image version and resources	operations/deployment-charts	master	+18 -18
eventrouter: don't send duplicate events, fix metrics	operations/docker-images/production-images	master	+7 -0
Lower label cardinality of prometheus metrics	operations/software/heptiolabs/eventrouter	v0.3-wmf	+0 -16
Don't send duplicate events from resync to sink	operations/software/heptiolabs/eventrouter	v0.3-wmf	+5 -0
eventrouter: Fix link to eventrouter helmfile	operations/deployment-charts	master	+1 -1
eventrouter: Fix values for all environments	operations/deployment-charts	master	+45 -17
eventrouter: Update image version and set kubernetesApi	operations/deployment-charts	master	+13 -3
eventrouter: Various chart improvements	operations/deployment-charts	master	+61 -5
Rename the fields in output json	operations/software/heptiolabs/eventrouter	v0.3-wmf	+2 -2
eventrouter: Use less generic field names in output json	operations/docker-images/production-images	master	+6 -0
eventrouter: don't deploy to production clusters by now	operations/deployment-charts	master	+3 -2
admin: fix eventrouter chart reference	operations/deployment-charts	master	+1 -1
admin: deploy eventrouter to all clusters	operations/deployment-charts	master	+47 -12
eventrouter: always log to stderr	operations/docker-images/production-images	master	+7 -1
Initial commit of eventrouter chart from stable/charts	operations/deployment-charts	master	+278 -0
Initial commit of eventrouter docker image	operations/docker-images/production-images	master	+33 -0

Related Objects

Mentioned In: rOSHR9a9ad2db2ad7: Lower label cardinality of prometheus metrics
rOSHR37dce3539b47: Don't send duplicate events from resync to sink
Mentioned Here: T234565: Standardize the logging format
T207200: Revisit the logging work done on Q1 2017-2018 for the standard pod setup

Event Timeline

JMeybohm triaged this task as Medium priority.Sep 11 2020, 2:00 PM

JMeybohm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2020, 2:00 PM

hello @JMeybohm do you have some guidance as to priority for this task, is it interesting for the next set of weeks? or is this more along the nice to have? We have some thoughts and would like to accommodate this request for planning. Also let us know how you'd like us to support you on this task.

Hi @lmata,
I would love to get this done at the beginning of next quarter (mainly because we're probably going to do a lot of kubernetes upgrade work and I would like to have some event history by then). I'm happy do take a closer look at the mentioned projects and do the building/deploying work as well.

What I think I need from your side is mainly the "okay" to push those events to the logstash-* indices of the elasticsearch cluster (I can try to figure out what that would mean in terms of documents per day, size etc. - as you might need some numbers there I guess) and probably some support in how to access it (set up needed credentials/accounts etc.). But if you have any objections in general or better ideas on how to do this, please let me know.

In T262675#6459313, @JMeybohm wrote:

What I think I need from your side is mainly the "okay" to push those events to the logstash-* indices of the elasticsearch cluster (I can try to figure out what that would mean in terms of documents per day, size etc. - as you might need some numbers there I guess) and probably some support in how to access it (set up needed credentials/accounts etc.). But if you have any objections in general or better ideas on how to do this, please let me know.

Would it be possible to use/extend the approach in T207200 for this?

In T262675#6459397, @herron wrote:

Would it be possible to use/extend the approach in T207200 for this?

We could ofc just read the events from the k8s API from within a container, dump them to stdout and have them picket up by the existing logging system. Would probably be the easiest way to implement this but I'm not sure if we're loosing structure/formatting on the way. Also this would add the overhead of having to write everything to disk, read and parse it again whereas we could write structured data directly.

Thanks @JMeybohm, ok I think we should defer to your expertise with regard to the optimal way to output these logs from the Kubernetes environment.

From the logging perspective, we'll want these to be shipped first to kafka-logging where they will be picked up by logstash and then output to elasticsearch. If the rsyslog approach is workable it might save us some prep work in the logging config and possibly a bit of maintenance, but if another tool is better suited to the task we can support this as well. In the latter case we'll probably want to assign this its own Kafka topic(s).

herron moved this task from Inbox to In progress on the observability board.Sep 15 2020, 4:06 PM

JMeybohm claimed this task.Oct 19 2020, 11:23 AM

JMeybohm moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

Change 634985 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/docker-images/production-images@master] Initial commit of eventrouter docker image

https://gerrit.wikimedia.org/r/634985

gerritbot added a project: Patch-For-Review.Oct 19 2020, 2:18 PM

Quick chat in IRC turned out that we don't have a "good for kubernetes" way to discover the kafka brokers (like DNS SRV records) producing directly to kafka-logging would require some coupling with puppet code/re-deployments on changes to kafka-logging brokers (which is obviously bad).

As the volume of logs is expected to be rather low, I'll start with logging to stdout/stderr and have the default logging pipeline (rsyslog) pick this up.

In T262675#6562991, @JMeybohm wrote:

Quick chat in IRC turned out that we don't have a "good for kubernetes" way to discover the kafka brokers (like DNS SRV records) producing directly to kafka-logging would require some coupling with puppet code/re-deployments on changes to kafka-logging brokers (which is obviously bad).

This can and should be fixed, IMHO. Having said that

As the volume of logs is expected to be rather low, I'll start with logging to stdout/stderr and have the default logging pipeline (rsyslog) pick this up.

I think this is the best solution rn.

Change 635258 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Initial commit of eventrouter chart from stable/charts

https://gerrit.wikimedia.org/r/635258

Change 635259 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] admin: deploy eventrouter to all clusters

https://gerrit.wikimedia.org/r/635259

Change 634985 merged by JMeybohm:
[operations/docker-images/production-images@master] Initial commit of eventrouter docker image

https://gerrit.wikimedia.org/r/634985

Change 635258 merged by jenkins-bot:
[operations/deployment-charts@master] Initial commit of eventrouter chart from stable/charts

https://gerrit.wikimedia.org/r/635258

Change 635986 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/docker-images/production-images@master] eventrouter: always log to stderr

https://gerrit.wikimedia.org/r/635986

Change 635986 merged by JMeybohm:
[operations/docker-images/production-images@master] eventrouter: always log to stderr

https://gerrit.wikimedia.org/r/635986

Change 635259 merged by jenkins-bot:
[operations/deployment-charts@master] admin: deploy eventrouter to all clusters

https://gerrit.wikimedia.org/r/635259

Maintenance_bot removed a project: Patch-For-Review.Oct 23 2020, 11:10 AM

Change 636008 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] admin: fix eventrouter chart reference

https://gerrit.wikimedia.org/r/636008

gerritbot added a project: Patch-For-Review.Oct 23 2020, 12:28 PM

Change 636008 merged by jenkins-bot:
[operations/deployment-charts@master] admin: fix eventrouter chart reference

https://gerrit.wikimedia.org/r/636008

Change 636023 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventrouter: don't deploy to production clusters by now

https://gerrit.wikimedia.org/r/636023

Change 636023 merged by jenkins-bot:
[operations/deployment-charts@master] eventrouter: don't deploy to production clusters by now

https://gerrit.wikimedia.org/r/636023

Unfortunately it looks as if the logging pipeline does not parse the output of eventrouter by default:
https://logstash-next.wikimedia.org/goto/d8b98b06cbe6f8089e48c090f479bfc9

In T262675#6574444, @JMeybohm wrote:

Unfortunately it looks as if the logging pipeline does not parse the output of eventrouter by default:
https://logstash-next.wikimedia.org/goto/d8b98b06cbe6f8089e48c090f479bfc9

That was me unable to read the code properly. Without glog (using "stdoutsink") only the JSON is written to stdout. Unfortunately that leads to indexing errors in elasticsearch: https://logstash-next.wikimedia.org/app/discover#/doc/acba6310-f6d3-11ea-b848-090a7444f26c/logstash-syslog-2020.10.23?id=0HDuVXUB4ZFhBDlTjQgt

[2020-10-23T14:48:16,108][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-syslog-2020.10.23", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x7261c9c7>], :response=>{"index"=>{"_index"=>"logstash-syslog-2020.10.23", "_type"=>"_doc", "_id"=>"bvnuVXUBVJIw0T9Wg7nf", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [event] of type [text] in document with id 'bvnuVXUBVJIw0T9Wg7nf'. Preview of field's value: '{firstTimestamp=2020-10-23T14:40:31Z, reason=SuccessfulCreate, metadata={uid=b3db4dd4-153d-11eb-bac4-aa00007d1a57, resourceVersion=35025809, namespace=kube-system, name=eventrouter-5d7cfcc4f7.1640a60aa2feb311, creationTimestamp=2020-10-23T14:40:31Z, selfLink=/api/v1/namespaces/kube-system/events/eventrouter-5d7cfcc4f7.1640a60aa2feb311}, involvedObject={uid=4d76a337-1535-11eb-bac4-aa00007d1a57, apiVersion=apps/v1, kind=ReplicaSet, resourceVersion=35021131, namespace=kube-system, name=eventrouter-5d7cfcc4f7}, reportingInstance=, lastTimestamp=2020-10-23T14:40:31Z, eventTime=null, count=1, source={component=replicaset-controller}, type=Normal, message=Created pod: eventrouter-5d7cfcc4f7-x8tvj, reportingComponent=}'", "caused_by"=>{"type"=>"illegal_state_exception", "reason"=>"Can't get text on a START_OBJECT at 1:1074"}}}}}

This is a known issue with the current Logstash configuration and one of the primary drivers behind adopting a Common Logging Schema (T234565).

In a nutshell, a field gets assigned its type based on the first log message with that field present, whether that's an object, string, or number. Any subsequent messages with that field present must be of the same type or Elasticsearch will throw this mapping exception. In this case, the event field was a string when it was first encountered, and this event's event field is an object.

One possible workaround is to rename the event field key to something currently not in use.

Change 636354 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/software/heptiolabs/eventrouter@v0.3-wmf] Rename the fields in output json

https://gerrit.wikimedia.org/r/636354

Change 636354 merged by JMeybohm:
[operations/software/heptiolabs/eventrouter@v0.3-wmf] Rename the fields in output json

https://gerrit.wikimedia.org/r/636354

Change 636358 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/docker-images/production-images@master] eventrouter: Use less generic field names in output json

https://gerrit.wikimedia.org/r/636358

Change 636358 merged by JMeybohm:
[operations/docker-images/production-images@master] eventrouter: Use less generic field names in output json

https://gerrit.wikimedia.org/r/636358

Change 636363 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventrouter: Various chart improvements

https://gerrit.wikimedia.org/r/636363

Change 636364 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventrouter: Update image version and set kubernetesApi

https://gerrit.wikimedia.org/r/636364

Change 636363 merged by jenkins-bot:
[operations/deployment-charts@master] eventrouter: Various chart improvements

https://gerrit.wikimedia.org/r/636363

Change 636364 merged by jenkins-bot:
[operations/deployment-charts@master] eventrouter: Update image version and set kubernetesApi

https://gerrit.wikimedia.org/r/636364

Change 636412 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventrouter: Fix values for all environments

https://gerrit.wikimedia.org/r/636412

Change 636412 merged by JMeybohm:
[operations/deployment-charts@master] eventrouter: Fix values for all environments

https://gerrit.wikimedia.org/r/636412

Change 636418 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventrouter: Fix link to eventrouter helmfile

https://gerrit.wikimedia.org/r/636418

Change 636418 merged by jenkins-bot:
[operations/deployment-charts@master] eventrouter: Fix link to eventrouter helmfile

https://gerrit.wikimedia.org/r/636418

I've changed the field names to be more specific so events are indexed now.

Also I created a fancy dashboard using my limited kibana skills: Kubernetes Events

First thing that sticks out is that events appear multiple times. Maybe caused by a bad implementation of the sharedInformer event handler not checking for syncInterval resyncs.

Change 636553 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/software/heptiolabs/eventrouter@v0.3-wmf] Don't send duplicate events from resync to sink

https://gerrit.wikimedia.org/r/636553

Change 636554 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/software/heptiolabs/eventrouter@v0.3-wmf] Lower label cardinality of prometheus metrics

https://gerrit.wikimedia.org/r/636554

Change 636553 merged by JMeybohm:
[operations/software/heptiolabs/eventrouter@v0.3-wmf] Don't send duplicate events from resync to sink

https://gerrit.wikimedia.org/r/636553

Change 636554 merged by JMeybohm:
[operations/software/heptiolabs/eventrouter@v0.3-wmf] Lower label cardinality of prometheus metrics

https://gerrit.wikimedia.org/r/636554

Change 636555 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/docker-images/production-images@master] eventrouter: don't send duplicate events, fix metrics

https://gerrit.wikimedia.org/r/636555

Change 636555 merged by JMeybohm:
[operations/docker-images/production-images@master] eventrouter: don't send duplicate events, fix metrics

https://gerrit.wikimedia.org/r/636555

Change 636556 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventrouter: Pump image version and resources

https://gerrit.wikimedia.org/r/636556

JMeybohm mentioned this in rOSHR37dce3539b47: Don't send duplicate events from resync to sink.Oct 27 2020, 6:48 AM

JMeybohm mentioned this in rOSHR9a9ad2db2ad7: Lower label cardinality of prometheus metrics.

Change 636556 merged by jenkins-bot:
[operations/deployment-charts@master] eventrouter: Bump image version and resources

https://gerrit.wikimedia.org/r/636556

Two more eventrouter patches in I must say I'm a bit disappointed by my decision to go with that one but I *think* it should be good now. Will revisit over the week to see if duplicate events are gone, re-check on resources etc.

Change 636912 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventrouter: deploy to codfw and eqiad

https://gerrit.wikimedia.org/r/636912

Change 636912 merged by jenkins-bot:
[operations/deployment-charts@master] eventrouter: deploy to codfw and eqiad