Page MenuHomePhabricator

Logging for the session storage service
Closed, ResolvedPublic

Description

The session storage service should be capable of logging to our remote logstash infrastructure.

Related Objects

Event Timeline

Eevans triaged this task as Normal priority.Nov 8 2018, 8:51 PM
Eevans created this task.
Eevans updated the task description. (Show Details)Jan 8 2019, 3:52 PM

My two cents WRT logging would be that messages should end up on syslog, either by way of syslog() (i.e. writing to /dev/log) or emitted to stdout/stderr, which then would be picked up by journald (or k8s/docker) and ultimately received by rsyslog and ingested into the logging pipeline. As far as the format goes, JSON (prefixed with @cee:) is supported when put into the syslog payload (once T213189 is resolved).

HTH!

My two cents WRT logging would be that messages should end up on syslog, either by way of syslog() (i.e. writing to /dev/log) or emitted to stdout/stderr, which then would be picked up by journald (or k8s/docker) and ultimately received by rsyslog and ingested into the logging pipeline.

We already have a basic logger implemented, and it does exactly that.

As far as the format goes, JSON (prefixed with @cee:) is supported when put into the syslog payload (once T213189 is resolved).

From an implementation POV, we'll create a struct to represent a log message, and serialize that to JSON to accomplish this; Let us know if there are any standardized attribute names you think we should be utilizing.

CDanis added a subscriber: CDanis.Jan 15 2019, 2:35 PM

Let us know if there are any standardized attribute names you think we should be utilizing.

I think this is more of a curiosity than anything else, but I spent a little time on it and wanted to share.

The rsyslog docs (and a few other sources) cite a list of standard field names that was established by the 'Lumberjack' project. That project seems to be defunct, as does MITRE's 'CEE' project from which the @cee: prefix originates. But I was able to dig up this:
http://web.archive.org/web/20161110090322/https://fedorahosted.org/lumberjack/wiki/FieldList

However, IMO many of the fields are so generic as to be of questionable use.

Change 485854 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] Basic CEE/lumberjack specs added to log messages

https://gerrit.wikimedia.org/r/485854

Change 485854 merged by Eevans:
[mediawiki/services/kask@master] Basic CEE/lumberjack specs added to log messages

https://gerrit.wikimedia.org/r/485854

Kask now implements a logger that is a simple wrapper around syslog, w/ CEE compatible messages. A question has come up though about which log levels/severities to implement, and when/how they should be used.

IME, application loggers typically support (up to) 6 levels:

LevelMeaning
TRACEFine-grained debug message; Used to track flow through an application
DEBUGGeneral debugging
INFOPurely informational
WARNINGModerate urgency messages; Something that could lead to an error
ERRORAn application error, though possibly recoverable
FATALSevere, un-recoverable errors
NOTE: Not all loggers support TRACE (Python for example)
NOTE: Some loggers use CRITICAL instead of FATAL to represent unrecoverable errors

The analogue to the levels above in syslog is severity; There are 8 severities in syslog.

I don't believe the full list of syslog severities make sense for our applications (some have a very system-specific context), so I'm more inclined to follow industry practice for application loggers, but that means mapping 5 or 6 typical levels to the 8 syslog severities. And, more important than whatever that specific mapping will be, is ensuring that we are consistent.

I propose the following:

LevelSyslog severityMeaning
DEBUGdebugGeneral debugging
INFOinfoPurely informational
WARNINGwarningModerate urgency messages; Something that could lead to an error
ERRORerrAn application error, though possibly recoverable
FATALcriticalSevere, un-recoverable errors

@fgiunchedi , @Joe, @herron, @mobrovac, @Pchelolo, @Clarakosi, @holger.knust thoughts?

Pchelolo added a comment.EditedFeb 6 2019, 8:51 PM

In node we have an implementation of bynuan -> syslog adapter and it has a bit of a different mapping. It's not used yet, but this seems to conform to most of the mappings other node packages do. (Not saying we need to do what node does, just adding 2 cents of info)

One of the most popular Go logging frameworks, logrus has a syslog adapter which uses the following mapping.

Thanks for kickstarting the discussion @Eevans @Pchelolo !

The proposed mapping looks good to me, and agreed on consistency. re: bunyan-udp-syslog it looks like it almost matches the mapping except for fatal -> emerg which would be fatal -> critical instead ?

I was looking into other logging frameworks and e.g. python's logging seems to DTRT already when emitting to syslog

Clarakosi moved this task from Backlog to In-Progress on the User-Clarakosi board.Feb 8 2019, 5:43 PM

Since critical is the next Syslog severity level above error, I agree with the above mapping outlined by @Eevans for FATAL.

There appears to be some possibility that instead of running the Kask service stand-alone on the session storage cluster, we may run it inside a security-fenced k8s environment.

If we do find ourselves deployed to k8s, then log output will need to be sent to stdout, formatted as JSON (and w/o the "@cee" cookie). This means including the severity as a JSON attribute, (convention is an attribute named level). It also means including an attribute with the timestamp, format convention is iso8601, the attribute name among service-runner services is time, though others have suggested dt (see: T212529: Standardize datetimes/timestamps in the Data Lake).

Change 493609 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] Convert logging from syslog to stdout

https://gerrit.wikimedia.org/r/493609

Change 493609 merged by Clarakosi:
[mediawiki/services/kask@master] Convert logging from syslog to stdout

https://gerrit.wikimedia.org/r/493609

Change 495050 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] Add request id to log messages

https://gerrit.wikimedia.org/r/495050

To better keep track of request errors we have been asked to add a request id to log messages from requests that include it in the X-Request-ID header. I’m adding it as request_id but @fgiunchedi please let me know if there is another field name that is currently being used or preferred.

To better keep track of request errors we have been asked to add a request id to log messages from requests that include it in the X-Request-ID header. I’m adding it as request_id but @fgiunchedi please let me know if there is another field name that is currently being used or preferred.

Indeed, AFAICS request_id is what service-runner based services already use (?).

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Indeed, AFAICS request_id is what service-runner based services already use (?).

Yup, that's correct.

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Is there a master ticket for this? We should probably create sub-tickets for the respective services.

Indeed, AFAICS request_id is what service-runner based services already use (?).

Yup, that's correct.

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Is there a master ticket for this? We should probably create sub-tickets for the respective services.

Not yet no, although I believe we'll get started on getting a standard set of fields in the near future, and filing related tasks.

Change 495050 merged by jenkins-bot:
[mediawiki/services/kask@master] Add request id to log messages

https://gerrit.wikimedia.org/r/495050

Eevans moved this task from Backlog to In-Progress on the User-Eevans board.Apr 25 2019, 3:50 PM

I believe this issue is basically complete, but was been left open because we weren't certain whether or not we needed the @cee token to be prepended to log messages; @akosiaris is logging working as expected in k8s?

Also, there are some exceptional situations where the Go http module (and perhaps gocql) as well do their own unstructured logging to stdout (and/or stderr?). I don't believe there are any situations were these errors modes wouldn't be handled (and logged) appropriately, but we should probably make sure this output doesn't cause harm elsewhere.

Eevans added a comment.Jul 8 2019, 6:38 PM

[ ... ]
Also, there are some exceptional situations where the Go http module (and perhaps gocql) as well do their own unstructured logging to stdout (and/or stderr?). I don't believe there are any situations were these errors modes wouldn't be handled (and logged) appropriately, but we should probably make sure this output doesn't cause harm elsewhere.

See T227514: k8s liveness check(?) generating session storage log noise for a concrete example of this.

WDoranWMF moved this task from mop to remove cpt on the Core Platform Team board.Jul 26 2019, 6:28 PM
WDoranWMF removed a project: Core Platform Team.

Please note that EventGate is currently logging directly to logstash via UDP. It's logging into stdout as well, but those logs are not getting delivered to logstash. So, I guess T209110 is waiting on this to be figured out.

Change 537744 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] logging: Logger optimizations

https://gerrit.wikimedia.org/r/537744

After startup, Kask doesn't log much outside of exceptional errors; To invoke log output try issuing a HEAD request:

$ curl -k -I https://staging.svc.eqiad.wmnet:8081/sessions/v1/jericevans

TTBMK, this is the only remaining blocker to fully moving sessionstore to production.

Change 537744 merged by jenkins-bot:
[mediawiki/services/kask@master] logging: Logger optimizations

https://gerrit.wikimedia.org/r/537744

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Anyway, and just partially related, I 've noticed that there is a lot of

http: TLS handshake error from 10.2.2.16:36374: EOF

I am guessing this is Go http server and logs this before kask logging has been initialized? Should we do something about this? It's logging quite a lot and in a non JSON format as well. This is probably the kubernetes tcpSocket liveness probe. Possible solutions that I can think of are

  • Avoid logging these
  • Logging these in JSON with some level that get's configured by logstash to ignore.
  • Drop the livenessprobe (the caveat being that if kask get's stuck and stops answering to TCP it will not be restarted)

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Great!

Are the logs searchable now? If so, how do we need to search given this JSON-in-JSON format?

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

Anyway, and just partially related, I 've noticed that there is a lot of
http: TLS handshake error from 10.2.2.16:36374: EOF
I am guessing this is Go http server and logs this before kask logging has been initialized? Should we do something about this? It's logging quite a lot and in a non JSON format as well. This is probably the kubernetes tcpSocket liveness probe. Possible solutions that I can think of are

  • Avoid logging these
  • Logging these in JSON with some level that get's configured by logstash to ignore.
  • Drop the livenessprobe (the caveat being that if kask get's stuck and stops answering to TCP it will not be restarted)

Yes, this is T227514: k8s liveness check(?) generating session storage log noise. I de-prioritized this because I wasn't sure how serious a problem it was. I can escalate it if need be.

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Great!

We unfortunately had to roll that back. An unrelated service on kubernetes was logging so badly that it caused a big lag in the pipeline

Are the logs searchable now?

Yes, for a definition of "searchable". A look at it would be https://logstash.wikimedia.org/goto/90ab9dc9f656b65af8328361ecf1dc0a. As you can tell, that JSON in JSON parsing is really needed.

If so, how do we need to search given this JSON-in-JSON format?

https://gerrit.wikimedia.org/r/539978 should resolve this

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

For a few more days, it looks like yes.

Anyway, and just partially related, I 've noticed that there is a lot of
http: TLS handshake error from 10.2.2.16:36374: EOF
I am guessing this is Go http server and logs this before kask logging has been initialized? Should we do something about this? It's logging quite a lot and in a non JSON format as well. This is probably the kubernetes tcpSocket liveness probe. Possible solutions that I can think of are

  • Avoid logging these
  • Logging these in JSON with some level that get's configured by logstash to ignore.
  • Drop the livenessprobe (the caveat being that if kask get's stuck and stops answering to TCP it will not be restarted)

Yes, this is T227514: k8s liveness check(?) generating session storage log noise. I de-prioritized this because I wasn't sure how serious a problem it was. I can escalate it if need be.

Already seen the change, LGTM

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Great!

We unfortunately had to roll that back. An unrelated service on kubernetes was logging so badly that it caused a big lag in the pipeline

Are the logs searchable now?

Yes, for a definition of "searchable". A look at it would be https://logstash.wikimedia.org/goto/90ab9dc9f656b65af8328361ecf1dc0a. As you can tell, that JSON in JSON parsing is really needed.

If so, how do we need to search given this JSON-in-JSON format?

https://gerrit.wikimedia.org/r/539978 should resolve this

Patch has been merged and resulting logs LGTM in the sense that nested json is now parsed (although more non-blocking refinements would be certainly needed), PTAL!

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

For a few more days, it looks like yes.

I think this means the block has been removed.

Eevans closed this task as Resolved.Tue, Oct 8, 2:47 PM

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

For a few more days, it looks like yes.

I think this means the block has been removed.

Indeed; I made a dashboard

Thanks!