Page MenuHomePhabricator

Logging for the session storage service
Open, NormalPublic

Description

The session storage service should be capable of logging to our remote logstash infrastructure.

Related Objects

Event Timeline

Eevans triaged this task as Normal priority.Nov 8 2018, 8:51 PM
Eevans created this task.
Eevans updated the task description. (Show Details)Jan 8 2019, 3:52 PM

My two cents WRT logging would be that messages should end up on syslog, either by way of syslog() (i.e. writing to /dev/log) or emitted to stdout/stderr, which then would be picked up by journald (or k8s/docker) and ultimately received by rsyslog and ingested into the logging pipeline. As far as the format goes, JSON (prefixed with @cee:) is supported when put into the syslog payload (once T213189 is resolved).

HTH!

My two cents WRT logging would be that messages should end up on syslog, either by way of syslog() (i.e. writing to /dev/log) or emitted to stdout/stderr, which then would be picked up by journald (or k8s/docker) and ultimately received by rsyslog and ingested into the logging pipeline.

We already have a basic logger implemented, and it does exactly that.

As far as the format goes, JSON (prefixed with @cee:) is supported when put into the syslog payload (once T213189 is resolved).

From an implementation POV, we'll create a struct to represent a log message, and serialize that to JSON to accomplish this; Let us know if there are any standardized attribute names you think we should be utilizing.

CDanis added a subscriber: CDanis.Jan 15 2019, 2:35 PM

Let us know if there are any standardized attribute names you think we should be utilizing.

I think this is more of a curiosity than anything else, but I spent a little time on it and wanted to share.

The rsyslog docs (and a few other sources) cite a list of standard field names that was established by the 'Lumberjack' project. That project seems to be defunct, as does MITRE's 'CEE' project from which the @cee: prefix originates. But I was able to dig up this:
http://web.archive.org/web/20161110090322/https://fedorahosted.org/lumberjack/wiki/FieldList

However, IMO many of the fields are so generic as to be of questionable use.

Change 485854 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] Basic CEE/lumberjack specs added to log messages

https://gerrit.wikimedia.org/r/485854

Change 485854 merged by Eevans:
[mediawiki/services/kask@master] Basic CEE/lumberjack specs added to log messages

https://gerrit.wikimedia.org/r/485854

Kask now implements a logger that is a simple wrapper around syslog, w/ CEE compatible messages. A question has come up though about which log levels/severities to implement, and when/how they should be used.

IME, application loggers typically support (up to) 6 levels:

LevelMeaning
TRACEFine-grained debug message; Used to track flow through an application
DEBUGGeneral debugging
INFOPurely informational
WARNINGModerate urgency messages; Something that could lead to an error
ERRORAn application error, though possibly recoverable
FATALSevere, un-recoverable errors
NOTE: Not all loggers support TRACE (Python for example)
NOTE: Some loggers use CRITICAL instead of FATAL to represent unrecoverable errors

The analogue to the levels above in syslog is severity; There are 8 severities in syslog.

I don't believe the full list of syslog severities make sense for our applications (some have a very system-specific context), so I'm more inclined to follow industry practice for application loggers, but that means mapping 5 or 6 typical levels to the 8 syslog severities. And, more important than whatever that specific mapping will be, is ensuring that we are consistent.

I propose the following:

LevelSyslog severityMeaning
DEBUGdebugGeneral debugging
INFOinfoPurely informational
WARNINGwarningModerate urgency messages; Something that could lead to an error
ERRORerrAn application error, though possibly recoverable
FATALcriticalSevere, un-recoverable errors

@fgiunchedi , @Joe, @herron, @mobrovac, @Pchelolo, @Clarakosi, @holger.knust thoughts?

Pchelolo added a comment.EditedFeb 6 2019, 8:51 PM

In node we have an implementation of bynuan -> syslog adapter and it has a bit of a different mapping. It's not used yet, but this seems to conform to most of the mappings other node packages do. (Not saying we need to do what node does, just adding 2 cents of info)

One of the most popular Go logging frameworks, logrus has a syslog adapter which uses the following mapping.

Thanks for kickstarting the discussion @Eevans @Pchelolo !

The proposed mapping looks good to me, and agreed on consistency. re: bunyan-udp-syslog it looks like it almost matches the mapping except for fatal -> emerg which would be fatal -> critical instead ?

I was looking into other logging frameworks and e.g. python's logging seems to DTRT already when emitting to syslog

Clarakosi moved this task from Backlog to In-Progress on the User-Clarakosi board.Feb 8 2019, 5:43 PM

Since critical is the next Syslog severity level above error, I agree with the above mapping outlined by @Eevans for FATAL.

There appears to be some possibility that instead of running the Kask service stand-alone on the session storage cluster, we may run it inside a security-fenced k8s environment.

If we do find ourselves deployed to k8s, then log output will need to be sent to stdout, formatted as JSON (and w/o the "@cee" cookie). This means including the severity as a JSON attribute, (convention is an attribute named level). It also means including an attribute with the timestamp, format convention is iso8601, the attribute name among service-runner services is time, though others have suggested dt (see: T212529: Standardize datetimes/timestamps in the Data Lake).

Change 493609 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] Convert logging from syslog to stdout

https://gerrit.wikimedia.org/r/493609

Change 493609 merged by Clarakosi:
[mediawiki/services/kask@master] Convert logging from syslog to stdout

https://gerrit.wikimedia.org/r/493609

Change 495050 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] Add request id to log messages

https://gerrit.wikimedia.org/r/495050

To better keep track of request errors we have been asked to add a request id to log messages from requests that include it in the X-Request-ID header. I’m adding it as request_id but @fgiunchedi please let me know if there is another field name that is currently being used or preferred.

To better keep track of request errors we have been asked to add a request id to log messages from requests that include it in the X-Request-ID header. I’m adding it as request_id but @fgiunchedi please let me know if there is another field name that is currently being used or preferred.

Indeed, AFAICS request_id is what service-runner based services already use (?).

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Indeed, AFAICS request_id is what service-runner based services already use (?).

Yup, that's correct.

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Is there a master ticket for this? We should probably create sub-tickets for the respective services.

Indeed, AFAICS request_id is what service-runner based services already use (?).

Yup, that's correct.

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Is there a master ticket for this? We should probably create sub-tickets for the respective services.

Not yet no, although I believe we'll get started on getting a standard set of fields in the near future, and filing related tasks.

Change 495050 merged by jenkins-bot:
[mediawiki/services/kask@master] Add request id to log messages

https://gerrit.wikimedia.org/r/495050

Eevans moved this task from Backlog to In-Progress on the User-Eevans board.Apr 25 2019, 3:50 PM

I believe this issue is basically complete, but was been left open because we weren't certain whether or not we needed the @cee token to be prepended to log messages; @akosiaris is logging working as expected in k8s?

Also, there are some exceptional situations where the Go http module (and perhaps gocql) as well do their own unstructured logging to stdout (and/or stderr?). I don't believe there are any situations were these errors modes wouldn't be handled (and logged) appropriately, but we should probably make sure this output doesn't cause harm elsewhere.

Eevans added a comment.Jul 8 2019, 6:38 PM

[ ... ]
Also, there are some exceptional situations where the Go http module (and perhaps gocql) as well do their own unstructured logging to stdout (and/or stderr?). I don't believe there are any situations were these errors modes wouldn't be handled (and logged) appropriately, but we should probably make sure this output doesn't cause harm elsewhere.

See T227514: k8s liveness check(?) generating session storage log noise for a concrete example of this.

WDoranWMF moved this task from mop to remove cpt on the Core Platform Team board.Fri, Jul 26, 6:28 PM
WDoranWMF removed a project: Core Platform Team.