⚓ T209110 Logging for the session storage service

Subject	Repo	Branch	Lines +/-
logging: Logger optimizations	mediawiki/services/kask	master	+147 -17
Add request id to log messages	mediawiki/services/kask	master	+47 -14
Convert logging from syslog to stdout	mediawiki/services/kask	master	+44 -55
Basic CEE/lumberjack specs added to log messages	mediawiki/services/kask	master	+65 -10

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Eevans	T206016 Create a service for session storage
Resolved	Eevans	T209110 Logging for the session storage service

Eevans triaged this task as Medium priority.Nov 8 2018, 8:51 PM

Eevans created this task.

Eevans updated the task description. (Show Details)Jan 8 2019, 3:52 PM

My two cents WRT logging would be that messages should end up on syslog, either by way of syslog() (i.e. writing to /dev/log) or emitted to stdout/stderr, which then would be picked up by journald (or k8s/docker) and ultimately received by rsyslog and ingested into the logging pipeline. As far as the format goes, JSON (prefixed with @cee:) is supported when put into the syslog payload (once T213189 is resolved).

HTH!

In T209110#4880850, @fgiunchedi wrote:

My two cents WRT logging would be that messages should end up on syslog, either by way of syslog() (i.e. writing to /dev/log) or emitted to stdout/stderr, which then would be picked up by journald (or k8s/docker) and ultimately received by rsyslog and ingested into the logging pipeline.

We already have a basic logger implemented, and it does exactly that.

As far as the format goes, JSON (prefixed with @cee:) is supported when put into the syslog payload (once T213189 is resolved).

From an implementation POV, we'll create a struct to represent a log message, and serialize that to JSON to accomplish this; Let us know if there are any standardized attribute names you think we should be utilizing.

In T209110#4880929, @Eevans wrote:

Let us know if there are any standardized attribute names you think we should be utilizing.

I think this is more of a curiosity than anything else, but I spent a little time on it and wanted to share.

The rsyslog docs (and a few other sources) cite a list of standard field names that was established by the 'Lumberjack' project. That project seems to be defunct, as does MITRE's 'CEE' project from which the @cee: prefix originates. But I was able to dig up this:
http://web.archive.org/web/20161110090322/https://fedorahosted.org/lumberjack/wiki/FieldList

However, IMO many of the fields are so generic as to be of questionable use.

Change 485854 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] Basic CEE/lumberjack specs added to log messages

https://gerrit.wikimedia.org/r/485854

gerritbot added a project: Patch-For-Review.Jan 22 2019, 5:04 PM

• Clarakosi mentioned this in rMSKSc7cf2c2fbfb6: Basic CEE/lumberjack specs added to log messages.Jan 22 2019, 5:23 PM

• Clarakosi mentioned this in rMSKS4c3dafb51d05: Basic CEE/lumberjack specs added to log messages.

• Clarakosi mentioned this in rMSKS0ed9df303cbc: Basic CEE/lumberjack specs added to log messages.

• Clarakosi mentioned this in rMSKSf5769c83d966: Basic CEE/lumberjack specs added to log messages.Jan 23 2019, 10:14 PM

• Clarakosi mentioned this in rMSKS413ecdb73676: Basic CEE/lumberjack specs added to log messages.

• Clarakosi mentioned this in rMSKS66931ddbc2a3: Basic CEE/lumberjack specs added to log messages.Feb 4 2019, 4:34 PM

• Clarakosi mentioned this in rMSKS7eb5ae37756a: Basic CEE/lumberjack specs added to log messages.Feb 4 2019, 9:28 PM

• Clarakosi mentioned this in rMSKSb0e0bc87408d: Basic CEE/lumberjack specs added to log messages.Feb 4 2019, 9:31 PM

• Clarakosi mentioned this in rMSKS55a350380801: Basic CEE/lumberjack specs added to log messages.Feb 4 2019, 9:57 PM

• Clarakosi mentioned this in rMSKS81e87a70f273: Basic CEE/lumberjack specs added to log messages.Feb 4 2019, 10:01 PM

• Clarakosi mentioned this in rMSKSbbb749550b77: Basic CEE/lumberjack specs added to log messages.

• Clarakosi mentioned this in rMSKS329f9c3744a5: Basic CEE/lumberjack specs added to log messages.Feb 4 2019, 10:44 PM

• Clarakosi mentioned this in rMSKS5fd99639a078: Basic CEE/lumberjack specs added to log messages.Feb 4 2019, 11:02 PM

• Clarakosi mentioned this in rMSKSa6569bf1c016: Basic CEE/lumberjack specs added to log messages.Feb 5 2019, 12:05 AM

Change 485854 merged by Eevans:
[mediawiki/services/kask@master] Basic CEE/lumberjack specs added to log messages

https://gerrit.wikimedia.org/r/485854

Kask now implements a logger that is a simple wrapper around syslog, w/ CEE compatible messages. A question has come up though about which log levels/severities to implement, and when/how they should be used.

IME, application loggers typically support (up to) 6 levels:

Level	Meaning
TRACE	Fine-grained debug message; Used to track flow through an application
DEBUG	General debugging
INFO	Purely informational
WARNING	Moderate urgency messages; Something that could lead to an error
ERROR	An application error, though possibly recoverable
FATAL	Severe, un-recoverable errors

NOTE: Not all loggers support TRACE (Python for example)

NOTE: Some loggers use CRITICAL instead of FATAL to represent unrecoverable errors

The analogue to the levels above in syslog is severity; There are 8 severities in syslog.

I don't believe the full list of syslog severities make sense for our applications (some have a very system-specific context), so I'm more inclined to follow industry practice for application loggers, but that means mapping 5 or 6 typical levels to the 8 syslog severities. And, more important than whatever that specific mapping will be, is ensuring that we are consistent.

I propose the following:

Level	Syslog severity	Meaning
DEBUG	debug	General debugging
INFO	info	Purely informational
WARNING	warning	Moderate urgency messages; Something that could lead to an error
ERROR	err	An application error, though possibly recoverable
FATAL	critical	Severe, un-recoverable errors

@fgiunchedi , @Joe, @herron, @mobrovac, @Pchelolo, @Clarakosi, @holger.knust thoughts?

In node we have an implementation of bynuan -> syslog adapter and it has a bit of a different mapping. It's not used yet, but this seems to conform to most of the mappings other node packages do. (Not saying we need to do what node does, just adding 2 cents of info)

One of the most popular Go logging frameworks, logrus has a syslog adapter which uses the following mapping.

Thanks for kickstarting the discussion @Eevans @Pchelolo !

The proposed mapping looks good to me, and agreed on consistency. re: bunyan-udp-syslog it looks like it almost matches the mapping except for fatal -> emerg which would be fatal -> critical instead ?

I was looking into other logging frameworks and e.g. python's logging seems to DTRT already when emitting to syslog

• Clarakosi moved this task from Backlog to In-Progress on the User-Clarakosi board.Feb 8 2019, 5:43 PM

Since critical is the next Syslog severity level above error, I agree with the above mapping outlined by @Eevans for FATAL.

There appears to be some possibility that instead of running the Kask service stand-alone on the session storage cluster, we may run it inside a security-fenced k8s environment.

If we do find ourselves deployed to k8s, then log output will need to be sent to stdout, formatted as JSON (and w/o the "@cee" cookie). This means including the severity as a JSON attribute, (convention is an attribute named level). It also means including an attribute with the timestamp, format convention is iso8601, the attribute name among service-runner services is time, though others have suggested dt (see: T212529: Standardize datetimes/timestamps in the Data Lake).

Change 493609 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] Convert logging from syslog to stdout

https://gerrit.wikimedia.org/r/493609

Change 493609 merged by Clarakosi:
[mediawiki/services/kask@master] Convert logging from syslog to stdout

https://gerrit.wikimedia.org/r/493609

Eevans mentioned this in rMSKS2dc55e8888ad: Convert logging from syslog to stdout.Feb 28 2019, 9:56 PM

Change 495050 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] Add request id to log messages

https://gerrit.wikimedia.org/r/495050

To better keep track of request errors we have been asked to add a request id to log messages from requests that include it in the X-Request-ID header. I’m adding it as request_id but @fgiunchedi please let me know if there is another field name that is currently being used or preferred.

• Clarakosi mentioned this in rMSKSf8518140b55d: Add request id to log messages.Mar 7 2019, 8:40 PM

• Clarakosi mentioned this in rMSKS21eeb7a14c2a: Add request id to log messages.Mar 8 2019, 8:02 PM

• Clarakosi mentioned this in rMSKSad317bf5e2b1: Add request id to log messages.

In T209110#5009524, @Clarakosi wrote:

To better keep track of request errors we have been asked to add a request id to log messages from requests that include it in the X-Request-ID header. I’m adding it as request_id but @fgiunchedi please let me know if there is another field name that is currently being used or preferred.

Indeed, AFAICS request_id is what service-runner based services already use (?).

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

In T209110#5015645, @fgiunchedi wrote:

Indeed, AFAICS request_id is what service-runner based services already use (?).

Yup, that's correct.

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Is there a master ticket for this? We should probably create sub-tickets for the respective services.

In T209110#5016148, @mobrovac wrote:

In T209110#5015645, @fgiunchedi wrote:

Indeed, AFAICS request_id is what service-runner based services already use (?).

Yup, that's correct.

As a side note for posterity: there's also a mixture of unique_id and reqId between mediawiki and parsoid (!) that I could find from a quick search, likely other variations too.

Is there a master ticket for this? We should probably create sub-tickets for the respective services.

Not yet no, although I believe we'll get started on getting a standard set of fields in the near future, and filing related tasks.

• Clarakosi mentioned this in rMSKS7709b3f97cf7: Add request id to log messages.Mar 14 2019, 4:08 PM

• Clarakosi mentioned this in rMSKSe62facce9a73: Add request id to log messages.

• Clarakosi mentioned this in rMSKSf6bf6b3cb9ce: Add request id to log messages.

• Clarakosi mentioned this in rMSKSea956f173762: Add request id to log messages.Mar 14 2019, 4:21 PM

Change 495050 merged by jenkins-bot:
[mediawiki/services/kask@master] Add request id to log messages

https://gerrit.wikimedia.org/r/495050

jijiki mentioned this in rMSKS5e3e2823c9c8: Basic CEE/lumberjack specs added to log messages.Mar 18 2019, 4:35 PM

Eevans moved this task from Backlog to In-Progress on the User-Eevans board.Apr 25 2019, 3:50 PM

I believe this issue is basically complete, but was been left open because we weren't certain whether or not we needed the @cee token to be prepended to log messages; @akosiaris is logging working as expected in k8s?

Also, there are some exceptional situations where the Go http module (and perhaps gocql) as well do their own unstructured logging to stdout (and/or stderr?). I don't believe there are any situations were these errors modes wouldn't be handled (and logged) appropriately, but we should probably make sure this output doesn't cause harm elsewhere.

In T209110#5251135, @Eevans wrote:

[ ... ]

Also, there are some exceptional situations where the Go http module (and perhaps gocql) as well do their own unstructured logging to stdout (and/or stderr?). I don't believe there are any situations were these errors modes wouldn't be handled (and logged) appropriately, but we should probably make sure this output doesn't cause harm elsewhere.

See T227514: k8s liveness check(?) generating session storage log noise for a concrete example of this.

Eevans mentioned this in T227514: k8s liveness check(?) generating session storage log noise.Jul 8 2019, 6:38 PM

CCicalese_WMF removed a project: Platform Team Legacy (Next).Jul 26 2019, 5:35 PM

• WDoranWMF moved this task from Session Management Service (CDP2) to mop on the Platform Engineering board.Jul 26 2019, 6:25 PM

• WDoranWMF edited projects, added Platform Engineering, Core Platform Team Initiatives (Session Management Service (CDP2)); removed Platform Engineering (Session Management Service (CDP2)).

• WDoranWMF moved this task from mop to remove cpt on the Platform Engineering board.Jul 26 2019, 6:28 PM

• WDoranWMF removed a project: Platform Engineering.

Please note that EventGate is currently logging directly to logstash via UDP. It's logging into stdout as well, but those logs are not getting delivered to logstash. So, I guess T209110 is waiting on this to be figured out.

Change 537744 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] logging: Logger optimizations

https://gerrit.wikimedia.org/r/537744

After startup, Kask doesn't log much outside of exceptional errors; To invoke log output try issuing a HEAD request:

$ curl -k -I https://staging.svc.eqiad.wmnet:8081/sessions/v1/jericevans

TTBMK, this is the only remaining blocker to fully moving sessionstore to production.

Change 537744 merged by jenkins-bot:
[mediawiki/services/kask@master] logging: Logger optimizations

https://gerrit.wikimedia.org/r/537744

Eevans mentioned this in rMSKSf9e6066f4257: logging: Logger optimizations.Sep 20 2019, 7:49 AM

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Anyway, and just partially related, I 've noticed that there is a lot of

http: TLS handshake error from 10.2.2.16:36374: EOF

I am guessing this is Go http server and logs this before kask logging has been initialized? Should we do something about this? It's logging quite a lot and in a non JSON format as well. This is probably the kubernetes tcpSocket liveness probe. Possible solutions that I can think of are

Avoid logging these
Logging these in JSON with some level that get's configured by logstash to ignore.
Drop the livenessprobe (the caveat being that if kask get's stuck and stops answering to TCP it will not be restarted)

In T209110#5534883, @akosiaris wrote:

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Great!

Are the logs searchable now? If so, how do we need to search given this JSON-in-JSON format?

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

Anyway, and just partially related, I 've noticed that there is a lot of

http: TLS handshake error from 10.2.2.16:36374: EOF

I am guessing this is Go http server and logs this before kask logging has been initialized? Should we do something about this? It's logging quite a lot and in a non JSON format as well. This is probably the kubernetes tcpSocket liveness probe. Possible solutions that I can think of are

Avoid logging these

Logging these in JSON with some level that get's configured by logstash to ignore.

Drop the livenessprobe (the caveat being that if kask get's stuck and stops answering to TCP it will not be restarted)

Yes, this is T227514: k8s liveness check(?) generating session storage log noise. I de-prioritized this because I wasn't sure how serious a problem it was. I can escalate it if need be.

In T209110#5535794, @Eevans wrote:

In T209110#5534883, @akosiaris wrote:

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Great!

We unfortunately had to roll that back. An unrelated service on kubernetes was logging so badly that it caused a big lag in the pipeline

Are the logs searchable now?

Yes, for a definition of "searchable". A look at it would be https://logstash.wikimedia.org/goto/90ab9dc9f656b65af8328361ecf1dc0a. As you can tell, that JSON in JSON parsing is really needed.

If so, how do we need to search given this JSON-in-JSON format?

https://gerrit.wikimedia.org/r/539978 should resolve this

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

For a few more days, it looks like yes.

Anyway, and just partially related, I 've noticed that there is a lot of

http: TLS handshake error from 10.2.2.16:36374: EOF

I am guessing this is Go http server and logs this before kask logging has been initialized? Should we do something about this? It's logging quite a lot and in a non JSON format as well. This is probably the kubernetes tcpSocket liveness probe. Possible solutions that I can think of are

Avoid logging these

Logging these in JSON with some level that get's configured by logstash to ignore.

Drop the livenessprobe (the caveat being that if kask get's stuck and stops answering to TCP it will not be restarted)

Yes, this is T227514: k8s liveness check(?) generating session storage log noise. I de-prioritized this because I wasn't sure how serious a problem it was. I can escalate it if need be.

Already seen the change, LGTM

In T209110#5544017, @akosiaris wrote:

In T209110#5535794, @Eevans wrote:

In T209110#5534883, @akosiaris wrote:

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Great!

We unfortunately had to roll that back. An unrelated service on kubernetes was logging so badly that it caused a big lag in the pipeline

Are the logs searchable now?

Yes, for a definition of "searchable". A look at it would be https://logstash.wikimedia.org/goto/90ab9dc9f656b65af8328361ecf1dc0a. As you can tell, that JSON in JSON parsing is really needed.

If so, how do we need to search given this JSON-in-JSON format?

https://gerrit.wikimedia.org/r/539978 should resolve this

Patch has been merged and resulting logs LGTM in the sense that nested json is now parsed (although more non-blocking refinements would be certainly needed), PTAL!

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

For a few more days, it looks like yes.

I think this means the block has been removed.

In T209110#5554361, @akosiaris wrote:

We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?

For a few more days, it looks like yes.

I think this means the block has been removed.

Indeed; I made a dashboard

Thanks!

Logging for the session storage service
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Logging for the session storage serviceClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Logging for the session storage service
Closed, ResolvedPublic
Actions

Related Objects
Search...