Page MenuHomePhabricator

Move service-runner to new logging infrastructure
Open, NormalPublic

Description

Similarly to T211124: Move mediawiki to new logging infrastructure this task is about tracking the necessary changes to service-runner to use the new logging infrastructure.

The general transports/protocols involved are syslog + JSON, specifically a @cee: token/cookie is required to prefix the json payload to properly recognize said message as json/structured logging (see also https://www.rsyslog.com/doc/v8-stable/configuration/modules/mmjsonparse.html). For transport the preferred way is through unix socket /dev/log (managed by journald, then messages are read by rsyslogd) though a compatibility udp endpoint can also be provided if needed.

Additional information/context: https://wikitech.wikimedia.org/wiki/Logstash/SRE_onboard

Event Timeline

fgiunchedi triaged this task as Normal priority.Dec 4 2018, 4:25 PM
fgiunchedi created this task.
fgiunchedi updated the task description. (Show Details)Dec 5 2018, 11:03 AM

So, currently we only support sending to syslog via UDP using the node-bunyan-syslog-upd package created by @mobrovac It's pure-js implementation since we want to avoid native bindings.

Searching for existing libraries for bynuan-syslog adapters doesn't reveal a lot and it's a mess. Most of the packages do not look supported or well-used. The only implementation that claims to support Unix dgram sockets seems abandoned, plus it uses unix-dgram package under the hood that is introducing a native binding. There are some implementations for just syslog clients, without bunyan support, but even though we just need a domain socket (supported out of the box in node) they all rely on unix-dgram package.

To sum up, I would not be comfortable relying on any of the existing implementations. How much more preferable Unix sockets would be over UDP as a transport? I will try to implement a proof of concept implementation that does not rely on a bulk of the native code, let's see how that goes.

Oh.... Apparently, apple have broken syslog in newer OSX versions, so I'm most certainly not the best persong to tackle this task. Supposedly, if we support receiving logs over UDP everything should work out of the box. If you're not too opposed to UDP transport, we could test it in beta almost immediately.

So, currently we only support sending to syslog via UDP using the node-bunyan-syslog-upd package created by @mobrovac It's pure-js implementation since we want to avoid native bindings.
Searching for existing libraries for bynuan-syslog adapters doesn't reveal a lot and it's a mess. Most of the packages do not look supported or well-used. The only implementation that claims to support Unix dgram sockets seems abandoned, plus it uses unix-dgram package under the hood that is introducing a native binding. There are some implementations for just syslog clients, without bunyan support, but even though we just need a domain socket (supported out of the box in node) they all rely on unix-dgram package.
To sum up, I would not be comfortable relying on any of the existing implementations. How much more preferable Unix sockets would be over UDP as a transport? I will try to implement a proof of concept implementation that does not rely on a bulk of the native code, let's see how that goes.

One significant difference between the two is that logs logged through the unix socket have additional metadata associated provided by the kernel (trusted properties as rsyslog calls those). One other difference is blocking behavior of AF_INET vs AF_UNIX sockets e.g. when there are no readers. Eventually I think we'd want all log producers to unix sockets, though local UDP will do as a compatibility shim until we do so.

I'd say let's go with UDP in beta at least to kick the tires! Which beta hosts (or better: puppet roles/profiles) should have the udp endpoint enabled?

There's not that much logging happening in beta services, I would guess we should start big, so deployment-restbase0[1,2]?

Will the logs go to logstash-beta.wmflabs.org or will just be local to the node in syslog for now?

There's not that much logging happening in beta services, I would guess we should start big, so deployment-restbase0[1,2]?

Sounds good to me! You should be able to send the cee-prefixed json logs already on those hosts in beta, udp on localhost:10514 (i.e. same port as logstash syslog)

Will the logs go to logstash-beta.wmflabs.org or will just be local to the node in syslog for now?

The former, logs will be ingested into logstash

We need to locally test whether the bynuan-syslog-udp actually works and modify it to add the @cee annotation. Then we can proceed with testing in beta cluster.

Change 496813 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Add rsyslog kafka to service nodes.

https://gerrit.wikimedia.org/r/496813

After enabling logging over syslog for RESTBase in deployment-prep, we have identified a number of disparities between node services and, for example, mediawiki.

In the old pipeline, we've used gelf protocol to deliver the logs directly from services to logstash. So, there were 2 mappings:

  • gelf-stream library to map 'bunyan' log format into gelf. Examples of mappins are renaming msg to short_message, name to facility etc.
  • A special gelf formatter for logstash that additionally modified the message. For example renaming 'facility' into 'type' etc.

After a switch to syslog both of the mappings have been broken. We need to figure out what to do here. Given that most of these disparities are withing the JSON payload of the log entry, there's no real protocol to follow.

Our options are:

  1. Try to modify bunyan-syslog-udp to conform to current rsyslog mappings as much as possible. This will not be possible completely since some property names in bunyan are not configurable. For example, msg. So we would need to create logstash mappings to remove the msg anyway. Additionally, it will make the bunyan-syslog-udp library quite opinionated and tied to WMF production environment.
  2. Create custom syslog mapper for node services in addition to a special formatter in logstash. In a first glance the configuration will be more complex, but I think in the end it is better.

@fgiunchedi what do you think?

Change 497321 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Map syslogseverity-text to severity_label.

https://gerrit.wikimedia.org/r/497321

Change 497321 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Map syslog{severity,facility}-text to {severity,facility}_label

https://gerrit.wikimedia.org/r/497321

Change 498417 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Create node-specific logstash filters for syslog.

https://gerrit.wikimedia.org/r/498417

Change 497321 abandoned by Ppchelko:
Map syslog{severity,facility}-text to {severity,facility}_label

Reason:
godog: Pchelolo: ah, yeah that's correct the _label fields are artifacts of syslog-input plugin
godog: I'd rather drop having both number and text representation while we're at it and go with severity/facility in text form

https://gerrit.wikimedia.org/r/497321

Change 498872 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Enable rsyslog logging for node services.

https://gerrit.wikimedia.org/r/498872

Change 498899 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Enable rsyslog logging in deployment-prep.

https://gerrit.wikimedia.org/r/498899

Change 498417 merged by Filippo Giunchedi:
[operations/puppet@production] Create node-specific logstash filters for syslog.

https://gerrit.wikimedia.org/r/498417

Change 496813 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Ppchelko):
[operations/puppet@production] Add rsyslog kafka to service nodes.

https://gerrit.wikimedia.org/r/496813

Change 496813 merged by Giuseppe Lavagetto:
[operations/puppet@production] Add rsyslog kafka to service nodes.

https://gerrit.wikimedia.org/r/496813

Change 498872 merged by Giuseppe Lavagetto:
[operations/puppet@production] Expose rsyslog_udp_port to services configs.

https://gerrit.wikimedia.org/r/498872

Change 498899 merged by Ppchelko:
[mediawiki/services/restbase/deploy@master] Enable rsyslog logging in deployment-prep.

https://gerrit.wikimedia.org/r/498899

Change 500767 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Enable rsyslog logging for production.

https://gerrit.wikimedia.org/r/500767

Change 500767 merged by Ppchelko:
[mediawiki/services/restbase/deploy@master] Enable rsyslog logging for production.

https://gerrit.wikimedia.org/r/500767

Mentioned in SAL (#wikimedia-operations) [2019-04-02T17:56:04Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@2cb53a7] (dev-cluster): Kafka logging pipeline, dev cluster only T211125

Mentioned in SAL (#wikimedia-operations) [2019-04-02T17:59:29Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@2cb53a7] (dev-cluster): Kafka logging pipeline, dev cluster only T211125 (duration: 03m 25s)

Mentioned in SAL (#wikimedia-operations) [2019-04-02T18:06:23Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, canary on restbase2010 T211125

Mentioned in SAL (#wikimedia-operations) [2019-04-02T18:08:56Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, canary on restbase2010 T211125 (duration: 02m 33s)

Mentioned in SAL (#wikimedia-operations) [2019-04-02T18:20:09Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125

Mentioned in SAL (#wikimedia-operations) [2019-04-02T18:20:13Z] <ppchelko@deploy1001> deploy aborted: Kafka logging pipeline, full deploy T211125 (duration: 00m 03s)

Mentioned in SAL (#wikimedia-operations) [2019-04-02T18:20:22Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125

Mentioned in SAL (#wikimedia-operations) [2019-04-02T18:41:10Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125 (duration: 20m 49s)

I have deployed a new pipeline for RESTBase in production and it all looks great. Next step - convert other services. I will try it out on change-prop and create subtasks for individual services.

Change 500813 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/deploy@master] Use rsyslog for logging.

https://gerrit.wikimedia.org/r/500813

Change 500818 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch logging to rsyslog.

https://gerrit.wikimedia.org/r/500818

Change 500813 merged by Ppchelko:
[mediawiki/services/change-propagation/deploy@master] Use rsyslog for logging.

https://gerrit.wikimedia.org/r/500813

Mentioned in SAL (#wikimedia-operations) [2019-04-04T17:45:47Z] <ppchelko@deploy1001> Started deploy [changeprop/deploy@f69dc9c]: Switch to new logging infrastructure T211125

Mentioned in SAL (#wikimedia-operations) [2019-04-04T17:47:31Z] <ppchelko@deploy1001> Finished deploy [changeprop/deploy@f69dc9c]: Switch to new logging infrastructure T211125 (duration: 01m 44s)

Change 500818 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch logging to rsyslog.

https://gerrit.wikimedia.org/r/500818

Mentioned in SAL (#wikimedia-operations) [2019-04-04T17:55:21Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@922cbc0]: Switch to new logging infrastructure T211125

Mentioned in SAL (#wikimedia-operations) [2019-04-04T17:59:24Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@922cbc0]: Switch to new logging infrastructure T211125 (duration: 04m 03s)

ChangeProp and JobQueue ChangeProp has been moved to the new logging infra as well.

Change 512064 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Bump to service-runner@2.6.18

https://gerrit.wikimedia.org/r/512064

Change 512064 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Bump to service-runner@2.6.19

https://gerrit.wikimedia.org/r/512064

Change 512403 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid/deploy@master] Switch logging to rsyslog

https://gerrit.wikimedia.org/r/512403

Change 512403 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] Switch logging to rsyslog

https://gerrit.wikimedia.org/r/512403