Page MenuHomePhabricator

Migrate network device syslogs to Kafka logging pipeline
Closed, ResolvedPublic0 Estimated Story Points

Description

Currently network devices send logs directly to Logstash via syslog. Over time this approach has proven to be fragile, and such has been deprecated in favor of the Kafka enabled logging pipeline.

In order to enable devices that do not have native Kafka producer capability we'll support a relay feature on the central syslog hosts. Syslogs can be sent to the nearest central rsyslog host, and from there will be forwarded to the Kafka logging pipeline, and optionally logged locally.

Creating this task to track the deployment of the syslog -> Kafka relay, testing, and migration of network devices away from the syslog input.

Event Timeline

herron triaged this task as Medium priority.May 22 2019, 2:52 PM
herron created this task.

Change 495980 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog: add netdev_kafka_relay compatibility endpoint

https://gerrit.wikimedia.org/r/495980

Change 495980 merged by Herron:
[operations/puppet@production] rsyslog: add netdev_kafka_relay compatibility endpoint

https://gerrit.wikimedia.org/r/495980

Change 514376 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] rsyslog: remove syslog json template from netdev_kafka_relay

https://gerrit.wikimedia.org/r/514376

Change 514376 merged by Herron:
[operations/puppet@production] rsyslog: remove syslog json template from netdev_kafka_relay

https://gerrit.wikimedia.org/r/514376

A syslog UDP listener on port 10514 is now running on lithium/wezen, and forwarding messages received to the Kafka logging pipeline.

@ayounsi do you have a test device or two to point at say lithium.eqiad.wmnet:10514/udp as a smoke test?

Before moving production logs to this I think we should decide on some cnames for the service, so we avoid needing to reconfigure clients if/when the backend syslog hosts change.

Before moving production logs to this I think we should decide on some cnames for the service, so we avoid needing to reconfigure clients if/when the backend syslog hosts change.

Actually we already have CNAMEs set up as syslog.codfw.wmnet and syslog.eqiad.wmnet so let's use those

Mentioned in SAL (#wikimedia-operations) [2019-06-04T19:48:08Z] <XioNoX> replace logstash.svc.eqiad.wmnet syslog target with syslog.codfw.wmnet on cr4-ulsfo - T224128

Test was successful, next step is to do the change to all devices.

Note that this would be a great use of anycast. Only 1 target hostname to configure and automatic failover in case of outages.

Mentioned in SAL (#wikimedia-operations) [2019-06-18T13:31:35Z] <XioNoX> push new syslog target to mr* - T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-18T13:42:08Z] <XioNoX> push new syslog target to msw* - T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T07:01:59Z] <XioNoX> jnt push to ulsfo, remove old protect-old-lvs-servers term + update syslog target T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T07:13:22Z] <XioNoX> jnt push to eqsin, remove old protect-old-lvs-servers term + update syslog target T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T07:17:24Z] <XioNoX> jnt push to eqord, remove old protect-old-lvs-servers term + update syslog target T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T07:18:30Z] <XioNoX> jnt push to eqdfw, remove old protect-old-lvs-servers term + update syslog target T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T08:51:51Z] <XioNoX> jnt push to codfw, remove old protect-old-lvs-servers term + update syslog target T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T09:19:59Z] <XioNoX> jnt push to esams, remove old protect-old-lvs-servers term + update syslog target T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T14:48:19Z] <XioNoX> jnt push to eqiad, remove old protect-old-lvs-servers term + update syslog target T224128

Mentioned in SAL (#wikimedia-operations) [2019-06-19T14:55:01Z] <XioNoX> jnt push to knams, remove old protect-old-lvs-servers term + update syslog target (T224128) + replace /28 with /29 (T211254)

Mentioned in SAL (#wikimedia-operations) [2019-06-19T14:57:58Z] <XioNoX> update syslog target on frack network devices (T224128)

Change 518813 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Allow $NETWORK_INFRA to use syslog/kafka

https://gerrit.wikimedia.org/r/518813

Change 518813 merged by Ayounsi:
[operations/puppet@production] Allow $NETWORK_INFRA to use syslog/kafka

https://gerrit.wikimedia.org/r/518813

Change 518818 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Rename facility_label to facility

https://gerrit.wikimedia.org/r/518818

Change 518818 merged by Herron:
[operations/puppet@production] Rename facility_label to facility

https://gerrit.wikimedia.org/r/518818