Page MenuHomePhabricator

Migrate >=90% of existing Logstash traffic to the logging pipeline
Closed, ResolvedPublic

Description

This task tracks migration of current Logstash producers to the logging pipeline, see also the table at https://phabricator.wikimedia.org/T198756#4552987

Migration checklist:

  • Which transport is the application going to use?
  • When the log daemon is involved, what is the application's behavior when the log daemon is down?

Event Timeline

fgiunchedi updated the task description. (Show Details)Oct 1 2018, 2:39 PM
herron triaged this task as Normal priority.Oct 2 2018, 5:26 PM

Change 475352 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP rsyslog: udp input json_lines shim

https://gerrit.wikimedia.org/r/475352

A first iteration on this might look like https://gerrit.wikimedia.org/r/475352 in which a new udp listener is added on localhost, that accepts "json_lines" (i.e. one json object per line) and forwards that to kafka to be ingested by logstash. This is meant to provide a compatibility shim between what we're doing now (transport over syslog/gelf/etc, payloads are json) and transporting messages over kafka. UDP on localhost is meant to provide a "safe" transition where applications will keep not blocking on logging in case of errors. Additionally the network MTU restriction is lifted and bigger messages are allowed (most notably for mediawiki).

Kafka topic selection depends on a few factors: lines that can't be parsed as json objects will be sent to a separate kafka topic for later examination. When json parsing succeeds the object is sent as-is to a kafka topic for ingestion by logstash. Additionally objects containing a level key will be sent to a level-specific kafka topic, this is done since most log messages from services have standardized on level and helps to prioritize message processing if needed (e.g. process error or critical messages first)

Testing on deployment-mediawiki-07 the endpoint above yields e.g. this message being produced on kafka:

{
  "@timestamp": "2018-11-28T08:21:51.000000+00:00",
  "@version": 1,
  "channel": "TitleBlacklist-cache",
  "facility": "user",
  "host": "deployment-mediawiki-07",
  "http_method": "POST",
  "ip": "172.16.4.21",
  "level": "INFO",
  "logsource": "deployment-mediawiki-07",
  "message": "Updated commonswiki:title_blacklist_entries with 762 entries.",
  "mwversion": "1.33.0-alpha",
  "normalized_message": "Updated commonswiki:title_blacklist_entries with 762 entries.",
  "private": false
  "program": "mediawiki",
  "referrer": "https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:CreateAccount&returnto=MediaWiki+talk%3AGadget-DefaultSearch.js%2FCompatibility",
  "reqId": "W-5QHqwQBHcAAG6D9NgAAAAG",
  "server": "commons.wikimedia.beta.wmflabs.org",
  "severity": "info",
  "shard": "s3",
  "timestamp": "2018-11-28T08:21:51+00:00",
  "type": "mediawiki",
  "unique_id": "W-5QHqwQBHcAAG6D9NgAAAAG",
  "url": "/w/index.php?title=Special:CreateAccount&returnto=MediaWiki+talk:Gadget-DefaultSearch.js/Compatibility",
  "wiki": "commonswiki",
}

And its equivalent consumed into logstash/elasticsearch (json from kibana)

{
  "_index": "logstash-2018.11.28",
  "_type": "mediawiki",
  "_id": "AWdZaPoEvie9JhiTaPOL",
  "_version": 1,
  "_score": null,
  "_source": {
    "server": "commons.wikimedia.beta.wmflabs.org",
    "private": false,
    "wiki": "commonswiki",
    "channel": "TitleBlacklist-cache",
    "program": "mediawiki",
    "type": "mediawiki",
    "http_method": "POST",
    "@version": 1,
    "host": "deployment-mediawiki-07",
    "shard": "s3",
    "timestamp": "2018-11-28T08:21:51+00:00",
    "unique_id": "W-5QHqwQBHcAAG6D9NgAAAAG",
    "level": "INFO",
    "ip": "172.16.4.21",
    "mwversion": "1.33.0-alpha",
    "logsource": "deployment-mediawiki-07",
    "message": "Updated commonswiki:title_blacklist_entries with 762 entries.",
    "normalized_message": "Updated commonswiki:title_blacklist_entries with 762 entries.",
    "url": "/w/index.php?title=Special:CreateAccount&returnto=MediaWiki+talk:Gadget-DefaultSearch.js/Compatibility",
    "reqId": "W-5QHqwQBHcAAG6D9NgAAAAG",
    "tags": [
      "rsyslog-shipper",
      "kafka",
      "es"
    ],
    "referrer": "https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:CreateAccount&returnto=MediaWiki+talk%3AGadget-DefaultSearch.js%2FCompatibility",
    "@timestamp": "2018-11-28T08:21:51.000Z",
    "facility": "user"
  },
  "fields": {
    "@timestamp": [
      1543393311000
    ]
  },
  "highlight": {
    "tags": [
      "@kibana-highlighted-field@rsyslog-shipper@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1543393311000
  ]
}

Change 476228 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/mediawiki-config@master] LabsServices: ship logs locally

https://gerrit.wikimedia.org/r/476228

Change 476472 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: add new logging kafka consumer

https://gerrit.wikimedia.org/r/476472

Change 476473 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: copy 'severity' into 'level' where needed

https://gerrit.wikimedia.org/r/476473

Change 476228 merged by jenkins-bot:
[operations/mediawiki-config@master] LabsServices: ship logs locally

https://gerrit.wikimedia.org/r/476228

Change 478617 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/mediawiki-config@master] Revert "LabsServices: ship logs locally"

https://gerrit.wikimedia.org/r/478617

Change 478617 merged by Filippo Giunchedi:
[operations/mediawiki-config@master] Revert "LabsServices: ship logs locally"

https://gerrit.wikimedia.org/r/478617

Change 475352 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: add UDP localhost compatibility endpoint

https://gerrit.wikimedia.org/r/475352

Change 478902 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] deployment-prep: bump logstash heap memory

https://gerrit.wikimedia.org/r/478902

Change 478902 merged by Filippo Giunchedi:
[operations/puppet@production] deployment-prep: bump logstash heap memory

https://gerrit.wikimedia.org/r/478902

Change 476473 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: copy 'severity' into 'level' where needed

https://gerrit.wikimedia.org/r/476473

Change 479677 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/mediawiki-config@master] LabsServices: ship logs locally

https://gerrit.wikimedia.org/r/479677

Change 479677 merged by jenkins-bot:
[operations/mediawiki-config@master] LabsServices: ship logs locally

https://gerrit.wikimedia.org/r/479677

Change 476472 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: add new logging kafka consumer

https://gerrit.wikimedia.org/r/476472

Looks like group1 is done now, I checked kibana mediawiki dashboards and everything seems in order (i.e. no regressions or log disappeared). We're producing ~150 messages/s ATM, next up is to add partitions to the kafka topics to spread the load amongst all brokers.

fgiunchedi closed this task as Resolved.Jan 16 2019, 10:18 AM
fgiunchedi claimed this task.

This is completed with mediawiki logging fully switched to new logging infra:

Expanding from the graph above with this expression: sum by (plugin_id) (rate(logstash_node_plugin_events_out_total{plugin_id=~"input/.*"}[5m])) / scalar(sum(rate(logstash_node_plugin_events_out_total{plugin_id=~"input/.*"}[5m]))) over the last week we've averaged 87% of logs onto the new logging pipeline (86% mediawiki + 1% from onboarded apps)