Page MenuHomePhabricator

Audit log producers across the infrastructure and plan their transition to centralized logging.
Closed, ResolvedPublic

Event Timeline

An easy target for producers that are not in logstash but should is central syslog and the programs that log there. I don't think we can add everything in bulk for capacity reasons, here's a partial tally of top producers as seen by lithium.

root@lithium:/srv/syslog# zcat archive/syslog.log-20180716.gz | awk '{print $5}' | sed -e 's@\[.*\]:@@'  | sort | uniq -c | sort -nr
19730448 systemd
8094006 sshd
2741188 systemd-logind
2334521 memcached-keys:
1939734 bash
1751169 eventlogging_sync.sh
1087272 pdns_recursor
 992420 puppet-master
 907107 charon:
 880225 prometheus-node-exporter
 710906 swift-container-reconciler:
 563598 mysqld
 538136 puppet-agent
 452220 uwsgi-graphite-web
 393450 /usr/local/sbin/maintain-dbusers
 332239 diamond
 281068 kernel:
 247282 puppet-agent-cronjob:
 234986 tileratorui
 209794 apertium-apy
 194429 thumbor@8826
 192329 icinga:
 182732 turnilo
 174646 pdfrender
 145275 prometheus-blackbox-exporter
 128942 uwsgi-labspuppetbackend
 122414 neutron-server
 118994 dbus
 108256 pybal
 100410 kafka-server-start
  99894 mcelog:
  97884 smartd
  97589 eventlogging-consumer@mysql-m4-master-00
  92437 nova-compute
  78472 python
  75545 ssh-agent
  63536 thumbor@8814
  61176 thumbor@8818
  51919 nodepoold
  51451 varnishospital
  46534 varnishstatsd
  45796 prometheus@ops
  42572 hhvm
  42203 hhvm:
  39869 prometheus-druid-exporter
  38698 nodejs
  38029 java
  31347 neutron-metadata-agent
  28764 varnishd
  26328 pdns
  26209 vhtcpd
  26191 druid
  25383 thumbor@8809
  23099 thumbor@8801
  23014 prometheus-burrow-exporter
  22787 systemd-timedated
  21059 rsyncd
  19082 prometheus-openldap-exporter
  17991 thumbor@8827
  17722 nfs-exportd
  16830 thumbor@8823
  16242 dnsmasq-dhcp
  15709 thumbor@8806
  15549 thumbor@8812
  13730 thumbor@8813
  13158 thumbor@8830
  13063 thumbor@8831
  13055 thumbor@8815
  12940 thumbor@8810
  12873 thumbor@8821
  12665 thumbor@8824
  12559 thumbor@8808
  12509 thumbor@8829
  12410 thumbor@8817
  12389 thumbor@8820
  12358 thumbor@8825
  12351 thumbor@8803
  12306 thumbor@8816
  12277 thumbor@8832
  12226 thumbor@8828
  12204 thumbor@8811
  12194 thumbor@8819
  12166 thumbor@8804
  12087 thumbor@8807
  11928 thumbor@8805
  11884 thumbor@8802
  11787 thumbor@8822
  11506 prometheus-mysqld-exporter
   9571 nova-api
   8007 thumbor@8840
   7339 grafana-server
   6209 thumbor@8836
   5880 thumbor@8833
   5793 thumbor@8839
   5756 /usr/local/bin/exim-to-gmetric
   5752 prometheus-mcrouter-exporter
   5501 thumbor@8835
   5430 thumbor@8837
   5371 thumbor@8838
   5288 thumbor@8834
   5024 acct
   4946 eventlogging-processor@client-side-04
   4904 varnishkafka
   4796 eventlogging-processor@client-side-01
   4736 eventlogging-processor@client-side-05
   4697 eventlogging-processor@client-side-11
   4625 eventlogging-processor@client-side-06
   4606 eventlogging-processor@client-side-08
   4545 eventlogging-processor@client-side-10
   4533 eventlogging-processor@client-side-09
   4510 eventlogging-processor@client-side-03
   4501 eventlogging-processor@client-side-02
   4455 eventlogging-processor@client-side-00
   4413 confd
   4410 eventlogging-processor@client-side-07
   4330 ykval
   4317 nfacctd
   4280 git-daemon
   3840 wmf-auto-restart:
   3759 debmonitor-client
   3532 etcd
   3319 parsoid
   3292 prometheus@global
   2816 nova-scheduler
   2761 udpmxircecho.py
   2658 docker-registry
   2596 zuul-server
   2373 prometheus@analytics
   2315 prometheus@k8s
   2268 liblogging-stdlog:
   2231 prometheus@services
   1912 Jul
   1488 drive-audit:
   1336 rsyslogd0:
   1194 prometheus@k8s-staging
   1166 prometheus@labs
   1093 rsyslogd-2359:
   1012 rsyslogd:

Change 446318 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] wdqs: use syslogidentifier in systemd units

https://gerrit.wikimedia.org/r/446318

Change 446324 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] dumps/snapshot: use syslogidentifier in systemd units

https://gerrit.wikimedia.org/r/446324

Change 446325 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mediawiki: use syslogidentifier in systemd units

https://gerrit.wikimedia.org/r/446325

Change 446324 merged by Filippo Giunchedi:
[operations/puppet@production] dumps/snapshot: use syslogidentifier in systemd units

https://gerrit.wikimedia.org/r/446324

Change 446318 merged by Filippo Giunchedi:
[operations/puppet@production] wdqs: use syslogidentifier in systemd units

https://gerrit.wikimedia.org/r/446318

Change 446325 merged by Filippo Giunchedi:
[operations/puppet@production] mediawiki: use syslogidentifier in systemd units

https://gerrit.wikimedia.org/r/446325

As for producers already in logstash here's what we have for the last 30 days:

typerecordspercenttransportformatlibrarymigrating tonotes
mediawiki1,668,972,72470.78%udpjsonmonologsee notesmw/monolog+kafka blocked on PHP7 (see https://phabricator.wikimedia.org/T188136)
parsoid266,472,75911.30%udpgelfservice-runnerlocal syslog
wdqs124,565,1245.28%udpjson and sysloglogbacklocal syslog (udp)
ores117,729,9894.99%udpjsonwsgi's logginglocal syslog
elasticsearch77,215,9603.27%udpgelflog4jlocal syslog (udp)
logback61,349,4892.60%udpjsonlogbacklocal sysloga mix of thumbor/wdqs/varnishslowlog/varnishhospital
syslog11,520,1390.49%udpsyslogproprietaryremote syslogshould be "junos"
restbase8,601,4220.36%udpgelfservice-runnerlocal syslog
cpjobqueue4,507,8830.19%udpgelfservice-runnerlocal syslog
parsoid-tests3,482,2830.15%udpgelfservice-runnerlocal syslog
hhvm3,199,8590.14%udpsyslogcustom, via rsysloglocal syslog
webrequest2,822,7270.12%tcpjsonvia socat, from oxygenlocal syslog
citoid1,417,6670.06%udpgelfservice-runnerlocal syslog
eventstreams1,409,8190.06%udpgelfservice-runnerlocal syslog
cassandra1,354,9450.06%udpjsonlogbacklocal syslog (udp)
kartotherian738,9760.03%udpgelfservice-runnerlocal syslog
scap732,9390.03%udp2logjsonpython logginglocal syslog
changeprop451,4250.02%udpgelfservice-runnerlocal syslog
apache2418,5410.02%udpsyslogcustom, via rsysloglocal syslog
graphoid406,7270.02%udpgelfservice-runnerlocal syslog
aqs375,2950.02%udpgelfservice-runnerlocal syslog
striker298,6520.01%udpjsonpython-logginglocal syslog
root28,2130.00%udpgelfpython graypylocal syslogshould be "eventbus"
restbase-dev26,6890.00%udpgelfservice-runnerlocal syslog
log4j8,6290.00%udpjsonlog4jlocal syslog (udp)should be "gerrit" or "gitiles"
kafka.producer.sender4,3720.00%udpgelfpython graypylocal syslogshould be "eventbus"
mjolnir3,4010.00%tcpjsonpython-logstashlocal syslog
tornado.access2,2430.00%udpgelfpython graypylocal syslogshould be "eventbus"
cxserver1,4220.00%udpgelfservice-runnerlocal syslog
proton1,2020.00%udpgelfservice-runnerlocal syslog
%{facility}6140.00%gelfjson?json parse failure
tilerator5900.00%udpgelfservice-runnerlocal syslog
mobileapps2550.00%udpgelfservice-runnerlocal syslog
tornado.general1250.00%udpgelfpython graypylocal syslogshould be "eventbus"
kafka.client250.00%udpgelfpython graypylocal syslogshould be "eventbus"
mathoid230.00%udpgelfservice-runnerlocal syslog
recommendation_api230.00%udpgelfservice-runnerlocal syslog
kafka.cluster20.00%udpgelfpython graypylocal syslogshould be "eventbus"
tileratorui10.00%udpgelfservice-runnerlocal syslog

Another source of logs that are not in logstash nowadays is logs on disk, I've ran a crude audit by looking at directories under /srv/log and /var/log across the fleet. Note that some services listed here might be sending their structured logs already to logstash. Also the list is for audit purposes, not necessarily all services listed here will have their logs in logstash.

/srv/log/apertium
/srv/log/aqs
/srv/log/changeprop
/srv/log/citoid
/srv/log/cpjobqueue
/srv/log/cxserver
/srv/log/debmonitor
/srv/log/eventlogging
/srv/log/eventstreams
/srv/log/graphoid
/srv/log/kartotherian
/srv/log/keystone-admin
/srv/log/keystone-public
/srv/log/mathoid
/srv/log/mobileapps
/srv/log/mw-log
/srv/log/netbox
/srv/log/ores
/srv/log/parsoid
/srv/log/pdfrender
/srv/log/proton
/srv/log/puppetboard
/srv/log/recommendation_api
/srv/log/restbase
/srv/log/striker
/srv/log/thumbor
/srv/log/tilerator
/srv/log/tileratorui
/srv/log/trendingedits
/srv/log/webrequest
/srv/log/zotero
/var/log/apache2
/var/log/apertium
/var/log/aphlict
/var/log/archiva
/var/log/bacula
/var/log/burrow
/var/log/calico
/var/log/camus
/var/log/carbon
/var/log/cassandra
/var/log/categoriesrdf
/var/log/changeprop
/var/log/cirrusdump
/var/log/citoid
/var/log/confluent
/var/log/containers
/var/log/cumin
/var/log/cxserver
/var/log/debdeploy
/var/log/designate
/var/log/druid
/var/log/dumps
/var/log/elasticsearch
/var/log/etherpad-lite
/var/log/eventlogging
/var/log/eventlogging_cleaner
/var/log/eventlogging_sync
/var/log/ganeti
/var/log/glance
/var/log/grafana
/var/log/graphite
/var/log/graphite-web
/var/log/graphoid
/var/log/gunicorn
/var/log/hadoop-0.20-mapreduce
/var/log/hadoop-hdfs
/var/log/hadoop-httpfs
/var/log/hadoop-mapreduce
/var/log/hadoop-yarn
/var/log/hbase
/var/log/hhvm
/var/log/hive
/var/log/hive-hcatalog
/var/log/hue
/var/log/icinga
/var/log/jenkins
/var/log/jupyterhub
/var/log/kafka
/var/log/kartotherian
/var/log/keystone
/var/log/l10nupdatelog
/var/log/landscape
/var/log/libvirt
/var/log/logstash
/var/log/logster
/var/log/mathoid
/var/log/matomo
/var/log/mcrouter
/var/log/mediawiki
/var/log/memkeys
/var/log/mobileapps
/var/log/mongodb
/var/log/mtail
/var/log/mysql
/var/log/neutron
/var/log/nginx
/var/log/nodepool
/var/log/nova
/var/log/nutcracker
/var/log/oozie
/var/log/osm
/var/log/osm_replication
/var/log/osmosis
/var/log/parsoid
/var/log/phd
/var/log/pig
/var/log/piwik
/var/log/planet
/var/log/pods
/var/log/postgresql
/var/log/prometheus
/var/log/puppet
/var/log/puppetlabs
/var/log/quagga
/var/log/rabbitmq
/var/log/rancid
/var/log/redis
/var/log/refinery
/var/log/restbase
/var/log/solr
/var/log/spark
/var/log/spicerack
/var/log/sqoop
/var/log/squid3
/var/log/superset
/var/log/swift
/var/log/tilerator
/var/log/tileratorui
/var/log/tinyproxy
/var/log/trafficserver
/var/log/translationnotifications
/var/log/turnilo
/var/log/udp2log
/var/log/unattended-upgrades
/var/log/uwsgi
/var/log/varnish
/var/log/visualdiff
/var/log/waterlines
/var/log/wdqs
/var/log/wikidata
/var/log/wikidatadump
/var/log/wmf-auto-reimage
/var/log/zookeeper
/var/log/zotero
/var/log/zuul

I've been researching how the logback/log4j/log4j2 migration could look like, first with syslog + unix socket:

As far as Python programs using python-logstash we should be able to use logging.SysLogHandler instead and format logs as json

I tried experimenting with emulating a json_lines-compatible udp local endpoint with rsyslog, so that sending json over udp would result in sending to kafka. This is an experiment but it shows that it would be possible to do for cases where syslog on unix socket + json would result overly complex or not maintainable. Most notably using udp on localhost loses SCM_CREDENTIALS support out of the box, in other words we can't attach uid/gid/pid/etc of the calling process to the logs if we want to.

input(type="imudp" port="11514" address="localhost" ruleset="udp_json_kafka")

template(name="json_json" type="list") {
  property(name="$!all-json")
}

ruleset(name="udp_json_kafka") {
  action(type="mmjsonparse" cookie="" name="mmjsonparse")
  if $parsesuccess == "OK" then {
    action(
      broker=["logging-stretch01.logging.eqiad.wmflabs:9092"]
      type="omkafka"
      name="udp_json_kafka"
      topic="logging.rsyslog"
      template="json_json"
      confParam="queue.buffering.max.ms=50 batch.num.messages=1000"
    )
  }
}
fgiunchedi claimed this task.

The audit/list of current and future logs producers has been carried out as part of this task (see table at https://phabricator.wikimedia.org/T198756#4552987)