Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | fgiunchedi | T198753 Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) | |||
Resolved | fgiunchedi | T198756 Audit log producers across the infrastructure and plan their transition to centralized logging. | |||
Duplicate | None | T63781 Add syslog logs to logstash |
Event Timeline
Linking in some related tasks that we might be able to resolve or decline during this goal.
An easy target for producers that are not in logstash but should is central syslog and the programs that log there. I don't think we can add everything in bulk for capacity reasons, here's a partial tally of top producers as seen by lithium.
root@lithium:/srv/syslog# zcat archive/syslog.log-20180716.gz | awk '{print $5}' | sed -e 's@\[.*\]:@@' | sort | uniq -c | sort -nr 19730448 systemd 8094006 sshd 2741188 systemd-logind 2334521 memcached-keys: 1939734 bash 1751169 eventlogging_sync.sh 1087272 pdns_recursor 992420 puppet-master 907107 charon: 880225 prometheus-node-exporter 710906 swift-container-reconciler: 563598 mysqld 538136 puppet-agent 452220 uwsgi-graphite-web 393450 /usr/local/sbin/maintain-dbusers 332239 diamond 281068 kernel: 247282 puppet-agent-cronjob: 234986 tileratorui 209794 apertium-apy 194429 thumbor@8826 192329 icinga: 182732 turnilo 174646 pdfrender 145275 prometheus-blackbox-exporter 128942 uwsgi-labspuppetbackend 122414 neutron-server 118994 dbus 108256 pybal 100410 kafka-server-start 99894 mcelog: 97884 smartd 97589 eventlogging-consumer@mysql-m4-master-00 92437 nova-compute 78472 python 75545 ssh-agent 63536 thumbor@8814 61176 thumbor@8818 51919 nodepoold 51451 varnishospital 46534 varnishstatsd 45796 prometheus@ops 42572 hhvm 42203 hhvm: 39869 prometheus-druid-exporter 38698 nodejs 38029 java 31347 neutron-metadata-agent 28764 varnishd 26328 pdns 26209 vhtcpd 26191 druid 25383 thumbor@8809 23099 thumbor@8801 23014 prometheus-burrow-exporter 22787 systemd-timedated 21059 rsyncd 19082 prometheus-openldap-exporter 17991 thumbor@8827 17722 nfs-exportd 16830 thumbor@8823 16242 dnsmasq-dhcp 15709 thumbor@8806 15549 thumbor@8812 13730 thumbor@8813 13158 thumbor@8830 13063 thumbor@8831 13055 thumbor@8815 12940 thumbor@8810 12873 thumbor@8821 12665 thumbor@8824 12559 thumbor@8808 12509 thumbor@8829 12410 thumbor@8817 12389 thumbor@8820 12358 thumbor@8825 12351 thumbor@8803 12306 thumbor@8816 12277 thumbor@8832 12226 thumbor@8828 12204 thumbor@8811 12194 thumbor@8819 12166 thumbor@8804 12087 thumbor@8807 11928 thumbor@8805 11884 thumbor@8802 11787 thumbor@8822 11506 prometheus-mysqld-exporter 9571 nova-api 8007 thumbor@8840 7339 grafana-server 6209 thumbor@8836 5880 thumbor@8833 5793 thumbor@8839 5756 /usr/local/bin/exim-to-gmetric 5752 prometheus-mcrouter-exporter 5501 thumbor@8835 5430 thumbor@8837 5371 thumbor@8838 5288 thumbor@8834 5024 acct 4946 eventlogging-processor@client-side-04 4904 varnishkafka 4796 eventlogging-processor@client-side-01 4736 eventlogging-processor@client-side-05 4697 eventlogging-processor@client-side-11 4625 eventlogging-processor@client-side-06 4606 eventlogging-processor@client-side-08 4545 eventlogging-processor@client-side-10 4533 eventlogging-processor@client-side-09 4510 eventlogging-processor@client-side-03 4501 eventlogging-processor@client-side-02 4455 eventlogging-processor@client-side-00 4413 confd 4410 eventlogging-processor@client-side-07 4330 ykval 4317 nfacctd 4280 git-daemon 3840 wmf-auto-restart: 3759 debmonitor-client 3532 etcd 3319 parsoid 3292 prometheus@global 2816 nova-scheduler 2761 udpmxircecho.py 2658 docker-registry 2596 zuul-server 2373 prometheus@analytics 2315 prometheus@k8s 2268 liblogging-stdlog: 2231 prometheus@services 1912 Jul 1488 drive-audit: 1336 rsyslogd0: 1194 prometheus@k8s-staging 1166 prometheus@labs 1093 rsyslogd-2359: 1012 rsyslogd:
Change 446318 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] wdqs: use syslogidentifier in systemd units
Change 446324 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] dumps/snapshot: use syslogidentifier in systemd units
Change 446325 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mediawiki: use syslogidentifier in systemd units
Change 446324 merged by Filippo Giunchedi:
[operations/puppet@production] dumps/snapshot: use syslogidentifier in systemd units
Change 446318 merged by Filippo Giunchedi:
[operations/puppet@production] wdqs: use syslogidentifier in systemd units
Change 446325 merged by Filippo Giunchedi:
[operations/puppet@production] mediawiki: use syslogidentifier in systemd units
As for producers already in logstash here's what we have for the last 30 days:
type | records | percent | transport | format | library | migrating to | notes |
---|---|---|---|---|---|---|---|
mediawiki | 1,668,972,724 | 70.78% | udp | json | monolog | see notes | mw/monolog+kafka blocked on PHP7 (see https://phabricator.wikimedia.org/T188136) |
parsoid | 266,472,759 | 11.30% | udp | gelf | service-runner | local syslog | |
wdqs | 124,565,124 | 5.28% | udp | json and syslog | logback | local syslog (udp) | |
ores | 117,729,989 | 4.99% | udp | json | wsgi's logging | local syslog | |
elasticsearch | 77,215,960 | 3.27% | udp | gelf | log4j | local syslog (udp) | |
logback | 61,349,489 | 2.60% | udp | json | logback | local syslog | a mix of thumbor/wdqs/varnishslowlog/varnishhospital |
syslog | 11,520,139 | 0.49% | udp | syslog | proprietary | remote syslog | should be "junos" |
restbase | 8,601,422 | 0.36% | udp | gelf | service-runner | local syslog | |
cpjobqueue | 4,507,883 | 0.19% | udp | gelf | service-runner | local syslog | |
parsoid-tests | 3,482,283 | 0.15% | udp | gelf | service-runner | local syslog | |
hhvm | 3,199,859 | 0.14% | udp | syslog | custom, via rsyslog | local syslog | |
webrequest | 2,822,727 | 0.12% | tcp | json | via socat, from oxygen | local syslog | |
citoid | 1,417,667 | 0.06% | udp | gelf | service-runner | local syslog | |
eventstreams | 1,409,819 | 0.06% | udp | gelf | service-runner | local syslog | |
cassandra | 1,354,945 | 0.06% | udp | json | logback | local syslog (udp) | |
kartotherian | 738,976 | 0.03% | udp | gelf | service-runner | local syslog | |
scap | 732,939 | 0.03% | udp2log | json | python logging | local syslog | |
changeprop | 451,425 | 0.02% | udp | gelf | service-runner | local syslog | |
apache2 | 418,541 | 0.02% | udp | syslog | custom, via rsyslog | local syslog | |
graphoid | 406,727 | 0.02% | udp | gelf | service-runner | local syslog | |
aqs | 375,295 | 0.02% | udp | gelf | service-runner | local syslog | |
striker | 298,652 | 0.01% | udp | json | python-logging | local syslog | |
root | 28,213 | 0.00% | udp | gelf | python graypy | local syslog | should be "eventbus" |
restbase-dev | 26,689 | 0.00% | udp | gelf | service-runner | local syslog | |
log4j | 8,629 | 0.00% | udp | json | log4j | local syslog (udp) | should be "gerrit" or "gitiles" |
kafka.producer.sender | 4,372 | 0.00% | udp | gelf | python graypy | local syslog | should be "eventbus" |
mjolnir | 3,401 | 0.00% | tcp | json | python-logstash | local syslog | |
tornado.access | 2,243 | 0.00% | udp | gelf | python graypy | local syslog | should be "eventbus" |
cxserver | 1,422 | 0.00% | udp | gelf | service-runner | local syslog | |
proton | 1,202 | 0.00% | udp | gelf | service-runner | local syslog | |
%{facility} | 614 | 0.00% | gelf | json | ? | json parse failure | |
tilerator | 590 | 0.00% | udp | gelf | service-runner | local syslog | |
mobileapps | 255 | 0.00% | udp | gelf | service-runner | local syslog | |
tornado.general | 125 | 0.00% | udp | gelf | python graypy | local syslog | should be "eventbus" |
kafka.client | 25 | 0.00% | udp | gelf | python graypy | local syslog | should be "eventbus" |
mathoid | 23 | 0.00% | udp | gelf | service-runner | local syslog | |
recommendation_api | 23 | 0.00% | udp | gelf | service-runner | local syslog | |
kafka.cluster | 2 | 0.00% | udp | gelf | python graypy | local syslog | should be "eventbus" |
tileratorui | 1 | 0.00% | udp | gelf | service-runner | local syslog | |
Another source of logs that are not in logstash nowadays is logs on disk, I've ran a crude audit by looking at directories under /srv/log and /var/log across the fleet. Note that some services listed here might be sending their structured logs already to logstash. Also the list is for audit purposes, not necessarily all services listed here will have their logs in logstash.
/srv/log/apertium /srv/log/aqs /srv/log/changeprop /srv/log/citoid /srv/log/cpjobqueue /srv/log/cxserver /srv/log/debmonitor /srv/log/eventlogging /srv/log/eventstreams /srv/log/graphoid /srv/log/kartotherian /srv/log/keystone-admin /srv/log/keystone-public /srv/log/mathoid /srv/log/mobileapps /srv/log/mw-log /srv/log/netbox /srv/log/ores /srv/log/parsoid /srv/log/pdfrender /srv/log/proton /srv/log/puppetboard /srv/log/recommendation_api /srv/log/restbase /srv/log/striker /srv/log/thumbor /srv/log/tilerator /srv/log/tileratorui /srv/log/trendingedits /srv/log/webrequest /srv/log/zotero /var/log/apache2 /var/log/apertium /var/log/aphlict /var/log/archiva /var/log/bacula /var/log/burrow /var/log/calico /var/log/camus /var/log/carbon /var/log/cassandra /var/log/categoriesrdf /var/log/changeprop /var/log/cirrusdump /var/log/citoid /var/log/confluent /var/log/containers /var/log/cumin /var/log/cxserver /var/log/debdeploy /var/log/designate /var/log/druid /var/log/dumps /var/log/elasticsearch /var/log/etherpad-lite /var/log/eventlogging /var/log/eventlogging_cleaner /var/log/eventlogging_sync /var/log/ganeti /var/log/glance /var/log/grafana /var/log/graphite /var/log/graphite-web /var/log/graphoid /var/log/gunicorn /var/log/hadoop-0.20-mapreduce /var/log/hadoop-hdfs /var/log/hadoop-httpfs /var/log/hadoop-mapreduce /var/log/hadoop-yarn /var/log/hbase /var/log/hhvm /var/log/hive /var/log/hive-hcatalog /var/log/hue /var/log/icinga /var/log/jenkins /var/log/jupyterhub /var/log/kafka /var/log/kartotherian /var/log/keystone /var/log/l10nupdatelog /var/log/landscape /var/log/libvirt /var/log/logstash /var/log/logster /var/log/mathoid /var/log/matomo /var/log/mcrouter /var/log/mediawiki /var/log/memkeys /var/log/mobileapps /var/log/mongodb /var/log/mtail /var/log/mysql /var/log/neutron /var/log/nginx /var/log/nodepool /var/log/nova /var/log/nutcracker /var/log/oozie /var/log/osm /var/log/osm_replication /var/log/osmosis /var/log/parsoid /var/log/phd /var/log/pig /var/log/piwik /var/log/planet /var/log/pods /var/log/postgresql /var/log/prometheus /var/log/puppet /var/log/puppetlabs /var/log/quagga /var/log/rabbitmq /var/log/rancid /var/log/redis /var/log/refinery /var/log/restbase /var/log/solr /var/log/spark /var/log/spicerack /var/log/sqoop /var/log/squid3 /var/log/superset /var/log/swift /var/log/tilerator /var/log/tileratorui /var/log/tinyproxy /var/log/trafficserver /var/log/translationnotifications /var/log/turnilo /var/log/udp2log /var/log/unattended-upgrades /var/log/uwsgi /var/log/varnish /var/log/visualdiff /var/log/waterlines /var/log/wdqs /var/log/wikidata /var/log/wikidatadump /var/log/wmf-auto-reimage /var/log/zookeeper /var/log/zotero /var/log/zuul
I've been researching how the logback/log4j/log4j2 migration could look like, first with syslog + unix socket:
- AFAICS for logback the syslog appender doesn't support unix socket, only a remote host: https://logback.qos.ch/apidocs/ch/qos/logback/classic/net/SyslogAppender.html
- A stackoverflow post outlining how to do logback/log4j syslog with syslog4j (which looks like it is dead/unmaintained) https://stackoverflow.com/questions/32053768/using-syslogs-unix-socket-with-log4j2
- Java doesn't seem to support unix socket (AF_UNIX) out of the box, though that's possible with https://github.com/mcfunley/juds
As far as Python programs using python-logstash we should be able to use logging.SysLogHandler instead and format logs as json
I tried experimenting with emulating a json_lines-compatible udp local endpoint with rsyslog, so that sending json over udp would result in sending to kafka. This is an experiment but it shows that it would be possible to do for cases where syslog on unix socket + json would result overly complex or not maintainable. Most notably using udp on localhost loses SCM_CREDENTIALS support out of the box, in other words we can't attach uid/gid/pid/etc of the calling process to the logs if we want to.
input(type="imudp" port="11514" address="localhost" ruleset="udp_json_kafka") template(name="json_json" type="list") { property(name="$!all-json") } ruleset(name="udp_json_kafka") { action(type="mmjsonparse" cookie="" name="mmjsonparse") if $parsesuccess == "OK" then { action( broker=["logging-stretch01.logging.eqiad.wmflabs:9092"] type="omkafka" name="udp_json_kafka" topic="logging.rsyslog" template="json_json" confParam="queue.buffering.max.ms=50 batch.num.messages=1000" ) } }
The audit/list of current and future logs producers has been carried out as part of this task (see table at https://phabricator.wikimedia.org/T198756#4552987)