Page MenuHomePhabricator

Send cassandra3 (new hosts) logs to logstash
Closed, ResolvedPublic

Event Timeline

BTullis triaged this task as High priority.
BTullis subscribed.

I'll have a look at this, because I agree that it would be really beneficial to get this sorted before completing the migration.

Is it this particular dashboard cassandra_aqs upon which you're expecting to see cassandra logs @JAllemandou ?

Is it this particular dashboard cassandra_aqs upon which you're expecting to see cassandra logs @JAllemandou ?

Yes @BTullis :)

I have been looking into this, but have yet to find the reason that they are not making it all the way through to the Logstash dashboard.

The elements of the logging from cassandra all appear to be in place:

  • A logback appender specified in /etc/cassandra-[ab]/logback.xml
<appender name="UDP" class="net.logstash.logback.appender.LogstashSocketAppender">
  <host>localhost</host>
  <port>11514</port>
  <customFields>{"program":"cassandra", "cluster":"Analytics Query Service Storage", "instance_name":"a"}</customFields>
  <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
    <level>INFO</level>
  </filter>
</appender>
  • An rsyslog listener running on port 11514
btullis@aqs1010:/etc/cassandra-a$ sudo netstat -alntpu|grep 11514
udp        0      0 127.0.0.1:11514         0.0.0.0:*                           1013/rsyslogd       
udp6       0      0 ::1:11514               :::*                                1013/rsyslogd
  • Kafka output to the kafka-logging cluster is specified in /etc/rsyslog.d/50-udp-json-logback-compat.conf
  • I have verified that when restarting the cassandra-a service on aqs1010, lots of TCP streams are sent to kafka-logging100[1-3] on port 9093 by using tcpdump.

So this suggests that the logs are successfully leaving the AQS hosts and arriving at the Kafka logging cluster.

On second thoughts, I'm not so sure about this now.

It looks like Cassandra on the new hosts isn't sending any traffic to localhost port 11514 when the service restarts.

I ran the following on aqs1004 and aqs1010:

sudo tcpdump -i any udp port 11514

When I restarted the cassandra-a service on aqs1004 it generated lots of packets to that port, in conjunction with lines being written to /var/log/cassandra/system-a.log

btullis@aqs1004:~$ sudo tcpdump -i any udp port 11514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes


14:09:19.203774 IP localhost.43306 > localhost.11514: UDP, length 342
14:09:30.473945 IP localhost.58783 > localhost.11514: UDP, length 5170
14:09:30.474799 IP localhost.58783 > localhost.11514: UDP, length 361
14:09:30.697930 IP localhost.58783 > localhost.11514: UDP, length 364
14:09:30.698429 IP localhost.58783 > localhost.11514: UDP, length 365
14:09:30.710946 IP localhost.58783 > localhost.11514: UDP, length 393
14:09:30.769827 IP localhost.58783 > localhost.11514: UDP, length 337

When I restarted the cassandra-a service on aqs1010, no traffic was generated.

btullis@aqs1010:~$ sudo tcpdump -i any udp port 11514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

So this makes me think that it is something to do with the logback configuration on these hosts.

@Eevans - would you be able to provide any insight on this at all? Thanks.

Oh, I have something!

Compare:

btullis@aqs1004:/etc/cassandra-a$ ls -lt /srv/deployment/cassandra/logstash-logback-encoder/lib/logstash-logback-encoder-4.2.jar
-rw-r--r-- 1 deploy-service deploy-service 735565 Jul  2  2018 /srv/deployment/cassandra/logstash-logback-encoder/lib/logstash-logback-encoder-4.2.jar

With:

btullis@aqs1010:/usr/share/cassandra/lib$ ls -lt /srv/deployment/cassandra/logstash-logback-encoder/lib/logstash-logback-encoder-4.2.jar
-rw-r--r-- 1 deploy-service deploy-service 74 Mar 12  2021 /srv/deployment/cassandra/logstash-logback-encoder/lib/logstash-logback-encoder-4.2.jar

It looks like we haven't successfully deployed that logstash-logback-encoder-4.2.jar file.

Change 751948 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/software/logstash-logback-encoder@master] Add the aqs_next hosts to the deployment targets

https://gerrit.wikimedia.org/r/751948

I have created a CR to add the aqs_next hosts to the deployment targets for logstash-logback-encoder.

Also I've checked that @Eevans is happy for me to deploy the software to the new hosts.

I will use a procedure along the lines of:

ssh deploy1002.eqiad.wmnet
cd /srv/deployment/cassandra/logstash-logback-encoder
git pull
scap deploy -l aqs101* "Deploying logstash-logback-encoder to production"

Change 751948 merged by Btullis:

[operations/software/logstash-logback-encoder@master] Add the aqs_next hosts to the deployment targets

https://gerrit.wikimedia.org/r/751948

I am having some difficulty with the deploy.

There's a paste here P18407 with some useful output.

@Ottomata tried a deploy and it appeared successful, but the jars were still not deployed.

I have successfully deployed the logstash-logback-encoder and restarted aqs1010-a. It send the logs through to logstash as expected.

image.png (815×1 px, 153 KB)

I will schedule a rolling reboot of all aqs_next instances except one tomorrow. We will leave one instance running in order to support research for T298516: Investigate high levels of garbage collection on new AQS nodes.

Sincere thanks to @Eevans for his help with this work.

I have restarted all instances expect aqs1014-b (which is being kept running in support of T298516 - When it is restarted it will pick up the new logging configuration.