Page MenuHomePhabricator

Invalid Cassandra seeds list is spamming the debug logs
Closed, ResolvedPublic

Description

The list of Cassandra seed nodes is generated from the entire list of Cassandra instances (excluding the current), and additionally includes the hostname itself. Since the main host where instances reside are not actually running Cassandra, this results in a very large number of Gossip-related connection failure log messages (which conspire to obscure valuable log data).

Event Timeline

Eevans added a subscriber: Joe.

Change 370554 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] Cassandra: Do not include the main DNS in the list of seeds

https://gerrit.wikimedia.org/r/370554

Change 370554 merged by Filippo Giunchedi:
[operations/puppet@production] Cassandra: Do not include the main DNS in the list of seeds

https://gerrit.wikimedia.org/r/370554

mobrovac added a subscriber: mobrovac.

This has been merged and Puppet has been run. The main IPs are no longer in the seeds lists, so resolving.

Whoops, just noticed that this wiped out the seeds list in deployment-prep:

eevans@deployment-restbase01:~$ grep -B 9 -A 3 seeds: /etc/cassandra/cassandra.yaml
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          # Omit own host name / IP in multi-node clusters (see
          # https://phabricator.wikimedia.org/T91617).
          # Also disregard the main DNS interfaces of each node when
          # multiple instances are colocated on the same node (see
          # https://phabricator.wikimedia.org/T172610)
          
         - seeds: 
# For workloads with more data than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" should be set to (16 * number_of_drives) in
eevans@deployment-restbase01:~$

Reopening...

Whoops, just noticed that this wiped out the seeds list in deployment-prep:

eevans@deployment-restbase01:~$ grep -B 9 -A 3 seeds: /etc/cassandra/cassandra.yaml
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          # Omit own host name / IP in multi-node clusters (see
          # https://phabricator.wikimedia.org/T91617).
          # Also disregard the main DNS interfaces of each node when
          # multiple instances are colocated on the same node (see
          # https://phabricator.wikimedia.org/T172610)
          
         - seeds: 
# For workloads with more data than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" should be set to (16 * number_of_drives) in
eevans@deployment-restbase01:~$

Reopening...

I've looked at this, and it's not clear to me why this is happening. Assuming it's the new conditional added to filter seeds (wikimedia/puppet/.../cassandra.yaml-3.x.erb), I would assume that would be true by virtue of @instance_count == 1.

That said, this is all starting to look quite brittle to me (for example, we're now matching on IPv4 addresses and specific hostnames). With or without a fix for this particular issue, I'd be in favor of moving to a small statically configured list of seeds.

Change 377997 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] Cassandra: Include only instance DNS' in the list of seeds

https://gerrit.wikimedia.org/r/377997

Change 377997 merged by Gehel:
[operations/puppet@production] Cassandra: Include only instance DNS' in the list of seeds

https://gerrit.wikimedia.org/r/377997

mobrovac removed a project: Patch-For-Review.

Ok, the above patch truly fixed the issue. There were problems in the seed list in both labs and staging, and they have now been remedied.