Improve cassandra JBOD integration post-reimage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jan 18 2019, 1:50 PM

Description

Today I've reimaged restbase1016 in T212418: Memory error on restbase1016 though there's some things we should improve post-reimage:

/dev/sd*4 filesystems are formatted but not present in /etc/fstab
cassandra instances try to start at the first puppet run
the default cassandra process stays running after the first puppet run, even though we're explicitly marking cassandra service as stopped

For the first item, ideally partman takes care of that, failing partman we can use puppet.

For the second item, on a newly imaged host puppet will start all cassandra instances which will try to bootstrap (and eventually all but one will fail) though I think we should avoid that and selectively enable what instance(s) to start post-provisioning. We used to mask cassandra instances, though that's no longer working as intended (T211027). I think the next best thing would be use a flag file and add [ConditionPathExists]( https://www.freedesktop.org/software/systemd/man/systemd.unit.html#ConditionArchitecture=) to the cassandra systemd units: by default the file isn't there and operators enabling / bootstrapping cassandra will touch the file to enable the unit.

Details

Subject	Repo	Branch	Lines +/-
cassandra: add init.d 'stop' action	operations/puppet	production	+13 -1
cassandra: unquote ConditionPathExists argument	operations/puppet	production	+1 -1
cassandra: check for flag file before service startup	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects

Mentioned In: T222960: Fix restbase1017's physical rack
Mentioned Here: T222960: Fix restbase1017's physical rack
T211027: puppet (systemd::service) attempts to start manually masked units
T212418: Memory error on restbase1016

Event Timeline

fgiunchedi created this task.Jan 18 2019, 1:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2019, 1:50 PM

fgiunchedi updated the task description. (Show Details)Jan 18 2019, 1:51 PM

Aklapper added projects: RESTBase-Cassandra, Services.Jan 18 2019, 4:17 PM

Aklapper removed subscribers: RESTBase-Cassandra, Services.

That's a good idea @fgiunchedi ! +1

But, would that work in conjunction with puppet? AFAIK, puppet will check the service's state and still declare a failed run (I'm thinking about cases where for maintenance reasons we may want to stop certain instances).

In T214166#4892815, @mobrovac wrote:

That's a good idea @fgiunchedi ! +1

But, would that work in conjunction with puppet? AFAIK, puppet will check the service's state and still declare a failed run (I'm thinking about cases where for maintenance reasons we may want to stop certain instances).

It might also result in puppet run working but the "units failed" icinga alert failing, though either way I think that's ok as the maintenance is an exceptional event and we might as well know about it one way or another.

Change 509409 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: check for flag file before service startup

https://gerrit.wikimedia.org/r/509409

gerritbot added a project: Patch-For-Review.May 10 2019, 2:08 PM

fgiunchedi added a project: User-fgiunchedi.May 13 2019, 9:03 AM

Change 509409 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: check for flag file before service startup

https://gerrit.wikimedia.org/r/509409

Change 510195 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: unquote ConditionPathExists argument

https://gerrit.wikimedia.org/r/510195

Change 510195 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: unquote ConditionPathExists argument

https://gerrit.wikimedia.org/r/510195

fgiunchedi updated the task description. (Show Details)May 16 2019, 7:55 AM

For the last problem where cassandra default instance keeps running, it is stopped according to systemd but the process/cgroup are still there:

root@restbase1022:~# systemctl status cassandra
● cassandra.service
   Loaded: loaded (/etc/init.d/cassandra; generated; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-05-16 08:03:25 UTC; 1min 33s ago
     Docs: man:systemd-sysv-generator(8)
   CGroup: /system.slice/cassandra.service
           └─123225 java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-Use

May 16 08:02:07 restbase1022 systemd[1]: Starting LSB: distributed storage system for structured data...
May 16 08:02:07 restbase1022 systemd[1]: Started LSB: distributed storage system for structured data.
May 16 08:03:25 restbase1022 systemd[1]: Stopping cassandra.service...
May 16 08:03:25 restbase1022 systemd[1]: Stopped cassandra.service.

cassand+ 123225 19.5  6.9 11091936 9116440 ?    SLl  08:02   0:28 java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHea...

Change 510695 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: add init.d 'stop' action

https://gerrit.wikimedia.org/r/510695

fgiunchedi updated the task description. (Show Details)May 16 2019, 8:59 AM

Change 510695 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: add init.d 'stop' action

https://gerrit.wikimedia.org/r/510695

All done! Documentation updated at https://wikitech.wikimedia.org/wiki/Cassandra#Add_a_new_host_to_a_multi-instance_cluster

\o/ thank you @fgiunchedi !

Reopening since the first item isn't fixed, cfr https://phabricator.wikimedia.org/T222960#5327124

fgiunchedi mentioned this in T222960: Fix restbase1017's physical rack.Jul 12 2019, 8:07 AM

In T214166#5327456, @fgiunchedi wrote:

Reopening since the first item isn't fixed, cfr https://phabricator.wikimedia.org/T222960#5327124

Ran into the same issue with restbase1018, but turned out it was only missing a Hiera setting. After merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536201/ it worked as expected, so closing this task again.

MoritzMuehlenhoff updated the task description. (Show Details)Sep 12 2019, 3:08 PM

Improve cassandra JBOD integration post-reimageClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Improve cassandra JBOD integration post-reimage
Closed, ResolvedPublic
Actions