Page MenuHomePhabricator

Improve cassandra JBOD integration post-reimage
Closed, ResolvedPublic

Description

Today I've reimaged restbase1016 in T212418: Memory error on restbase1016 though there's some things we should improve post-reimage:

  • /dev/sd*4 filesystems are formatted but not present in /etc/fstab
  • cassandra instances try to start at the first puppet run
  • the default cassandra process stays running after the first puppet run, even though we're explicitly marking cassandra service as stopped

For the first item, ideally partman takes care of that, failing partman we can use puppet.

For the second item, on a newly imaged host puppet will start all cassandra instances which will try to bootstrap (and eventually all but one will fail) though I think we should avoid that and selectively enable what instance(s) to start post-provisioning. We used to mask cassandra instances, though that's no longer working as intended (T211027). I think the next best thing would be use a flag file and add [ConditionPathExists]( https://www.freedesktop.org/software/systemd/man/systemd.unit.html#ConditionArchitecture=) to the cassandra systemd units: by default the file isn't there and operators enabling / bootstrapping cassandra will touch the file to enable the unit.

Event Timeline

mobrovac subscribed.

That's a good idea @fgiunchedi ! +1

But, would that work in conjunction with puppet? AFAIK, puppet will check the service's state and still declare a failed run (I'm thinking about cases where for maintenance reasons we may want to stop certain instances).

That's a good idea @fgiunchedi ! +1

But, would that work in conjunction with puppet? AFAIK, puppet will check the service's state and still declare a failed run (I'm thinking about cases where for maintenance reasons we may want to stop certain instances).

It might also result in puppet run working but the "units failed" icinga alert failing, though either way I think that's ok as the maintenance is an exceptional event and we might as well know about it one way or another.

Change 509409 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: check for flag file before service startup

https://gerrit.wikimedia.org/r/509409

Change 509409 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: check for flag file before service startup

https://gerrit.wikimedia.org/r/509409

Change 510195 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: unquote ConditionPathExists argument

https://gerrit.wikimedia.org/r/510195

Change 510195 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: unquote ConditionPathExists argument

https://gerrit.wikimedia.org/r/510195

For the last problem where cassandra default instance keeps running, it is stopped according to systemd but the process/cgroup are still there:

root@restbase1022:~# systemctl status cassandra
● cassandra.service
   Loaded: loaded (/etc/init.d/cassandra; generated; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-05-16 08:03:25 UTC; 1min 33s ago
     Docs: man:systemd-sysv-generator(8)
   CGroup: /system.slice/cassandra.service
           └─123225 java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-Use

May 16 08:02:07 restbase1022 systemd[1]: Starting LSB: distributed storage system for structured data...
May 16 08:02:07 restbase1022 systemd[1]: Started LSB: distributed storage system for structured data.
May 16 08:03:25 restbase1022 systemd[1]: Stopping cassandra.service...
May 16 08:03:25 restbase1022 systemd[1]: Stopped cassandra.service.
cassand+ 123225 19.5  6.9 11091936 9116440 ?    SLl  08:02   0:28 java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHea...

Change 510695 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: add init.d 'stop' action

https://gerrit.wikimedia.org/r/510695

Change 510695 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: add init.d 'stop' action

https://gerrit.wikimedia.org/r/510695

fgiunchedi removed fgiunchedi as the assignee of this task.
fgiunchedi updated the task description. (Show Details)

Reopening since the first item isn't fixed, cfr https://phabricator.wikimedia.org/T222960#5327124

MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff subscribed.

Reopening since the first item isn't fixed, cfr https://phabricator.wikimedia.org/T222960#5327124

Ran into the same issue with restbase1018, but turned out it was only missing a Hiera setting. After merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536201/ it worked as expected, so closing this task again.