Page MenuHomePhabricator

Revisit default settings for c-foreach-restart
Open, NormalPublic

Description

With the default settings of c-foreach-restart it's no longer possible to reliably restart Cassandra on the restbase hosts (probably due to increased data set sizes or higher I/O), resulting in tracebacks like the following:

2018-06-14 08:46:12,288 INFO     [b] Disabling client ports...
2018-06-14 08:46:21,521 INFO     [b] Draining...
2018-06-14 08:51:33,693 INFO     [b] Stopping service cassandra-b
2018-06-14 08:51:37,530 INFO     [b] Starting service cassandra-b
2018-06-14 08:51:37,539 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:51:43,539 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:51:49,540 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:51:55,543 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:01,547 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:07,553 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:13,559 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:19,563 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:25,567 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:31,572 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:37,575 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:43,579 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:49,583 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:52:55,587 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:53:01,592 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:53:07,598 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:53:13,605 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:53:19,611 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:53:25,615 WARNING  [b] CQL (10.192.32.135:9042) not listening (will retry)...
2018-06-14 08:53:37,616 ERROR    [b] CQL (10.192.32.135:9042) DOWN
Traceback (most recent call last):
  File "/usr/bin/c-foreach-restart", line 60, in <module>
    main()
  File "/usr/bin/c-foreach-restart", line 29, in main
    post_shutdown=args.execute_post_shutdown
  File "/usr/share/cassandra-tools-wmf/cassandra/tools/instances.py", line 92, in restart
    raise Exception("{} restart FAILED".format(self.service_name))
Exception: cassandra-b restart FAILED

After some discussion with Eric I switched to "c-foreach-restart -d 10 -a 20 -r 12" which worked reliably (the remaining ~15 servers restarted without any issues with these settings).

Given that cassandra-wmf-tools is an internal package, should be update the defaults, so that this works out of the box?

Completion criteria

  • Update defaults
  • Tag the debian branch (debian/1.0.3-1)
  • Build a package (ala gbp)
  • Upload package to apt repo
  • Upgrade machines

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2018, 11:49 AM
MoritzMuehlenhoff triaged this task as Normal priority.Jul 4 2018, 11:50 AM
mobrovac claimed this task.Jul 4 2018, 3:46 PM
mobrovac added a project: Services (doing).
mobrovac added a subscriber: mobrovac.

SGTM. The default number of attempts is already 20, so only the retries and delay need to be changed.

Change 443848 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/debs/cassandra-tools-wmf@master] c-for-each: Increase retry and delay defaults

https://gerrit.wikimedia.org/r/443848

Eevans added a comment.Jul 5 2018, 7:12 PM

With the default settings of c-foreach-restart it's no longer possible to reliably restart Cassandra on the restbase hosts (probably due to increased data set sizes or higher I/O), resulting in tracebacks like the following:

ACK; The startups run long because the hosts are quite IO-bound.

After some discussion with Eric I switched to "c-foreach-restart -d 10 -a 20 -r 12" which worked reliably (the remaining ~15 servers restarted without any issues with these settings).

Just for posterity sake, the -d argument controls the delay between (successful) instance restarts. Adding some delay has proven to ease the rate of timeouts (and subsequent RESTBase 500s) by giving the driver time to sort out dead connections before moving on. So, unrelated to the fails caused by long startups.

That said, a default of something does makes sense I think (I always use this argument).

Change 443848 merged by Eevans:
[operations/debs/cassandra-tools-wmf@master] c-foreach-restart: Increase retry and delay defaults

https://gerrit.wikimedia.org/r/443848

Eevans added a comment.Jul 5 2018, 7:14 PM

The change is merged, but we still need to get a package built, have it added to the APT repo, and then upgrade the machines.

Eevans moved this task from Backlog to Next on the User-Eevans board.

The change is merged, but we still need to get a package built, have it added to the APT repo, and then upgrade the machines.

I'll take care of that.

I had a look at the repo; the 1.0.2-1 package currently deployed to production has a date stamp from May 24 2017, while the 1.0.2 git tag is from November. Could you please doublecheck whether the changes in between are also okay for production rollout, then we can tag it as 1.0.3 or so?

I had a look at the repo; the 1.0.2-1 package currently deployed to production has a date stamp from May 24 2017, while the 1.0.2 git tag is from November. Could you please doublecheck whether the changes in between are also okay for production rollout, then we can tag it as 1.0.3 or so?

The tag is correct, not the changelog. The changelog was bumped in 1e3e221 in May 2017 but only tagged in November, which is what is currently deployed in production. Between these two events there are commits fb9c5ae, fd67f89 and 7619c61 and all of these commits' code is in production, so the only difference between code in production and master is 180f390 . I think we can safely proceed with tagging and releasing v1.0.3

Eevans added a comment.Jul 6 2018, 4:04 PM

So (just to be clear), I use gbp on this repo, and the Debian packaging is in the debian branch, changes to master get merged there before building. I've done that (the merge), and tagged master as upstream/1.0.3. I also updated the changelog to 1.0.3-1.

@MoritzMuehlenhoff, assuming you are still planning to build/upload, I'll leave tagging debian to you, just in case any other changes are needed.

Eevans added a comment.Jul 6 2018, 4:21 PM

So (just to be clear), I use gbp on this repo, and the Debian packaging is in the debian branch, changes to master get merged there before building. I've done that (the merge), and tagged master as upstream/1.0.3. I also updated the changelog to 1.0.3-1.
@MoritzMuehlenhoff, assuming you are still planning to build/upload, I'll leave tagging debian to you, just in case any other changes are needed.

Actually, I made a bit of a mess of the above by going through Gerrit for changes to the debian branch, and having them applied to master instead. I've fixed that now, but it required a force-push to master, so if you happened to update your clone between my last comment, and this one, then I apologize. :(

The version of c-foreach-restart as currently deployed on restbase* doesn't seem to use "c-foreach-restart -d 10 -a 20 -r 12" by default.

@Eevans Can we close this?

Eevans added a comment.Jul 5 2019, 9:25 PM

@Eevans Can we close this?

It's not done unfortunately. The changes discussed are merged to operations/debs/cassandra-tools-wmf, but the debian branch needs to be tagged (as debian/1.0.3-1), then a package needs to be built, uploaded, and the machines upgraded.

Eevans awarded a token.Jul 5 2019, 9:25 PM
Eevans updated the task description. (Show Details)Jul 5 2019, 9:28 PM