Page MenuHomePhabricator

Replace expiring Cassandra SSL certificates (sessionstore cluster)
Closed, ResolvedPublic

Assigned To
Authored By
Eevans
Jan 23 2023, 5:48 PM
Referenced Files
F36578077: image.png
Feb 1 2023, 9:24 PM
F36578075: image.png
Feb 1 2023, 9:24 PM
F36578073: image.png
Feb 1 2023, 9:24 PM

Description

All six sessionstore cluster nodes have SSL certificates that will expire in the coming month, and need to be replaced.


Edit 2023-01-25:

After the events of https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues, we should exercise additional care when completing this work.

Proposal:

  • codfw:
    • De-pool sessionstore in codfw
    • Replace SSL certificates
    • Perform a rolling restart of Cassandra
    • Perform rolling restart of sessionstore service (Kask) (skipped)
    • Re-pool codfw
  • eqiad:
    • De-pool sessionstore in eqiad (skipped)
    • Replace SSL certificates
    • Perform a rolling restart of Cassandra
    • Perform rolling restart the sessionstore service (Kask) (skipped)
    • Re-pool eqiad (skipped)

Event Timeline

Eevans triaged this task as Medium priority.Jan 23 2023, 5:49 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-01T20:43:40Z] <urandom> depooling sessionstore —codfw— in preparation for Cassandra restarts — T327675

Mentioned in SAL (#wikimedia-operations) [2023-02-01T21:02:18Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore200*: Applying new TLS certificates — T327675 - eevans@cumin1001

Eevans updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-02-01T21:19:48Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore200*: Applying new TLS certificates — T327675 - eevans@cumin1001

I ran seige against sessionstore.svc.codfw.wmnet while codfw was de-pooled, and the rolling restart was happening.

eevans@deploy2002:~/T327954$ date -Iseconds; siege -f urls.txt -i -c 64 -t 15M -d 0.25
2023-02-01T20:59:29+00:00
** SIEGE 4.0.4
** Preparing 64 concurrent users for battle.
The server is now under siege...
Lifting the server siege...
Transactions:                 322624 hits
Availability:                 100.00 %
Elapsed time:                 899.24 secs
Data transferred:              35.98 MB
Response time:                  0.05 secs
Transaction rate:             358.77 trans/sec
Throughput:                     0.04 MB/sec
Concurrency:                   18.95
Successful transactions:      319320
Failed transactions:               3
Longest transaction:            1.11
Shortest transaction:           0.04
 
eevans@deploy2002:~/T327954

There were only 3 failed HTTP transactions, and 10 errors logged (all of the expected variety, "connection reset", "EOF", "connection refused").

image.png (810×1 px, 297 KB)

image.png (233×928 px, 29 KB)

image.png (232×924 px, 17 KB)

Mentioned in SAL (#wikimedia-operations) [2023-02-01T21:39:39Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore100*: Applying new TLS certificates — T327675 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-02-01T21:57:45Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore100*: Applying new TLS certificates — T327675 - eevans@cumin1001

Eevans claimed this task.
Eevans updated the task description. (Show Details)

Done!