Page MenuHomePhabricator

Upgrade Cassandra to latest 3.11.x (3.11.14)
Closed, ResolvedPublic

Description

Upgrade Cassandra clusters to 3.11.14:

  • cassandra-dev
  • sessionstore
  • restbase
  • aqs
  • deployment-restbase04
  • deployment-echostore02
  • deployment-sessionstore04
  • ml-cache

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 25 2023, 8:26 PM
Eevans triaged this task as Medium priority.

Change 911934 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra_dev: Upgrade cluster to 'dev' version (3.11.14)

https://gerrit.wikimedia.org/r/911934

Change 911934 merged by Eevans:

[operations/puppet@production] cassandra_dev: Upgrade cluster to 'dev' version (3.11.14)

https://gerrit.wikimedia.org/r/911934

Change 913951 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/913951

Change 913951 merged by Eevans:

[operations/puppet@production] sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/913951

Mentioned in SAL (#wikimedia-operations) [2023-05-01T16:03:55Z] <urandom> upgrading sessionstore2001 to Cassandra 3.11.14 โ€” T335383

Change 913954 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: upgrade sessionstore200[2-3] to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/913954

Change 913954 merged by Eevans:

[operations/puppet@production] sessionstore: upgrade sessionstore200[2-3] to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/913954

Mentioned in SAL (#wikimedia-operations) [2023-05-01T16:19:55Z] <urandom> upgrading sessionstore2002 to Cassandra 3.11.14 โ€” T335383

Mentioned in SAL (#wikimedia-operations) [2023-05-01T16:22:30Z] <urandom> upgrading sessionstore2003 to Cassandra 3.11.14 โ€” T335383

Change 913983 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: upgrade eqiad servers to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/913983

Change 913983 merged by Eevans:

[operations/puppet@production] sessionstore: upgrade eqiad servers to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/913983

Mentioned in SAL (#wikimedia-operations) [2023-05-01T20:42:41Z] <urandom> upgrading sessionstore1001 to Cassandra 3.11.14 โ€” T335383

Mentioned in SAL (#wikimedia-operations) [2023-05-01T20:45:06Z] <urandom> upgrading sessionstore1002 to Cassandra 3.11.14 โ€” T335383

Mentioned in SAL (#wikimedia-operations) [2023-05-01T20:47:05Z] <urandom> upgrading sessionstore1003 to Cassandra 3.11.14 โ€” T335383

@hnowlan & @BTullis let me know if you have any concerns about upgrading restbase & aqs respectively, otherwise I will probably start rolling this out sometime later the week!

@hnowlan & @BTullis let me know if you have any concerns about upgrading restbase & aqs respectively, otherwise I will probably start rolling this out sometime later the week!

Sounds good to me, thank you!

Change 914797 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] restbase: upgrade Cassandra on restbase2012 & restbase1016

https://gerrit.wikimedia.org/r/914797

Change 914797 merged by Eevans:

[operations/puppet@production] restbase: upgrade Cassandra on restbase2012 & restbase1016

https://gerrit.wikimedia.org/r/914797

Mentioned in SAL (#wikimedia-operations) [2023-05-03T14:53:10Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:03:24Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:07:00Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:17:15Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Change 914851 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] restbase: upgrade cluster to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/914851

Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:24:23Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:34:36Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Change 914851 merged by Eevans:

[operations/puppet@production] restbase: upgrade cluster to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/914851

Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:48:45Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T18:26:30Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T18:26:48Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T21:31:49Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Change 914897 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/914897

Change 914897 merged by Eevans:

[operations/puppet@production] aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/914897

Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:00:26Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:08:04Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:11:28Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:19:40Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

/cc] @BTullis, @JAllemandou

I've upgraded aqs1010 & aqs2001 to 3.11.14 as canary, ahead of a planned cluster-wide upgrade tomorrow (I don't expect issues, but if there were any, the daily import would be as likely to provoke as anything).

Change 915675 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade cluster to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/915675

Change 915675 merged by Eevans:

[operations/puppet@production] aqs: upgrade cluster to Cassandra 3.11.14

https://gerrit.wikimedia.org/r/915675

Mentioned in SAL (#wikimedia-operations) [2023-05-04T14:12:20Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-04T14:34:27Z] <eevans@cumin1001> END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-04T14:36:49Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-04T15:57:32Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-04T15:58:29Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-05-04T17:22:25Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra โ€” T335383 - eevans@cumin1001

Change 915846 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] deployment-prep: upgrade Cassandra (restbase) to 3.11.14

https://gerrit.wikimedia.org/r/915846

Change 915846 merged by Eevans:

[operations/puppet@production] deployment-prep: upgrade Cassandra (restbase) to 3.11.14

https://gerrit.wikimedia.org/r/915846

@elukey are we OK to upgrade ml-cache to 3.11.14?

While upgrading Cassandra on deployment-restbase04 I discovered existing Puppet errors:

eevans@deployment-restbase04:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Class[Lvs::Realserver]:
  parameter 'realserver_ips' variant 0 expects size to be at least 1, got 0
  parameter 'realserver_ips' variant 1 expects a Hash value, got Array (file: /etc/puppet/modules/profile/manifests/lvs/realserver.pp, line: 27, column: 5) on node deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
eevans@deployment-restbase04:~$

Looks like it's been this way since early March(!) when @jbond committed b45eb896361. Apparently realserver_ips was always unset (and the outcome was OK as far as this env goes). Presumably the answer is to pass in a value now, but what? AFAICT, we're not actually using LVS in deployment-prep, it gets pulled in because we're using role::restbase::production, but I'd hate to introduce networking issues. @hnowlan any ideas?

Change 917407 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] ml-cache: upgrade Cassandra to 3.11.14

https://gerrit.wikimedia.org/r/917407

Change 917407 merged by Eevans:

[operations/puppet@production] ml-cache: upgrade Cassandra to 3.11.14

https://gerrit.wikimedia.org/r/917407

While upgrading Cassandra on deployment-restbase04 I discovered existing Puppet errors:

eevans@deployment-restbase04:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Class[Lvs::Realserver]:
  parameter 'realserver_ips' variant 0 expects size to be at least 1, got 0
  parameter 'realserver_ips' variant 1 expects a Hash value, got Array (file: /etc/puppet/modules/profile/manifests/lvs/realserver.pp, line: 27, column: 5) on node deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
eevans@deployment-restbase04:~$

Looks like it's been this way since early March(!) when @jbond committed b45eb896361. Apparently realserver_ips was always unset (and the outcome was OK as far as this env goes). Presumably the answer is to pass in a value now, but what? AFAICT, we're not actually using LVS in deployment-prep, it gets pulled in because we're using role::restbase::production, but I'd hate to introduce networking issues. @hnowlan any ideas?

@jbond any suggestions?

@Eevans Sorry for the delay in responding. The patch above was created to make sure we spot issue in production quirky. This has the side affect that this profile now needs real information instead of just quietly ignoring the issue. This is achieved by adding the following hiera data to horizon

profile::lvs::realserver::pools:
  restbase-https:
    services:
    - restbase
    - envoyproxy

Going further puppet uses the information above to search the service catalogue for the correct IP address. This is when i noticed that the service catalogue AFAIKT doesn't work in beta, As such we need some data for the service::catalog, i added the following to horizon to mock this service but will likely need updating with some real ips/ports etc

service::catalog:
  restbase-https:
    description: RESTBase, restbase.svc.%{::site}.wmnet - HTTPS
    discovery:
    - active_active: true
      dnsdisc: restbase
    - active_active: true
      dnsdisc: restbase-async
    encryption: true
    ip:
      eqiad:
        default: 127.0.0.1
    lvs:
      class: low-traffic
      conftool:
        cluster: restbase
        service: restbase-ssl
      depool_threshold: '.5'
      enabled: true
      monitors:
        IdleConnection:
          max-delay: 300
          timeout-clean-reconnect: 3
        ProxyFetch:
          url:
          - https://localhost/
      scheduler: wrr
    port: 7443
    probes:
    - type: http
    - type: swagger
    sites:
    - eqiad
    - codfw
    state: production

I would have thought that the service catalogue was useful to many other services and something should probably get put in to the deployment-prep global space (for now i added all theses to the deployment-restbase prefix

At this point we have an lvs ip address and so envoy wants to set up a service_proxy. for this i needed to add the following two bits of data. theses override values in the global space for deployment prep so it might be better to just add theses, to that space. However if you do, i think you will likely need service::catalog entries for all services listed in profile::services_proxy::envoy::enabled_listeners:

profile::services_proxy::envoy::enabled_listeners:
- restbase-https
profile::services_proxy::envoy::listeners:
- name: restbase-https
  port: 8443
  service: restbase-https
  timeout: 60s
  upstream: '%{facts.networking.fqdn}'

With all this puppet is running but like i said some of the hiera will need to be audited and likely corrected