Upgrade Cassandra clusters to 3.11.14:
- cassandra-dev
- sessionstore
- restbase
- aqs
- deployment-restbase04
- deployment-echostore02
- deployment-sessionstore04
- ml-cache
Upgrade Cassandra clusters to 3.11.14:
Change 911934 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] cassandra_dev: Upgrade cluster to 'dev' version (3.11.14)
Change 911934 merged by Eevans:
[operations/puppet@production] cassandra_dev: Upgrade cluster to 'dev' version (3.11.14)
Change 913951 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14
Change 913951 merged by Eevans:
[operations/puppet@production] sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14
Mentioned in SAL (#wikimedia-operations) [2023-05-01T16:03:55Z] <urandom> upgrading sessionstore2001 to Cassandra 3.11.14 โ T335383
Change 913954 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] sessionstore: upgrade sessionstore200[2-3] to Cassandra 3.11.14
Change 913954 merged by Eevans:
[operations/puppet@production] sessionstore: upgrade sessionstore200[2-3] to Cassandra 3.11.14
Mentioned in SAL (#wikimedia-operations) [2023-05-01T16:19:55Z] <urandom> upgrading sessionstore2002 to Cassandra 3.11.14 โ T335383
Mentioned in SAL (#wikimedia-operations) [2023-05-01T16:22:30Z] <urandom> upgrading sessionstore2003 to Cassandra 3.11.14 โ T335383
Change 913983 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] sessionstore: upgrade eqiad servers to Cassandra 3.11.14
Change 913983 merged by Eevans:
[operations/puppet@production] sessionstore: upgrade eqiad servers to Cassandra 3.11.14
Mentioned in SAL (#wikimedia-operations) [2023-05-01T20:42:41Z] <urandom> upgrading sessionstore1001 to Cassandra 3.11.14 โ T335383
Mentioned in SAL (#wikimedia-operations) [2023-05-01T20:45:06Z] <urandom> upgrading sessionstore1002 to Cassandra 3.11.14 โ T335383
Mentioned in SAL (#wikimedia-operations) [2023-05-01T20:47:05Z] <urandom> upgrading sessionstore1003 to Cassandra 3.11.14 โ T335383
Change 914797 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] restbase: upgrade Cassandra on restbase2012 & restbase1016
Change 914797 merged by Eevans:
[operations/puppet@production] restbase: upgrade Cassandra on restbase2012 & restbase1016
Mentioned in SAL (#wikimedia-operations) [2023-05-03T14:53:10Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:03:24Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2012.codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:07:00Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:17:15Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Change 914851 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] restbase: upgrade cluster to Cassandra 3.11.14
Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:24:23Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:34:36Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Change 914851 merged by Eevans:
[operations/puppet@production] restbase: upgrade cluster to Cassandra 3.11.14
Mentioned in SAL (#wikimedia-operations) [2023-05-03T15:48:45Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T18:26:30Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase20[13-27].codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T18:26:48Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T21:31:49Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[17-33].eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Change 914897 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14
Change 914897 merged by Eevans:
[operations/puppet@production] aqs: canary aqs1010 & aqs2001 to Cassandra 3.11.14
Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:00:26Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:08:04Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:11:28Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-03T22:19:40Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
/cc] @BTullis, @JAllemandou
I've upgraded aqs1010 & aqs2001 to 3.11.14 as canary, ahead of a planned cluster-wide upgrade tomorrow (I don't expect issues, but if there were any, the daily import would be as likely to provoke as anything).
Change 915675 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] aqs: upgrade cluster to Cassandra 3.11.14
Change 915675 merged by Eevans:
[operations/puppet@production] aqs: upgrade cluster to Cassandra 3.11.14
Mentioned in SAL (#wikimedia-operations) [2023-05-04T14:12:20Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-04T14:34:27Z] <eevans@cumin1001> END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-04T14:36:49Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-04T15:57:32Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-04T15:58:29Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Mentioned in SAL (#wikimedia-operations) [2023-05-04T17:22:25Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra โ T335383 - eevans@cumin1001
Change 915846 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] deployment-prep: upgrade Cassandra (restbase) to 3.11.14
Change 915846 merged by Eevans:
[operations/puppet@production] deployment-prep: upgrade Cassandra (restbase) to 3.11.14
While upgrading Cassandra on deployment-restbase04 I discovered existing Puppet errors:
eevans@deployment-restbase04:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Class[Lvs::Realserver]: parameter 'realserver_ips' variant 0 expects size to be at least 1, got 0 parameter 'realserver_ips' variant 1 expects a Hash value, got Array (file: /etc/puppet/modules/profile/manifests/lvs/realserver.pp, line: 27, column: 5) on node deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run eevans@deployment-restbase04:~$
Looks like it's been this way since early March(!) when @jbond committed b45eb896361. Apparently realserver_ips was always unset (and the outcome was OK as far as this env goes). Presumably the answer is to pass in a value now, but what? AFAICT, we're not actually using LVS in deployment-prep, it gets pulled in because we're using role::restbase::production, but I'd hate to introduce networking issues. @hnowlan any ideas?
Change 917407 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] ml-cache: upgrade Cassandra to 3.11.14
Change 917407 merged by Eevans:
[operations/puppet@production] ml-cache: upgrade Cassandra to 3.11.14
@Eevans Sorry for the delay in responding. The patch above was created to make sure we spot issue in production quirky. This has the side affect that this profile now needs real information instead of just quietly ignoring the issue. This is achieved by adding the following hiera data to horizon
profile::lvs::realserver::pools: restbase-https: services: - restbase - envoyproxy
Going further puppet uses the information above to search the service catalogue for the correct IP address. This is when i noticed that the service catalogue AFAIKT doesn't work in beta, As such we need some data for the service::catalog, i added the following to horizon to mock this service but will likely need updating with some real ips/ports etc
service::catalog: restbase-https: description: RESTBase, restbase.svc.%{::site}.wmnet - HTTPS discovery: - active_active: true dnsdisc: restbase - active_active: true dnsdisc: restbase-async encryption: true ip: eqiad: default: 127.0.0.1 lvs: class: low-traffic conftool: cluster: restbase service: restbase-ssl depool_threshold: '.5' enabled: true monitors: IdleConnection: max-delay: 300 timeout-clean-reconnect: 3 ProxyFetch: url: - https://localhost/ scheduler: wrr port: 7443 probes: - type: http - type: swagger sites: - eqiad - codfw state: production
I would have thought that the service catalogue was useful to many other services and something should probably get put in to the deployment-prep global space (for now i added all theses to the deployment-restbase prefix
At this point we have an lvs ip address and so envoy wants to set up a service_proxy. for this i needed to add the following two bits of data. theses override values in the global space for deployment prep so it might be better to just add theses, to that space. However if you do, i think you will likely need service::catalog entries for all services listed in profile::services_proxy::envoy::enabled_listeners:
profile::services_proxy::envoy::enabled_listeners: - restbase-https profile::services_proxy::envoy::listeners: - name: restbase-https port: 8443 service: restbase-https timeout: 60s upstream: '%{facts.networking.fqdn}'
With all this puppet is running but like i said some of the hiera will need to be audited and likely corrected