Page MenuHomePhabricator

Eevans (Eric Evans)
Staff Site Reliability Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Feb 27 2015, 10:47 PM (478 w, 8 h)
Availability
Available
IRC Nick
urandom
LDAP User
Eevans
MediaWiki User
EEvans (WMF) [ Global Accounts ]

Recent Activity

Today

Eevans committed rLPRI45411f7ef9b2: Rename cassandra user.
Rename cassandra user
Sat, Apr 27, 1:34 AM
Eevans added a project to T363615: puppetserver1001.eqiad.wmnet is unresponsive: Infrastructure-Foundations.
Sat, Apr 27, 1:09 AM · Infrastructure-Foundations, SRE
Eevans added a comment to T363615: puppetserver1001.eqiad.wmnet is unresponsive.

Restarted via the drac and everything seems OK now. I skimmed the logs and didn't see anything that seemed unusual prior to the event.

Sat, Apr 27, 1:08 AM · Infrastructure-Foundations, SRE
Eevans added a comment to T363615: puppetserver1001.eqiad.wmnet is unresponsive.

Also unable to login via the serial console.

Sat, Apr 27, 12:52 AM · Infrastructure-Foundations, SRE
Eevans created T363615: puppetserver1001.eqiad.wmnet is unresponsive.
Sat, Apr 27, 12:45 AM · Infrastructure-Foundations, SRE

Yesterday

Eevans committed rLPRI4b94269d9183: cassandra: add (faux) password for cassandra-devel user.
cassandra: add (faux) password for cassandra-devel user
Fri, Apr 26, 8:45 PM
Eevans added a comment to T362841: Degraded RAID on aqs1014.

The first device is done rebuilding:

Fri, Apr 26, 2:01 PM · Cassandra, SRE, ops-eqiad

Thu, Apr 25

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Ok, the rebuild is complete.

Thu, Apr 25, 2:27 PM · Cassandra, SRE, ops-eqiad

Wed, Apr 24

Eevans added a comment to T362841: Degraded RAID on aqs1014.

2:23 PM <jclark-ctr> i am swapping sdf again
2:24 PM <jclark-ctr> swapped with one that was just erased

Wed, Apr 24, 7:32 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Having some trouble adding sdf2 back into the array: mdadm: Cannot open /dev/sdf2: Device or resource busy :/

Wed, Apr 24, 6:07 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Looking at lshw.log and inventory on idrac it looks like all the drives are in order except sdf ,sdh are swapped in slots. after sdf rebuilds i can swap sdh

Wed, Apr 24, 5:57 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

@Eevans Replaced drive

Wed, Apr 24, 3:24 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

I was asked to provide feedback from mariadb perspective (and how consistent we want to be across different technologies but in the same team).

We don't usually hand over dev accounts to the staging environment. Many development/staging work gets done on mariadb instance outside of production, most notably beta cluster (which has its own issues but I assume setting up a dedicated project for cassandra in cloud VPS and giving access to that wouldn't be too hard). Given that it's outside of prod, the impact of mistakes or compromise is quite limited, It also discourages "testing in production" situation. I know the staging cluster has different data but still it's in prod infra with all the complexities/downsides that it brings with itself.

Wed, Apr 24, 2:58 PM · Patch-For-Review, Cassandra
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Wed, Apr 24, 1:33 PM · Patch-For-Review, Data-Persistence, Cassandra

Tue, Apr 23

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Here is a transcript of everything done (for posterity sake):

Tue, Apr 23, 11:15 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

@Eevans this one is out of warranty also let me know if i am able to swap drive i can take care of in morning

Tue, Apr 23, 11:04 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

Hi, is there any update with dev access for PCS devs?

Tue, Apr 23, 7:44 PM · Patch-For-Review, Cassandra
Eevans assigned T362841: Degraded RAID on aqs1014 to Jclark-ctr.

Hey @Jclark-ctr: I hope it's OK to assign this one to you as well.

Tue, Apr 23, 7:21 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362033: Degraded RAID on aqs1013.

@Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again

Sure, go ahead.

P.S. I think this is the 4th time, are we just really unlucky, or is there some underlying factor at work?

Tue, Apr 23, 7:20 PM · Cassandra, SRE, ops-eqiad

Fri, Apr 19

Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

[ ... ]

Finally, something for you to consider: For image suggestions we created something we're calling Cassandra HTTP Gateway. The idea is that if you are persisting results exactly as you hope to return them, right down to attribute names, then we could give you an HTTP interface (REST) to retrieve them from. This wouldn't be a substitute for your service, but your service could make a simple HTTP request rather than directly query Cassandra. This should make things much simpler for you. The gateway returns JSON-encoded results as an object with a single attribute called rows, that is an array of JSON-encoded row objects ({rows: [{...}, {...}, ]}). For your case, we'd be able to set that up so that you could plug that array right into items: [] in your response objects. Let me know what you think!

Fri, Apr 19, 7:25 PM · Cassandra, Data Products (Data Products Sprint 12), Commons-Impact-Metrics

Thu, Apr 18

Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

@Eevans I believe you are the owner of the production Cassandra instance.

Thu, Apr 18, 7:44 PM · Cassandra, Data Products (Data Products Sprint 12), Commons-Impact-Metrics
Eevans moved T362841: Degraded RAID on aqs1014 from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:03 PM · Cassandra, SRE, ops-eqiad
Eevans added a project to T362841: Degraded RAID on aqs1014: Cassandra.
Thu, Apr 18, 2:03 PM · Cassandra, SRE, ops-eqiad
Eevans moved T362033: Degraded RAID on aqs1013 from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:02 PM · Cassandra, SRE, ops-eqiad
Eevans moved T362697: Create Cassandra tables for Commons Impact Metrics from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:02 PM · Cassandra, Data Products (Data Products Sprint 12), Commons-Impact-Metrics
Eevans moved T362697: Create Cassandra tables for Commons Impact Metrics from Next to Backlog on the Cassandra board.
Thu, Apr 18, 2:02 PM · Cassandra, Data Products (Data Products Sprint 12), Commons-Impact-Metrics
Eevans moved T362697: Create Cassandra tables for Commons Impact Metrics from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:02 PM · Cassandra, Data Products (Data Products Sprint 12), Commons-Impact-Metrics
Eevans added a project to T362033: Degraded RAID on aqs1013: Cassandra.
Thu, Apr 18, 2:02 PM · Cassandra, SRE, ops-eqiad

Wed, Apr 17

Eevans triaged T362840: (Unexpectedly?) large image suggestions set size as Medium priority.
Wed, Apr 17, 11:53 PM · Cassandra, Structured-Data-Backlog, Structured Data Engineering
Eevans created T362840: (Unexpectedly?) large image suggestions set size.
Wed, Apr 17, 11:53 PM · Cassandra, Structured-Data-Backlog, Structured Data Engineering

Tue, Apr 16

Eevans added a project to T362697: Create Cassandra tables for Commons Impact Metrics: Cassandra.
Tue, Apr 16, 9:34 PM · Cassandra, Data Products (Data Products Sprint 12), Commons-Impact-Metrics
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Tue, Apr 16, 5:52 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

I propose we carry on with the migration to PKI, accepting that Cassandra-based golang services will have to have verification disabled for now. It's not a regression, so I don't think we should let it hold up this work.

+1 makes sense, the plan is to enable host verification etc.. only after the move, so let's proceed! What is the best way forward? Complete AQS, do Restbase and then finally Session Store?

Tue, Apr 16, 3:24 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans updated subscribers of T297944: Set up regular-repairs for AQS cassandra cluster tables.
Tue, Apr 16, 3:22 PM · Cassandra, Data-Engineering
Eevans merged T225694: Create cookbook to do `nodetool repair` across cassandra cluster into T297944: Set up regular-repairs for AQS cassandra cluster tables.
Tue, Apr 16, 3:20 PM · Cassandra, Data-Engineering
Eevans merged task T225694: Create cookbook to do `nodetool repair` across cassandra cluster into T297944: Set up regular-repairs for AQS cassandra cluster tables.
Tue, Apr 16, 3:20 PM · Cassandra, SRE-tools, User-Joe, SRE
Eevans lowered the priority of T297944: Set up regular-repairs for AQS cassandra cluster tables from High to Low.

We've made the upgrade to 4.x already, and we did so without a migration. If I've understood the context above, that was the reason for elevating the priority, so I'm going to drop it down now. Please fee free to readjust if that's wrong.

Tue, Apr 16, 3:19 PM · Cassandra, Data-Engineering
Eevans moved T361964: Golang-based Cassandra clients do not perform TLS host verification from Backlog to In-Progress on the Cassandra board.
Tue, Apr 16, 2:56 PM · AQS2.0, Data Products, Cassandra
Eevans added a comment to T361964: Golang-based Cassandra clients do not perform TLS host verification.

We've encountered a problem enabling verification for gocql-based clients (see: T352647#9715110). We'll need to implement a custom HostDialer for Cassandra-connecting golang services before this work can continue.

Tue, Apr 16, 2:56 PM · AQS2.0, Data Products, Cassandra
Eevans renamed T361964: Golang-based Cassandra clients do not perform TLS host verification from (some?) golang-based Cassandra clients do not perform TLS host verification to Golang-based Cassandra clients do not perform TLS host verification.
Tue, Apr 16, 2:55 PM · AQS2.0, Data Products, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

@Eevans sigh :( I found https://github.com/gocql/gocql/issues/1611 that may help, I didn't have time to check the code though.

Oh boy...

First, for those playing along at home, a brief summary of gocql/gocql/issues/1611 is that the list of discovered hosts used to populate gocql's connection pool comes from Cassandra's system.peers table —as IP address. The maintainers have fielded other issues that couldn't be solved as easily as resolving the IP to its FQDN, so they're opting toward a pluggable approach.


So gocql/gocql/issues/1611 is still open and hasn't been updated in a couple of years, but it gave rise to gocql/gocql/pull/1629 which was merged some time ago. They've implemented a DIY approach using an interface called HostDialer, basically a hook to override the driver's connection setup with code of your own. It landed in v1.2.0, and we're (for Kask at least) using v1.2.1, so all of the pieces should be there.

Tue, Apr 16, 2:51 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

@Eevans sigh :( I found https://github.com/gocql/gocql/issues/1611 that may help, I didn't have time to check the code though.

Tue, Apr 16, 2:48 PM · Patch-For-Review, Data-Persistence, Cassandra

Mon, Apr 15

Eevans moved T355730: Provide developer access to the cassandra-dev cluster from Backlog to In-Progress on the Cassandra board.
Mon, Apr 15, 11:49 PM · Patch-For-Review, Cassandra
Eevans moved T352647: Move Cassandra clusters to PKI from Backlog to In-Progress on the Cassandra board.
Mon, Apr 15, 11:49 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans moved T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines from Next to Backlog on the Cassandra board.
Mon, Apr 15, 11:49 PM · Section-Level-Image-Suggestions, Cassandra
Eevans moved T354970: Upgrade Cassandra to 4.1.5 from Next to Backlog on the Cassandra board.
Mon, Apr 15, 11:49 PM · Cassandra
Eevans triaged T362181: Encrypt Airflow connections to AQS Cassandra as Medium priority.
Mon, Apr 15, 11:48 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Data-Engineering, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.
{"msg":"error: failed to connect to \"[HostInfo hostname=\\\"10.192.48.54\\\" connectAddress=\\\"10.192.48.54\\\" peer=\\\"10.192.48.54\\\" rpc_address=\\\"10.192.48.54\\\" broadcast_address=\\\"\u003cnil\u003e\\\" preferred_ip=\\\"\u003cnil\u003e\\\" connect_addr=\\\"10.192.48.54\\\" connect_addr_source=\\\"connect_address\\\" port=9042 data_centre=\\\"codfw\\\" rack=\\\"A_D\\\" host_id=\\\"5bfa3453-48f8-4c3c-82ea-478c460b6ee5\\\" version=\\\"v4.1.1\\\" state=UP num_tokens=256]\" due to error: x509: cannot validate certificate for 10.192.48.54 because it doesn't contain any IP SANs","appname":"sessionstore","time":"2024-04-15T18:25:35Z","level":"WARNING"}
Mon, Apr 15, 6:56 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.
{"msg":"error: failed to connect to \"[HostInfo hostname=\\\"10.192.48.54\\\" connectAddress=\\\"10.192.48.54\\\" peer=\\\"10.192.48.54\\\" rpc_address=\\\"10.192.48.54\\\" broadcast_address=\\\"\u003cnil\u003e\\\" preferred_ip=\\\"\u003cnil\u003e\\\" connect_addr=\\\"10.192.48.54\\\" connect_addr_source=\\\"connect_address\\\" port=9042 data_centre=\\\"codfw\\\" rack=\\\"A_D\\\" host_id=\\\"5bfa3453-48f8-4c3c-82ea-478c460b6ee5\\\" version=\\\"v4.1.1\\\" state=UP num_tokens=256]\" due to error: x509: cannot validate certificate for 10.192.48.54 because it doesn't contain any IP SANs","appname":"sessionstore","time":"2024-04-15T18:25:35Z","level":"WARNING"}
Mon, Apr 15, 6:42 PM · Patch-For-Review, Data-Persistence, Cassandra

Thu, Apr 11

Eevans added a comment to T362033: Degraded RAID on aqs1013.

@Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again

Thu, Apr 11, 3:02 PM · Cassandra, SRE, ops-eqiad

Wed, Apr 10

Eevans added a comment to T361964: Golang-based Cassandra clients do not perform TLS host verification.

@Eevans medium is going to mean that it will likely only make it into a sprint at the end of this quarter. Is that ok or is this a risk?

Wed, Apr 10, 8:42 PM · AQS2.0, Data Products, Cassandra
Eevans updated the task description for T361964: Golang-based Cassandra clients do not perform TLS host verification.
Wed, Apr 10, 1:10 PM · AQS2.0, Data Products, Cassandra

Mon, Apr 8

Eevans updated subscribers of T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.

[ ... ]

So the question becomes, what shall we call this new chart? What about...

  • http-gateway
  • cassandra-druid-http-gateway
  • aqs-http-gateway
  • combined-http-gateway
Mon, Apr 8, 4:38 PM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Mon, Apr 8, 2:08 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Mon, Apr 8, 2:05 PM · Patch-For-Review, Data-Persistence, Cassandra

Fri, Apr 5

Eevans changed the status of T334130: Access to AQS keyspaces for cassandra from Open to Stalled.

Once the new AQS services are in production, will there be any on-going need?

Possibly. At least, there will be an on-going need for additional testing data. I've seen informal discussion of additional AQS endpoints, which would require additional data to develop against. It is also reasonable to assume that we'll at least occasionally find/create a bug for which we'd like additional tests to prevent future regressions, and for which we need additional test data to run local tests.

Now, that doesn't necessarily mean we need to continue extracting production data in the way we are now. Maybe there's another way to get the data. Or maybe we switch to using mock data representative of production rather than actual extracted data.

Fri, Apr 5, 9:23 PM · Cassandra
Eevans added a comment to T350882: Query additional sample data for AQS testing.

The original scope of this ticket was a very specific request to retrieve data, and that request as been met, so I'll close this ticket now.

Fri, Apr 5, 9:17 PM · Cassandra
Eevans closed T350882: Query additional sample data for AQS testing as Resolved.
Fri, Apr 5, 9:17 PM · Cassandra
Eevans closed T320831: Section Level Image Suggestions - Data Persistence Request as Resolved.
Fri, Apr 5, 8:52 PM · Section-Level-Image-Suggestions, Cassandra, Image-Suggestions
Eevans closed T320831: Section Level Image Suggestions - Data Persistence Request, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, as Resolved.
Fri, Apr 5, 8:52 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Freezer, Epic
Eevans added a comment to T320831: Section Level Image Suggestions - Data Persistence Request.

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit.
So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

Fully agree, a staged pathway to production is exactly what we need.
In other words, something we can feel free to feed with data and to eventually wipe it clean.

So as a step 1, can you create a separate ticket with your requirements here?

Here you are: T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

As a subtask for you @Eevans , T328670: Add section title column to image_suggestions.suggestions table schema would be the final requirement for production.

I think we can safely close this ticket as soon as the subtasks are resolved.

Fri, Apr 5, 8:52 PM · Section-Level-Image-Suggestions, Cassandra, Image-Suggestions
Eevans triaged T320831: Section Level Image Suggestions - Data Persistence Request as Medium priority.
Fri, Apr 5, 8:45 PM · Section-Level-Image-Suggestions, Cassandra, Image-Suggestions
Eevans triaged T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines as Medium priority.
Fri, Apr 5, 8:45 PM · Section-Level-Image-Suggestions, Cassandra
Eevans triaged T360548: Cassandra quorum read timeouts during node decommissions as Medium priority.
Fri, Apr 5, 8:45 PM · Cassandra
Eevans triaged T343855: AQS 2.0 differentially private pageviews deploy API as Medium priority.
Fri, Apr 5, 8:44 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE
Eevans changed the status of T343855: AQS 2.0 differentially private pageviews deploy API from Open to Stalled.
Fri, Apr 5, 8:44 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE
Eevans raised the priority of T360548: Cassandra quorum read timeouts during node decommissions from High to Needs Triage.
Fri, Apr 5, 8:40 PM · Cassandra
Eevans changed the status of T360548: Cassandra quorum read timeouts during node decommissions from Open to Stalled.
Fri, Apr 5, 8:39 PM · Cassandra
Eevans triaged T361964: Golang-based Cassandra clients do not perform TLS host verification as Medium priority.
Fri, Apr 5, 6:24 PM · AQS2.0, Data Products, Cassandra
Eevans created T361964: Golang-based Cassandra clients do not perform TLS host verification.
Fri, Apr 5, 6:24 PM · AQS2.0, Data Products, Cassandra
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Fri, Apr 5, 3:35 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

I tried to check the Cassandra AQS' clients and how they trust/validate TLS certificates. IIUC all the clients are on k8s and using the cassandra-http-gateway as chart, that renders a config file like /etc/cassandra-http-gateway/config.yaml containing various info about how to connect to a Cassandra cluster, and among those I found:

tls:
  ca: /etc/ssl/certs/wmf-ca-certificates.crt

This is great and it follows what we currently want on k8s and in production, namely that a daemon/service/etc.. that connects to Cassandra/Kafka/etc.. trusts the bundle composed by Puppet Root CA and PKI Root CA, so we can move towards cfssl freely. I am very puzzled with Cassandra since it uses ca-manger, and self-signed CAs IIRC, so those TLS certificates shouldn't be trusted by something that uses /etc/ssl/certs/wmf-ca-certificates.crt (the TLS connection should fail for TLS cert verification etc..).

I tried to dig a bit more and ended up in the generated-data-platform/aqs/device-analytics repo. Afaics we use github.com/gocql/gocql to manage TLS connections to Cassandra, and the only explanation that I can give is that we don't set either InsecureSkipVerify or EnableHostVerification so we skip TLS cert verification (see this commit).

If the above makes sense it simplifies a lot our work, since we are able to move Cassandra instances to cfssl without modifying any of the k8s clients, we just need to turn on the TLS cert validation once the cluster is fully migrated. Does it make sense?

Am I missing any other big clients that hit AQS?

Fri, Apr 5, 3:34 PM · Patch-For-Review, Data-Persistence, Cassandra

Thu, Apr 4

Eevans added a comment to T350507: Update mobileapps k8s deployment chart for Cassandra credentials.

carltondance

Thu, Apr 4, 2:33 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting
jijiki awarded T350507: Update mobileapps k8s deployment chart for Cassandra credentials a Love token.
Thu, Apr 4, 12:48 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting

Wed, Apr 3

Eevans added a subtask for T360548: Cassandra quorum read timeouts during node decommissions: T354970: Upgrade Cassandra to 4.1.5.
Wed, Apr 3, 7:21 PM · Cassandra
Eevans added a parent task for T354970: Upgrade Cassandra to 4.1.5: T360548: Cassandra quorum read timeouts during node decommissions.
Wed, Apr 3, 7:21 PM · Cassandra
Eevans renamed T354970: Upgrade Cassandra to 4.1.5 from Upgrade Cassandra to 4.1.4 to Upgrade Cassandra to 4.1.5.
Wed, Apr 3, 7:20 PM · Cassandra
Eevans moved T354970: Upgrade Cassandra to 4.1.5 from Backlog to Next on the Cassandra board.
Wed, Apr 3, 7:15 PM · Cassandra
Eevans moved T360548: Cassandra quorum read timeouts during node decommissions from Backlog to Blocked on the Cassandra board.
Wed, Apr 3, 7:15 PM · Cassandra
Eevans added a comment to T360548: Cassandra quorum read timeouts during node decommissions.

I'm not sure what to make of the results of disabling read-repair. It did not stop the errors entirely, but we can't say there is no change either. The decommissions are now complete, which makes further experimentation difficult. I think CASSANDRA-19120 is the most promising thing, so I propose that we upgrade to Cassandra 4.1.5 when it becomes available, and leave this issue open until the next decommission is needed.

Wed, Apr 3, 7:08 PM · Cassandra
Eevans updated the task description for T360548: Cassandra quorum read timeouts during node decommissions.
Wed, Apr 3, 6:55 PM · Cassandra
Eevans updated the task description for T360548: Cassandra quorum read timeouts during node decommissions.
Wed, Apr 3, 6:53 PM · Cassandra
Eevans closed T354561: Hardware refresh: Decommission restbase10[19-27] as Resolved.

Done!

Wed, Apr 3, 6:51 PM · Patch-For-Review, Cassandra
Eevans closed T354561: Hardware refresh: Decommission restbase10[19-27], a subtask of T354560: Provision new RESTBase cluster nodes: restbase10[34-42], as Resolved.
Wed, Apr 3, 6:50 PM · Cassandra
Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Wed, Apr 3, 6:50 PM · Patch-For-Review, Cassandra
Eevans updated the task description for T361372: decommission restbase10[19-27].
Wed, Apr 3, 6:49 PM · SRE, ops-eqiad, decommission-hardware
Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Wed, Apr 3, 2:00 PM · Patch-For-Review, Cassandra

Tue, Apr 2

Eevans triaged T361645: AQS Cassandra cluster: host/service failures should notify Data Persistence as High priority.
Tue, Apr 2, 8:47 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Cassandra
Eevans created T361645: AQS Cassandra cluster: host/service failures should notify Data Persistence.
Tue, Apr 2, 8:46 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Cassandra
Eevans added a comment to T340105: PHP-based alternative to wikimedia/service-template-node.

One more hypothesis for PHP IMO is that it requires significantly less long-term maintenance effort than Node when base OS versions (and as a consequence language versions) change, because the library ecosystem tends to be more stable. (Not sure how to quantify and test that, though; it's just a personal impression based on how abandoned Node services fared. Also, no idea if the same is true for Go.)

Tue, Apr 2, 8:40 PM · MediaWiki-Engineering, service-template-node, API Guidelines, Kubernetes
Eevans assigned T361603: aqs2001.codfw.wmnet down to Jhancock.wm.
Tue, Apr 2, 3:45 PM · SRE, Cassandra, ops-codfw
Eevans triaged T361603: aqs2001.codfw.wmnet down as High priority.
Tue, Apr 2, 2:53 PM · SRE, Cassandra, ops-codfw
Eevans created T361603: aqs2001.codfw.wmnet down.
Tue, Apr 2, 2:53 PM · SRE, Cassandra, ops-codfw

Mon, Apr 1

Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Mon, Apr 1, 4:43 PM · Patch-For-Review, Cassandra
Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Mon, Apr 1, 4:41 PM · Patch-For-Review, Cassandra
Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Mon, Apr 1, 3:27 PM · Patch-For-Review, Cassandra
Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Mon, Apr 1, 3:26 PM · Patch-For-Review, Cassandra

Sun, Mar 31

Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Sun, Mar 31, 12:46 AM · Patch-For-Review, Cassandra

Sat, Mar 30

Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Sat, Mar 30, 12:53 PM · Patch-For-Review, Cassandra
Eevans updated the task description for T354561: Hardware refresh: Decommission restbase10[19-27].
Sat, Mar 30, 12:49 PM · Patch-For-Review, Cassandra