Page MenuHomePhabricator

Eevans (Eric Evans)
Staff Site Reliability Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Feb 27 2015, 10:47 PM (479 w, 5 d)
Availability
Available
IRC Nick
urandom
LDAP User
Eevans
MediaWiki User
EEvans (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

I am trying to get access and face the following error:

> ssh cassandra-dev2001.codfw.wmnet
jgiannelos@cassandra-dev2001:~$ sudo -u cassandra_dev cqlsh cassandra-dev2001-a
sudo: unknown user: cassandra_dev
sudo: error initializing audit plugin sudoers_audit
Wed, May 8, 2:43 PM · Cassandra

Tue, May 7

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

Tue, May 7, 6:38 PM · Cassandra, SRE, ops-eqiad
Eevans triaged T364422: Reimage aqs1013 as High priority.
Tue, May 7, 6:37 PM · Patch-For-Review, Cassandra, SRE, ops-eqiad
Eevans created T364422: Reimage aqs1013.
Tue, May 7, 6:37 PM · Patch-For-Review, Cassandra, SRE, ops-eqiad
Eevans closed T355730: Provide developer access to the cassandra-dev cluster as Resolved.

Ok @Jgiannelos this should now be setup:

Tue, May 7, 6:05 PM · Cassandra

Thu, May 2

Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

The final changeset —r1026194— is up, but will need to pass review by Infrastructure Foundations (reviewed weekly, on Mondays).

Thu, May 2, 3:57 PM · Cassandra
Eevans closed T363377: Requesting access to deployment shell access for Jsn.sherman as Resolved.

Hi @jsn.sherman, You've been added to the deployment group, your shell username is jsn (same as wmf cloud). Let me know if you have any issues!

Thu, May 2, 3:47 PM · SRE, SRE-Access-Requests

Tue, Apr 30

Eevans added a comment to T362841: Degraded RAID on aqs1014.

The rebuild is complete:

Tue, Apr 30, 9:25 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362033: Degraded RAID on aqs1013.

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

Tue, Apr 30, 9:23 PM · Cassandra, SRE, ops-eqiad
Eevans updated the task description for T363377: Requesting access to deployment shell access for Jsn.sherman.
Tue, Apr 30, 7:05 PM · SRE, SRE-Access-Requests
Eevans claimed T363377: Requesting access to deployment shell access for Jsn.sherman.
Tue, Apr 30, 3:35 PM · SRE, SRE-Access-Requests
Eevans updated subscribers of T363377: Requesting access to deployment shell access for Jsn.sherman.

@thcipriani It looks like you're the approver for group deployment... do you?

Tue, Apr 30, 3:35 PM · SRE, SRE-Access-Requests
Eevans added a comment to T363377: Requesting access to deployment shell access for Jsn.sherman.

@jsn.sherman Could you please do one of the following? Either:

Tue, Apr 30, 3:34 PM · SRE, SRE-Access-Requests
Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

@Eevans thanks for following up on.

Did we leave Data_Gateway in an unclear space of ownership or where do it live? Is it on Data Product's side of the road(if so it is not currently anywhere on our list of all things).

If we can sort that out then I think we can roll on.

Tue, Apr 30, 2:47 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

Restbase done!

Next steps:

  • Rollout the new truststore for session store - do we need to schedule maintenance time and depool kask etc..?
Tue, Apr 30, 2:01 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

Since it is experimental, I do not think we should go in the HTTP Data Gateway direction at this time.

@WDoranWMF is this interesting to thinking through as a future release?

Tue, Apr 30, 1:01 AM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics

Mon, Apr 29

Eevans added a comment to T362841: Degraded RAID on aqs1014.

Ok, sdf has been replaced again, here is a transcript of what was done to add it back to the array:

Mon, Apr 29, 7:25 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Ok, so to summarize what has happened so far:

Mon, Apr 29, 6:44 PM · Cassandra, SRE, ops-eqiad
Eevans updated subscribers of T362697: Create Cassandra tables for Commons Impact Metrics.

@Eevans yes, of course.
Let's freeze this task for now.
We will meet as a team, since these changes would require some rework.
We'll decide, and we'll get back to you with all the changes to this task and the design doc.
🙏

Mon, Apr 29, 3:44 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics

Sat, Apr 27

Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

Yes @xcollazo I also wish I knew about this sooner.
And thanks @Eevans for the careful review!

As for the more technical deets, I have a few things here as well:

First, a lot of the metric values in your proposed schema use bigint, is a 64 bit signed integer warranted here, or would a 32 bit (int) value work?

I'd argue that for all the values that store pageview counts, I'd be safer to keep as bigint considering we do a bunch of rollups and pageview counts can be very big numbers. But agreed that many, like category_metrics_snapshot's media_file_count, could very well be ints. @mforns WDYT?

Agree! Everything that's not pageviews can be INT. I believe media file counts per category are not going over 1M so far, so we'd have quite some slack there. I see some pageview counts that reach 30+B though, so we need a BIGINT there.

Sat, Apr 27, 2:12 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

@Eevans thank you for the extensive comments, and apologies for taking long to respond.

[ ... ]
Second, a number of these tables use an attribute called top_data that is a JSON-encoded blob. Unless there is a very good reason, I would strongly discourage doing this. That serialized value is a part of the data model, a part that has been elided from this schema. As far as I can tell, each example of this could be modeled entirely in Cassandra to produce the same results. Take commons.top_pages_by_category as an example. If you change the table to this:

CREATE TABLE commons.top_pages_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    year                INT,
    month               INT,
    page_title          text,
    page_view_count     int,
    rank                int,
    PRIMARY KEY ((category, category_scope, wiki, year, month), page_title)
);

The same query —using predicates for category, category_scope, wiki, year, and month— will produce the results that you are JSON encoding.

I'll let @mforns expand/refute, but I think the rationale for encoding this inside the top_data was to have point queries and thus avoid range queries like the one you propose. If you believe Cassandra would be happy with range queries that could have up to 1000 rows, perhaps we should reconsider the schema indeed.

Sat, Apr 27, 2:09 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans committed rLPRI45411f7ef9b2: Rename cassandra user.
Rename cassandra user
Sat, Apr 27, 1:34 AM
Eevans added a project to T363615: puppetserver1001.eqiad.wmnet is unresponsive: Infrastructure-Foundations.
Sat, Apr 27, 1:09 AM · Infrastructure-Foundations, SRE
Eevans added a comment to T363615: puppetserver1001.eqiad.wmnet is unresponsive.

Restarted via the drac and everything seems OK now. I skimmed the logs and didn't see anything that seemed unusual prior to the event.

Sat, Apr 27, 1:08 AM · Infrastructure-Foundations, SRE
Eevans added a comment to T363615: puppetserver1001.eqiad.wmnet is unresponsive.

Also unable to login via the serial console.

Sat, Apr 27, 12:52 AM · Infrastructure-Foundations, SRE
Eevans created T363615: puppetserver1001.eqiad.wmnet is unresponsive.
Sat, Apr 27, 12:45 AM · Infrastructure-Foundations, SRE

Fri, Apr 26

Eevans committed rLPRI4b94269d9183: cassandra: add (faux) password for cassandra-devel user.
cassandra: add (faux) password for cassandra-devel user
Fri, Apr 26, 8:45 PM
Eevans added a comment to T362841: Degraded RAID on aqs1014.

The first device is done rebuilding:

Fri, Apr 26, 2:01 PM · Cassandra, SRE, ops-eqiad

Thu, Apr 25

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Ok, the rebuild is complete.

Thu, Apr 25, 2:27 PM · Cassandra, SRE, ops-eqiad

Wed, Apr 24

Eevans added a comment to T362841: Degraded RAID on aqs1014.

2:23 PM <jclark-ctr> i am swapping sdf again
2:24 PM <jclark-ctr> swapped with one that was just erased

Wed, Apr 24, 7:32 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Having some trouble adding sdf2 back into the array: mdadm: Cannot open /dev/sdf2: Device or resource busy :/

Wed, Apr 24, 6:07 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Looking at lshw.log and inventory on idrac it looks like all the drives are in order except sdf ,sdh are swapped in slots. after sdf rebuilds i can swap sdh

Wed, Apr 24, 5:57 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

@Eevans Replaced drive

Wed, Apr 24, 3:24 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

I was asked to provide feedback from mariadb perspective (and how consistent we want to be across different technologies but in the same team).

We don't usually hand over dev accounts to the staging environment. Many development/staging work gets done on mariadb instance outside of production, most notably beta cluster (which has its own issues but I assume setting up a dedicated project for cassandra in cloud VPS and giving access to that wouldn't be too hard). Given that it's outside of prod, the impact of mistakes or compromise is quite limited, It also discourages "testing in production" situation. I know the staging cluster has different data but still it's in prod infra with all the complexities/downsides that it brings with itself.

Wed, Apr 24, 2:58 PM · Cassandra
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Wed, Apr 24, 1:33 PM · Patch-For-Review, Data-Persistence, Cassandra

Tue, Apr 23

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Here is a transcript of everything done (for posterity sake):

Tue, Apr 23, 11:15 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

@Eevans this one is out of warranty also let me know if i am able to swap drive i can take care of in morning

Tue, Apr 23, 11:04 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

Hi, is there any update with dev access for PCS devs?

Tue, Apr 23, 7:44 PM · Cassandra
Eevans assigned T362841: Degraded RAID on aqs1014 to Jclark-ctr.

Hey @Jclark-ctr: I hope it's OK to assign this one to you as well.

Tue, Apr 23, 7:21 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362033: Degraded RAID on aqs1013.

@Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again

Sure, go ahead.

P.S. I think this is the 4th time, are we just really unlucky, or is there some underlying factor at work?

Tue, Apr 23, 7:20 PM · Cassandra, SRE, ops-eqiad

Fri, Apr 19

Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

[ ... ]

Finally, something for you to consider: For image suggestions we created something we're calling Cassandra HTTP Gateway. The idea is that if you are persisting results exactly as you hope to return them, right down to attribute names, then we could give you an HTTP interface (REST) to retrieve them from. This wouldn't be a substitute for your service, but your service could make a simple HTTP request rather than directly query Cassandra. This should make things much simpler for you. The gateway returns JSON-encoded results as an object with a single attribute called rows, that is an array of JSON-encoded row objects ({rows: [{...}, {...}, ]}). For your case, we'd be able to set that up so that you could plug that array right into items: [] in your response objects. Let me know what you think!

Fri, Apr 19, 7:25 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics

Thu, Apr 18

Eevans added a comment to T362697: Create Cassandra tables for Commons Impact Metrics.

@Eevans I believe you are the owner of the production Cassandra instance.

Thu, Apr 18, 7:44 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans moved T362841: Degraded RAID on aqs1014 from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:03 PM · Cassandra, SRE, ops-eqiad
Eevans added a project to T362841: Degraded RAID on aqs1014: Cassandra.
Thu, Apr 18, 2:03 PM · Cassandra, SRE, ops-eqiad
Eevans moved T362033: Degraded RAID on aqs1013 from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:02 PM · Cassandra, SRE, ops-eqiad
Eevans moved T362697: Create Cassandra tables for Commons Impact Metrics from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:02 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans moved T362697: Create Cassandra tables for Commons Impact Metrics from Next to Backlog on the Cassandra board.
Thu, Apr 18, 2:02 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans moved T362697: Create Cassandra tables for Commons Impact Metrics from Backlog to Next on the Cassandra board.
Thu, Apr 18, 2:02 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a project to T362033: Degraded RAID on aqs1013: Cassandra.
Thu, Apr 18, 2:02 PM · Cassandra, SRE, ops-eqiad

Wed, Apr 17

Eevans triaged T362840: (Unexpectedly?) large image suggestions set size as Medium priority.
Wed, Apr 17, 11:53 PM · Cassandra, Structured Data Engineering, Structured-Data-Backlog
Eevans created T362840: (Unexpectedly?) large image suggestions set size.
Wed, Apr 17, 11:53 PM · Cassandra, Structured Data Engineering, Structured-Data-Backlog

Tue, Apr 16

Eevans added a project to T362697: Create Cassandra tables for Commons Impact Metrics: Cassandra.
Tue, Apr 16, 9:34 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Tue, Apr 16, 5:52 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

I propose we carry on with the migration to PKI, accepting that Cassandra-based golang services will have to have verification disabled for now. It's not a regression, so I don't think we should let it hold up this work.

+1 makes sense, the plan is to enable host verification etc.. only after the move, so let's proceed! What is the best way forward? Complete AQS, do Restbase and then finally Session Store?

Tue, Apr 16, 3:24 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans updated subscribers of T297944: Set up regular-repairs for AQS cassandra cluster tables.
Tue, Apr 16, 3:22 PM · Cassandra, Data-Engineering
Eevans merged T225694: Create cookbook to do `nodetool repair` across cassandra cluster into T297944: Set up regular-repairs for AQS cassandra cluster tables.
Tue, Apr 16, 3:20 PM · Cassandra, Data-Engineering
Eevans merged task T225694: Create cookbook to do `nodetool repair` across cassandra cluster into T297944: Set up regular-repairs for AQS cassandra cluster tables.
Tue, Apr 16, 3:20 PM · Cassandra, SRE-tools, User-Joe, SRE
Eevans lowered the priority of T297944: Set up regular-repairs for AQS cassandra cluster tables from High to Low.

We've made the upgrade to 4.x already, and we did so without a migration. If I've understood the context above, that was the reason for elevating the priority, so I'm going to drop it down now. Please fee free to readjust if that's wrong.

Tue, Apr 16, 3:19 PM · Cassandra, Data-Engineering
Eevans moved T361964: Golang-based Cassandra clients do not perform TLS host verification from Backlog to In-Progress on the Cassandra board.
Tue, Apr 16, 2:56 PM · AQS2.0, Data Products, Cassandra
Eevans added a comment to T361964: Golang-based Cassandra clients do not perform TLS host verification.

We've encountered a problem enabling verification for gocql-based clients (see: T352647#9715110). We'll need to implement a custom HostDialer for Cassandra-connecting golang services before this work can continue.

Tue, Apr 16, 2:56 PM · AQS2.0, Data Products, Cassandra
Eevans renamed T361964: Golang-based Cassandra clients do not perform TLS host verification from (some?) golang-based Cassandra clients do not perform TLS host verification to Golang-based Cassandra clients do not perform TLS host verification.
Tue, Apr 16, 2:55 PM · AQS2.0, Data Products, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

@Eevans sigh :( I found https://github.com/gocql/gocql/issues/1611 that may help, I didn't have time to check the code though.

Oh boy...

First, for those playing along at home, a brief summary of gocql/gocql/issues/1611 is that the list of discovered hosts used to populate gocql's connection pool comes from Cassandra's system.peers table —as IP address. The maintainers have fielded other issues that couldn't be solved as easily as resolving the IP to its FQDN, so they're opting toward a pluggable approach.


So gocql/gocql/issues/1611 is still open and hasn't been updated in a couple of years, but it gave rise to gocql/gocql/pull/1629 which was merged some time ago. They've implemented a DIY approach using an interface called HostDialer, basically a hook to override the driver's connection setup with code of your own. It landed in v1.2.0, and we're (for Kask at least) using v1.2.1, so all of the pieces should be there.

Tue, Apr 16, 2:51 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

@Eevans sigh :( I found https://github.com/gocql/gocql/issues/1611 that may help, I didn't have time to check the code though.

Tue, Apr 16, 2:48 PM · Patch-For-Review, Data-Persistence, Cassandra

Mon, Apr 15

Eevans moved T355730: Provide developer access to the cassandra-dev cluster from Backlog to In-Progress on the Cassandra board.
Mon, Apr 15, 11:49 PM · Cassandra
Eevans moved T352647: Move Cassandra clusters to PKI from Backlog to In-Progress on the Cassandra board.
Mon, Apr 15, 11:49 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans moved T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines from Next to Backlog on the Cassandra board.
Mon, Apr 15, 11:49 PM · Section-Level-Image-Suggestions, Cassandra
Eevans moved T354970: Upgrade Cassandra to 4.1.5 from Next to Backlog on the Cassandra board.
Mon, Apr 15, 11:49 PM · Cassandra
Eevans triaged T362181: Encrypt Airflow connections to AQS Cassandra as Medium priority.
Mon, Apr 15, 11:48 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.
{"msg":"error: failed to connect to \"[HostInfo hostname=\\\"10.192.48.54\\\" connectAddress=\\\"10.192.48.54\\\" peer=\\\"10.192.48.54\\\" rpc_address=\\\"10.192.48.54\\\" broadcast_address=\\\"\u003cnil\u003e\\\" preferred_ip=\\\"\u003cnil\u003e\\\" connect_addr=\\\"10.192.48.54\\\" connect_addr_source=\\\"connect_address\\\" port=9042 data_centre=\\\"codfw\\\" rack=\\\"A_D\\\" host_id=\\\"5bfa3453-48f8-4c3c-82ea-478c460b6ee5\\\" version=\\\"v4.1.1\\\" state=UP num_tokens=256]\" due to error: x509: cannot validate certificate for 10.192.48.54 because it doesn't contain any IP SANs","appname":"sessionstore","time":"2024-04-15T18:25:35Z","level":"WARNING"}
Mon, Apr 15, 6:56 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.
{"msg":"error: failed to connect to \"[HostInfo hostname=\\\"10.192.48.54\\\" connectAddress=\\\"10.192.48.54\\\" peer=\\\"10.192.48.54\\\" rpc_address=\\\"10.192.48.54\\\" broadcast_address=\\\"\u003cnil\u003e\\\" preferred_ip=\\\"\u003cnil\u003e\\\" connect_addr=\\\"10.192.48.54\\\" connect_addr_source=\\\"connect_address\\\" port=9042 data_centre=\\\"codfw\\\" rack=\\\"A_D\\\" host_id=\\\"5bfa3453-48f8-4c3c-82ea-478c460b6ee5\\\" version=\\\"v4.1.1\\\" state=UP num_tokens=256]\" due to error: x509: cannot validate certificate for 10.192.48.54 because it doesn't contain any IP SANs","appname":"sessionstore","time":"2024-04-15T18:25:35Z","level":"WARNING"}
Mon, Apr 15, 6:42 PM · Patch-For-Review, Data-Persistence, Cassandra

Thu, Apr 11

Eevans added a comment to T362033: Degraded RAID on aqs1013.

@Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again

Thu, Apr 11, 3:02 PM · Cassandra, SRE, ops-eqiad

Wed, Apr 10

Eevans added a comment to T361964: Golang-based Cassandra clients do not perform TLS host verification.

@Eevans medium is going to mean that it will likely only make it into a sprint at the end of this quarter. Is that ok or is this a risk?

Wed, Apr 10, 8:42 PM · AQS2.0, Data Products, Cassandra
Eevans updated the task description for T361964: Golang-based Cassandra clients do not perform TLS host verification.
Wed, Apr 10, 1:10 PM · AQS2.0, Data Products, Cassandra

Apr 8 2024

Eevans updated subscribers of T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.

[ ... ]

So the question becomes, what shall we call this new chart? What about...

  • http-gateway
  • cassandra-druid-http-gateway
  • aqs-http-gateway
  • combined-http-gateway
Apr 8 2024, 4:38 PM · Data Products (Data Products Sprint 12), Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Apr 8 2024, 2:08 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Apr 8 2024, 2:05 PM · Patch-For-Review, Data-Persistence, Cassandra

Apr 5 2024

Eevans changed the status of T334130: Access to AQS keyspaces for cassandra from Open to Stalled.

Once the new AQS services are in production, will there be any on-going need?

Possibly. At least, there will be an on-going need for additional testing data. I've seen informal discussion of additional AQS endpoints, which would require additional data to develop against. It is also reasonable to assume that we'll at least occasionally find/create a bug for which we'd like additional tests to prevent future regressions, and for which we need additional test data to run local tests.

Now, that doesn't necessarily mean we need to continue extracting production data in the way we are now. Maybe there's another way to get the data. Or maybe we switch to using mock data representative of production rather than actual extracted data.

Apr 5 2024, 9:23 PM · Cassandra
Eevans added a comment to T350882: Query additional sample data for AQS testing.

The original scope of this ticket was a very specific request to retrieve data, and that request as been met, so I'll close this ticket now.

Apr 5 2024, 9:17 PM · Cassandra
Eevans closed T350882: Query additional sample data for AQS testing as Resolved.
Apr 5 2024, 9:17 PM · Cassandra
Eevans closed T320831: Section Level Image Suggestions - Data Persistence Request as Resolved.
Apr 5 2024, 8:52 PM · Section-Level-Image-Suggestions, Cassandra, Image-Suggestions
Eevans closed T320831: Section Level Image Suggestions - Data Persistence Request, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, as Resolved.
Apr 5 2024, 8:52 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Freezer, Epic
Eevans added a comment to T320831: Section Level Image Suggestions - Data Persistence Request.

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit.
So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

Fully agree, a staged pathway to production is exactly what we need.
In other words, something we can feel free to feed with data and to eventually wipe it clean.

So as a step 1, can you create a separate ticket with your requirements here?

Here you are: T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

As a subtask for you @Eevans , T328670: Add section title column to image_suggestions.suggestions table schema would be the final requirement for production.

I think we can safely close this ticket as soon as the subtasks are resolved.

Apr 5 2024, 8:52 PM · Section-Level-Image-Suggestions, Cassandra, Image-Suggestions
Eevans triaged T320831: Section Level Image Suggestions - Data Persistence Request as Medium priority.
Apr 5 2024, 8:45 PM · Section-Level-Image-Suggestions, Cassandra, Image-Suggestions
Eevans triaged T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines as Medium priority.
Apr 5 2024, 8:45 PM · Section-Level-Image-Suggestions, Cassandra
Eevans triaged T360548: Cassandra quorum read timeouts during node decommissions as Medium priority.
Apr 5 2024, 8:45 PM · Cassandra
Eevans triaged T343855: AQS 2.0 differentially private pageviews deploy API as Medium priority.
Apr 5 2024, 8:44 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE
Eevans changed the status of T343855: AQS 2.0 differentially private pageviews deploy API from Open to Stalled.
Apr 5 2024, 8:44 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE
Eevans raised the priority of T360548: Cassandra quorum read timeouts during node decommissions from High to Needs Triage.
Apr 5 2024, 8:40 PM · Cassandra
Eevans changed the status of T360548: Cassandra quorum read timeouts during node decommissions from Open to Stalled.
Apr 5 2024, 8:39 PM · Cassandra
Eevans triaged T361964: Golang-based Cassandra clients do not perform TLS host verification as Medium priority.
Apr 5 2024, 6:24 PM · AQS2.0, Data Products, Cassandra
Eevans created T361964: Golang-based Cassandra clients do not perform TLS host verification.
Apr 5 2024, 6:24 PM · AQS2.0, Data Products, Cassandra
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Apr 5 2024, 3:35 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

I tried to check the Cassandra AQS' clients and how they trust/validate TLS certificates. IIUC all the clients are on k8s and using the cassandra-http-gateway as chart, that renders a config file like /etc/cassandra-http-gateway/config.yaml containing various info about how to connect to a Cassandra cluster, and among those I found:

tls:
  ca: /etc/ssl/certs/wmf-ca-certificates.crt

This is great and it follows what we currently want on k8s and in production, namely that a daemon/service/etc.. that connects to Cassandra/Kafka/etc.. trusts the bundle composed by Puppet Root CA and PKI Root CA, so we can move towards cfssl freely. I am very puzzled with Cassandra since it uses ca-manger, and self-signed CAs IIRC, so those TLS certificates shouldn't be trusted by something that uses /etc/ssl/certs/wmf-ca-certificates.crt (the TLS connection should fail for TLS cert verification etc..).

I tried to dig a bit more and ended up in the generated-data-platform/aqs/device-analytics repo. Afaics we use github.com/gocql/gocql to manage TLS connections to Cassandra, and the only explanation that I can give is that we don't set either InsecureSkipVerify or EnableHostVerification so we skip TLS cert verification (see this commit).

If the above makes sense it simplifies a lot our work, since we are able to move Cassandra instances to cfssl without modifying any of the k8s clients, we just need to turn on the TLS cert validation once the cluster is fully migrated. Does it make sense?

Am I missing any other big clients that hit AQS?

Apr 5 2024, 3:34 PM · Patch-For-Review, Data-Persistence, Cassandra

Apr 4 2024

Eevans added a comment to T350507: Update mobileapps k8s deployment chart for Cassandra credentials.

carltondance

Apr 4 2024, 2:33 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting
jijiki awarded T350507: Update mobileapps k8s deployment chart for Cassandra credentials a Love token.
Apr 4 2024, 12:48 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting

Apr 3 2024

Eevans added a subtask for T360548: Cassandra quorum read timeouts during node decommissions: T354970: Upgrade Cassandra to 4.1.5.
Apr 3 2024, 7:21 PM · Cassandra
Eevans added a parent task for T354970: Upgrade Cassandra to 4.1.5: T360548: Cassandra quorum read timeouts during node decommissions.
Apr 3 2024, 7:21 PM · Cassandra
Eevans renamed T354970: Upgrade Cassandra to 4.1.5 from Upgrade Cassandra to 4.1.4 to Upgrade Cassandra to 4.1.5.
Apr 3 2024, 7:20 PM · Cassandra
Eevans moved T354970: Upgrade Cassandra to 4.1.5 from Backlog to Next on the Cassandra board.
Apr 3 2024, 7:15 PM · Cassandra
Eevans moved T360548: Cassandra quorum read timeouts during node decommissions from Backlog to Blocked on the Cassandra board.
Apr 3 2024, 7:15 PM · Cassandra