Page MenuHomePhabricator
Feed Advanced Search

Fri, May 24

Eevans added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

certs updated in all DCs, alerts resolved. I sincerely hope we will have the mesh migration resolved so we can avoid having to update echostore's certificates in October, but in case something prevents that and for reference the process was:

  • puppet cert revoke sessionstore.discovery.wmnet
  • In the puppet repo on your local checkout ./utils/create_ecdsa_cert sessionstore.discovery.wmnet sessionstore.svc.eqiad.wmnet sessionstore.svc.codfw.wmnet
  • On the puppetmaster, put the contents of /var/lib/puppet/server/ssl/ca/signed/sessionstore.discovery.wmnet.pem into certs.kask.cert in helmfile.d
  • Add the contents of the new private key from ./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key to hieradata/role/common/deployment_server/kubernetes.yaml
  • Validate the files and make sure everything looks okay using openssl ec/openssl x509, then git commit your changes in private
  • Follow the Helm rollout process as normal, keeping an eye on the sessionstore graphs and the session loss graphs
Fri, May 24, 2:32 PM · serviceops, Data-Persistence

Thu, May 23

Eevans added a comment to T364583: Consider whether we want to set a TTL on the Cassandra tables.

Oh! It's great that we can do range deletions from Cassandra!
I think with the deprecation of Druid, we are adopting Cassandra as our main serving layer, and we should be able to delete data from it given a time range.
This would help do idempotent reruns, and implement deletion policies.

I think TTL is convenient, because it's easy. But I'm not sure is the good paradigm for long lived data?
It has some problems, like:

  • If we backfill say 1 year of data, then all that data is going to be deleted at the same time.
  • If we re-run past data, then the updated data will live longer than more recent data.
Thu, May 23, 6:45 PM · Data Products (Data Products Sprint 14), Commons-Impact-Metrics

Wed, May 22

Eevans merged T365337: Degraded RAID on aqs1013 into T362033: Degraded RAID on aqs1013.
Wed, May 22, 2:05 PM · DC-Ops, Cassandra, SRE, ops-eqiad
Eevans merged task T365337: Degraded RAID on aqs1013 into T362033: Degraded RAID on aqs1013.
Wed, May 22, 2:05 PM · Data-Persistence, DC-Ops, SRE, ops-eqiad
Eevans added a comment to T365337: Degraded RAID on aqs1013.

We're aware, yeah, current efforts being tracked in T362033: Degraded RAID on aqs1013 (no shortage of tickets on this one 😞).

Wed, May 22, 2:04 PM · Data-Persistence, DC-Ops, SRE, ops-eqiad
Eevans added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

Steps that I see:

  • Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services
  • Add toggleable mesh support
  • Get a baseline for performance in staging by running siege for a few hours
  • Deploy mesh-enabled version to staging, compare performance before and after
  • Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

Wed, May 22, 1:55 PM · serviceops, Data-Persistence

Tue, May 21

Eevans raised the priority of T309229: Make Cassandra client encryption non-optional (AQS cluster) from Medium to High.
Tue, May 21, 11:31 PM · Data-Engineering-Radar, Cassandra
Eevans raised the priority of T310820: Encrypt Spark-Cassandra connection from Medium to High.
Tue, May 21, 11:31 PM · Data-Engineering, Data Pipelines, Cassandra
Eevans raised the priority of T305600: Properly add aqsloader user (w/ secrets) from Medium to High.
Tue, May 21, 11:30 PM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Data-Engineering-Kanban, Cassandra, User-Eevans
Eevans claimed T362697: Create Staging Cassandra tables for Commons Impact Metrics.
Tue, May 21, 11:09 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

Steps that I see:

  • Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services

[ ... ]

Tue, May 21, 1:40 PM · serviceops, Data-Persistence

Mon, May 20

Eevans added a comment to T364583: Consider whether we want to set a TTL on the Cassandra tables.

...

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki), year, month, rank)
);

The one obstacle I see in a situation like this is cardinality. Changing it like this means that we're growing the partition from being based on the number of unique rank attributes (for each (category_scope, wiki, year, month)), to 12 * num_years * num_ranks (assuming the number of unique ranks is relatively low, that's not much). The other side of the equation is distribution. Changing it like this means we we're shrinking the number of partitions from the number of unique (category_scope, wiki, year, month) tuples, to the number of unique (category_scope, wiki). If the latter is overly small, then distribution won't be as good.

Do you have some idea of the cardinality of these datasets?

If a change like this will fit your data, I think the schema is all we'd need to fix; All we'd be doing is changing the way Cassandra organizes the data, the schema insofar as your client code is concerned should work as-is.

ranks cardinality is 1000. That is, we calcualte a max of 1000 rows per (category_scope, wiki, year, month). category_scope can only be one of shallow or deep. So cardinality is 2. wiki can be any of our ~900+ wikis.

Calculating over 5 years of data:

Unique (category_scope, wiki, year, month) should be about 2 * 900 * 5 * 12 = 108,000.

Unique (category_scope, wiki) should be about 2 * 900 = 1,800.

So definitely a stark contrast for this particular table.

Mon, May 20, 5:50 PM · Data Products (Data Products Sprint 14), Commons-Impact-Metrics
Eevans added a comment to T364583: Consider whether we want to set a TTL on the Cassandra tables.

Couldn't we add a dummy TIMESTAMP for DELETE purposes?

That is, for all top_* tables, we add:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    internal_dt         TIMESTAMP,                 <<<<<<<<<<<<<<<<
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Then the deletes could be like DELETE FROM top_viewed_categories_monthly WHERE internal_dt < '2020-01-01' since that column is not part of the PRIMARY KEY. Looks like that would cost us a 64-bit integer, but seems like it would solve the issue without modifications to the PRIMARY KEY, and no TTLs needed as well.

Actually, it could be a DATE type, and then it would only cost 32-bits.

Mon, May 20, 5:26 PM · Data Products (Data Products Sprint 14), Commons-Impact-Metrics
Eevans added a comment to T362033: Degraded RAID on aqs1013.

The array has rebuilt, but I could swear I hear it ticking...

Mon, May 20, 1:54 AM · DC-Ops, Cassandra, SRE, ops-eqiad

Sat, May 18

Eevans added a comment to T364583: Consider whether we want to set a TTL on the Cassandra tables.

This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

So wouldn't the tombstone thingy work similarly? As in:

DELETE FROM top_viewed_categories_monthly WHERE year < 2020

Semantically, that DELETE is ok with our data model.

No, because year & month are a part of the partition key; The category_scope, wiki, year, and month attributes are for all intents and purposes concatenated together. A range delete like this is something that has to be done against the (indexed) partition itself.

Sat, May 18, 1:02 AM · Data Products (Data Products Sprint 14), Commons-Impact-Metrics

Fri, May 17

Eevans added a comment to T364583: Consider whether we want to set a TTL on the Cassandra tables.

This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

So wouldn't the tombstone thingy work similarly? As in:

DELETE FROM top_viewed_categories_monthly WHERE year < 2020

Semantically, that DELETE is ok with our data model.

Fri, May 17, 7:58 PM · Data Products (Data Products Sprint 14), Commons-Impact-Metrics
Eevans added a comment to T364583: Consider whether we want to set a TTL on the Cassandra tables.

One thing worth considering: Since TTLs are recorded at write-time, on a per-record basis, it makes changing your retention policy later very costly (probably prohibitively so).

Fri, May 17, 7:38 PM · Data Products (Data Products Sprint 14), Commons-Impact-Metrics
Eevans added a comment to T364583: Consider whether we want to set a TTL on the Cassandra tables.

@JAllemandou could you elaborate on the performance issues?

Fri, May 17, 7:16 PM · Data Products (Data Products Sprint 14), Commons-Impact-Metrics
Eevans added a comment to T364921: Commons Impact Metrics: Data Gateway endpoints.

Many thanks for getting the image builds running and setting up the data_gateway role, @Eevans.

With the exception of a private-puppet patch [0] to wire in the credentials for the data_gateway role, I think all of the patches needed to get us to a working service are ready. I'll start sending those out for review soon.

[0] https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Add_private_data/secrets_(optional)

Fri, May 17, 3:10 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

For posterity sake: I will make a note here that while most of the AQS datasets make use of the Leveled compaction strategy, I've used Size-tiered here. Leveled was chosen (way back when) to maximize read latency (which comes at the expense of significantly higher write throughput). Experience has proven though that the tiering you get from STCS typically matches what you see from Leveled, especially for a total-ordered dataset like this; I suspect that latency will be on par, and that we can extend the life of our SSDs in the process. I'll be monitoring this as the dataset grows though, and if that doesn't prove to be true, we can always change it later.

Fri, May 17, 3:07 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

I have updated the DDL for Cassandra in the description of this task.

The changes are:

  • Adopted the recommendations from T362697#9728035 related to demoting some columns from BIGINTs to INTs.
  • Adopted the recommendations from T362697#9728035 related to removing the top_data blob and adopting a range query strategy.
  • Table names and some column names have been updated as per changes in our Service Layer Design.

Kindly please review, and if no issues, let's deploy to the Cassandra Staging cluster whenever you folks have the time.

Thanks @xcollazo, I'll get started and update you when it's done.

Fri, May 17, 2:35 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

@Eevans it should take around 1 sprint (3 weeks) for us to adapt the service to the latest naming changes, do the integration tests and QA. So, it will be when you come back.

Fri, May 17, 2:34 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

I have updated the DDL for Cassandra in the description of this task.

The changes are:

  • Adopted the recommendations from T362697#9728035 related to demoting some columns from BIGINTs to INTs.
  • Adopted the recommendations from T362697#9728035 related to removing the top_data blob and adopting a range query strategy.
  • Table names and some column names have been updated as per changes in our Service Layer Design.

Kindly please review, and if no issues, let's deploy to the Cassandra Staging cluster whenever you folks have the time.

Fri, May 17, 1:34 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics

Thu, May 16

Eevans triaged T365172: Update {session,echo}store to make use of external-services-networkpolicies as Medium priority.
Thu, May 16, 3:30 PM · Cassandra
Eevans created T365172: Update {session,echo}store to make use of external-services-networkpolicies.
Thu, May 16, 3:30 PM · Cassandra
Eevans updated subscribers of T365118: Requesting access to cassandra-staging-devs for sg912.

@KOfori as group approver for cassandra-staging-devs...do you? :)

Thu, May 16, 1:43 PM · SRE, SRE-Access-Requests
Eevans updated the task description for T365118: Requesting access to cassandra-staging-devs for sg912.
Thu, May 16, 1:41 PM · SRE, SRE-Access-Requests
Eevans updated the task description for T365118: Requesting access to cassandra-staging-devs for sg912.
Thu, May 16, 1:36 PM · SRE, SRE-Access-Requests
Eevans committed rLPRI684a9a628276: cassandra: add faux creds for data_gateway role.
cassandra: add faux creds for data_gateway role
Thu, May 16, 1:30 PM

Wed, May 15

Eevans closed T364588: Requesting access to cassandra-staging-devs for xcollazo as Resolved.
Wed, May 15, 10:13 PM · SRE, SRE-Access-Requests
Eevans closed T364588: Requesting access to cassandra-staging-devs for xcollazo, a subtask of T364584: Create accounts in the Cassandra staging cluster for the Data Platform team members, as Resolved.
Wed, May 15, 10:13 PM · Commons-Impact-Metrics
Eevans added a comment to T364588: Requesting access to cassandra-staging-devs for xcollazo.

This is now done. The document is here: https://wikitech.wikimedia.org/wiki/Cassandra/Staging (it's still quite bare, so if you have any questions, let me know).

Wed, May 15, 10:12 PM · SRE, SRE-Access-Requests
Eevans moved T352647: Move Cassandra clusters to PKI from In-Progress to Backlog on the Cassandra board.
Wed, May 15, 9:10 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans moved T361964: Golang-based Cassandra clients do not perform TLS host verification from In-Progress to Backlog on the Cassandra board.
Wed, May 15, 9:10 PM · AQS2.0, Data Products, Cassandra
Eevans moved T350567: Migrate Cassandra to Java 11 from Backlog to Next on the Cassandra board.
Wed, May 15, 9:10 PM · Cassandra, Data-Persistence, SRE
Eevans added a comment to T364584: Create accounts in the Cassandra staging cluster for the Data Platform team members.

@Eevans if you can point us to any existing documentation about this staging Cassandra cluster I'd appreciate it.

The existing documentation (such as it is) is: https://wikitech.wikimedia.org/wiki/Cassandra/Staging

The process for being added to the cassandra-staging-devs group is the same that is used everywhere else, see: https://wikitech.wikimedia.org/wiki/SRE/Production_access#Access_Request_Process. For those that already have shell access, that will probably just boil down to filing the templated request (one request per user).

Wed, May 15, 9:06 PM · Commons-Impact-Metrics
Eevans raised the priority of T364921: Commons Impact Metrics: Data Gateway endpoints from High to Needs Triage.
Wed, May 15, 9:03 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans removed a project from T364584: Create accounts in the Cassandra staging cluster for the Data Platform team members: Cassandra.
Wed, May 15, 9:03 PM · Commons-Impact-Metrics
Eevans triaged T364921: Commons Impact Metrics: Data Gateway endpoints as High priority.
Wed, May 15, 9:01 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans moved T364921: Commons Impact Metrics: Data Gateway endpoints from Backlog to Next on the Cassandra board.
Wed, May 15, 9:01 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Wed, May 15, 8:36 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans added a comment to T364921: Commons Impact Metrics: Data Gateway endpoints.

A Docker image is now published:

Wed, May 15, 8:29 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Wed, May 15, 8:24 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans updated the task description for T364588: Requesting access to cassandra-staging-devs for xcollazo.
Wed, May 15, 5:08 PM · SRE, SRE-Access-Requests
Eevans added a comment to T362033: Degraded RAID on aqs1013.

The array has rebuilt, but I could swear I hear it ticking...

Wed, May 15, 12:08 AM · DC-Ops, Cassandra, SRE, ops-eqiad

Tue, May 14

Eevans updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Tue, May 14, 9:47 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Tue, May 14, 9:36 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE
Eevans created T364921: Commons Impact Metrics: Data Gateway endpoints.
Tue, May 14, 8:27 PM · Data Products, Patch-For-Review, Cassandra, serviceops, Service-deployment-requests, SRE

Mon, May 13

Eevans added a comment to T362033: Degraded RAID on aqs1013.

The failed device (sdd) was replaced; This time we're using sfdisk to copy the partition table.

Mon, May 13, 7:36 PM · DC-Ops, Cassandra, SRE, ops-eqiad
Eevans added a comment to T362033: Degraded RAID on aqs1013.

That didn't take long:

Mon, May 13, 1:19 PM · DC-Ops, Cassandra, SRE, ops-eqiad

Fri, May 10

Eevans closed T327524: Kask: gocql pool errors after repeated Cassandra outages as Invalid.

This no longer happens in deployment-prep (hasn't in a very long time), and was never happening production, so I am going to close this.

Fri, May 10, 2:53 PM · Cassandra
Eevans added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

Yes it was a deliberate decision. There is some back story but the TL;DR is

  • Kask predates the service mesh, needed TLS and so implemented itself
  • Sessionstore is pretty critical and could benefit from not having the extra latency the service mesh possibly adds
  • A change to sessionstore was somewhat high risk and altering the behavior of a well functioning component didn't seem like it had any kind of significant payoff back then.

If we're not using the mesh, kask should support reloading certs from disk on changes. With that we could just replace the cergen certs with auto-renewing ones like we do in the mesh.

Sure, but I don't know if it actually supports reloading certs from disk, adding @Eevans.

Fri, May 10, 2:16 PM · serviceops, Data-Persistence
Eevans added a comment to T362033: Degraded RAID on aqs1013.

The machine has been reimaged and the instances bootstrapped. 🤞

Fri, May 10, 2:08 PM · DC-Ops, Cassandra, SRE, ops-eqiad
Eevans closed T364422: Reimage aqs1013 as Resolved.

The reimage is complete, and both instances have been bootstrapped. Closing.

Fri, May 10, 2:08 PM · Cassandra, SRE, ops-eqiad
Eevans closed T364422: Reimage aqs1013, a subtask of T362033: Degraded RAID on aqs1013, as Resolved.
Fri, May 10, 2:07 PM · DC-Ops, Cassandra, SRE, ops-eqiad

Thu, May 9

Eevans updated subscribers of T364588: Requesting access to cassandra-staging-devs for xcollazo.
  • - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml
Thu, May 9, 11:01 PM · SRE, SRE-Access-Requests
Eevans updated the task description for T364588: Requesting access to cassandra-staging-devs for xcollazo.
Thu, May 9, 10:58 PM · SRE, SRE-Access-Requests
Eevans updated the task description for T364588: Requesting access to cassandra-staging-devs for xcollazo.
Thu, May 9, 10:55 PM · SRE, SRE-Access-Requests
Eevans edited projects for T364584: Create accounts in the Cassandra staging cluster for the Data Platform team members, added: Cassandra; removed Data-Persistence.
Thu, May 9, 9:32 PM · Commons-Impact-Metrics
Eevans added a comment to T364584: Create accounts in the Cassandra staging cluster for the Data Platform team members.

@Eevans if you can point us to any existing documentation about this staging Cassandra cluster I'd appreciate it.

Thu, May 9, 9:32 PM · Commons-Impact-Metrics
Eevans updated the task description for T364422: Reimage aqs1013.
Thu, May 9, 6:51 PM · Cassandra, SRE, ops-eqiad
Eevans awarded T362181: Encrypt Airflow connections to AQS Cassandra a Cookie token.
Thu, May 9, 3:19 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra

Wed, May 8

Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

I am trying to get access and face the following error:

> ssh cassandra-dev2001.codfw.wmnet
jgiannelos@cassandra-dev2001:~$ sudo -u cassandra_dev cqlsh cassandra-dev2001-a
sudo: unknown user: cassandra_dev
sudo: error initializing audit plugin sudoers_audit
Wed, May 8, 2:43 PM · Cassandra

Tue, May 7

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

Tue, May 7, 6:38 PM · DC-Ops, Cassandra, SRE, ops-eqiad
Eevans triaged T364422: Reimage aqs1013 as High priority.
Tue, May 7, 6:37 PM · Cassandra, SRE, ops-eqiad
Eevans created T364422: Reimage aqs1013.
Tue, May 7, 6:37 PM · Cassandra, SRE, ops-eqiad
Eevans closed T355730: Provide developer access to the cassandra-dev cluster as Resolved.

Ok @Jgiannelos this should now be setup:

Tue, May 7, 6:05 PM · Cassandra

May 2 2024

Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

The final changeset —r1026194— is up, but will need to pass review by Infrastructure Foundations (reviewed weekly, on Mondays).

May 2 2024, 3:57 PM · Cassandra
Eevans closed T363377: Requesting access to deployment shell access for Jsn.sherman as Resolved.

Hi @jsn.sherman, You've been added to the deployment group, your shell username is jsn (same as wmf cloud). Let me know if you have any issues!

May 2 2024, 3:47 PM · SRE, SRE-Access-Requests

Apr 30 2024

Eevans added a comment to T362841: Degraded RAID on aqs1014.

The rebuild is complete:

Apr 30 2024, 9:25 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362033: Degraded RAID on aqs1013.

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

Apr 30 2024, 9:23 PM · DC-Ops, Cassandra, SRE, ops-eqiad
Eevans updated the task description for T363377: Requesting access to deployment shell access for Jsn.sherman.
Apr 30 2024, 7:05 PM · SRE, SRE-Access-Requests
Eevans claimed T363377: Requesting access to deployment shell access for Jsn.sherman.
Apr 30 2024, 3:35 PM · SRE, SRE-Access-Requests
Eevans updated subscribers of T363377: Requesting access to deployment shell access for Jsn.sherman.

@thcipriani It looks like you're the approver for group deployment... do you?

Apr 30 2024, 3:35 PM · SRE, SRE-Access-Requests
Eevans added a comment to T363377: Requesting access to deployment shell access for Jsn.sherman.

@jsn.sherman Could you please do one of the following? Either:

Apr 30 2024, 3:34 PM · SRE, SRE-Access-Requests
Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

@Eevans thanks for following up on.

Did we leave Data_Gateway in an unclear space of ownership or where do it live? Is it on Data Product's side of the road(if so it is not currently anywhere on our list of all things).

If we can sort that out then I think we can roll on.

Apr 30 2024, 2:47 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T352647: Move Cassandra clusters to PKI.

Restbase done!

Next steps:

  • Rollout the new truststore for session store - do we need to schedule maintenance time and depool kask etc..?
Apr 30 2024, 2:01 PM · Patch-For-Review, Data-Persistence, Cassandra
Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

Since it is experimental, I do not think we should go in the HTTP Data Gateway direction at this time.

@WDoranWMF is this interesting to thinking through as a future release?

Apr 30 2024, 1:01 AM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics

Apr 29 2024

Eevans added a comment to T362841: Degraded RAID on aqs1014.

Ok, sdf has been replaced again, here is a transcript of what was done to add it back to the array:

Apr 29 2024, 7:25 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Ok, so to summarize what has happened so far:

Apr 29 2024, 6:44 PM · Cassandra, SRE, ops-eqiad
Eevans updated subscribers of T362697: Create Staging Cassandra tables for Commons Impact Metrics.

@Eevans yes, of course.
Let's freeze this task for now.
We will meet as a team, since these changes would require some rework.
We'll decide, and we'll get back to you with all the changes to this task and the design doc.
🙏

Apr 29 2024, 3:44 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics

Apr 27 2024

Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

Yes @xcollazo I also wish I knew about this sooner.
And thanks @Eevans for the careful review!

As for the more technical deets, I have a few things here as well:

First, a lot of the metric values in your proposed schema use bigint, is a 64 bit signed integer warranted here, or would a 32 bit (int) value work?

I'd argue that for all the values that store pageview counts, I'd be safer to keep as bigint considering we do a bunch of rollups and pageview counts can be very big numbers. But agreed that many, like category_metrics_snapshot's media_file_count, could very well be ints. @mforns WDYT?

Agree! Everything that's not pageviews can be INT. I believe media file counts per category are not going over 1M so far, so we'd have quite some slack there. I see some pageview counts that reach 30+B though, so we need a BIGINT there.

Apr 27 2024, 2:12 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans added a comment to T362697: Create Staging Cassandra tables for Commons Impact Metrics.

@Eevans thank you for the extensive comments, and apologies for taking long to respond.

[ ... ]
Second, a number of these tables use an attribute called top_data that is a JSON-encoded blob. Unless there is a very good reason, I would strongly discourage doing this. That serialized value is a part of the data model, a part that has been elided from this schema. As far as I can tell, each example of this could be modeled entirely in Cassandra to produce the same results. Take commons.top_pages_by_category as an example. If you change the table to this:

CREATE TABLE commons.top_pages_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    year                INT,
    month               INT,
    page_title          text,
    page_view_count     int,
    rank                int,
    PRIMARY KEY ((category, category_scope, wiki, year, month), page_title)
);

The same query —using predicates for category, category_scope, wiki, year, and month— will produce the results that you are JSON encoding.

I'll let @mforns expand/refute, but I think the rationale for encoding this inside the top_data was to have point queries and thus avoid range queries like the one you propose. If you believe Cassandra would be happy with range queries that could have up to 1000 rows, perhaps we should reconsider the schema indeed.

Apr 27 2024, 2:09 PM · Data Products (Data Products Sprint 13), Cassandra, Commons-Impact-Metrics
Eevans committed rLPRI45411f7ef9b2: Rename cassandra user.
Rename cassandra user
Apr 27 2024, 1:34 AM
Eevans added a project to T363615: puppetserver1001.eqiad.wmnet is unresponsive: Infrastructure-Foundations.
Apr 27 2024, 1:09 AM · Infrastructure-Foundations, SRE
Eevans added a comment to T363615: puppetserver1001.eqiad.wmnet is unresponsive.

Restarted via the drac and everything seems OK now. I skimmed the logs and didn't see anything that seemed unusual prior to the event.

Apr 27 2024, 1:08 AM · Infrastructure-Foundations, SRE
Eevans added a comment to T363615: puppetserver1001.eqiad.wmnet is unresponsive.

Also unable to login via the serial console.

Apr 27 2024, 12:52 AM · Infrastructure-Foundations, SRE
Eevans created T363615: puppetserver1001.eqiad.wmnet is unresponsive.
Apr 27 2024, 12:45 AM · Infrastructure-Foundations, SRE

Apr 26 2024

Eevans committed rLPRI4b94269d9183: cassandra: add (faux) password for cassandra-devel user.
cassandra: add (faux) password for cassandra-devel user
Apr 26 2024, 8:45 PM
Eevans added a comment to T362841: Degraded RAID on aqs1014.

The first device is done rebuilding:

Apr 26 2024, 2:01 PM · Cassandra, SRE, ops-eqiad

Apr 25 2024

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Ok, the rebuild is complete.

Apr 25 2024, 2:27 PM · DC-Ops, Cassandra, SRE, ops-eqiad

Apr 24 2024

Eevans added a comment to T362841: Degraded RAID on aqs1014.

2:23 PM <jclark-ctr> i am swapping sdf again
2:24 PM <jclark-ctr> swapped with one that was just erased

Apr 24 2024, 7:32 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Having some trouble adding sdf2 back into the array: mdadm: Cannot open /dev/sdf2: Device or resource busy :/

Apr 24 2024, 6:07 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

Looking at lshw.log and inventory on idrac it looks like all the drives are in order except sdf ,sdh are swapped in slots. after sdf rebuilds i can swap sdh

Apr 24 2024, 5:57 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

@Eevans Replaced drive

Apr 24 2024, 3:24 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

I was asked to provide feedback from mariadb perspective (and how consistent we want to be across different technologies but in the same team).

We don't usually hand over dev accounts to the staging environment. Many development/staging work gets done on mariadb instance outside of production, most notably beta cluster (which has its own issues but I assume setting up a dedicated project for cassandra in cloud VPS and giving access to that wouldn't be too hard). Given that it's outside of prod, the impact of mistakes or compromise is quite limited, It also discourages "testing in production" situation. I know the staging cluster has different data but still it's in prod infra with all the complexities/downsides that it brings with itself.

Apr 24 2024, 2:58 PM · Cassandra
Eevans updated the task description for T352647: Move Cassandra clusters to PKI.
Apr 24 2024, 1:33 PM · Patch-For-Review, Data-Persistence, Cassandra

Apr 23 2024

Eevans added a comment to T362033: Degraded RAID on aqs1013.

Here is a transcript of everything done (for posterity sake):

Apr 23 2024, 11:15 PM · DC-Ops, Cassandra, SRE, ops-eqiad
Eevans added a comment to T362841: Degraded RAID on aqs1014.

@Eevans this one is out of warranty also let me know if i am able to swap drive i can take care of in morning

Apr 23 2024, 11:04 PM · Cassandra, SRE, ops-eqiad
Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

Hi, is there any update with dev access for PCS devs?

Apr 23 2024, 7:44 PM · Cassandra