Page MenuHomePhabricator

Eevans (Eric Evans)
Staff Site Reliability Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 27 2015, 10:47 PM (472 w, 3 d)
Availability
Available
IRC Nick
urandom
LDAP User
Eevans
MediaWiki User
EEvans (WMF) [ Global Accounts ]

Recent Activity

Sun, Mar 17

Eevans updated the task description for T354561: Decommission restbase10[19-27].
Sun, Mar 17, 7:03 PM · Cassandra

Sat, Mar 16

Eevans closed T354560: Provision new RESTBase cluster nodes: restbase10[34-42] as Resolved.
Sat, Mar 16, 5:05 AM · Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Sat, Mar 16, 5:04 AM · Cassandra

Fri, Mar 15

Eevans updated the task description for T354561: Decommission restbase10[19-27].
Fri, Mar 15, 9:43 AM · Cassandra

Wed, Mar 13

Eevans updated the task description for T354561: Decommission restbase10[19-27].
Wed, Mar 13, 4:36 PM · Cassandra
Eevans updated the task description for T354561: Decommission restbase10[19-27].
Wed, Mar 13, 4:28 PM · Cassandra

Tue, Mar 12

Eevans updated the task description for T354561: Decommission restbase10[19-27].
Tue, Mar 12, 7:04 AM · Cassandra

Sun, Mar 10

Eevans updated the task description for T354561: Decommission restbase10[19-27].
Sun, Mar 10, 2:15 PM · Cassandra
Eevans updated the task description for T354561: Decommission restbase10[19-27].
Sun, Mar 10, 2:11 PM · Cassandra
Eevans updated the task description for T354561: Decommission restbase10[19-27].
Sun, Mar 10, 2:07 PM · Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Sun, Mar 10, 2:04 PM · Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Sun, Mar 10, 2:00 PM · Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Sun, Mar 10, 2:00 PM · Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Sun, Mar 10, 1:44 PM · Cassandra

Sat, Mar 9

Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Sat, Mar 9, 5:39 PM · Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Sat, Mar 9, 5:32 PM · Cassandra

Fri, Mar 8

Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Fri, Mar 8, 1:55 PM · Cassandra

Thu, Mar 7

Eevans placed T357739: Package logstash-logback-encoder for Debian up for grabs.
Thu, Mar 7, 9:09 PM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans assigned T357739: Package logstash-logback-encoder for Debian to dancy.
Thu, Mar 7, 9:09 PM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans renamed T357739: Package logstash-logback-encoder for Debian from Find alternative to scap for deployment of logstash-logback-encoder to Package logstash-logback-encoder for Debian.
Thu, Mar 7, 9:08 PM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans reopened T357739: Package logstash-logback-encoder for Debian as "Open".

I think that ultimately we may still want to do this (even if our backs are no longer up against the wall). We don't routinely update these, and having a separate deployment like this just adds moving parts.

Thu, Mar 7, 9:06 PM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Thu, Mar 7, 3:53 AM · Cassandra

Wed, Mar 6

Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Wed, Mar 6, 3:40 PM · Cassandra
Eevans added a comment to T359234: ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad).

Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.

I would assume either Traffic on the SRE side, or Content-Transform-Team on the development side should be able to help.

Retagged accordingly, please let me know if we can help in any way.

Wed, Mar 6, 2:44 PM · Content-Transform-Team-WIP, Content-Transform-Team, Data-Persistence, Traffic
Eevans triaged T359234: ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) as High priority.
Wed, Mar 6, 3:11 AM · Content-Transform-Team-WIP, Content-Transform-Team, Data-Persistence, Traffic
Eevans created T359234: ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad).
Wed, Mar 6, 3:10 AM · Content-Transform-Team-WIP, Content-Transform-Team, Data-Persistence, Traffic

Tue, Mar 5

Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Tue, Mar 5, 2:08 PM · Cassandra

Thu, Feb 29

Eevans added a comment to T358793: Decommission AQS 1.0.

@WDoranWMF, just to make sure I understand: this is about decommissioning the AQS servers (/^aqs10(1[0-9]|2[0-1])\.eqiad\./ and /^aqs200[1-9]|aqs201[0-2]\.codfw\./), including some cleanup of puppet code, with the service already not in use, but still running. There is also Cassandra running on those hosts, but that can also be trashed, with no need to preserve any of the data?

Thu, Feb 29, 10:07 PM · Data-Persistence, Data-Platform-SRE (2024.03.04 - 2024.03.24), Cassandra, Data Products
Eevans added a comment to T358793: Decommission AQS 1.0.

This would be great; Long term, decoupling the service from the storage cluster is going to simply maintenance a great deal. In the nearer term, this will unblock upgrading to Debian Bookworm (the transition to Bullseye was already problematic).

Thu, Feb 29, 10:03 PM · Data-Persistence, Data-Platform-SRE (2024.03.04 - 2024.03.24), Cassandra, Data Products
Eevans awarded T358793: Decommission AQS 1.0 a Cookie token.
Thu, Feb 29, 10:03 PM · Data-Persistence, Data-Platform-SRE (2024.03.04 - 2024.03.24), Cassandra, Data Products
Eevans edited P58263 (An Untitled Masterwork).
Thu, Feb 29, 5:49 PM
Eevans created P58263 (An Untitled Masterwork).
Thu, Feb 29, 5:49 PM

Wed, Feb 28

Eevans added a comment to T355730: Provide developer access to the cassandra-dev cluster.

Hey @Eevans is there any update on that? We are picking up the cassandra/PCS work and dev access would be useful to be in place to test things on staging.

Wed, Feb 28, 4:27 PM · Cassandra

Mon, Feb 26

Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Mon, Feb 26, 2:47 PM · Cassandra

Fri, Feb 23

Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Fri, Feb 23, 3:08 PM · Cassandra

Thu, Feb 22

Eevans lowered the priority of T352647: Move Cassandra clusters to PKI from High to Medium.
Thu, Feb 22, 5:51 PM · Data-Persistence, Cassandra
Eevans triaged T334130: Access to AQS keyspaces for cassandra as Low priority.
Thu, Feb 22, 5:49 PM · Cassandra
Eevans triaged T310168: Refactor Puppet definitions of Cassandra 'rack' as Low priority.
Thu, Feb 22, 5:48 PM · Cassandra
Eevans triaged T310820: Encrypt Spark-Cassandra connection as Medium priority.
Thu, Feb 22, 5:48 PM · Data-Engineering, Data Pipelines, Cassandra
Eevans triaged T228294: Cassandra PHP driver evaluation as Low priority.
Thu, Feb 22, 5:47 PM · Cassandra, Platform Engineering Roadmap Decision Making, User-Eevans
Eevans triaged T309619: Automate joining Cassandra cluster as Low priority.
Thu, Feb 22, 5:47 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Cassandra
Eevans triaged T295897: Automated application of grants for Cassandra as Low priority.
Thu, Feb 22, 5:46 PM · Platform Team Workboards (Platform Engineering Reliability), Cassandra
Eevans triaged T323692: Create puppet defined type for adding/updating/deleting secrets or other small files on HDFS as High priority.
Thu, Feb 22, 5:45 PM · Data-Engineering, Data Pipelines, Cassandra
Eevans triaged T353189: Issue(s) bootstrapping Cassandra nodes as Medium priority.
Thu, Feb 22, 5:45 PM · Cassandra
Eevans triaged T352647: Move Cassandra clusters to PKI as High priority.
Thu, Feb 22, 5:44 PM · Data-Persistence, Cassandra
Eevans triaged T354970: Upgrade Cassandra to 4.1.4 as Medium priority.
Thu, Feb 22, 5:44 PM · Cassandra
Eevans triaged T356446: image suggestions DAG should not use aqsloader Cassandra role as High priority.
Thu, Feb 22, 5:44 PM · Structured-Data-Backlog, Structured Data Engineering, Data-Engineering, Cassandra
Eevans triaged T350567: Migrate Cassandra to Java 11 as Medium priority.
Thu, Feb 22, 5:44 PM · Cassandra, Data-Persistence, SRE
Eevans moved T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines from Backlog to Next on the Cassandra board.
Thu, Feb 22, 5:42 PM · Section-Level-Image-Suggestions, Cassandra
Eevans moved T315517: Document best-practice for hinted-handoff from Next to Backlog on the Cassandra board.
Thu, Feb 22, 5:38 PM · Data-Persistence, SRE-Sprint-Week-Sustainability-March2023, SRE-OnFire, Sustainability (Incident Followup), Cassandra
Eevans triaged T315517: Document best-practice for hinted-handoff as Low priority.
Thu, Feb 22, 5:38 PM · Data-Persistence, SRE-Sprint-Week-Sustainability-March2023, SRE-OnFire, Sustainability (Incident Followup), Cassandra
Eevans moved T307035: Relocate hosts: aqs10[3-5] from Next to Backlog on the Cassandra board.
Thu, Feb 22, 5:37 PM · SRE, DC-Ops, ops-eqiad, Cassandra, User-Eevans
Eevans moved T305102: Erroneous node placement (AQS Cassandra cluster) from Next to Backlog on the Cassandra board.
Thu, Feb 22, 5:37 PM · Cassandra, User-Eevans
Eevans moved T317364: [M] Stop unbounded image suggestions dataset growth and clean up legacy results from Next to Backlog on the Cassandra board.
Thu, Feb 22, 5:37 PM · Growth-Team, Structured Data Engineering, Cassandra
Eevans moved T309229: Make Cassandra client encryption non-optional (AQS cluster) from Next to Backlog on the Cassandra board.
Thu, Feb 22, 5:36 PM · Data-Engineering-Radar, Cassandra

Wed, Feb 21

Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Wed, Feb 21, 10:20 PM · Cassandra
Eevans committed rLPRIf36d604d0d54: restbase: (phony) keys & certs for missing/new hosts (authored by Eevans).
restbase: (phony) keys & certs for missing/new hosts
Wed, Feb 21, 10:19 PM
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Wed, Feb 21, 10:19 PM · Cassandra
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Wed, Feb 21, 8:57 PM · Cassandra
Eevans closed T354893: Q3:rack/setup/install restbase10[34-42] as Resolved.

@Eevans yes, we've done it already in T305568#7992643 :(
I've created the records for 3 cassandra instances (-a, -b and -c) in Netbox.

[ ... ]

I've also run the sre.dns.netbox cookbook to propagate those records to the DNS, they are now live.

Wed, Feb 21, 8:55 PM · SRE, RESTBase, ops-eqiad, DC-Ops
Eevans committed rODCTW18d322b20495: c-cqlsh is now deprecated; long live cqlsh-instance (authored by Eevans).
c-cqlsh is now deprecated; long live cqlsh-instance
Wed, Feb 21, 8:47 PM
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Wed, Feb 21, 8:03 PM · Cassandra
Eevans triaged T358141: sre.cassandra.roll-restart cookbook can fail if it overlaps with a puppet run as Medium priority.
Wed, Feb 21, 7:07 PM · Cassandra
Eevans created T358141: sre.cassandra.roll-restart cookbook can fail if it overlaps with a puppet run.
Wed, Feb 21, 7:07 PM · Cassandra
Eevans updated subscribers of T354893: Q3:rack/setup/install restbase10[34-42].

@Jclark-ctr it looks like these hosts weren't allocated the additional IP addresses, do you know what is required to assign them after the fact?

Wed, Feb 21, 12:50 AM · SRE, RESTBase, ops-eqiad, DC-Ops

Feb 16 2024

Eevans added a comment to T357739: Package logstash-logback-encoder for Debian.

It looks like this could be easy as a) installing liblogstash-logback-encoder-java (and deps¹), and then b) adding the jars to extra_classpath in hiera:

Feb 16 2024, 8:38 PM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans added a comment to T357739: Package logstash-logback-encoder for Debian.

A Debian package already exists (in testing):

Feb 16 2024, 8:00 PM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans updated the task description for T357791: Cassandra upgrades to Debian Bookworm.
Feb 16 2024, 5:18 PM · Cassandra
Eevans updated the task description for T357791: Cassandra upgrades to Debian Bookworm.
Feb 16 2024, 5:12 PM · Cassandra
Eevans updated subscribers of T357791: Cassandra upgrades to Debian Bookworm.

Officially, Cassandra supports running with Java 11 (https://cassandra.apache.org/doc/4.1/cassandra/getting_started/java11.html), which doesn't ship in Bookworm. I reckon that Java 17 ought to work —we could test it and see how things go— but if we're the only ones then there will always be risk. @MoritzMuehlenhoff, what are our options for Java 11? Would we be able to import it from unstable? What would security support mean?

Feb 16 2024, 5:09 PM · Cassandra
Eevans updated the task description for T357791: Cassandra upgrades to Debian Bookworm.
Feb 16 2024, 4:58 PM · Cassandra
Eevans triaged T357791: Cassandra upgrades to Debian Bookworm as Medium priority.
Feb 16 2024, 4:58 PM · Cassandra
Eevans created T357791: Cassandra upgrades to Debian Bookworm.
Feb 16 2024, 4:42 PM · Cassandra
Eevans reopened T354893: Q3:rack/setup/install restbase10[34-42] as "Open".

@Jclark-ctr it looks like these hosts weren't allocated the additional IP addresses, do you know what is required to assign them after the fact?

Feb 16 2024, 4:21 PM · SRE, RESTBase, ops-eqiad, DC-Ops
Eevans updated the task description for T354560: Provision new RESTBase cluster nodes: restbase10[34-42].
Feb 16 2024, 4:16 PM · Cassandra
Eevans updated the task description for T354561: Decommission restbase10[19-27].
Feb 16 2024, 4:12 PM · Cassandra
Eevans merged T357788: Decommission EOL hosts: restbase10[19-27] into T354561: Decommission restbase10[19-27].
Feb 16 2024, 4:09 PM · Cassandra
Eevans merged task T357788: Decommission EOL hosts: restbase10[19-27] into T354561: Decommission restbase10[19-27].
Feb 16 2024, 4:09 PM · Cassandra
Eevans moved T354561: Decommission restbase10[19-27] from Backlog to In-Progress on the Cassandra board.
Feb 16 2024, 4:08 PM · Cassandra
Eevans moved T354560: Provision new RESTBase cluster nodes: restbase10[34-42] from Backlog to In-Progress on the Cassandra board.
Feb 16 2024, 4:08 PM · Cassandra
Eevans moved T357788: Decommission EOL hosts: restbase10[19-27] from Backlog to In-Progress on the Cassandra board.
Feb 16 2024, 4:07 PM · Cassandra
Eevans created T357788: Decommission EOL hosts: restbase10[19-27].
Feb 16 2024, 4:05 PM · Cassandra
Eevans closed T353550: Cassandra (logstash) logging broken as Resolved.

The jars are deployed everywhere, all that remains are restarts:

  • restbase
  • aqs
  • cassandra-dev
  • sessionstore
  • ml-cache
Feb 16 2024, 2:05 AM · Cassandra
Eevans added a comment to T353550: Cassandra (logstash) logging broken.

@MoritzMuehlenhoff I know we're trying to be done with python2 (and that python-is-python2 is something of a hack), is there an alternative I should be aware of? With debmonitor showing 421 installs I'm guessing not, but thought I should ask. :)

Noone has currently taken on the work to port git-fat to Python 3,we have https://phabricator.wikimedia.org/T279509 to track this. As workaround we can still enable it for now (I'll followup on the patch). But there needs to be a solution eventually, as in Bookworm Python 2 is completely gone.

Feb 16 2024, 1:11 AM · Cassandra
Eevans triaged T357739: Package logstash-logback-encoder for Debian as Medium priority.
Feb 16 2024, 1:09 AM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans created T357739: Package logstash-logback-encoder for Debian.
Feb 16 2024, 1:09 AM · Release-Engineering-Team (Now this 🫠), Cassandra
Eevans added a comment to T353550: Cassandra (logstash) logging broken.

The jars are deployed everywhere, all that remains are restarts:

  • restbase
  • aqs
  • cassandra-dev
  • sessionstore
  • ml-cache
Feb 16 2024, 12:05 AM · Cassandra

Feb 15 2024

Eevans added a comment to T353550: Cassandra (logstash) logging broken.

The jars are deployed everywhere, all that remains are restarts:

Feb 15 2024, 5:28 PM · Cassandra

Feb 14 2024

Eevans updated subscribers of T353550: Cassandra (logstash) logging broken.

I attempted a scap deploy for the restbase cluster (after updating the list of targets), but that failed because git-fat was missing. Installing git-fat and rerunning the deploy (and in some cases re-rerunning it with -f after some jars still failed to hydrate), ultimately worked. After a Cassandra restart, logging messages are now showing in OpenSearch.

Feb 14 2024, 8:55 PM · Cassandra
Eevans reopened T354893: Q3:rack/setup/install restbase10[34-42] as "Open".

@Jclark-ctr did restbase1036 get imaged? I don't see any comments from the cookbook...

Feb 14 2024, 4:17 PM · SRE, RESTBase, ops-eqiad, DC-Ops

Feb 13 2024

Eevans closed T229475: Add monitoring to Kask service as Invalid.

@CCicalese_WMF: If you remember details and if this task is still valid, could you elaborate which "the test" codebase and/or team this relates to, and make the task title more specific? Thanks.

Feb 13 2024, 8:12 PM · Observability-Alerting, Cassandra, Story, Code-Health
Eevans added a comment to T229475: Add monitoring to Kask service.

@Aklapper this service is currently unowned, not sure how to move forward here. To reflect reality I'll remove MediaWiki-Engineering but happy to discuss it further to find a final solution.

Feb 13 2024, 8:02 PM · Observability-Alerting, Cassandra, Story, Code-Health
Eevans closed T356828: Decommission EOL hosts: sessionstore200[1-3] as Resolved.

Handed over to dcops via T357356; Closing

Feb 13 2024, 2:42 PM · Patch-For-Review, Cassandra
Eevans updated the task description for T357356: Decommission sessionstore200[1-3].
Feb 13 2024, 2:40 PM · SRE, ops-codfw, Cassandra, decommission-hardware

Feb 12 2024

Eevans updated the task description for T356828: Decommission EOL hosts: sessionstore200[1-3].
Feb 12 2024, 11:02 PM · Patch-For-Review, Cassandra
Eevans created T357356: Decommission sessionstore200[1-3].
Feb 12 2024, 11:01 PM · SRE, ops-codfw, Cassandra, decommission-hardware
Eevans claimed T356828: Decommission EOL hosts: sessionstore200[1-3].
Feb 12 2024, 10:55 PM · Patch-For-Review, Cassandra
Eevans closed T356829: Provision new Cassandra hosts: sessionstore200[4-6] as Resolved.

Done.

Feb 12 2024, 9:59 PM · Cassandra
Eevans closed T356829: Provision new Cassandra hosts: sessionstore200[4-6], a subtask of T356828: Decommission EOL hosts: sessionstore200[1-3], as Resolved.
Feb 12 2024, 9:59 PM · Patch-For-Review, Cassandra
Eevans moved T356828: Decommission EOL hosts: sessionstore200[1-3] from Backlog to In-Progress on the Cassandra board.
Feb 12 2024, 9:58 PM · Patch-For-Review, Cassandra