Page MenuHomePhabricator

hnowlan (Hugh Nowlan)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jan 6 2020, 12:19 PM (99 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
HNowlan (WMF) [ Global Accounts ]

Recent Activity

Mon, Nov 29

hnowlan added a comment to T294445: API Gateway has missed its write latency SLO.

I believe this graph contains the relevant 99th percentile timing metric we're looking for to ascertain time added to requests by Envoy. Unfortunately at present we are unable to differentiate between read and write traffic for this metric currently. For the short term we have two options: rewrite the SLO to use this metric in the immediate term as a v1.5, or rewrite our cluster definitions to allow us better access to the read and write versions of this metric.

Mon, Nov 29, 5:38 PM · Patch-For-Review, Platform Team Initiatives (API Gateway), Platform Team Workboards (Platform Engineering Reliability)

Thu, Nov 25

hnowlan added a project to T295956: Proposal: add a per-service rate limit setting to API Gateway: Platform Team Workboards (Platform Engineering Reliability).
Thu, Nov 25, 4:34 PM · Platform Team Workboards (Platform Engineering Reliability), Patch-For-Review, Machine-Learning-Team, Platform Team Initiatives (API Gateway)
hnowlan added a comment to T296448: Restbase/Cassandra TLS cert expiration warnings.

Thanks for reporting this !

Thu, Nov 25, 12:32 PM · Platform Team Workboards (Platform Engineering Reliability), RESTBase-Cassandra, SRE
hnowlan closed T296448: Restbase/Cassandra TLS cert expiration warnings as Resolved.
Thu, Nov 25, 12:32 PM · Platform Team Workboards (Platform Engineering Reliability), RESTBase-Cassandra, SRE
hnowlan claimed T296448: Restbase/Cassandra TLS cert expiration warnings.
Thu, Nov 25, 11:29 AM · Platform Team Workboards (Platform Engineering Reliability), RESTBase-Cassandra, SRE

Tue, Nov 23

hnowlan added a project to T295375: Restbase migration to Buster: Generated Data Platform.
Tue, Nov 23, 5:30 PM · Generated Data Platform, Patch-For-Review, RESTBase, Platform Team Workboards (Platform Engineering Reliability)
hnowlan moved T295897: Automated application of grants for Cassandra from Work in Progress ⚙️ to QA/Review ❓ on the Generated Data Platform board.
Tue, Nov 23, 5:13 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Generated Data Platform, Cassandra
hnowlan moved T295897: Automated application of grants for Cassandra from Backlog to Work in Progress ⚙️ on the Generated Data Platform board.
Tue, Nov 23, 5:13 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Generated Data Platform, Cassandra
hnowlan moved T290149: Configure replication slots on Postgres masters from Backlog to Watching on the Platform Team Workboards (Platform Engineering Reliability) board.
Tue, Nov 23, 2:45 PM · Platform Team Workboards (Platform Engineering Reliability), Tech-Product API Roadmap, Code-Health-Objective, Platform Engineering Roadmap, Product Infrastructure Roadmap, Maps, Epic, Product-Infrastructure-Team-Backlog
hnowlan created T296288: API Gateway needs a dual logging solution.
Tue, Nov 23, 11:15 AM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Initiatives (API Gateway)

Fri, Nov 19

hnowlan moved T290756: Migrate restbase production service to node12 from Backlog to Watching on the Platform Team Workboards (Platform Engineering Reliability) board.
Fri, Nov 19, 12:18 PM · Platform Engineering, Platform Team Workboards (Platform Engineering Reliability), RESTBase
hnowlan moved T268836: Documentation for Similarusers from Backlog to Watching on the Platform Team Workboards (Platform Engineering Reliability) board.
Fri, Nov 19, 12:18 PM · Platform Team Workboards (Platform Engineering Reliability), Documentation
hnowlan moved T295375: Restbase migration to Buster from Backlog to Doing on the Platform Team Workboards (Platform Engineering Reliability) board.
Fri, Nov 19, 12:18 PM · Generated Data Platform, Patch-For-Review, RESTBase, Platform Team Workboards (Platform Engineering Reliability)
hnowlan moved T295897: Automated application of grants for Cassandra from Backlog to Doing on the Platform Team Workboards (Platform Engineering Reliability) board.
Fri, Nov 19, 12:18 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Generated Data Platform, Cassandra
hnowlan added a project to T296081: Cassandra on Maps and AQS don't use inter-node encryption: Platform Team Workboards (Platform Engineering Reliability).
Fri, Nov 19, 12:18 PM · Maps, Platform Team Workboards (Platform Engineering Reliability), Cassandra
hnowlan created T296081: Cassandra on Maps and AQS don't use inter-node encryption.
Fri, Nov 19, 12:17 PM · Maps, Platform Team Workboards (Platform Engineering Reliability), Cassandra

Thu, Nov 18

hnowlan added a comment to T295897: Automated application of grants for Cassandra.

To be clear: will these be applied on every Puppet run, or only when the file has changed?

Thu, Nov 18, 10:32 AM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Generated Data Platform, Cassandra

Wed, Nov 17

hnowlan created T295897: Automated application of grants for Cassandra.
Wed, Nov 17, 3:04 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Generated Data Platform, Cassandra
hnowlan closed T235299: Cassandra cluster management support for multi-tenancy as Resolved.
Wed, Nov 17, 3:01 PM · Generated Data Platform, Platform Team Workboards (Platform Engineering Reliability), Platform Engineering (Icebox), Cassandra, User-Eevans
hnowlan added a comment to T235299: Cassandra cluster management support for multi-tenancy.

All clusters now support the definition of multiple users and grants - automatic application of grants files for users has not yet been implemented.

Wed, Nov 17, 12:48 PM · Generated Data Platform, Platform Team Workboards (Platform Engineering Reliability), Platform Engineering (Icebox), Cassandra, User-Eevans
hnowlan added a comment to T197470: find a way to systematically update the deployment server name across all repos.

For the immediate term if there are no objections I will replace all instances of git_server: deploy1001.eqiad.wmnet with git_server: deploy1002.eqiad.wmnet on deploy1002 (there are about 27 with the old host defined)

No objection.

Wed, Nov 17, 10:42 AM · Release-Engineering-Team (Priority Backlog 📥), Scap, SRE

Tue, Nov 16

hnowlan added a comment to T197470: find a way to systematically update the deployment server name across all repos.

This happens because of how DEPLOY_HEAD retains the last-used deploy server name and unless explicitly told to ignore, it will use it after the first clone:

grep deploy1001 /srv/deployment/3d2png/deploy/.git/DEPLOY_HEAD
git_server: deploy1001.eqiad.wmnet

The deploy server can be overridden on the command line https://gerrit.wikimedia.org/r/c/operations/puppet/+/670784/1#message-96812b744409eea5e74a934fc912834fed0e7e9b

Tue, Nov 16, 11:08 AM · Release-Engineering-Team (Priority Backlog 📥), Scap, SRE
hnowlan added a comment to T269778: Introduce API versioning to Sockpuppet.

Not sure what to do to be honest, but closing seems like the easiest option - there are a lot of good-idea-but-when tickets in the Sockpuppet epic

Tue, Nov 16, 10:57 AM · Platform Engineering
hnowlan added a comment to T295717: Logstash Kafka Consumer Lag alert firing every hour.

These spikes are caused by logging on the gateway being too verbose - I've filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/739130 to turn off the debug logging and this will hopefully greatly reduce the impact of these requests upon logstash

Tue, Nov 16, 10:48 AM · observability

Mon, Nov 15

hnowlan committed rLPRIb92951c0b0e7: cassandra: add stub values for new credentials format (authored by hnowlan).
cassandra: add stub values for new credentials format
Mon, Nov 15, 4:50 PM
hnowlan edited projects for T268836: Documentation for Similarusers, added: Platform Team Workboards (Platform Engineering Reliability); removed Platform Engineering, Platform Team Workboards (Green).
Mon, Nov 15, 3:22 PM · Platform Team Workboards (Platform Engineering Reliability), Documentation
hnowlan placed T269261: Language and project configuration for Sockpuppet up for grabs.
Mon, Nov 15, 3:00 PM · Platform Engineering, Platform Team Workboards (Green)
hnowlan added a project to T290756: Migrate restbase production service to node12: Platform Team Workboards (Platform Engineering Reliability).
Mon, Nov 15, 3:00 PM · Platform Engineering, Platform Team Workboards (Platform Engineering Reliability), RESTBase
hnowlan edited projects for T291620: Better observability/visualization for MediaWiki jobs, added: Platform Team Workboards (Platform Engineering Reliability); removed Platform Engineering.
Mon, Nov 15, 3:00 PM · Platform Team Workboards (Platform Engineering Reliability), Data-Engineering, serviceops, Wikibase change dispatching scripts to jobs
hnowlan removed a project from T294377: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet: Platform Engineering.
Mon, Nov 15, 3:00 PM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-codfw, DC-Ops
hnowlan removed a project from T294372: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet: Platform Engineering.
Mon, Nov 15, 3:00 PM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-eqiad, DC-Ops

Fri, Nov 12

hnowlan added a comment to T295485: [SPIKE] Investigate Approach for Shipping Airflow/Data Pipeline Metrics.

Some challenges present in this work from the outset - we need workers to push their metrics to the WMF Pushgateway given that the jobs are short-lived and will easily be missed by the Prometheus scrape process.

Fri, Nov 12, 12:48 PM · Spike, Generated Data Platform

Thu, Nov 11

hnowlan added a comment to T235299: Cassandra cluster management support for multi-tenancy.

After having considered all of the above solutions, we're ultimately going to keep things very simple by simply doing some mild puppet rewrites and keeping the grants as CQL files and use the existing Cassandra tooling for management rather than anything more custom. Changes to grants are idempotent so running these automatedly will not pose much danger to us, and Puppet will manage transport and versioning for us for free.

Thu, Nov 11, 10:27 AM · Generated Data Platform, Platform Team Workboards (Platform Engineering Reliability), Platform Engineering (Icebox), Cassandra, User-Eevans

Tue, Nov 9

hnowlan updated subscribers of T295375: Restbase migration to Buster.
Tue, Nov 9, 4:07 PM · Generated Data Platform, Patch-For-Review, RESTBase, Platform Team Workboards (Platform Engineering Reliability)
hnowlan created T295375: Restbase migration to Buster.
Tue, Nov 9, 4:05 PM · Generated Data Platform, Patch-For-Review, RESTBase, Platform Team Workboards (Platform Engineering Reliability)
hnowlan added a comment to T295324: AuthZN Sockpuppet Model via Liftwing API.

I have concerns about whether this work is the right thing to do and whether the API gateway is the right place for it.
If the API gateway were to implement authentication and authorisation, it would have to do so in a manner that was totally reusable for all WMF applications rather than specific to one application. That not only assumes an amount of time and effort invested in doing so, it also means implementation of[[ https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ext_authz/v3/ext_authz.proto#envoy-v3-api-msg-extensions-filters-http-ext-authz-v3-httpservice | secondary services for the gateway itself to communicate with ]]. What will be the source of truth for requests - Mediawiki? Another source?

Tue, Nov 9, 10:03 AM · Foundational Technology Requests
hnowlan closed T265722: New Service Request: Sockpuppet Detection, a subtask of T259471: Sockpuppet detection API [low effort], as Resolved.
Tue, Nov 9, 9:57 AM · Product-Feature, Platform Engineering Roadmap, Tech-Product API Roadmap, Anti-Harassment
hnowlan closed T265722: New Service Request: Sockpuppet Detection as Resolved.
Tue, Nov 9, 9:57 AM · Platform Engineering, Platform Team Workboards (Green)

Mon, Nov 8

hnowlan moved T281257: Build helm chart for the service using the docker container from In review to Watching on the Platform Team Workboards (Platform Engineering Reliability) board.
Mon, Nov 8, 11:55 AM · Platform Team Workboards (Platform Engineering Reliability)
hnowlan moved T289583: Define API Gateway Staging Environment from In review to Done on the Platform Team Workboards (Platform Engineering Reliability) board.
Mon, Nov 8, 11:55 AM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Initiatives (API Gateway), API Platform
hnowlan added a comment to T289583: Define API Gateway Staging Environment.

Staging service hierarchy has been added to the API gateway and is live. I've updated the wikitech docs to reflect this with some basics on how to use it https://wikitech.wikimedia.org/wiki/API_Gateway#Staging

Mon, Nov 8, 11:54 AM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Initiatives (API Gateway), API Platform

Nov 4 2021

hnowlan closed T261966: Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java as Resolved.
Nov 4 2021, 5:55 PM · Platform Team Workboards (Platform Engineering Reliability), Patch-For-Review, Cassandra, SRE
hnowlan closed T261966: Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java, a subtask of T264174: Migrate remaining services using Java to profile::java , as Resolved.
Nov 4 2021, 5:55 PM · User-MoritzMuehlenhoff, SRE
hnowlan moved T261966: Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java from Doing to Done on the Platform Team Workboards (Platform Engineering Reliability) board.
Nov 4 2021, 5:54 PM · Platform Team Workboards (Platform Engineering Reliability), Patch-For-Review, Cassandra, SRE
hnowlan moved T285857: Deploy wikidiff2 1.13.0 from Doing to Done on the Platform Team Workboards (Platform Engineering Reliability) board.
Nov 4 2021, 5:54 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2
hnowlan moved T261966: Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java from Backlog to Doing on the Platform Team Workboards (Platform Engineering Reliability) board.
Nov 4 2021, 5:14 PM · Platform Team Workboards (Platform Engineering Reliability), Patch-For-Review, Cassandra, SRE
hnowlan closed T295056: Puppet failing on deployment-restbase03 as Resolved.
Nov 4 2021, 4:20 PM · Beta-Cluster-Infrastructure, RESTBase
hnowlan added a comment to T295056: Puppet failing on deployment-restbase03.

Thanks for the report - looks like this was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/736784

Nov 4 2021, 4:18 PM · Beta-Cluster-Infrastructure, RESTBase

Nov 3 2021

hnowlan added a comment to T285857: Deploy wikidiff2 1.13.0.

PHP 7.4 images have been bumped via https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/4de99a08bc2eb983f3aca798ff33fd22976c65e8

Nov 3 2021, 3:54 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2
hnowlan updated the task description for T285857: Deploy wikidiff2 1.13.0.
Nov 3 2021, 3:53 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2
hnowlan added a comment to T285857: Deploy wikidiff2 1.13.0.

The deployment is just finishing up to codfw's API servers in the next few minutes, all others are complete. Please let me know if the changes are present and whether there are any issues!

Nov 3 2021, 3:53 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2

Nov 2 2021

hnowlan created P17655 (An Untitled Masterwork).
Nov 2 2021, 3:10 PM
hnowlan added a comment to T285857: Deploy wikidiff2 1.13.0.

I've deployed wikidiff2-1.13.0-1 to the canaries and will deploy to the rest of production tomorrow. For reference this is the debdeploy file used:

hnowlan@cumin1001:~$ cat 2021-11-01-wikidiff2.yaml
comment: RTL compatibility fixes T285857
fixes:
  bullseye: ''
  buster: 1.13.0-1
  stretch: ''
libraries: []
source: wikidiff2
transitions: {}
update_type: library
Nov 2 2021, 2:41 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2

Nov 1 2021

hnowlan created P17647 wikidiff2 deploy.
Nov 1 2021, 3:15 PM

Oct 27 2021

hnowlan triaged T294445: API Gateway has missed its write latency SLO as High priority.
Oct 27 2021, 4:20 PM · Patch-For-Review, Platform Team Initiatives (API Gateway), Platform Team Workboards (Platform Engineering Reliability)
hnowlan moved T235299: Cassandra cluster management support for multi-tenancy from Backlog to In-Progress on the Cassandra board.
Oct 27 2021, 4:02 PM · Generated Data Platform, Platform Team Workboards (Platform Engineering Reliability), Platform Engineering (Icebox), Cassandra, User-Eevans
hnowlan edited projects for T290756: Migrate restbase production service to node12, added: Platform Engineering; removed Platform Team Workboards (Platform Engineering Reliability).
Oct 27 2021, 3:55 PM · Platform Engineering, Platform Team Workboards (Platform Engineering Reliability), RESTBase
hnowlan moved T294445: API Gateway has missed its write latency SLO from Backlog to Doing on the Platform Team Workboards (Platform Engineering Reliability) board.
Oct 27 2021, 3:50 PM · Patch-For-Review, Platform Team Initiatives (API Gateway), Platform Team Workboards (Platform Engineering Reliability)
hnowlan moved T294377: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet from Backlog to Watching on the Platform Team Workboards (Platform Engineering Reliability) board.
Oct 27 2021, 3:50 PM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-codfw, DC-Ops
hnowlan moved T294372: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet from Backlog to Watching on the Platform Team Workboards (Platform Engineering Reliability) board.
Oct 27 2021, 3:50 PM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-eqiad, DC-Ops
hnowlan created T294445: API Gateway has missed its write latency SLO.
Oct 27 2021, 3:45 PM · Patch-For-Review, Platform Team Initiatives (API Gateway), Platform Team Workboards (Platform Engineering Reliability)
hnowlan updated the task description for T294377: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet.
Oct 27 2021, 11:40 AM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-codfw, DC-Ops
hnowlan updated the task description for T294372: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet.
Oct 27 2021, 11:40 AM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-eqiad, DC-Ops
hnowlan reassigned T294372: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet from hnowlan to Papaul.
Oct 27 2021, 11:37 AM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-eqiad, DC-Ops
hnowlan reassigned T294377: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet from hnowlan to Papaul.
Oct 27 2021, 11:06 AM · Platform Team Workboards (Platform Engineering Reliability), SRE, RESTBase, ops-codfw, DC-Ops

Oct 19 2021

hnowlan added a comment to T141541: Certs from cassandra-ca-manager should have the FQDN in cert's CN.

I've seen this exception once or twice before, but only ever when there was something wrong with the file itself. Not to say this couldn't be red herring of some sort though...

Could it be file permissions or something?

Oct 19 2021, 4:47 PM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Legacy (Later), Services (later), Cassandra

Oct 18 2021

hnowlan added a comment to T235299: Cassandra cluster management support for multi-tenancy.

Current WIP format for grants management:

roles:
  cassandra:
    login: true
    member_of: []
    permissions:
      data/local_group_default_T_pageviews_per_project_v2:
      - ALTER
      - AUTHORIZE
      - CREATE
      - DROP
      - MODIFY
      - SELECT
      data/local_group_default_T_pageviews_per_project_v2/data:
      - ALTER
      - AUTHORIZE
      - DROP
      - MODIFY
      - SELECT
      data/local_group_default_T_pageviews_per_project_v2/meta:
      - ALTER
      - AUTHORIZE
      - DROP
      - MODIFY
      - SELECT
      functions/local_group_default_T_pageviews_per_project_v2:
      - ALTER
      - AUTHORIZE
      - CREATE
      - DROP
      - EXECUTE
      roles/itest_role:
      - ALTER
      - AUTHORIZE
      - DROP
      roles/itest_user:
      - ALTER
      - AUTHORIZE
      - DROP
    superuser: true
  itest_role:
    login: false
    member_of: []
    permissions:
      data/local_group_default_T_pageviews_per_project_v2/data:
      - ALTER
    superuser: false
  itest_user:
    login: true
    member_of:
    - itest_role
    permissions:
      data/local_group_default_T_pageviews_per_project_v2/data:
      - DROP
      data/local_group_default_T_pageviews_per_project_v2/meta:
      - ALTER
      - AUTHORIZE
      - DROP
      - MODIFY
      - SELECT
    superuser: false
Oct 18 2021, 2:13 PM · Generated Data Platform, Platform Team Workboards (Platform Engineering Reliability), Platform Engineering (Icebox), Cassandra, User-Eevans

Oct 14 2021

Eevans awarded T291738: Degraded RAID on sessionstore1003 a Cookie token.
Oct 14 2021, 9:17 PM · Platform Engineering, SRE
hnowlan closed T291738: Degraded RAID on sessionstore1003 as Resolved.
Oct 14 2021, 4:49 PM · Platform Engineering, SRE
hnowlan added a comment to T291738: Degraded RAID on sessionstore1003.

Thanks @Jclark-ctr and @Cmjohnson! I have remirrored the disk via:

sfdisk -d /dev/sda | sfdisk /dev/sdb
Oct 14 2021, 2:07 PM · Platform Engineering, SRE
hnowlan removed a project from T291738: Degraded RAID on sessionstore1003: ops-eqiad.
Oct 14 2021, 2:05 PM · Platform Engineering, SRE

Oct 12 2021

hnowlan claimed T235299: Cassandra cluster management support for multi-tenancy.
Oct 12 2021, 2:18 PM · Generated Data Platform, Platform Team Workboards (Platform Engineering Reliability), Platform Engineering (Icebox), Cassandra, User-Eevans

Oct 8 2021

hnowlan added a comment to T291472: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances.

Just a note - if you're importing tables from two racks, doing a nodetool cleanup will probably return a significant amount of space from orphaned sstables, but will also probably take a while given the size of the data being imported.

Oct 8 2021, 2:40 PM · Data-Engineering-Kanban, Data-Engineering, Analytics, Analytics-Kanban

Oct 5 2021

hnowlan added a comment to T141541: Certs from cassandra-ca-manager should have the FQDN in cert's CN.

The attempted FQDN-use method appears to have failed - Cassandra claims there is an issue with the keystore format despite it being the same format/method as before:

INFO  [main] 2021-10-05 11:43:09,282 IndexSummaryManager.java:80 - Initializing index summary manager with a memory pool size of 614 MB and a resize interval of 60 minutes
ERROR [main] 2021-10-05 11:43:09,297 CassandraDaemon.java:749 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: Unable to create ssl socket
        at org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:701) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:681) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:665) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:796) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:683) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:632) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:388) [apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:620) [apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:732) [apache-cassandra-3.11.4.jar:3.11.4]
Caused by: java.io.IOException: Error creating the initializing the SSL Context
        at org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:201) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.security.SSLFactory.getServerSocket(SSLFactory.java:61) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:697) ~[apache-cassandra-3.11.4.jar:3.11.4]
        ... 8 common frames omitted
Caused by: java.io.IOException: Invalid keystore format
        at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:666) ~[na:1.8.0_302]
        at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:57) ~[na:1.8.0_302]
        at sun.security.provider.KeyStoreDelegator.engineLoad(KeyStoreDelegator.java:224) ~[na:1.8.0_302]
        at sun.security.provider.JavaKeyStore$DualFormatJKS.engineLoad(JavaKeyStore.java:71) ~[na:1.8.0_302]
        at java.security.KeyStore.load(KeyStore.java:1445) ~[na:1.8.0_302]
        at org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:179) ~[apache-cassandra-3.11.4.jar:3.11.4]
        ... 10 common frames omitted
Oct 5 2021, 11:52 AM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Legacy (Later), Services (later), Cassandra

Oct 4 2021

hnowlan added a comment to T291912: Clarify in Wikifeeds documention the request flows.

@elukey, @hnowlan Let me know if this is more clear now.

Oct 4 2021, 11:59 AM · serviceops, Sustainability (Incident Followup), Wikifeeds

Sep 30 2021

hnowlan committed rLPRIbaed37f78e35: secrets: Clean up restbase stub certificates (authored by hnowlan).
secrets: Clean up restbase stub certificates
Sep 30 2021, 4:01 PM
hnowlan updated the task description for T285857: Deploy wikidiff2 1.13.0.
Sep 30 2021, 12:17 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2
hnowlan updated the task description for T285857: Deploy wikidiff2 1.13.0.
Sep 30 2021, 12:09 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2
hnowlan added a comment to T285857: Deploy wikidiff2 1.13.0.

wikidiff 1.13.0 is now installed on the beta cluster.

Sep 30 2021, 12:09 PM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2

Sep 29 2021

hnowlan moved T249755: Cassandra3 migration for Analytics AQS from Backlog to Watching on the Platform Team Workboards (Platform Engineering Reliability) board.
Sep 29 2021, 11:45 AM · Analytics-Clusters, Platform Team Workboards (Platform Engineering Reliability), Data-Engineering-Kanban, Epic, Data-Engineering, Cassandra
hnowlan moved T235299: Cassandra cluster management support for multi-tenancy from Backlog to Doing on the Platform Team Workboards (Platform Engineering Reliability) board.
Sep 29 2021, 11:40 AM · Generated Data Platform, Platform Team Workboards (Platform Engineering Reliability), Platform Engineering (Icebox), Cassandra, User-Eevans
hnowlan moved T285857: Deploy wikidiff2 1.13.0 from Backlog to Doing on the Platform Team Workboards (Platform Engineering Reliability) board.
Sep 29 2021, 11:38 AM · User-notice, Community-Tech, Platform Team Workboards (Platform Engineering Reliability), serviceops, SRE, wikidiff2
hnowlan closed T289852: Maps postgres read replicas throws errors on eqiad as Resolved.
Sep 29 2021, 11:38 AM · Platform Team Workboards (Platform Engineering Reliability), Product-Data-Infrastructure (Backlog), SRE, Maps
hnowlan moved T141541: Certs from cassandra-ca-manager should have the FQDN in cert's CN from Backlog to In review on the Platform Team Workboards (Platform Engineering Reliability) board.
Sep 29 2021, 11:38 AM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Legacy (Later), Services (later), Cassandra
hnowlan moved T289583: Define API Gateway Staging Environment from Backlog to In review on the Platform Team Workboards (Platform Engineering Reliability) board.
Sep 29 2021, 11:37 AM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Initiatives (API Gateway), API Platform
hnowlan changed the status of T141541: Certs from cassandra-ca-manager should have the FQDN in cert's CN from Open to In Progress.
Sep 29 2021, 11:37 AM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Legacy (Later), Services (later), Cassandra

Sep 27 2021

hnowlan added a comment to T291738: Degraded RAID on sessionstore1003.

Given that this host is a single rack and the /srv/ partition is on an array and not using JBOD I think this host can just be shut down and the disk be replaced.

Sep 27 2021, 10:18 AM · Platform Engineering, SRE
hnowlan claimed T291738: Degraded RAID on sessionstore1003.
Sep 27 2021, 9:40 AM · Platform Engineering, SRE

Sep 20 2021

hnowlan closed T288269: Make Cassandra puppet configuration multi-tenant as Declined.
Sep 20 2021, 3:12 PM · Platform Team Workboards (Platform Engineering Reliability)

Sep 16 2021

hnowlan added a comment to T187260: Reporting of wide Cassandra partitions.

Is this work still relevant? https://phabricator.wikimedia.org/T187255#5066229 implies not but I'm not certain. Are wide partitions not a greatly reduced concern under 3.11?
The relevant panel is now missing from the dashboard and may need updating/recreating.

Sep 16 2021, 12:36 PM · Platform Team Legacy (Later), Services (next), RESTBase-Cassandra, Cassandra
hnowlan closed T275350: Postgres replication lagging on maps[12]008 as Resolved.
Sep 16 2021, 10:51 AM · Discovery-Search, Maps

Sep 15 2021

hnowlan committed rODCTW8f57e3c74f12: Warn when no instance name is passed. (authored by hnowlan).
Warn when no instance name is passed.
Sep 15 2021, 10:10 PM
hnowlan added a comment to T288134: Deploy prototype API.

@hnowlan As I understand it today, the deployment to staging k8 I understand that typically SRE would need to +2. Is there anything in your mind that we could incorporate as a check to get this be more automated and less manual?

Sep 15 2021, 11:39 AM · API Platform

Sep 14 2021

hnowlan added a comment to T288131: Stand up storage.

@hnowlan , do you disagree with anything I said? I'd love it if there were a better answer.

Sep 14 2021, 10:10 AM · API Platform

Sep 9 2021

hnowlan closed T264292: Migrate maps to Buster as Resolved.
Sep 9 2021, 9:52 AM · Product-Infrastructure-Team-Backlog, Maps, SRE
hnowlan closed T264292: Migrate maps to Buster, a subtask of T247045: Migrate all of production metal and VMs to Buster or later, as Resolved.
Sep 9 2021, 9:52 AM · SRE, Epic
hnowlan closed T269582: [OSM] perform imposm3 migration in production as Resolved.
Sep 9 2021, 9:51 AM · Product-Infrastructure-Team-Backlog (Kanban), Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Maps
hnowlan closed T269582: [OSM] perform imposm3 migration in production, a subtask of T260456: [Maps] Reduce Map Sync Latency with OpenStreetMaps (OSM), as Resolved.
Sep 9 2021, 9:50 AM · Product Infrastructure Roadmap, Maps, Epic, Product-Infrastructure-Team-Backlog
hnowlan closed T269582: [OSM] perform imposm3 migration in production, a subtask of T264292: Migrate maps to Buster, as Resolved.
Sep 9 2021, 9:50 AM · Product-Infrastructure-Team-Backlog, Maps, SRE
hnowlan moved T269582: [OSM] perform imposm3 migration in production from Watching to Done on the Platform Team Workboards (Platform Engineering Reliability) board.
Sep 9 2021, 9:50 AM · Product-Infrastructure-Team-Backlog (Kanban), Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), Maps