Page MenuHomePhabricator

Stevemunene (Stevemunene)
Disabled

User Details

User Since
Nov 1 2022, 1:30 PM (162 w, 6 d)
Roles
Disabled
LDAP User
Stevemunene
MediaWiki User
SMunene-WMF [ Global Accounts ]

Recent Activity

Nov 5 2025

Stevemunene claimed T409271: Airflow image on DSE failing to get inspected by Debmonitor.
Nov 5 2025, 12:15 PM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene renamed T409151: Upgrade the superset memcached image to Trixie. from Upgrade the superset-production-memcached image to Trixie. to Upgrade the superset memcached image to Trixie..
Nov 5 2025, 11:56 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene renamed T409151: Upgrade the superset memcached image to Trixie. from Upgrade the superset-production-memcached image to bullseye. to Upgrade the superset-production-memcached image to Trixie..
Nov 5 2025, 11:55 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene moved T409271: Airflow image on DSE failing to get inspected by Debmonitor from Backlog - project to Tracking on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Nov 5 2025, 11:31 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene edited projects for T409271: Airflow image on DSE failing to get inspected by Debmonitor, added: Data-Platform-SRE (2025.10.17 - 2025.11.07); removed Data-Platform-SRE.

Hi @elukey ,
We are currently testing a new image for airflow over on T408711, so this image should be deprecated soon.

Nov 5 2025, 11:31 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene added a comment to T408711: Deploy airflow images from airflow-dags repository build.

The issue was with the port, re exposed and accessed

Nov 5 2025, 10:43 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
Stevemunene added a comment to T408711: Deploy airflow images from airflow-dags repository build.

Created a devenv to test the recent upgrade of upgrade flask-appbuilder to solve some of the initial challenges we had accessing the connection/list/ and variable/list/ pages.
To confirm, these are currently accessible using the default image as below

image.png (342×743 px, 28 KB)

image.png (393×743 px, 38 KB)

Nov 5 2025, 10:10 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
Stevemunene created T409262: Remove rundev from stat hosts.
Nov 5 2025, 8:16 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene updated the task description for T405232: User Migration from Run dev instances to airflow devenv..
Nov 5 2025, 8:14 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28), Epic
Stevemunene updated the task description for T405232: User Migration from Run dev instances to airflow devenv..
Nov 5 2025, 8:13 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28), Epic
Stevemunene closed T408180: Post-Migration Support for Airflow Development Environments, a subtask of T405232: User Migration from Run dev instances to airflow devenv., as Resolved.
Nov 5 2025, 8:12 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28), Epic
Stevemunene closed T408180: Post-Migration Support for Airflow Development Environments as Resolved.
Nov 5 2025, 8:12 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene updated the task description for T408180: Post-Migration Support for Airflow Development Environments.
Nov 5 2025, 8:11 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene added a comment to T408180: Post-Migration Support for Airflow Development Environments.

Results from the recent survey indicate the overall experience with the new environment to have been quite favourable.

image.png (545×758 px, 38 KB)

The migration process scored a 4/5
image.png (545×758 px, 40 KB)

There were some bugs during the migration process which were recorded as promptly fixed namely
https://phabricator.wikimedia.org/T405485, https://phabricator.wikimedia.org/T399235
And the availability of documentation and support for the process scored a 3/5
image.png (545×758 px, 41 KB)

Nov 5 2025, 8:11 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene added a comment to T406222: Add druid coordinator service to LVS for the druid_public cluster..

We have had some delays due to scheduling conflicts and PTO. However, we have found some middle ground and have a slot anytime between 10:00 GMT and 12:00 GMT. for 5th Nov for the deploy.

Nov 5 2025, 7:44 AM · Traffic, Data-Platform-SRE (2025.11.07 - 2025.11.28), Patch-For-Review, Essential-Work
Stevemunene added a comment to T409151: Upgrade the superset memcached image to Trixie..

We currently do not have a memcached image later than Buster, We are currently running memcached 1.6.6 from a custom backport on Buster, meanwhile we have 1.6.38 being built on Trixie. The changes should be minimal, and this version should work the same without requiring any other changes. We shall deploy the new image once ready

Nov 5 2025, 7:42 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)

Nov 4 2025

Stevemunene closed T396072: Monitor flink-operator in dse-k8s-eqiad as Resolved.

The flink-kubernetes-operator monitoring is already setup and sends alerts to the data platform SRE's via team-data-platform, I dont think there is more to do for the ticket so I shall move to close it.

Nov 4 2025, 8:16 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene updated the task description for T408643: OpenSearch on K8s: Deploy an OpenSearch cluster in dse-k8s-codfw.
Nov 4 2025, 4:07 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), OKR-Work
Stevemunene moved T408123: WDQS: Log `x-ja3n` and `x-is-browser` in nginx from Backlog - operations to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Nov 4 2025, 12:10 PM · Essential-Work, Wikidata, Wikidata-Query-Service, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene updated the task description for T409151: Upgrade the superset memcached image to Trixie..
Nov 4 2025, 11:04 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene moved T409151: Upgrade the superset memcached image to Trixie. from Backlog - project to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.

only the memcached container is running Buster

Nov 4 2025, 10:08 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene created T409151: Upgrade the superset memcached image to Trixie..
Nov 4 2025, 9:20 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
Stevemunene added a comment to T408711: Deploy airflow images from airflow-dags repository build.

Since we still have some dumps running, we shall continue the tests on a devenv and proceed on the test-k8s instance once all is verified

Nov 4 2025, 8:46 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work

Oct 31 2025

Stevemunene closed T408779: Create an opensearch namespace for codfw as Resolved.
Oct 31 2025, 9:58 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene closed T408779: Create an opensearch namespace for codfw, a subtask of T408643: OpenSearch on K8s: Deploy an OpenSearch cluster in dse-k8s-codfw, as Resolved.
Oct 31 2025, 9:58 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), OKR-Work

Oct 30 2025

Stevemunene updated subscribers of T408711: Deploy airflow images from airflow-dags repository build.
Oct 30 2025, 3:41 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
Stevemunene added a comment to T408711: Deploy airflow images from airflow-dags repository build.

Deployed this but getting an error on kerberos

Oct 30 2025, 2:28 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
Stevemunene added a comment to T408180: Post-Migration Support for Airflow Development Environments.

We had the first support/office hours session be it with no turn out, I think we can move such sessions to an as needed basis.
Sharing a feedback survey form to rate the general sentiments on the move and possible improvements https://docs.google.com/forms/d/e/1FAIpQLScbc1wKiFG5PU4teTFN4cYFR-iKw80uLKzod3ZNuNrgMj8nHA/viewform?usp=dialog

Oct 30 2025, 12:53 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene renamed T403955: Switch all hard coded druid_public host urls to druid-public-coordinator svc url from Switch all hard coded druid_public host urls to druid-public broker svc url to Switch all hard coded druid_public host urls to druid-public-coordinator svc url.
Oct 30 2025, 11:58 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Patch-For-Review
Stevemunene created T408779: Create an opensearch namespace for codfw.
Oct 30 2025, 8:20 AM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene claimed T408643: OpenSearch on K8s: Deploy an OpenSearch cluster in dse-k8s-codfw.
Oct 30 2025, 8:19 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), OKR-Work

Oct 29 2025

Stevemunene moved T408711: Deploy airflow images from airflow-dags repository build from Backlog - project to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Oct 29 2025, 3:54 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
Stevemunene created T408711: Deploy airflow images from airflow-dags repository build.
Oct 29 2025, 3:34 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
Stevemunene closed T408189: Increase the size of the Druid broker cache size from 2GB to 4GB as Resolved.

Ran puppet on the hosts then restarted the druid daemons

Oct 29 2025, 2:13 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering
Stevemunene closed T405485: airflow-devenv fails to create when --name is long-ish as Resolved.
Oct 29 2025, 9:40 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene claimed T408123: WDQS: Log `x-ja3n` and `x-is-browser` in nginx.
Oct 29 2025, 9:36 AM · Essential-Work, Wikidata, Wikidata-Query-Service, Data-Platform-SRE (2025.10.17 - 2025.11.07)

Oct 28 2025

Stevemunene moved T408189: Increase the size of the Druid broker cache size from 2GB to 4GB from In Progress to To Be Deployed on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Oct 28 2025, 3:21 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering
Stevemunene moved T406222: Add druid coordinator service to LVS for the druid_public cluster. from In Progress to To Be Deployed on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Oct 28 2025, 3:18 PM · Traffic, Data-Platform-SRE (2025.11.07 - 2025.11.28), Patch-For-Review, Essential-Work
Stevemunene moved T408189: Increase the size of the Druid broker cache size from 2GB to 4GB from Backlog - operations to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Oct 28 2025, 12:37 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering
Stevemunene claimed T408189: Increase the size of the Druid broker cache size from 2GB to 4GB.
Oct 28 2025, 7:21 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering
Stevemunene claimed T396072: Monitor flink-operator in dse-k8s-eqiad.
Oct 28 2025, 7:20 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene moved T405485: airflow-devenv fails to create when --name is long-ish from Blocked/Waiting to Done on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.

T408365 is unblocked and we have deployed the new version.

Oct 28 2025, 7:19 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work

Oct 27 2025

Stevemunene closed T408365: Airflow devenv deb package build failure as Resolved.

Then update the sources and upgrade from the cumin host

sudo cumin 'deploy*' 'apt-get update && apt-get upgrade -y airflow-devenv'
Oct 27 2025, 12:41 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene closed T408365: Airflow devenv deb package build failure, a subtask of T405485: airflow-devenv fails to create when --name is long-ish, as Resolved.
Oct 27 2025, 12:41 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene added a comment to T408365: Airflow devenv deb package build failure.

We solved this by building for bookworm which had all the prerequisite packages already abailable.

Oct 27 2025, 12:30 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene claimed T408365: Airflow devenv deb package build failure.
Oct 27 2025, 8:30 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene moved T408180: Post-Migration Support for Airflow Development Environments from Backlog - operations to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Oct 27 2025, 7:50 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene moved T405485: airflow-devenv fails to create when --name is long-ish from In Progress to Blocked/Waiting on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Oct 27 2025, 7:30 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene updated the task description for T405485: airflow-devenv fails to create when --name is long-ish.
Oct 27 2025, 6:42 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene created T408365: Airflow devenv deb package build failure.
Oct 27 2025, 6:41 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work

Oct 24 2025

Stevemunene added a comment to T405485: airflow-devenv fails to create when --name is long-ish.

Tried building with bookworm packages from https://packages.debian.org/search?keywords=pybuild-plugin-pyproject and https://packages.debian.org/bookworm/python3-poetry-core but the end result is still the same.
both APT_USE_BUILT=yes BUILDRESULT=/home/stevemunene/pydep/pybuild-plugin-pyproject_5.20230130+deb12u1_all.deb BUILDRESULT=/home/stevemunene/pydep/python3-poetry-core_1.4.0-4_all.deb DIST=bullseye pdebuild and APT_USE_BUILT=yes BUILDRESULT=/home/stevemunene/pydep DIST=bullseye pdebuild

Oct 24 2025, 9:43 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene created T408180: Post-Migration Support for Airflow Development Environments.
Oct 24 2025, 7:22 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)

Oct 23 2025

Stevemunene added a comment to T405485: airflow-devenv fails to create when --name is long-ish.

Some observations,
When installing using BACKPORTS=yes DIST=bullseye-wikimedia pdebuild, despite the fact that there are no mirror etc. E: The repository 'http://mirrors.wikimedia.org/debian bullseye-backports Release' does not have a Release file. all the required packages are built.

Oct 23 2025, 12:36 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work

Oct 22 2025

Stevemunene closed T407799: Increase the nginx proxy timeouts in superset to 185 seconds as Resolved.

added the two options to nginx and we don't seem to be having the timeouts previously seen on some charts.
Marking this as done while monitoring for any potential issues.

Oct 22 2025, 3:45 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)
Stevemunene added a comment to T405485: airflow-devenv fails to create when --name is long-ish.

The error above was due to T383557, removed the backports option but I amrunning into this error now

Oct 22 2025, 3:42 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene added a comment to T405485: airflow-devenv fails to create when --name is long-ish.

Getting an error setting up the build on the buid2002 host which I am looking into

Oct 22 2025, 2:44 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene updated the task description for T405232: User Migration from Run dev instances to airflow devenv..
Oct 22 2025, 1:40 PM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28), Epic
Stevemunene moved T403955: Switch all hard coded druid_public host urls to druid-public-coordinator svc url from In Progress to Blocked/Waiting on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.

Moving this to the waiting column as I work on T406222

Oct 22 2025, 8:53 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Patch-For-Review
Stevemunene moved T406222: Add druid coordinator service to LVS for the druid_public cluster. from Backlog - operations to In Progress on the Data-Platform-SRE (2025.10.17 - 2025.11.07) board.
Oct 22 2025, 8:53 AM · Traffic, Data-Platform-SRE (2025.11.07 - 2025.11.28), Patch-For-Review, Essential-Work
Stevemunene claimed T407799: Increase the nginx proxy timeouts in superset to 185 seconds.
Oct 22 2025, 6:22 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07)

Oct 21 2025

Stevemunene updated the task description for T405232: User Migration from Run dev instances to airflow devenv..
Oct 21 2025, 3:22 PM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28), Epic

Oct 8 2025

Stevemunene added a comment to T404073: Airflow instance for wikidata platform.

Deployed the airflow-wikidata instance but gettng some challenges on the kerberose, scheduler and webserver pods.

Oct 8 2025, 7:33 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Wikidata, Wikidata-Query-Service
Stevemunene closed T405340: User Migration from Run dev instances to airflow devenv. Pre-Migration Planning, a subtask of T405232: User Migration from Run dev instances to airflow devenv., as Resolved.
Oct 8 2025, 7:20 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28), Epic
Stevemunene closed T405340: User Migration from Run dev instances to airflow devenv. Pre-Migration Planning as Resolved.
Oct 8 2025, 7:20 AM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)

Oct 7 2025

Stevemunene reassigned T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster from Stevemunene to BTullis.
Oct 7 2025, 3:19 PM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene added a comment to T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster.

Checking the permissions we have for our key in codfw

Oct 7 2025, 7:37 AM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene edited projects for T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster, added: Data-Platform-SRE (2025.09.26 - 2025.10.17); removed Data-Platform-SRE (2025.09.05 - 2025.09.26).
Oct 7 2025, 7:20 AM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene reopened T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster, a subtask of T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster, as Open.
Oct 7 2025, 7:19 AM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Epic
Stevemunene reopened T404576: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s-codfw cluster as "Open".

Re opening this task since we have had some issues using ceph on dse-k8s-codfw.
To test the integration, we tried a simple pvc definition as a raw block device

Oct 7 2025, 7:19 AM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)

Oct 6 2025

Stevemunene moved T405340: User Migration from Run dev instances to airflow devenv. Pre-Migration Planning from In Progress to Done on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.

Reached out individually to the affected members and also shared communication with the wider team on the upcoming transition.

Oct 6 2025, 12:55 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene updated the task description for T405340: User Migration from Run dev instances to airflow devenv. Pre-Migration Planning.
Oct 6 2025, 12:54 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene updated the task description for T405340: User Migration from Run dev instances to airflow devenv. Pre-Migration Planning.
Oct 6 2025, 11:47 AM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene moved T405485: airflow-devenv fails to create when --name is long-ish from Backlog - project to In Progress on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Oct 6 2025, 8:59 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work
Stevemunene closed T403207: Add analytics-research user to stat boxes as Resolved.
Oct 6 2025, 8:58 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Data-Engineering-Radar, Research-engineering

Oct 3 2025

Stevemunene moved T403801: decommission druid100[7-8].eqiad.wmnet from In Progress to Tracking on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.
Oct 3 2025, 7:24 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Patch-For-Review, decommission-hardware
Stevemunene added a comment to T403207: Add analytics-research user to stat boxes.

Thanks @fkaelin this has been updated.

Oct 3 2025, 2:09 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Data-Engineering-Radar, Research-engineering
Stevemunene closed T405446: Decommission druid100[7-8].eqiad.wmnet, a subtask of T403801: decommission druid100[7-8].eqiad.wmnet, as Resolved.
Oct 3 2025, 1:50 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Patch-For-Review, decommission-hardware
Stevemunene closed T405446: Decommission druid100[7-8].eqiad.wmnet as Resolved.
Oct 3 2025, 1:50 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work
Stevemunene updated the task description for T405446: Decommission druid100[7-8].eqiad.wmnet.
Oct 3 2025, 1:45 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work
Stevemunene added a comment to T405446: Decommission druid100[7-8].eqiad.wmnet.

removed the keytabs

Oct 3 2025, 1:44 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work
Stevemunene placed T403801: decommission druid100[7-8].eqiad.wmnet up for grabs.
Oct 3 2025, 1:19 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Patch-For-Review, decommission-hardware
Stevemunene added a comment to T403801: decommission druid100[7-8].eqiad.wmnet.

cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: druid1008.eqiad.wmnet

  • druid1008.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

Oct 3 2025, 1:15 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Patch-For-Review, decommission-hardware
Stevemunene moved T403207: Add analytics-research user to stat boxes from Backlog - operations to Done on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.

analytics-research user has been added to stat hosts, @fkaelin please confirm that this works as expected

Oct 3 2025, 9:05 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Data-Engineering-Radar, Research-engineering
Stevemunene added a comment to T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster.

The initial error was from the fact that we did not have the namespace defined in the list of available tenandNamespaces. This was added, then proceeded to recreate the pvc in the namespace and we are now seeing a different error message.

root@deploy2002:~# kubectl apply -f /home/stevemunene/raw-block-pvc.yaml 
persistentvolumeclaim/raw-block-pvc created
root@deploy2002:~# kubectl apply -f /home/stevemunene/raw-block-pod.yaml
pod/pod-with-raw-block-volume created
root@deploy2002:~# kubectl -n stevemunene-pvc-tests get events -w
LAST SEEN   TYPE      REASON                 OBJECT                                MESSAGE
9s          Warning   FailedScheduling       pod/pod-with-raw-block-volume         0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
11s         Normal    Provisioning           persistentvolumeclaim/raw-block-pvc   External provisioner is provisioning volume for claim "stevemunene-pvc-tests/raw-block-pvc"
14s         Normal    ExternalProvisioning   persistentvolumeclaim/raw-block-pvc   Waiting for a volume to be created either by the external provisioner 'rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Oct 3 2025, 8:54 AM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Epic
Stevemunene added a comment to T403207: Add analytics-research user to stat boxes.

created the keytabs for the stat hosts and added them

Oct 3 2025, 7:56 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Data-Engineering-Radar, Research-engineering

Oct 2 2025

Stevemunene created T406222: Add druid coordinator service to LVS for the druid_public cluster..
Oct 2 2025, 1:13 PM · Traffic, Data-Platform-SRE (2025.11.07 - 2025.11.28), Patch-For-Review, Essential-Work
Stevemunene added a comment to T403955: Switch all hard coded druid_public host urls to druid-public-coordinator svc url.

From discussions on this we decided to add the druid-coordinator service for the public cluster to LVS and get a single usable url for the service. However there were concerns on the need for this as the druid host changes rarely occur ie.(every 3 years or per server lifecycle).
Moreover, there are some discussions on having druid on k8s at some point in the future.

Oct 2 2025, 1:07 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Patch-For-Review
Stevemunene added a comment to T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster.

Going through the ceph-csi-releases for any change that might have impacted us in the 1.31 upgrade.

Oct 2 2025, 10:23 AM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Epic

Oct 1 2025

Stevemunene closed T404551: Check home/HDFS leftovers of jly as Resolved.

Thanks @SLopes-WMF the files have been cleared and we can close the task.

Oct 1 2025, 3:14 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene added a comment to T404551: Check home/HDFS leftovers of jly.

In communication with @SLopes-WMF and the files have been shared for review.
Thanks @Gehel

Oct 1 2025, 1:58 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene updated the task description for T404073: Airflow instance for wikidata platform.
Oct 1 2025, 11:55 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Wikidata, Wikidata-Query-Service
Stevemunene added a comment to T404073: Airflow instance for wikidata platform.

Completed the Hadoop setup by Creating UNIX user/group and ops group then

Oct 1 2025, 11:48 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Wikidata, Wikidata-Query-Service
Stevemunene added a comment to T405557: Request for airflow-wikidata-ops primary group.

Thanks @MoritzMuehlenhoff, Approval requests should be approved by @gmodena

Oct 1 2025, 11:44 AM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Infrastructure-Foundations, Wikidata, Wikidata-Query-Service

Sep 30 2025

Stevemunene added a comment to T405340: User Migration from Run dev instances to airflow devenv. Pre-Migration Planning.

To help with identifying the current users, I have setup sessions with some of the active users to see what they are working on and to get a better way of automating the discovery.

Sep 30 2025, 12:39 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17)
Stevemunene added a comment to T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster.

The dse-k8s-codfw cluster is up and running and the components can be verified as below:
A simple PVC definition as a raw block device

root@deploy2002:~# kube_env admin dse-k8s-codfw
root@deploy2002:~# kubectl create namespace stevemunene-pvc-tests
namespace/stevemunene-pvc-tests created
root@deploy2002:~# kubectl get namespaces
NAME                    STATUS   AGE
analytics-test          Active   47h
cert-manager            Active   10d
default                 Active   10d
echoserver              Active   10d
external-services       Active   10d
istio-system            Active   10d
kube-node-lease         Active   10d
kube-public             Active   10d
kube-system             Active   10d
opensearch-ipoid        Active   10d
opensearch-ipoid-test   Active   10d
opensearch-operator     Active   10d
opensearch-test         Active   10d
sidecar-controller      Active   10d
stevemunene-pvc-tests   Active   13s
Sep 30 2025, 11:59 AM · OKR-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Epic

Sep 29 2025

Stevemunene updated the task description for T405446: Decommission druid100[7-8].eqiad.wmnet.
Sep 29 2025, 3:48 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work
Stevemunene added a comment to T405446: Decommission druid100[7-8].eqiad.wmnet.

Depooled and removed the hosts from LVS

Sep 29 2025, 3:44 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work

Sep 25 2025

Stevemunene added a comment to T405446: Decommission druid100[7-8].eqiad.wmnet.

Druid hosts are done decommissioning, next is removing them from LVS

image.png (432×1 px, 77 KB)

Sep 25 2025, 5:18 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work
Stevemunene closed T403358: Check home/HDFS leftovers of mszabo as Resolved.

Thanks, user files have been deleted.

Sep 25 2025, 5:10 PM · Data-Platform-SRE (2025.09.05 - 2025.09.26)
Stevemunene updated the task description for T404073: Airflow instance for wikidata platform.
Sep 25 2025, 11:11 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Wikidata, Wikidata-Query-Service