Page MenuHomePhabricator

Stevemunene (Stevemunene)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 1 2022, 1:30 PM (77 w, 1 d)
Availability
Available
LDAP User
Stevemunene
MediaWiki User
SMunene-WMF [ Global Accounts ]

Recent Activity

Today

Stevemunene updated the task description for T361185: Move datahub to dse-k8s cluster.
Thu, Apr 25, 11:26 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)

Yesterday

Stevemunene added a subtask for T361185: Move datahub to dse-k8s cluster: T363304: Create saved views for datahub deployment logs on dse-k8s.
Wed, Apr 24, 8:52 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a parent task for T363304: Create saved views for datahub deployment logs on dse-k8s: T361185: Move datahub to dse-k8s cluster.
Wed, Apr 24, 8:52 AM · Data-Platform-SRE
Stevemunene created T363304: Create saved views for datahub deployment logs on dse-k8s.
Wed, Apr 24, 8:52 AM · Data-Platform-SRE
Stevemunene added a subtask for T361185: Move datahub to dse-k8s cluster: T363301: Monitor the availability of datahub deployment on dse-k8s.
Wed, Apr 24, 8:50 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a parent task for T363301: Monitor the availability of datahub deployment on dse-k8s: T361185: Move datahub to dse-k8s cluster.
Wed, Apr 24, 8:50 AM · Data-Platform-SRE
Stevemunene created T363301: Monitor the availability of datahub deployment on dse-k8s.
Wed, Apr 24, 8:49 AM · Data-Platform-SRE
Stevemunene added a subtask for T361185: Move datahub to dse-k8s cluster: T363300: Move datahub and datahub-staging helfile deployments to dse-k8s.
Wed, Apr 24, 8:44 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a parent task for T363300: Move datahub and datahub-staging helfile deployments to dse-k8s: T361185: Move datahub to dse-k8s cluster.
Wed, Apr 24, 8:44 AM · Data-Platform-SRE
Stevemunene created T363300: Move datahub and datahub-staging helfile deployments to dse-k8s.
Wed, Apr 24, 8:44 AM · Data-Platform-SRE
Stevemunene added a parent task for T363299: create Internal service DNS record for datahub pointing to dse-k8s: T361185: Move datahub to dse-k8s cluster.
Wed, Apr 24, 8:40 AM · Data-Platform-SRE
Stevemunene added a subtask for T361185: Move datahub to dse-k8s cluster: T363299: create Internal service DNS record for datahub pointing to dse-k8s.
Wed, Apr 24, 8:40 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene created T363299: create Internal service DNS record for datahub pointing to dse-k8s.
Wed, Apr 24, 8:40 AM · Data-Platform-SRE
Stevemunene added a subtask for T361185: Move datahub to dse-k8s cluster: T363298: Create datahub namespaces on dse-k8s.
Wed, Apr 24, 8:38 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a parent task for T363298: Create datahub namespaces on dse-k8s: T361185: Move datahub to dse-k8s cluster.
Wed, Apr 24, 8:38 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene created T363298: Create datahub namespaces on dse-k8s.
Wed, Apr 24, 8:38 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)

Tue, Apr 23

Stevemunene moved T356231: Package versions in Conda-Analytics are not pinned from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

We hit a bit of a delay with this while building the new conda package, so far we have updated the conda-analytics-clone command to include the --pinned tag so that the file is available to everyone using it as per Creating_a_new_environment.
However, we ran into a debian related challenge where the buster-backports repo is no longer available which prevents us from building any images that include his repo on apt-update failure. This is being tracked on T362518 and we plan to solve this by rebuilding our container on bullseye here T362648 so as to unblock progress on this.

Tue, Apr 23, 2:36 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene closed T362648: Rebuild conda-analytics container on Bullseye as Resolved.

run an apt update on the hadoop workers

stevemunene@cumin1002:~$ sudo cumin A:hadoop-worker 'apt update'
106 hosts will be targeted:
an-worker[1078-1175].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
Tue, Apr 23, 1:05 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a comment to T362648: Rebuild conda-analytics container on Bullseye.

To upgrade the version

Tue, Apr 23, 12:26 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)

Thu, Apr 18

Stevemunene updated the task description for T361688: Upgrade datahub to v0.12.1.
Thu, Apr 18, 2:45 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene closed T361688: Upgrade datahub to v0.12.1 as Resolved.
Thu, Apr 18, 2:45 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene added a comment to T361688: Upgrade datahub to v0.12.1.

The deployment was successful and we are now running datahub v0.12.1

Thu, Apr 18, 2:30 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review

Wed, Apr 17

Stevemunene added a comment to T362648: Rebuild conda-analytics container on Bullseye.

Could we also run a sanity Spark test? See T344910#9331963 for an example that jsut runs two Spark SQL queries.

Wed, Apr 17, 5:11 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a comment to T361688: Upgrade datahub to v0.12.1.

Ran into the same issue as previously even with the BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE set to default. The datahub-main-system-update-job-lnk9h ran to completion and successfully reindexed the indices, then the rest of the pods datahub frontend, mae-consumer,mce-consumer-main, gms-main were all recreated without any error.
The current error is from the datahub-main-nocode-migration-job being unable to access gms-main, with the endpoint returning a 503

Wed, Apr 17, 1:06 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene added a comment to T362648: Rebuild conda-analytics container on Bullseye.

New package installs correctly and the conda functionality seems unaffected.

Wed, Apr 17, 7:52 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)

Tue, Apr 16

Stevemunene updated the task description for T362648: Rebuild conda-analytics container on Bullseye.
Tue, Apr 16, 3:35 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a comment to T361688: Upgrade datahub to v0.12.1.

From the community and docs, it seems we do need to update the BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/datahub/values.yaml#L96 when upgrading datahub to a new version since there are tasks performed by the system update job on upgrades. Re: very informative slack dicussion. The downside of this being some downtime which is/was expected for the v0.11.0 upgrade details below:

Tue, Apr 16, 2:57 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene moved T356230: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Tue, Apr 16, 12:26 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene added a comment to T356230: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy.

Crossposting an update for visibility

Tue, Apr 16, 12:26 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene moved T356231: Package versions in Conda-Analytics are not pinned from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

We hit a bit of a delay with this while building the new conda package, so far we have updated the conda-analytics-clone command to include the --pinned tag so that the file is available to everyone using it as per Creating_a_new_environment.
However, we ran into a debian related challenge where the buster-backports repo is no longer available which prevents us from building any images that include his repo on apt-update failure. This is being tracked on T362518 and we plan to solve this by rebuilding our container on bullseye here T362648 so as to unblock progress on this.

Tue, Apr 16, 12:25 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene moved T362648: Rebuild conda-analytics container on Bullseye from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Tue, Apr 16, 12:24 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene created T362648: Rebuild conda-analytics container on Bullseye.
Tue, Apr 16, 12:23 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a comment to T361688: Upgrade datahub to v0.12.1.

datahub-gms-main was in error and was rolled back before I could get any error logs from there. Should we revert the BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE back to the default? mce and mae were all ok

Tue, Apr 16, 11:55 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene updated subscribers of T361688: Upgrade datahub to v0.12.1.

First upgrade attempt has failed on codfw with some errors on the datahub-main-nocode-migration-job

Tue, Apr 16, 11:35 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review

Mon, Apr 15

Stevemunene updated the task description for T361688: Upgrade datahub to v0.12.1.
Mon, Apr 15, 2:36 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene moved T361688: Upgrade datahub to v0.12.1 from In Progress to To Be Deployed on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

Datahub v0.12.1 is successfully running in staging without any issues, we can proceed with the main upgrade.

Mon, Apr 15, 2:35 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review

Fri, Apr 5

Stevemunene updated the task description for T361688: Upgrade datahub to v0.12.1.
Fri, Apr 5, 11:33 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene changed the status of T361894: Superset account does not receive sql_lab role despite wmf LDAP membership from Open to In Progress.
Fri, Apr 5, 8:05 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), superset.wikimedia.org, Movement-Insights

Thu, Apr 4

Stevemunene updated the task description for T361185: Move datahub to dse-k8s cluster.
Thu, Apr 4, 1:13 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a comment to T356231: Package versions in Conda-Analytics are not pinned.

Thanks @nshahquinn-wmf at the moment the pinned file can only be included in clones if the user wishes to. There is not yet a default way to avail this which does not have the optimal UX.
The pinned file can be availed during cloning by introducing the --pinned tag when cloning shown below;

Thu, Apr 4, 9:11 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene renamed T361688: Upgrade datahub to v0.12.1 from Upgrade datahub to v0.13.0 to Upgrade datahub to v0.12.1.
Thu, Apr 4, 8:38 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review

Wed, Apr 3

Stevemunene added a comment to T361688: Upgrade datahub to v0.12.1.

Got the following errors during the build process

Wed, Apr 3, 2:49 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene added a comment to T361688: Upgrade datahub to v0.12.1.

We have some failed container builds which I am looking into https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/pipelines/47394

Wed, Apr 3, 12:35 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene added a comment to T356231: Package versions in Conda-Analytics are not pinned.

New package introduces a pinned file for the base environment

stevemunene@an-test-client1002:~$ cat /opt/conda-analytics/conda-meta/pinned 
# https://phabricator.wikimedia.org/T356230
numpy <1.24
# https://phabricator.wikimedia.org/T356230
pandas <2.2

The current pinned file is a base model and we might need a more standardised production pinned file cc @nshahquinn-wmf

Wed, Apr 3, 12:12 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene triaged T361688: Upgrade datahub to v0.12.1 as High priority.
Wed, Apr 3, 10:16 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene added a parent task for T361688: Upgrade datahub to v0.12.1: Unknown Object (Task).
Wed, Apr 3, 10:15 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene created T361688: Upgrade datahub to v0.12.1.
Wed, Apr 3, 10:15 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
Stevemunene added a comment to T356231: Package versions in Conda-Analytics are not pinned.

We have introduces a conda analytics pinned file with pandas and numpy versions for starters and built the dev deb package which we are going to test on an-test-client1003

Wed, Apr 3, 10:09 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights

Thu, Mar 28

Stevemunene created T361185: Move datahub to dse-k8s cluster.
Thu, Mar 28, 7:16 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)

Feb 27 2024

Stevemunene moved T356484: Monitor the availability of the superset deployments from In Progress to To Be Deployed on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 27 2024, 3:08 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Engineering
Stevemunene claimed T356230: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy.
Feb 27 2024, 3:05 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene moved T356231: Package versions in Conda-Analytics are not pinned from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 27 2024, 3:05 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene moved T356230: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 27 2024, 3:05 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene claimed T356231: Package versions in Conda-Analytics are not pinned.
Feb 27 2024, 10:07 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Movement-Insights
Stevemunene closed T356486: [superset k8s] Update the wikitech page with our production readiness checklist as Resolved.

Ack, the checklist has been moved to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset/Administration which is where we will also have the runbook.

Feb 27 2024, 8:19 AM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering
Stevemunene closed T356486: [superset k8s] Update the wikitech page with our production readiness checklist, a subtask of T347710: Migrate the Analytics Superset instances to our DSE Kubernetes cluster, as Resolved.
Feb 27 2024, 8:19 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Epic, Data-Engineering

Feb 26 2024

Stevemunene moved T356484: Monitor the availability of the superset deployments from Needs Review to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 26 2024, 11:25 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Engineering

Feb 23 2024

Stevemunene moved T356484: Monitor the availability of the superset deployments from In Progress to Needs Review on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 23 2024, 11:08 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Engineering
Stevemunene moved T356484: Monitor the availability of the superset deployments from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 23 2024, 8:57 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Engineering
Stevemunene moved T356484: Monitor the availability of the superset deployments from Quarterly Goals to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.
Feb 23 2024, 8:57 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Engineering
Stevemunene moved T356486: [superset k8s] Update the wikitech page with our production readiness checklist from In Progress to Done on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

superset and superset-next checklists have been added with the only missing values being the metrics and monitor urls which are a WIP.

Feb 23 2024, 8:56 AM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering
Stevemunene moved T356485: Create saved views for the superset deployment logs from In Progress to Done on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

We now have some saved views for superset and superset next
superset
superset-next

Feb 23 2024, 8:54 AM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering
Stevemunene moved T336040: Bring stat1010 into service with GPU from stat1005 from To Be Deployed to Done on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 23 2024, 8:32 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene moved T354526: Bring stat1011 into service from To Be Deployed to Done on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 23 2024, 8:32 AM · Patch-For-Review, Data-Platform-SRE (2024.02.12 - 2024.03.03)
Stevemunene updated the task description for T354526: Bring stat1011 into service.
Feb 23 2024, 8:32 AM · Patch-For-Review, Data-Platform-SRE (2024.02.12 - 2024.03.03)
Stevemunene updated the task description for T336040: Bring stat1010 into service with GPU from stat1005.
Feb 23 2024, 8:30 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene moved T356485: Create saved views for the superset deployment logs from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 23 2024, 7:48 AM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering
Stevemunene moved T356485: Create saved views for the superset deployment logs from Quarterly Goals to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.
Feb 23 2024, 7:46 AM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering

Feb 20 2024

Stevemunene moved T336040: Bring stat1010 into service with GPU from stat1005 from Needs Review to To Be Deployed on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 20 2024, 3:11 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene moved T354526: Bring stat1011 into service from Needs Review to To Be Deployed on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 20 2024, 3:10 PM · Patch-For-Review, Data-Platform-SRE (2024.02.12 - 2024.03.03)
Stevemunene moved T356486: [superset k8s] Update the wikitech page with our production readiness checklist from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.
Feb 20 2024, 3:09 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering

Feb 15 2024

Stevemunene moved T336040: Bring stat1010 into service with GPU from stat1005 from In Progress to Needs Review on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Once the patch to Add stat1010 and stat1011 to scap targets is merged, we shall add a note on the ops week deployment and keep an eye out during the deployment incase of any issues with the two hosts.

Feb 15 2024, 1:56 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene moved T354526: Bring stat1011 into service from In Progress to Needs Review on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Once the patch to Add stat1010 and stat1011 to scap targets is merged, we shall add a note on the ops week deployment and keep an eye out during the deployment incase of any issues with the two hosts.

Feb 15 2024, 1:55 PM · Patch-For-Review, Data-Platform-SRE (2024.02.12 - 2024.03.03)
Stevemunene added a comment to T336040: Bring stat1010 into service with GPU from stat1005.

adding a link to the rsync-published.service resolution and potential discussion on the stat1011 ticket. https://phabricator.wikimedia.org/T354526#9546019

Feb 15 2024, 1:54 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene added a comment to T354526: Bring stat1011 into service.

The rsync-published.service error is similar to what we encountered when bringing up stat1009 on T336036.
The error occurs as we try to Rsync $source to $destination/$::hostname using the published-sync script. The destination host is expected to use the statistics::published class to merge $::hostname directories into a single directory.
However, for this case the rsync service user does not have enough permission to create the $::hostname directory in the destination folder on an-web1001. Resulting in the error below when starting

Feb 15 2024, 1:35 PM · Patch-For-Review, Data-Platform-SRE (2024.02.12 - 2024.03.03)

Feb 13 2024

Stevemunene claimed T336040: Bring stat1010 into service with GPU from stat1005.
Feb 13 2024, 1:47 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
Stevemunene claimed T354526: Bring stat1011 into service.
Feb 13 2024, 1:04 PM · Patch-For-Review, Data-Platform-SRE (2024.02.12 - 2024.03.03)
Stevemunene moved T339015: Check log rotation settings on airflow instances from In Progress to Done on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Spent some time looking at Advanced logging configuration and how we use Airflow at the foundation. This is a method that can be explored if there are issues with our current logs or if the specific teams want to fine tune their specific logs, otherwise it was not a viable solution to this specific task especially on the SRE side.
Furthermore, If we check the script used to delete the logs clean_logs.sh, the script uses find for files and directories older than 90 Days and deletes them and it was okay without any need for modification. Tried some variations and all checked out.

Feb 13 2024, 12:52 PM · Patch-For-Review, Data-Platform-SRE (2024.02.12 - 2024.03.03)

Feb 8 2024

Stevemunene closed T353776: Bring an-worker11[57-75] into service, a subtask of T353784: Decommission an-worker10[78-95] & an-worker1116, as Resolved.
Feb 8 2024, 9:54 AM · Data-Platform-SRE
Stevemunene closed T353776: Bring an-worker11[57-75] into service as Resolved.
Feb 8 2024, 9:54 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Feb 7 2024

BTullis awarded T353776: Bring an-worker11[57-75] into service a Party Time token.
Feb 7 2024, 2:10 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene moved T353776: Bring an-worker11[57-75] into service from To Be Deployed to Done on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.

The hosts are slowly balancing in the cluster and should help with the low capacity warnings we were getting.

image.png (1×1 px, 285 KB)

Feb 7 2024, 1:48 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Feb 6 2024

Stevemunene moved T353776: Bring an-worker11[57-75] into service from In Progress to To Be Deployed on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.

Sure, I'll try the manual failover and restart of the services probably during our sync

Feb 6 2024, 4:21 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

Hosts are visiblue on the namenode UI and should rebalance with time

image.png (2×2 px, 626 KB)

Feb 6 2024, 2:08 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

The hosts have been added to net_topology and assigned the right role. Hosts are also running OK without any RAID related alerts. However, some hosts are in the default rack so we shall need to run a role restart of the masters sudo cookbook sre.hadoop.roll-restart-masters analytics this was checked with
stevemunene@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology
hosts on the default being some of the newly added hosts

Rack: /eqiad/default/rack
   10.64.153.2:50010 (an-worker1157.eqiad.wmnet)
   10.64.153.3:50010 (an-worker1158.eqiad.wmnet)
   10.64.153.4:50010 (an-worker1159.eqiad.wmnet)
   10.64.155.4:50010 (an-worker1162.eqiad.wmnet)
   10.64.157.2:50010 (an-worker1163.eqiad.wmnet)
   10.64.157.3:50010 (an-worker1164.eqiad.wmnet)
   10.64.157.4:50010 (an-worker1165.eqiad.wmnet)
   10.64.161.2:50010 (an-worker1166.eqiad.wmnet)
   10.64.161.4:50010 (an-worker1168.eqiad.wmnet)
   10.64.165.2:50010 (an-worker1172.eqiad.wmnet)
   10.64.165.4:50010 (an-worker1174.eqiad.wmnet)

Keeping an eye on the cluster
HDFS Capacity remaining
HDFS Available vs used space

Feb 6 2024, 12:41 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene updated the task description for T353776: Bring an-worker11[57-75] into service.
Feb 6 2024, 12:06 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Feb 2 2024

Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

Thanks @BTullis, updating wikitech and the cookbook once I get the right commands

Feb 2 2024, 1:07 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Feb 1 2024

Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

From the iDRAc interfce we can verify that the hosts have been set to RAID0 and that the virtual drives are visible as expected.

image.png (704×1 px, 47 KB)

image.png (1×1 px, 179 KB)

Still trying to understand how they are not visible via megacli commands. Exploring reimaging an-worker1157 with the default playbook and setting up the disks to be visible with
sudo megacli -CfgEachDskRaid0 WB RA Direct NoCachedBadBBU -a0

Feb 1 2024, 4:09 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Jan 31 2024

Stevemunene updated subscribers of T353776: Bring an-worker11[57-75] into service.

Saw some comments on some RAID config issues here https://phabricator.wikimedia.org/T349936#9360470 by @Papaul from the rack/setup task but not as detailed.

Jan 31 2024, 1:20 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

The new an-workers1157-1175 do not have any Virtual drive configured, however the datanode disks/partitions initialized are as expected. Comparing new host an-worker1157 to previously initialized host an-worker1156

Jan 31 2024, 12:00 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Jan 30 2024

Stevemunene updated the task description for T353776: Bring an-worker11[57-75] into service.
Jan 30 2024, 4:10 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

Adding to this, had to run the cookbook sre.hadoop.init-hadoop-workers to install megacli first on all the hosts then pass the megacli command to create a RAID0 volume on each of the physical disks in the array. So we can edit the cookbook to check on the number of drives on the hosts and run the command to create the volumes. However, comparing the parameters passed and those available on the cookbook MegaCLI specific settings. there are some variations in the WB value and NoCachedBadBBU which I would like to look into a bit further.
Currently looking into an error from the playbook

Jan 30 2024, 4:09 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

Thanks @BTullis , We can add it to the cookbook for future reference. Did some further reading on the RAID Configuration Input Options used from the StorCLI Reference Manual
RA: Read Ahead
WB: Write Back
Direct: Direct I/O.: Sets the logical drive cache policy. Direct I/O is the default.
CachedBadBBU: Enable bad BBU caching.: Enables caching when BBU is not functioning. Disabled is the default.

Jan 30 2024, 12:56 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene added a comment to T353776: Bring an-worker11[57-75] into service.

The sre.hadoop.init-hadoop-workers fails in creating new partitions. running the cookbook for an-worker1157 fails with the details below

Jan 30 2024, 11:29 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene updated the task description for T353776: Bring an-worker11[57-75] into service.
Jan 30 2024, 11:27 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene committed rLPRIeed113b83da8: Add dummy keytabs for new an-worker1157-1175.
Add dummy keytabs for new an-worker1157-1175
Jan 30 2024, 10:29 AM

Jan 29 2024

Stevemunene updated the task description for T353776: Bring an-worker11[57-75] into service.
Jan 29 2024, 2:02 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene updated the task description for T353776: Bring an-worker11[57-75] into service.
Jan 29 2024, 12:28 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene updated the task description for T353776: Bring an-worker11[57-75] into service.
Jan 29 2024, 12:24 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Stevemunene claimed T353776: Bring an-worker11[57-75] into service.
Jan 29 2024, 10:07 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11)