User Details
- User Since
- Nov 1 2022, 1:30 PM (162 w, 6 d)
- Roles
- Disabled
- LDAP User
- Stevemunene
- MediaWiki User
- SMunene-WMF [ Global Accounts ]
Nov 5 2025
The issue was with the port, re exposed and accessed
Created a devenv to test the recent upgrade of upgrade flask-appbuilder to solve some of the initial challenges we had accessing the connection/list/ and variable/list/ pages.
To confirm, these are currently accessible using the default image as below
Results from the recent survey indicate the overall experience with the new environment to have been quite favourable.
The migration process scored a 4/5
There were some bugs during the migration process which were recorded as promptly fixed namely
https://phabricator.wikimedia.org/T405485, https://phabricator.wikimedia.org/T399235
And the availability of documentation and support for the process scored a 3/5
We have had some delays due to scheduling conflicts and PTO. However, we have found some middle ground and have a slot anytime between 10:00 GMT and 12:00 GMT. for 5th Nov for the deploy.
We currently do not have a memcached image later than Buster, We are currently running memcached 1.6.6 from a custom backport on Buster, meanwhile we have 1.6.38 being built on Trixie. The changes should be minimal, and this version should work the same without requiring any other changes. We shall deploy the new image once ready
Nov 4 2025
The flink-kubernetes-operator monitoring is already setup and sends alerts to the data platform SRE's via team-data-platform, I dont think there is more to do for the ticket so I shall move to close it.
only the memcached container is running Buster
Since we still have some dumps running, we shall continue the tests on a devenv and proceed on the test-k8s instance once all is verified
Oct 31 2025
Oct 30 2025
Deployed this but getting an error on kerberos
We had the first support/office hours session be it with no turn out, I think we can move such sessions to an as needed basis.
Sharing a feedback survey form to rate the general sentiments on the move and possible improvements https://docs.google.com/forms/d/e/1FAIpQLScbc1wKiFG5PU4teTFN4cYFR-iKw80uLKzod3ZNuNrgMj8nHA/viewform?usp=dialog
Oct 29 2025
Ran puppet on the hosts then restarted the druid daemons
Oct 28 2025
T408365 is unblocked and we have deployed the new version.
Oct 27 2025
Then update the sources and upgrade from the cumin host
sudo cumin 'deploy*' 'apt-get update && apt-get upgrade -y airflow-devenv'
We solved this by building for bookworm which had all the prerequisite packages already abailable.
Oct 24 2025
Tried building with bookworm packages from https://packages.debian.org/search?keywords=pybuild-plugin-pyproject and https://packages.debian.org/bookworm/python3-poetry-core but the end result is still the same.
both APT_USE_BUILT=yes BUILDRESULT=/home/stevemunene/pydep/pybuild-plugin-pyproject_5.20230130+deb12u1_all.deb BUILDRESULT=/home/stevemunene/pydep/python3-poetry-core_1.4.0-4_all.deb DIST=bullseye pdebuild and APT_USE_BUILT=yes BUILDRESULT=/home/stevemunene/pydep DIST=bullseye pdebuild
Oct 23 2025
Some observations,
When installing using BACKPORTS=yes DIST=bullseye-wikimedia pdebuild, despite the fact that there are no mirror etc. E: The repository 'http://mirrors.wikimedia.org/debian bullseye-backports Release' does not have a Release file. all the required packages are built.
Oct 22 2025
added the two options to nginx and we don't seem to be having the timeouts previously seen on some charts.
Marking this as done while monitoring for any potential issues.
The error above was due to T383557, removed the backports option but I amrunning into this error now
Getting an error setting up the build on the buid2002 host which I am looking into
Moving this to the waiting column as I work on T406222
Oct 21 2025
Oct 8 2025
Deployed the airflow-wikidata instance but gettng some challenges on the kerberose, scheduler and webserver pods.
Oct 7 2025
Checking the permissions we have for our key in codfw
Re opening this task since we have had some issues using ceph on dse-k8s-codfw.
To test the integration, we tried a simple pvc definition as a raw block device
Oct 6 2025
Reached out individually to the affected members and also shared communication with the wider team on the upcoming transition.
Oct 3 2025
Thanks @fkaelin this has been updated.
removed the keytabs
analytics-research user has been added to stat hosts, @fkaelin please confirm that this works as expected
The initial error was from the fact that we did not have the namespace defined in the list of available tenandNamespaces. This was added, then proceeded to recreate the pvc in the namespace and we are now seeing a different error message.
root@deploy2002:~# kubectl apply -f /home/stevemunene/raw-block-pvc.yaml persistentvolumeclaim/raw-block-pvc created root@deploy2002:~# kubectl apply -f /home/stevemunene/raw-block-pod.yaml pod/pod-with-raw-block-volume created root@deploy2002:~# kubectl -n stevemunene-pvc-tests get events -w LAST SEEN TYPE REASON OBJECT MESSAGE 9s Warning FailedScheduling pod/pod-with-raw-block-volume 0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling. 11s Normal Provisioning persistentvolumeclaim/raw-block-pvc External provisioner is provisioning volume for claim "stevemunene-pvc-tests/raw-block-pvc" 14s Normal ExternalProvisioning persistentvolumeclaim/raw-block-pvc Waiting for a volume to be created either by the external provisioner 'rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
created the keytabs for the stat hosts and added them
Oct 2 2025
From discussions on this we decided to add the druid-coordinator service for the public cluster to LVS and get a single usable url for the service. However there were concerns on the need for this as the druid host changes rarely occur ie.(every 3 years or per server lifecycle).
Moreover, there are some discussions on having druid on k8s at some point in the future.
Going through the ceph-csi-releases for any change that might have impacted us in the 1.31 upgrade.
Oct 1 2025
Thanks @SLopes-WMF the files have been cleared and we can close the task.
In communication with @SLopes-WMF and the files have been shared for review.
Thanks @Gehel
Completed the Hadoop setup by Creating UNIX user/group and ops group then
Thanks @MoritzMuehlenhoff, Approval requests should be approved by @gmodena
Sep 30 2025
To help with identifying the current users, I have setup sessions with some of the active users to see what they are working on and to get a better way of automating the discovery.
The dse-k8s-codfw cluster is up and running and the components can be verified as below:
A simple PVC definition as a raw block device
root@deploy2002:~# kube_env admin dse-k8s-codfw root@deploy2002:~# kubectl create namespace stevemunene-pvc-tests namespace/stevemunene-pvc-tests created root@deploy2002:~# kubectl get namespaces NAME STATUS AGE analytics-test Active 47h cert-manager Active 10d default Active 10d echoserver Active 10d external-services Active 10d istio-system Active 10d kube-node-lease Active 10d kube-public Active 10d kube-system Active 10d opensearch-ipoid Active 10d opensearch-ipoid-test Active 10d opensearch-operator Active 10d opensearch-test Active 10d sidecar-controller Active 10d stevemunene-pvc-tests Active 13s
Sep 29 2025
Depooled and removed the hosts from LVS
Sep 25 2025
Druid hosts are done decommissioning, next is removing them from LVS
Thanks, user files have been deleted.






