User Details
- User Since
- Apr 2 2020, 9:01 AM (305 w, 1 d)
- Availability
- Available
- IRC Nick
- jayme
- LDAP User
- JMeybohm
- MediaWiki User
- JMeybohm (WMF) [ Global Accounts ]
Today
Thank you all for untangling and documenting this!
I would like to suggest to uncouple this from the k8s upgrade procedure. It surfaced there, but it is actually a mediawiki bootstraping problem that might bite us in disaster recovery or similar scenarios as well. I'm not totally sure about this but if scap was capable of bootstraping mediawiki in the past, shouldn't it still be able to do so? The comparison is probably bad since we where running support-releases outside of scaps reach in the past (like statsd-exporter for example) but it also feels off to have maintain knowledge about what to do when (like the list of mw namespaces and support releases) in multiple places (scap and wikitech/charlie/...).
I don't think it makes much sense to maintain a list of namespaces where something is deployed inside the artifact that is being deployed.
My aim here was to add an annotation or label to the namespace objects in kubernetes. This could easily be done during namespace creation in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/main.yaml#14 and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/helmfile_namespaces.yaml. The hard part is to figure out the identifier to add add (phab tag, team name, .. ideally not something that changes every quarter) and the actually responsible group (as usual).
We have a rough documentation about our policy and process around adopting upstream helm charts which can be found here: https://wikitech.wikimedia.org/wiki/Kubernetes/Upstream_Helm_charts_policy
Yesterday
I disagree this being a duplicate of T412693: Ensure all Chart.yaml files include required metadata fields as that one aims at chart ownership while this task aims at ownership for k8s namespaces (or groups of deployments if that makes more sense). Those might contain deployments of charts maintained by the same group/team - but that's not a requirement.
This is done for all wikikube clusters, so we can move this to radar on our side. Still relevant for:
Tue, Jan 20
Removing serviceops since we won't be working on this.
Summarizing the current state and our recent discussion about this:
I've already tried to make lsof exclude bunch of mountpoints with:
lsof -nXd DEL $(findmnt -t tmpfs,nsfs,overlay,proc,^Csfs,cgroup2,devtmpfs,devpts,securityfs,pstore,bpf,hugetlbfs,mqueue,debugfs,tracefs,fusectl,configfs,ramfs, -o TARGET -n --list | sed 's/^/-e /')
Mon, Jan 19
I'm resolving this since we have updated the wikikube clusters quite some time ago and remaining work, cleanups etc. will be handled in subtasks.
Thu, Jan 15
The X-Cache-Status failures are gone now:
jayme@cumin1003:~$ sudo httpbb /srv/deployment/httpbb-tests/docker-registry/test_docker-registry.yaml --hosts 'registry2004.codfw.wmnet'
Sending to registry2004.codfw.wmnet...
https://docker-registry.wikimedia.org/v2/ml/nonexistent/manifests/latest (/srv/deployment/httpbb-tests/docker-registry/test_docker-registry.yaml:106)
Status code: expected 404, got 401.
https://docker-registry.wikimedia.org/v2/ml/nonexistent/blobs/upload (/srv/deployment/httpbb-tests/docker-registry/test_docker-registry.yaml:110)
Status code: expected 404, got 401.
===
FAIL: 22 requests sent to registry2004.codfw.wmnet. 2 requests with failed assertions.Since there is clearly no need for optimization here, I'll resolve this now.
The 403 vs. 401 or 404 are the result of the tests being run against a read-only (profile::docker_registry::read_only_mode) instance of the registry. I have updated the wikitech page accordingly.
I would assume this not related to your personal account (since the SQL query is clearly done as research user). Data-Platform-SRE || Data-Engineering can you help with this?
@MoritzMuehlenhoff assigning to you so the next clinic duty person knows you're working on this with Dan, thanks
Wed, Jan 14
I have added you to the analytics-privatedata-users group. If that does not grand you the required privileges, please take a look at https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels and try to figure out (maybe with help from Data-Engineering folks) what access level you require.
@Clement_Goubert this looks done, is it?
@thcipriani this needs sign-off from you as the approver for the deployment group
Tue, Jan 13
Key has been verified and patch merged. You should have access after ~30min max.
The kerberos principal has been created.
For off band verification of the SSH key, please confirm the key by putting it onto your (wiki user page).
Your account is already a member of the group (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml#459). Would you please take a look at https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels and try to describe what you are missing or what you are trying to do that does not work?
I have created the trixie components and copied the packages:
We just got this as a red herring during a registry outage where nginx was failing to start (so nothing listening)
Mon, Jan 12
Fri, Jan 9
@trueg could you please specify what access level you're requesting/what you need access to (see https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#What_access_should_I_request?)?
I see that you currently don't have shell access, but given you provided an SSH key I assume you're requesting shell access and analytics-privatedata-user membership?
Welcome!
Grafana access is granted by having an LDAP account. Please request access to logstash via Wikimedia IDM at https://idm.wikimedia.org.
Feel free to reopen this ticket in case you run into issues!
Merged the patch prepared by @Dzahn (thanks).
Thu, Jan 8
@KFrancis could you please confirm NDA status?
Release-Engineering-Team: Could you help with removing +2 ?
Access to the wmf group needs to be requested Using_the_Wikimedia_Identity_Management_System nowadays. If you run into issues, please feel free to reopen this task.
@KFrancis could you please confirm NDA status?
I've moved the file out of the way to /root/See_T413433 in case someone lost a session.
Dec 19 2025
FWIW there is the concept of helm test (https://helm.sh/docs/topics/chart_tests/) that is totally unused for most of our services although we do create a test based on service-checker by default for all new charts: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/_scaffold/service/_skel/templates/tests/test-service-checker.yaml
Dec 18 2025
Dec 16 2025
Since we package envoy binaries now, this is no longer required.
Thanks for volunteering to remove the remaining certs and cergen config during your January cleanup
With T352245: Migrate the etcd main cluster to cfssl-based PKI resolved, this has now been completed.
Closing again because it seems to work fine mostly and we can't reproduce failures
This is done. I've created T412805: Migrate ipblocks from fetch_external_clouds_vendors_nets.py to HIDDENPARMA for the follow up work.
Two questions/suggestions in this regard:
- I see that we also have wikikube-ctrl2006 racked (T406596), would it make sense to do all three at once?
- Given we moved to UEFI as default (and wikikube-ctrl2006 seems to require it anyways) I would suggest to switch wikikube-ctrl200[4-5] to UEFI as well (so we don't have to do that later), see: https://wikitech.wikimedia.org/wiki/UEFI_Boot
