User Details
- User Since
- Apr 2 2020, 9:01 AM (220 w, 20 h)
- Availability
- Available
- IRC Nick
- jayme
- LDAP User
- JMeybohm
- MediaWiki User
- JMeybohm (WMF) [ Global Accounts ]
Wed, Jun 19
Tue, Jun 18
Mon, Jun 17
Any objections to just remove the VM since we moved to (re-)packaging upstream (https://wikitech.wikimedia.org/wiki/Envoy#Building_envoy_for_WMF)?
Thu, Jun 13
Wed, Jun 12
Tue, Jun 11
Not sure if this is the source of it, but full CI runs do fail because the services/mw-videoscaler/staging fails to render
Finally gone...
I just had to deploy machinetranslation for T346638: Rename the envoy's uses_ingress option to sets_sni and noticed container startup times of around 5 minutes (thinking something went totally wrong). I'm still seeing data getting pulled from peopleweb - are the plans to improve this still ongoing?
Fri, Jun 7
The ratelimit service has been deployed to staging and prod wikikube clusters.
What's left to be done is to configure cirrus-streaming-updater to use it (see https://wikitech.wikimedia.org/wiki/Ratelimit#Enable/opt_in_to_rate_limiting). From all the values files I'm not sure which components (all?) should be rate limited, so I'd like to leave that change to you @pfischer / @bking / @dcausse. Feel free to send it my way for review/sync with me for the deployment so we can verify everything works as expected.
Thu, Jun 6
I see that we run nginx with the default debian nginx.conf which has worker_connections 768; and no worker_rlimit_nofile set. The generic tlsproxy module in puppet uses worker_connections 131072 (no idea where that number comes from) and worker_rlimit_nofile 131072 * 2.
Wed, Jun 5
Unfortunately this is actually nginx complaining:
Mon, Jun 3
As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?
This is basically T287491: Allow to address Kubernetes API servers from NetworkPolicy
IMHO the easiest and less intrusive way to do this with an upstream helm chart is to just add a calico networkpolicy template to the chart (the file could even be prefixed with wmf-) that just creates that one policy. The linked phab task should contain some examples for that.
Thu, May 23
kubernetes2023 is still cordoned and depooled for additional tests of the move v-lan process
After the reimage I needed to run the following for calico to start up properly:
Wed, May 22
I've cleared out kubernetes2023 and kubernetes2032 for you to run tests. As the hosts are pooled=inactive and cordoned in k8s all you have to do is to downtime them (which the cookbooks probably do).
@ayounsi I think we could test the rename cookbook together with T350152: Automation to change a server's vlan on the already cordoned kubernetes2023.codfw.wmnet, right?
May 21 2024
I might be missing something obvious here, but I've two questions:
- Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?
- Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?
May 17 2024
Both staging clusters have been migrated to stacked control-planes
For T362408: Migration to containerd and away from docker we're planning to backport containerd from bookworm to bullseye. Maybe it would be feasible to backport runc as well (although this won't help you with T363191: Test if we can avoid ROCm debian packages on k8s nodes ofc.)?
May 16 2024
Successfully published image docker-registry.discovery.wmnet/ratelimit:9.0.2-20240503.3fcc360, supporting hot reload of gRPC certs. This should unblock deploying the ratelimit service.
May 15 2024
May 14 2024
You got me @elukey :-p
For reasons I did not try to understand yet, the mknod cgroup permission is the culprit. Without it, the access() call fails:
Two more data points that don't help at all:
jayme@ml-staging2001:~$ sudo docker exec -it --user 0 k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))' real: 0:0, effective: 0:0, result: False jayme@ml-staging2001:~$ sudo docker exec -it --user 0 --privileged k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))' real: 0:0, effective: 0:0, result: False
Also worth noting that mediawiki already calls sessionstore via it's envoy sidecar, so we do have telemetry data from prod and we should be able to see the impact there pretty quickly as well: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=sessionstore
It's not clear to me what happened here. The makevm call was unable to detect a puppet run, at that time sudo rules where not present on the machine and puppet did not ran. I've terminated makevm and ran a reimage with more or less the same result but at the time the cookbook was waiting for puppet, there sudo wasn't even installed. install_console was not accessible in both cases, manual root logins via ganeti console did not work as well.
Over night things seem to have cleared up. Even though the cookbook failed puppet did ran, I'm able to login and sudo works (as well as successive puppet runs)...
May 13 2024
kubestagemaster2005 got stuck at:
kubestagemaster2004 is done (I messed up the phab ID in the cumin command, so report ended up in https://phabricator.wikimedia.org/T363310#9790605)
I tend to agree, also for sake of alignment of sessionstore with the rest of our services. Unfortunately this feels like the more involved change (A change to sessionstore was somewhat high risk ...) - but I think it's also true that it should not add much to latency.
To be precise here: If the service backing this as well as all consumers are running in the same k8s cluster we could implement network policies that will only allow access from certain workload in the cluster. But I would advice against relying on that because:
- We still have mediawiki appservers in hardware and there will probably be some snowflakes for which I don't know the implications for this
- We won't be able to use this service cross-dc (as we are with all other active/active services), e.g. depooling in an emergency etc. (which would make this a snowflake)
May 8 2024
I've not looked in detail (and I probably will not be able to before end of next week) but immediately worrisome to me is the daemonset running with superpowers we decided to not do this with calico for example and instead we distribute the cni plugin via debian packages - would that potentially be an option as well or has it been considered?
May 7 2024
I did deploy the cert-manager changes to aux, @brouberol did dse and @klausman will take care of ml clusters, thanks all
May 6 2024
Pushing both branches worked now, thanks!
@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?
Thank you!, almost there. It now fails with:
May 3 2024
May 2 2024
staging-eqiad has been migrated to /28 blocks as well