User Details
- User Since
- Apr 2 2020, 9:01 AM (111 w, 2 d)
- Availability
- Available
- IRC Nick
- jayme
- LDAP User
- JMeybohm
- MediaWiki User
- JMeybohm (WMF) [ Global Accounts ]
Thu, May 19
Wed, May 18
Tue, May 17
kubeconform debian package is ready as well (needs gerrit repo etc.) but I'm not sure about the best way to deal with the kubernetes json schema (the repo is quite big, 14GB on disk).
Yeah. Some other docs suggest they can be used in selector fields an general, though.
Mon, May 16
Sun, May 15
We should be aware that changing the typology annotations also changes scheduling behavior. As of now, the scheduler will try to schedule Pods of the same Replicaset across different rows (as zone contains just the row). Changing zone do include the rack will make each rack unique, so we lower the possibility of Replicatsets Pods to span multiple rows. That might have unwanted implications in case of power or network issues on one row.
Fri, May 13
Do we really think we need a global/shared cache directory? AIUI it is used to:
Wed, May 11
Tue, May 10
Mon, May 9
Fri, May 6
Wed, May 4
Tue, May 3
This is done with miscweb being the first full Ingress service and datahub following up.
Docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress
I finally managed to verify and document the steps needed to put a service under Ingress. I did also update the general
https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service documentation (which contains a link to the Ingress specific part).
@BTullis: I'd very much like you to go over the new docs to verify those are useful to others. From what I remember datahub still needs:
- most of the DNS CNAME records (currently only datahub-gms.discovery.wmnet exists)
- service::catalog entries for datahub-frontend and datahub-gms
- to make use of datahub-frontend.discovery.wmnet in hieradata/common/profile/trafficserver/backend.yaml
This is now used by miscweb and documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service (more specifically https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress)
Mon, May 2
Fri, Apr 29
Thu, Apr 28
I can reproduce that when running rake run_locally['default'] on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/787058 but not on master (a2857b739e8a1578e1b124c45d02e51cd940c363).
This is bad docs. The deployer Kubernetes accounts don't have (and don't need) permission to "get ns", that's why you get this error.
I've changed the docs to:
kube_env admin staging-codfw kubectl describe ns $YOUR-SERVICE-NAME
Wed, Apr 27
Tue, Apr 26
Please keep this open as it is absolutely in a hacky state currently (DNS + service::catalog wise)
I don't think this ever gets elevated to en error (but I do get that it is annoying).
Rolled out to canaries + deploy1002, still super slow as introduced with T305949
Apr 14 2022
I think this is a lucky coincidence as for both services host config is created by via monitoring::service resource (in modules/profile/manifests/chartmuseum.pp and modules/profile/manifests/releases/common.pp. As there are no "real hosts" with case of ingress services, monitoring::service resources usually don't exist.
I've removed the tags from the registry (https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images) and triggered docker-reporter-releng-images.service
Apr 13 2022
4.6.1 rolled out to canaries. scap pullworks as usual, cd /srv/deployment/restbase/deploy/; scap deploy --environment dev-cluster feels way(!) slower then the last times I did this.
Apr 11 2022
I'll prepare the needed patches.
Apr 8 2022
Apr 7 2022
Apr 5 2022
Rolled out everywhere
Apr 4 2022
Something like curl -I https://miscweb.k8s-staging.discovery.wmnet:30443 now works by default.
Basic checks on mwdebug and restbase reploy looking fine. Will roll out to fleet wide tomorrow
We skipped buster with T300744
Apr 1 2022
Mar 31 2022
To be discussed with service ops:
- Investigate and address the reasons why after a node failure k8s did not fulfill its promise of making sure that the rdf-streaming-updater deployment have 6 working replicas
The problem was more that the node did not really fail (to it's complete extend). It was heavily overloaded (for an unknown reason) and that's potentially why containers/processed running there seemed dead. But from K8s perspective the Pods where still running and a new pod was scheduled as soon as I power cycled the node (e.g. K8s was able to detect a mismatch in desired end existing replicas).