I'm Arturo Borrero Gonzalez from Spain (Seville). I'm Site Reliability Engineer (SRE) in the Wikimedia Cloud Services Team, a Wikimedia Foundation staff.
You may find me in some FLOSS projects, like Netfilter and Debian.
I'm Arturo Borrero Gonzalez from Spain (Seville). I'm Site Reliability Engineer (SRE) in the Wikimedia Cloud Services Team, a Wikimedia Foundation staff.
You may find me in some FLOSS projects, like Netfilter and Debian.
this could be related to work on the parent task
In T327919#8711755, @Papaul wrote:
In T327919#8699523, @cmooney wrote:In terms of the move we need to work with @aborrero and the team to decide when is good to do the work. We can do it in a number of batches or all in one go, whatever you guys think is best. I can move the interfaces in Netbox and configure the new switch in advance in either case.
In T332191#8698703, @dcaro wrote:I got a question, these will be shared in https://config-master.wikimedia.org/known_hosts.ecdsa ?
I use https://pypi.org/project/wm-ssh/, and it uses that url to fetch lists of hosts, just curious
In T324992#8697715, @aborrero wrote:cloudlb2001-dev lacks the right switch vlan trunk in the main interface: https://netbox.wikimedia.org/dcim/interfaces/16653/ We need to enable 2151 there like in the other cloudlb hosts (example: https://netbox.wikimedia.org/dcim/interfaces/28615/)
cloudlb2001-dev lacks the right switch vlan trunk in the main interface: https://netbox.wikimedia.org/dcim/interfaces/16653/ We need to enable 2151 there like in the other cloudlb hosts (example: https://netbox.wikimedia.org/dcim/interfaces/28615/)
just rebased https://gerrit.wikimedia.org/r/c/operations/puppet/+/868731 to introduce the BIRD configuration. Please take a look and comment.
We will need to think about the public IPv4 address to use as VIP before merging it.
In T293649#8676822, @aborrero wrote:I would like to know more details about why maintain_harbor is planned to run as a Toolforge jobs framework cronjob rather than a standalone application (or cronjob) in the kubernetes cluster.
I'm mentioning this because I feel that tying the two things together can make it cumbersome to operate (both things) in the future, for little added value.
The change to run as a standalone cronjob deployment in k8s would be very small.
In T327919#8679314, @cmooney wrote:In T327919#8664016, @aborrero wrote:Please let me know if there is something I can do to help with this (no switch config but perhaps testing, double checking stuff, IP allocation, connectivity, etc)
Thanks @aborrero
Config of the new switch is progressing well, just waiting on two cable moves (see T331470#8676018) and I will migrate the uplink/gateway for the cloud vlans from CR routers to the new switch.
Once that's done we can try to reimage / install OS on the new cloudlb's. If that goes to plan we can migrate the existing hosts over from old switch to new. If you can have a think about what's involved do do both of those that'd be great. No IPs etc. need to change so I think it should just be a matter of arranging the downtime and co-ordinating with DC-Ops. Thanks!
I can't explain how is possible the code was working before. PSP were in policy/v1beta1 in 1.21 https://v1-21.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.21/#podsecuritypolicy-v1beta1-policy
In T326758#8676367, @aborrero wrote:Another option is to make sure we're using an aio1 hostname from the beginning. Will try that!
I would like to know more details about why maintain_harbor is planned to run as a Toolforge jobs framework cronjob rather than a standalone application (or cronjob) in the kubernetes cluster.
In T326758#8676365, @aborrero wrote:Therefore I don't see an easy way to play with openstack-ansible in AIO mode within Cloud VPS VMs.
In T326758#8644613, @aborrero wrote:Note, by default openstack-ansible all-in-one setup renames the VM hostname, introducing a severe drift wrt. other Cloud VPS context (like puppet, etc), therefore making it difficult to operate inside Cloud VPS for evaluation & testing purposes. Will investigate next if this renaming can be disabled.
In T324992#8671971, @cmooney wrote:@aborrero I'd propose to allocate the following for the cloud-private subnets / vlans, look ok to you?
"supernet": 172.20.0.0/16
Sent a ping to @Marostegui regarding clouddb[1013-1014,1021]
In T286856#8671463, @taavi wrote:In T286856#8669296, @aborrero wrote:That could work, but mind that is the Easter/Holy week and some countries (including mine) have at least 2 bank holidays and could be a short week. Anyway, I'm planning to me on the laptop monday to wednesday.
I'm aware but I think it's fine if we do it early in the week. Would Monday work for you? Do you have any time preferences?
In T286856#8669077, @taavi wrote:Good to know, thanks. Next week does not work for me, nor does the week starting the 27th, so looks like this needs to be pushed into April. What about the week starting April 3rd?
In T286856#8668933, @taavi wrote:All of the blockers have been resolved so we can now start thinking about timelines for the actual upgrade. A change to the timeline this time is that PAWS no longer uses the same Puppetization and will be upgraded separately. With that and my personal schedules in mind I propose the following:
- toolsbeta: Upgrade this week, either tomorrow or on Wednesday.
- tools: Upgrade on Wednesday, March 22nd.
Any objections?
In T328539#8668284, @dcaro wrote:Will this include creating a repo for toolforge where we bundle up all these components or similar? (something like a toolforge repo with a helmfile pulling all the others)
I'm a bit concern on the sprawl of components without keeping track of the combinations that are deployed.
If so, that might belong to that repository.
In T328539#8614259, @taavi wrote:The base RBAC is a good candidate for moving into the maintain-kubeusers repository, I think.
thanks for the update!
This ticket had little activity in the last month. Did something happen offline that wasn't recorded in here?
Note, by default openstack-ansible all-in-one setup renames the VM hostname, introducing a severe drift wrt. other Cloud VPS context (like puppet, etc), therefore making it difficult to operate inside Cloud VPS for evaluation & testing purposes. Will investigate next if this renaming can be disabled.
Created https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Local_customization_and_hacks with this information
This should be done now.
On a quick read perhaps Option 4 provides the most value/flexibility?
Created some additional docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/lima-kilo
Next question would be:
Patching kolla containers: https://docs.openstack.org/kolla/latest/admin/image-building.html
Questions for NetOps: they live in the cloud-hosts vlan. It is OK if some hosts attached to that VLAN use high MTU and other don't?
tools.arturo-test-tool@tools-sgebastion-11:~$ toolforge-jobs images Short name Container image URL ------------ ---------------------------------------------------------------------- bullseye docker-registry.tools.wmflabs.org/toolforge-bullseye-sssd:latest golang1.11 docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest jdk17 docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latest mariadb docker-registry.tools.wmflabs.org/toolforge-mariadb-sssd-base:latest mono6.8 docker-registry.tools.wmflabs.org/toolforge-mono68-sssd-base:latest node16 docker-registry.tools.wmflabs.org/toolforge-node16-sssd-base:latest perl5.32 docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-base:latest php7.4 docker-registry.tools.wmflabs.org/toolforge-php74-sssd-base:latest python3.9 docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest ruby2.1 docker-registry.tools.wmflabs.org/toolforge-ruby21-sssd-base:latest ruby2.7 docker-registry.tools.wmflabs.org/toolforge-ruby27-sssd-base:latest tcl8.6 docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest tools.arturo-test-tool@tools-sgebastion-11:~$ toolforge-jobs run mariadb --command 'sleep 3600' --image mariadb tools.arturo-test-tool@tools-sgebastion-11:~$ kubectl get pods NAME READY STATUS RESTARTS AGE mariadb-49nnz 1/1 Running 0 8s test-6d76568d94-mbzrh 1/1 Running 1 23d test2-bcb6c74d9-6ncd7 1/1 Running 0 3d16h tools.arturo-test-tool@tools-sgebastion-11:~$ kubectl exec -it mariadb-49nnz -- bash tools.arturo-test-tool@mariadb-49nnz:~$ sql -h usage: sql [-h] [-v] [-N] [--cluster {analytics,web}] DATABASE ... [..] tools.arturo-test-tool@mariadb-49nnz:~$ mysql --help mysql Ver 15.1 Distrib 10.5.18-MariaDB, for debian-linux-gnu (x86_64) using EditLine wrapper Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
We have now a new image available which contains both curl and some mysql client tools, see:
In T320178#8618628, @Urbanecm wrote:thanks for the answer! It's possible I'm missing something here, but I don't understand how the bullseye image would help in this case. As far as I can see, the bullsyeye image has neither mysql nor wget available (both of those utilities are needed by the script):
What about having more pods in the deployment?
In T320178#8546721, @Urbanecm wrote:Since the script also runs a Python script, I first tried tf-python39, where there is no wget. I was unable to find wget installed in other containers, too. By shelling into the jobs container, I also figured that mysql is missing as well.
How can I migrate similar simple shell scripts (in tools.wmcz and elsewhere), please? Thanks!
In T319700#8305968, @UkrFace wrote:Hello.
Thank you for letting me know.
Will the jlocal command be saved (for small tasks)?
<taavi> Feb 14 13:21:28 tools-sgecron-2 collector-runner[4667]: 2023-02-14 13:21:28,798 Service monitor run completed, 283 webservices restarted
Some additional information.
Option #2 has been implemented (drop statsd support). The service is now up and running.
Removed them.
The webservicemonitor doing its thing:
I think the actual fix here may be to fix T329467: remove webservicemonitor (down due to DNS errors) and let it recover webservices instead of me doing manually