User Details
- User Since
- Jun 29 2021, 9:56 AM (250 w, 13 h)
- Availability
- Available
- IRC Nick
- btullis
- LDAP User
- Btullis
- MediaWiki User
- BTullis (WMF) [ Global Accounts ]
Today
@elukey - I think that you can consider cephosd[2001-2003].codfw.wmnet as lower risk for this work, so you can bring it forward.
Yesterday
Reassigning this back to myself.
There was a little confusion, but I'm confident that we have the metrics we need to proceed.
Thu, Apr 2
I'm reassigning this to you for now @bking - as I'll be out next week and you've been making solid progress on the telemetry work. Hope that's OK.
Thanks. I've added all of those 7 users to the Airflow-DAGs project in GitLab, now.
4 Kerberos principals created and welcome emails sent.
btullis@krb1002:~$ sudo manage_principals.py create wmf-ldlulisa --email_address=ldlulisa@wikimedia.org Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to ldlulisa@wikimedia.org
You should also now be able to start configuring and testing your SSH access to production, as outlined here:
https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config
I have now modified the airflow-platform-eng-ops group.
btullis@ldap-maint1001:~$ sudo modify-ldap-group airflow-platform-eng-ops
Searching in: dc=wikimedia,dc=org
1 entry read
Searching in: ou=groups
1 entry read
Search failed: No such object
No search results.
add: 0, rename: 0, modify: 1, delete: 0
Action? [yYqQvVebB*rsf+?] y
Done.Checked that the changes were applied.
btullis@ldap-maint1001:~$ ldapsearch -x cn=airflow-platform-eng-ops|egrep '(wmf-ldlulisa|kmontalva-wmf|renilthomas|hshaikh|eenabulele|ptiwary|sg912)' member: uid=sg912,ou=people,dc=wikimedia,dc=org member: uid=eenabulele,ou=people,dc=wikimedia,dc=org member: uid=hshaikh,ou=people,dc=wikimedia,dc=org member: uid=kmontalva-wmf,ou=people,dc=wikimedia,dc=org member: uid=ptiwary,ou=people,dc=wikimedia,dc=org member: uid=renilthomas,ou=people,dc=wikimedia,dc=org member: uid=wmf-ldlulisa,ou=people,dc=wikimedia,dc=org
You should now all have ops-level access on https://airflow-platform-eng.wikimedia.org
This patch for the POSIX groups is ready to go, I believe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1267031
I have run cross-validate-accounts for all new production access requests, with no issues detected.
btullis@ldap-maint1001:~$ cross-validate-accounts --username wmf-ldlulisa --uid 46469 --email ldlulisa@wikimedia.org --real-name "Luvo Dlulisa" --ssh-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHYyrHgfVH5J5ahLjEzuGEbP7Yq0afDvZUDNuKEuYf9J luvodlulisa@wmf3275" --kerberos btullis@ldap-maint1001:~$ echo $? 0
Great! Thanks @taavi
Although, that being said, I fully support Sandra's having these rights (or I wouldn't have added her in the first place).
Apologies. I had granted access for @Snwachukwu without following due process.
Namely: https://www.mediawiki.org/wiki/Gerrit/Privilege_policy
I have validated all SSH keys via out-of-band communication channels.
Wed, Apr 1
Hi @VRiley-WMF - Apologies for the delay in getting back to you. We haven't had a chance to do the fiddly bit with this yet, so I will reopen it and assign it to myself.
I have created the kerberos principal.
btullis@krb1002:~$ sudo manage_principals.py create andreawest --email_address=arwesterinen@gmail.com Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to arwesterinen@gmail.com
The email should contain the temporary password. along with instructions on how to reset it on first use.
I believe that the GeoIP files may now be mounted by Airflow task pods.
I believe that this is now applying the correct pod security standards, so this can be closed.
btullis@deploy1003:~$ kubectl get namespaces mediawiki-dumps-legacy -oyaml
apiVersion: v1
kind: Namespace
metadata:
annotations:
meta.helm.sh/release-name: namespaces
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2024-12-19T17:18:06Z"
labels:
app: raw
app.kubernetes.io/managed-by: Helm
chart: raw-0.3.0
heritage: Helm
istio-injection: disabled
kubernetes.io/metadata.name: mediawiki-dumps-legacy
pod-security.wmf.org/allow-hostpath-geoip: include
pod-security.wmf.org/disallow-capabilities-adding-capabilities: exclude
pod-security.wmf.org/disallow-capabilities-except-ptrace: include
pod-security.wmf.org/disallow-capabilities-strict-adding-capabilities-strict: exclude
pod-security.wmf.org/disallow-host-path: exclude
pod-security.wmf.org/profile: restricted
pod-security.wmf.org/restrict-volume-types-restricted-volumes: exclude
release: namespaces
name: mediawiki-dumps-legacy
resourceVersion: "1282738711"
uid: ddb55dfc-e4b9-443e-85f0-4e504793b9aa
spec:
finalizers:
- kubernetes
status:
phase: ActivePlease feel free to reopen if there is anything still amiss.
I will manually add @RThomas-WMF to the wmf LDAP group. This is usually performed as self-service, as noted here: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access but on this occasion I will add his account manually, using the method described here: https://wikitech.wikimedia.org/wiki/SRE/LDAP#Method_1
btullis@ldap-maint1001:~$ sudo modify-ldap-group wmf
Searching in: dc=wikimedia,dc=org
1 entry read
Searching in: ou=groups
1 entry read
Search failed: No such object
No search results.
add: 0, rename: 0, modify: 1, delete: 0
Action? [yYqQvVebB*rsf+?] y
Done.
btullis@ldap-maint1001:~$ ldapsearch -x cn=wmf|grep renilthomas
member: uid=renilthomas,ou=people,dc=wikimedia,dc=org
btullis@ldap-maint1001:~$I'll follow up with a puppet patch to track this, even before switching this to a full shell account.
In the meantime, we will need an SSH key for those users who don't already have one for production shell access.
That's the following users:
In terms of manager approvals, @HShaikh is the manager for all 7 of the other users listed, whereas @HShaikh reports to @lanebecker.
OK, thanks for all of the input so far. I've got some clarification on the procedures from the SRE team and I have a way forward.
So far, we have only discussed the requirement to have ops-level access to https://airflow-platform-eng.wikimedia.org so that they can trigger DAG runs and re-run any failed tasks etc.
I can add the users to the required LDAP group to grant this privilege.
Tue, Mar 31
I can also take the druid[1012-1013].eqiad.wmnet hosts.
dbstore1007 is shortly to be refreshed by dbstore1010 when T417948: Q3:rack/setup/install dbstore1010 is complete.
Agreed. I'm happy to decom this server. As per the original description, this drive bay keeps connecting and dropping. It costs us time each time that we try to re-add a data volume to Hadoop.
I approve membership of analytics-admin
I believe that this is all ready to go now. I'll resolve the ticket, but please feel free to let me know if you have any problems with your access @AWesterinen.
Mon, Mar 30
@LDlulisa-WMF , @RThomas-WMF , @E.Enabulele - I think that the next step in this will probably be for you to request membership of the wmf group in LDAP.
It seems strange that you weren't already added to this group during your onboarding, but perhaps this is related to the fact that you are primarily working on WME, so the authentication system will be a little different.
I can pick this up and work with you on the details, to make sure that we get the level of access correct for your needs.
There is also a Slack thread around this request, which might be useful for reference.
I believe that this should be fixed now. I noticed that there were quite a few out-of-date dummy keytabs for old hosts in the labs-private repo, so I carried out a one-off sync of the directory structure and filenames from the private repo to the labs-private repo.
I propose to resolve this ticket now.
The Data-Engineering team is still working on T415941: [EPIC] Move SystemD timer based jobs to Airflow to migrate the remaining systemd timer jobs to Airflow, but a large proportion of them have already moved the alert noise associated with those that remain is very low.
I'm resolving this ticket, since it is historical.
Thu, Mar 26
Wed, Mar 25
How does this look, now that T416345 is finished and all of the dse-k8s-worker hosts are on 10 Gbps? Do you think that we can close it?
This is blocked on T414484: Upgrade DSE clusters to kubernetes 1.31 - since we need to use the ValidatingAdmissionPolicy for this.
Tue, Mar 24
I believe that this should be working now. Please let me know if it doesn't behave as you expect.
Mon, Mar 23
I belive that this is now fixed. Thanks @Jclark-ctr .
We are going to wait until the dust has settled slightly on T414484: Upgrade DSE clusters to kubernetes 1.31 before implementing this.
Technically, the changes should be orthogonal, but it is probably best to keep the number of concurrent changes down as much as possible.
We are planning to upgrade the dse-k8s-eqiad cluster on Thursday March 26th.
I'll send out the communications today and start preparing patches.
Fri, Mar 20
The first reimage failed because of a partman issue.
I'll put the host into insetup mode to carry out the reimage.
Thu, Mar 19
These have all now been added to the cluster.
The transient networking issues are described here: T419992: Alert if calico BGP sessions are not established on any kubernetes worker
