Page MenuHomePhabricator
Feed Advanced Search

Thu, Aug 22

aborrero updated the task description for T230981: prometheus-openstack-exporter: add information about agents.
Thu, Aug 22, 9:37 AM · cloud-services-team (Kanban)
aborrero created T230981: prometheus-openstack-exporter: add information about agents.
Thu, Aug 22, 9:30 AM · cloud-services-team (Kanban)

Aug 18 2019

aborrero created T230674: shinken: issue with shinkengen.
Aug 18 2019, 8:56 AM · Shinken, cloud-services-team (Kanban)

Aug 15 2019

aborrero created T230537: keystone/horizon: character encoding issue in username.
Aug 15 2019, 10:17 AM · cloud-services-team (Kanban)
aborrero added a comment to T229871: relocate/reimage cloudvirt1023 with 10G interfaces.

I managed to bypass that issue by running

sudo wmf-auto-reimage-host --no-verify -p T229871 cloudvirt1023.mgmt.eqiad.wmnet

but it looks like manual intervention is required at the step below

Aug 15 2019, 8:25 AM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

ack!

Aug 15 2019, 8:22 AM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes

Aug 14 2019

aborrero closed T230344: repool integration-slave-docker-1040 as Resolved.
Aug 14 2019, 2:44 PM · Continuous-Integration-Config, Release-Engineering-Team-TODO (201908)
aborrero reassigned T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade from aborrero to Bstorm.

@Bstorm I'm assigning this task to you since its mostly you working on this right now (the maintain-kubeusers).

Aug 14 2019, 1:51 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero moved T220051: Puppet cleanup around OpenStack from Doing to Important on the cloud-services-team (Kanban) board.
Aug 14 2019, 1:47 PM · cloud-services-team (Kanban)

Aug 12 2019

aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

To put my mind at ease that we aren't going to end up limited to under 3000 tools, and because I want to understand a bit better, which one of these are you currently testing @aborrero? https://github.com/nginxinc/kubernetes-ingress/blob/master/docs/nginx-ingress-controllers.md
It seems the answers to many of these questions change depending on which nginx controller we are using.

Aug 12 2019, 3:46 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes.

That makes sense:

Aug 12 2019, 3:21 PM · cloud-services-team (Kanban)
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 12 2019, 12:00 PM · cloud-services-team (Kanban)

Aug 9 2019

aborrero updated subscribers of T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes.

This may be a good starting task for @Phamhi, apart from the ones we have already.

Aug 9 2019, 4:46 PM · cloud-services-team (Kanban)
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 9 2019, 4:33 PM · cloud-services-team (Kanban)
aborrero added a comment to T230126: LDAP: multiples accounts for Phamhi.

ok, ACK, thanks.

Aug 9 2019, 4:20 PM · Patch-For-Review, LDAP, cloud-services-team (Kanban)
aborrero updated the task description for T230126: LDAP: multiples accounts for Phamhi.
Aug 9 2019, 4:20 PM · Patch-For-Review, LDAP, cloud-services-team (Kanban)
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 9 2019, 4:15 PM · cloud-services-team (Kanban)
aborrero added a comment to T230126: LDAP: multiples accounts for Phamhi.

@bd808 the admin module changes have been merged. I don't see the changes in the LDAP directory. Let me know if I should do them myself.

Aug 9 2019, 1:45 PM · Patch-For-Review, LDAP, cloud-services-team (Kanban)
aborrero added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

That plan sounds good. Remember that you may need to manually restart ferm in some places because we pass FQDNs directly to the ferm config and thus the backing IP change for m5-master.eqiad.wmnet won't be detected as a puppet change (so puppet agent won't restart ferm)

Aug 9 2019, 1:10 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Aug 8 2019

aborrero triaged T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes as Normal priority.
Aug 8 2019, 5:37 PM · cloud-services-team (Kanban)
aborrero created T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes.
Aug 8 2019, 5:37 PM · cloud-services-team (Kanban)
aborrero triaged T230126: LDAP: multiples accounts for Phamhi as High priority.
Aug 8 2019, 12:26 PM · Patch-For-Review, LDAP, cloud-services-team (Kanban)
aborrero added a comment to T230126: LDAP: multiples accounts for Phamhi.

We are using the username from one account and the uid from another one. This is a bit confusing. I'm not sure if its even possible/desirable to drop accounts (or we just deactivate them or whatever).

Aug 8 2019, 12:26 PM · Patch-For-Review, LDAP, cloud-services-team (Kanban)
aborrero renamed T230126: LDAP: multiples accounts for Phamhi from LDAP account: multiples account for Phamhi to LDAP: multiples accounts for Phamhi.
Aug 8 2019, 12:21 PM · Patch-For-Review, LDAP, cloud-services-team (Kanban)
aborrero created T230126: LDAP: multiples accounts for Phamhi.
Aug 8 2019, 12:20 PM · Patch-For-Review, LDAP, cloud-services-team (Kanban)

Aug 7 2019

aborrero added a comment to T149589: Puppet tab in Horizon unusably slow.

Yes, the puppet information in horizon is extremely slow, specially the Prefix Puppet pages. That in concrete is a known issue with no short term fix :-(
35 seconds for me:

Aug 7 2019, 4:40 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, Puppet, Cloud-Services
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 7 2019, 3:55 PM · cloud-services-team (Kanban)
aborrero added a comment to T149589: Puppet tab in Horizon unusably slow.

Horizon really is unbearably slow, to the point of being almost unusable.

Aug 7 2019, 3:02 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, Puppet, Cloud-Services
aborrero added a comment to T229156: Degraded RAID on cloudvirt1018.

We had a really high IO usage on this server the other day, along with very high load avg.

Aug 7 2019, 2:56 PM · cloud-services-team (Kanban), ops-eqiad, Operations
aborrero edited projects for T229156: Degraded RAID on cloudvirt1018, added: cloud-services-team (Kanban); removed cloud-services-team.
Aug 7 2019, 2:53 PM · cloud-services-team (Kanban), ops-eqiad, Operations
aborrero closed T229833: SRE: root access for Hieu Pham, SRE @ WMCS, a subtask of T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services, as Resolved.
Aug 7 2019, 11:29 AM · cloud-services-team (Kanban)
aborrero closed T229833: SRE: root access for Hieu Pham, SRE @ WMCS as Resolved.

This should be done. Anybody please reopen if there are any related issues.

Aug 7 2019, 11:28 AM · Operations, SRE-Access-Requests, cloud-services-team (Kanban)
aborrero added a comment to T229786: Create a service account to manage traffic.wmflabs.org. from acme-chief.

For the record, I just created a basic documentation page: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Service_accounts

Aug 7 2019, 9:57 AM · cloud-services-team (Kanban), Horizon, Acme-chief
aborrero triaged T230003: openstack: cleanup neutron user as Normal priority.
Aug 7 2019, 9:39 AM · cloud-services-team (Kanban)
aborrero updated the task description for T230003: openstack: cleanup neutron user.
Aug 7 2019, 9:39 AM · cloud-services-team (Kanban)
aborrero created T230003: openstack: cleanup neutron user.
Aug 7 2019, 9:36 AM · cloud-services-team (Kanban)

Aug 6 2019

aborrero awarded T229936: Examine and prioritize work to change labs namespace to cloud in gerrit as well as groups a Love token.
Aug 6 2019, 5:34 PM · cloud-services-team (Kanban)
aborrero closed T229786: Create a service account to manage traffic.wmflabs.org. from acme-chief as Resolved.

Here you go @Vgutierrez. Ping me on IRC if you would like to have some assistance when creating the Wikitech or Striker account.

Aug 6 2019, 4:45 PM · cloud-services-team (Kanban), Horizon, Acme-chief
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 6 2019, 4:01 PM · cloud-services-team (Kanban)
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 6 2019, 1:10 PM · cloud-services-team (Kanban)
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 6 2019, 1:06 PM · cloud-services-team (Kanban)
aborrero added a member for Security: Phamhi.
Aug 6 2019, 1:05 PM
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 6 2019, 12:53 PM · cloud-services-team (Kanban)
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 6 2019, 12:20 PM · cloud-services-team (Kanban)
aborrero updated the task description for T228942: Onboard Hieu Pham to Wikimedia Foundation as SRE in Cloud Services.
Aug 6 2019, 11:45 AM · cloud-services-team (Kanban)
aborrero added a member for acl*sre-team: Phamhi.
Aug 6 2019, 11:44 AM
aborrero added a member for Trusted-Contributors: Phamhi.
Aug 6 2019, 11:42 AM
aborrero added a member for WMF-NDA-Requests: Phamhi.
Aug 6 2019, 11:40 AM
aborrero updated subscribers of T224585: Migrate labmon* to Stretch (or Buster, better yet!).
Aug 6 2019, 11:36 AM · cloud-services-team (Kanban), Operations
aborrero updated subscribers of T229237: Set up tools-buster repository in aptly to allow toolforge servers to be installed on buster.
Aug 6 2019, 11:36 AM · cloud-services-team (Kanban)
aborrero triaged T229920: WMCS: migrate python2 scripts to python3 as Normal priority.
Aug 6 2019, 11:35 AM · Epic, cloud-services-team (Kanban)
aborrero created T229920: WMCS: migrate python2 scripts to python3.
Aug 6 2019, 11:35 AM · Epic, cloud-services-team (Kanban)
aborrero removed the point value for T229237: Set up tools-buster repository in aptly to allow toolforge servers to be installed on buster.
Aug 6 2019, 11:23 AM · cloud-services-team (Kanban)
aborrero moved T229237: Set up tools-buster repository in aptly to allow toolforge servers to be installed on buster from Inbox to Important on the cloud-services-team (Kanban) board.
Aug 6 2019, 11:23 AM · cloud-services-team (Kanban)

Aug 5 2019

aborrero triaged T229833: SRE: root access for Hieu Pham, SRE @ WMCS as High priority.
Aug 5 2019, 3:37 PM · Operations, SRE-Access-Requests, cloud-services-team (Kanban)
aborrero created T229833: SRE: root access for Hieu Pham, SRE @ WMCS.
Aug 5 2019, 3:37 PM · Operations, SRE-Access-Requests, cloud-services-team (Kanban)
aborrero triaged T229786: Create a service account to manage traffic.wmflabs.org. from acme-chief as Normal priority.

For the WMCS team meeting, needs discussion: how to better handle this. I'm not aware of the current workflow for creating service account in openstack.

Aug 5 2019, 10:03 AM · cloud-services-team (Kanban), Horizon, Acme-chief
aborrero triaged T229787: Toolforge: sudden issues in both gridengine and k8s webservices as Normal priority.

I could normally operate both grid webservices and k8s webservices. There is no apparent reason for this issue.
Toolschecker didn't like that I managed the webservices on my own, so I had to stop them, and restart all the webservices again using toolscheckerctl restart.

Aug 5 2019, 10:00 AM · cloud-services-team (Kanban)
aborrero added a comment to T229787: Toolforge: sudden issues in both gridengine and k8s webservices.

At first sight both k8s and gridengine webservices look fine. Indeed, there was an issue with toolschecker-related webservices. I'm restarting them by hand to see what happens.

Aug 5 2019, 9:29 AM · cloud-services-team (Kanban)
aborrero created T229787: Toolforge: sudden issues in both gridengine and k8s webservices.
Aug 5 2019, 9:11 AM · cloud-services-team (Kanban)
aborrero closed T229783: Unable to create DNS zone traffic.wmflabs.org. in Horizon as Resolved.

This should be done now:

Aug 5 2019, 8:45 AM · cloud-services-team (Kanban), Acme-chief, Horizon
aborrero added a comment to T229783: Unable to create DNS zone traffic.wmflabs.org. in Horizon.

For the record: https://wikitech.wikimedia.org/wiki/Help:Horizon_FAQ#Can_I_create_a_new_DNS_domain/zone_for_my_project,_or_records_under_the_wmflabs.org_domain?

Aug 5 2019, 8:40 AM · cloud-services-team (Kanban), Acme-chief, Horizon
aborrero claimed T229783: Unable to create DNS zone traffic.wmflabs.org. in Horizon.
Aug 5 2019, 8:36 AM · cloud-services-team (Kanban), Acme-chief, Horizon

Aug 2 2019

aborrero triaged T229660: Horizon warning for instance deletion does not include instance name as Normal priority.
Aug 2 2019, 1:12 PM · Upstream, Horizon, cloud-services-team (Kanban)
aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

note to self, for next time I get into this again (hopefully next week):

  • I will need to re-deploy my testing tool into its own namespace to better test all this stuff
  • the ingress object should be namespaced: $tool-ingress
  • the nginx-ingress pod should not listen in 80/tcp, but other unprivileged port
  • the nginx-ingress-svc object should be added to modules/toolforge/files/k8s/kubeadm-nginx-ingress.yaml
Aug 2 2019, 1:04 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

Ok, 2019-10-03, work for us. Will let my team know, since I won't be around.

Aug 2 2019, 12:45 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA
aborrero added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

Ok, so I'm proposing two dates:

Aug 2 2019, 12:42 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA
aborrero edited projects for T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC, added: cloud-services-team (Kanban); removed cloud-services-team.

I think we could either do this next week or wait until september because the WMCS team we will be traveling for Wikimania + offsite.

Aug 2 2019, 12:33 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA
aborrero moved T229587: Introduction to Wikimedia Cloud Services session from Inbox to Doing on the cloud-services-team (Kanban) board.
Aug 2 2019, 11:26 AM · cloud-services-team (Kanban), Wikimania-Hackathon-2019
aborrero triaged T229587: Introduction to Wikimedia Cloud Services session as Normal priority.
Aug 2 2019, 11:26 AM · cloud-services-team (Kanban), Wikimania-Hackathon-2019
aborrero added a subtask for T221394: Wikimania Hackathon Focus Area: Small Wiki Toolkits: T229587: Introduction to Wikimedia Cloud Services session.
Aug 2 2019, 10:50 AM · International-Developer-Events, Wikimania-Hackathon-2019
aborrero added a parent task for T229587: Introduction to Wikimedia Cloud Services session: T221394: Wikimania Hackathon Focus Area: Small Wiki Toolkits.
Aug 2 2019, 10:50 AM · cloud-services-team (Kanban), Wikimania-Hackathon-2019

Aug 1 2019

aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

I do have a question on it: What namespace does it run in, and do we need to whitelist the namespace in the docker registry restrictions or to construct a container in our internal registry? The latter might be the better option. What do you think @aborrero?

Aug 1 2019, 3:57 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

Ok, I think I have a working setup that may (or may not) be headed in the right direction. First iteration anyway.

Aug 1 2019, 1:30 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero created T229571: labpuppetmaster1001: puppet catalog error related to canary_host.
Aug 1 2019, 11:44 AM · cloud-services-team (Kanban)
aborrero triaged T229559: CloudVPS: codfw1dev: database backup for clouddb2001-dev.codfw.wmnet as Normal priority.
Aug 1 2019, 9:15 AM · Cloud-VPS, cloud-services-team (Kanban)
aborrero created T229559: CloudVPS: codfw1dev: database backup for clouddb2001-dev.codfw.wmnet.
Aug 1 2019, 9:14 AM · Cloud-VPS, cloud-services-team (Kanban)
aborrero triaged T229441: CloudVPS: codfw1dev: missing bits as Normal priority.
Aug 1 2019, 9:08 AM · cloud-services-team (Kanban)
aborrero removed a subtask for T217891: CloudVPS: rework codfw deployments: T228974: CloudVPS: codfw1dev: proper DNS setup.
Aug 1 2019, 9:08 AM · Cloud-VPS, cloud-services-team (Kanban)
aborrero added a subtask for T229441: CloudVPS: codfw1dev: missing bits: T228974: CloudVPS: codfw1dev: proper DNS setup.
Aug 1 2019, 9:08 AM · cloud-services-team (Kanban)
aborrero edited parent tasks for T228974: CloudVPS: codfw1dev: proper DNS setup, added: T229441: CloudVPS: codfw1dev: missing bits; removed: T217891: CloudVPS: rework codfw deployments.
Aug 1 2019, 9:08 AM · Cloud-VPS, cloud-services-team (Kanban)
aborrero removed a subtask for T217891: CloudVPS: rework codfw deployments: T228972: CloudVPS: codfw1dev: refresh glance images.
Aug 1 2019, 9:07 AM · Cloud-VPS, cloud-services-team (Kanban)
aborrero added a subtask for T229441: CloudVPS: codfw1dev: missing bits: T228972: CloudVPS: codfw1dev: refresh glance images.
Aug 1 2019, 9:07 AM · cloud-services-team (Kanban)
aborrero edited parent tasks for T228972: CloudVPS: codfw1dev: refresh glance images, added: T229441: CloudVPS: codfw1dev: missing bits; removed: T217891: CloudVPS: rework codfw deployments.
Aug 1 2019, 9:07 AM · cloud-services-team (Kanban)
aborrero added a comment to T226778: Install new PDUs in rows A/B (Top level tracking task).

The dates you mention the WMCS team will be barely available because travel/wikimania/offsites, etc. Since the racks are "easy" for us, this shouldn't be a blocker though. Our servers are mostly ready for the operations, and will re-review them a day before to ensure nothing new (important VM) were scheduled to run there.
So, ACK, good to go.

Aug 1 2019, 9:00 AM · DC-Ops, Operations, ops-eqiad

Jul 31 2019

aborrero added a subtask for T217891: CloudVPS: rework codfw deployments: T229441: CloudVPS: codfw1dev: missing bits.
Jul 31 2019, 4:18 PM · Cloud-VPS, cloud-services-team (Kanban)
aborrero added a parent task for T229441: CloudVPS: codfw1dev: missing bits: T217891: CloudVPS: rework codfw deployments.
Jul 31 2019, 4:18 PM · cloud-services-team (Kanban)
aborrero created T229441: CloudVPS: codfw1dev: missing bits.
Jul 31 2019, 4:17 PM · cloud-services-team (Kanban)
aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

couple of question for @Bstorm and @bd808:

Jul 31 2019, 2:55 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T215529: Puppetize/stand up a load balancer for K8s API servers.

Just noticed a thing about this setup. We don't preserve the source address of the original client, which may complicate things in case we need debugging.
I wonder if we should consider other proxy approach, like using a L3/L4 load balancer instead (NAT based).

Jul 31 2019, 1:09 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T229372: Remove swap partitions from VPS base images.

For the record, this is what we do for the new k8s in Toolforge: profile::toolforge::k8s::kubeadm::preflight_checks in file modules/profile/manifests/toolforge/k8s/kubeadm/preflight_checks.pp.

Jul 31 2019, 8:52 AM · cloud-services-team (Kanban), Cloud-VPS

Jul 30 2019

aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

ok, so nginx-ingress requires TLS to be present by default even if we don't use it. Using the default self-signed certs from the upstream example.

Jul 30 2019, 12:15 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

That was it. Now at least the pod starts with errors :-)

Jul 30 2019, 10:43 AM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

Seviceaccounts are wrong?

Jul 30 2019, 10:38 AM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero added a comment to T228500: Toolforge: evaluate ingress mechanism.

Ok this is my lastest attempt to deploy nginx-ingress:

Jul 30 2019, 10:38 AM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
aborrero updated subscribers of T229156: Degraded RAID on cloudvirt1018.
Jul 30 2019, 9:30 AM · cloud-services-team (Kanban), ops-eqiad, Operations
aborrero closed T229274: python files under modules/openstack/files/mitaka/admin_scripts fail pep8, a subtask of T220051: Puppet cleanup around OpenStack, as Resolved.
Jul 30 2019, 9:25 AM · cloud-services-team (Kanban)
aborrero closed T229274: python files under modules/openstack/files/mitaka/admin_scripts fail pep8 as Resolved.
Jul 30 2019, 9:25 AM · cloud-services-team (Kanban)
aborrero added a subtask for T220051: Puppet cleanup around OpenStack: T229274: python files under modules/openstack/files/mitaka/admin_scripts fail pep8.
Jul 30 2019, 9:24 AM · cloud-services-team (Kanban)
aborrero added a parent task for T229274: python files under modules/openstack/files/mitaka/admin_scripts fail pep8: T220051: Puppet cleanup around OpenStack.
Jul 30 2019, 9:24 AM · cloud-services-team (Kanban)