Page MenuHomePhabricator
Feed Advanced Search

Jun 24 2020

aborrero awarded T242455: Investigate options to improve CloudVPS backend database architecture a Party Time token.
Jun 24 2020, 9:17 AM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)

May 14 2020

JHedden added a comment to T252831: cloudvirt ceph nodes can't launch new VMs.

Full error after upgrading the qemu packages to match package versions:

May 14 2020, 9:53 PM · cloud-services-team (Kanban)

May 12 2020

JHedden added a comment to P11186 rabbitmqctl cluster status cloudcontrol1004.
 Starting RabbitMQ 3.7.8 on Erlang 21.2.6
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-05-12 20:20:57.663 [info] <0.261.0> 
 node           : rabbit@cloudcontrol1003
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.config
 cookie hash    : cnbu5mjmX6EnA/KuQQ1WwQ==
 log(s)         : /var/log/rabbitmq/rabbit@cloudcontrol1003.log
                : /var/log/rabbitmq/rabbit@cloudcontrol1003_upgrade.log
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@cloudcontrol1003
2020-05-12 20:20:58.737 [info] <0.269.0> Memory high watermark set to 25717 MiB (26966397747 bytes) of 64292 MiB (67415994368 bytes) total
2020-05-12 20:20:58.743 [info] <0.271.0> Enabling free disk space monitoring
2020-05-12 20:20:58.743 [info] <0.271.0> Disk free limit set to 50MB
2020-05-12 20:20:58.747 [info] <0.274.0> Limiting to approx 65436 file handles (58890 sockets)
2020-05-12 20:20:58.748 [info] <0.275.0> FHC read buffering:  OFF
2020-05-12 20:20:58.748 [info] <0.275.0> FHC write buffering: ON
2020-05-12 20:20:58.755 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-05-12 20:21:28.756 [warning] <0.261.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-05-12 20:21:28.756 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-05-12 20:21:58.757 [warning] <0.261.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-05-12 20:21:58.757 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-05-12 20:22:28.758 [warning] <0.261.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-05-12 20:22:28.759 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
May 12 2020, 8:23 PM
JHedden added a comment to P11186 rabbitmqctl cluster status cloudcontrol1004.

2020-05-12 20:16:04.072 [error] emulator Discarding message {'$gen_call',{<0.2146.0>,#Ref<0.2094199940.271056901.235963>},stat} from <0.2146.0> to <0.7921.0> in an old incarnation (2) of this node (1)

May 12 2020, 8:16 PM
JHedden created P11186 rabbitmqctl cluster status cloudcontrol1004.
May 12 2020, 8:15 PM
JHedden created P11184 cloudcontrol1005 neutron.
May 12 2020, 3:24 PM

May 11 2020

JHedden added a comment to T250846: (Need By: TBD) rack/setup/install cloudceph200[123]-dev.

These servers should mimic the network configuration we have in production:

May 11 2020, 3:32 PM · Cloud-Services, Operations, ops-codfw, DC-Ops
JHedden created P11178 sample template for mariadb config with galera.
May 11 2020, 1:53 PM

May 7 2020

JHedden closed T241884: Degraded RAID on cloudvirt1024, a subtask of T199125: rack/setup/install cloudvirt102[34], as Resolved.
May 7 2020, 9:23 PM · cloud-services-team (Kanban), ops-eqiad, Cloud-VPS, Operations
JHedden closed T241884: Degraded RAID on cloudvirt1024 as Resolved.

The virtual drive rebuild process was MUCH faster, the firmware upgrades completed successfully and all drives have remained online.

May 7 2020, 9:23 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

Thanks! I've imported the RAID config, restored the boot order settings and will verify it's fixed.

May 7 2020, 3:00 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations

May 5 2020

JHedden added a comment to T250846: (Need By: TBD) rack/setup/install cloudceph200[123]-dev.

You can use the partman config echo partman/standard.cfg partman/raid1-2dev.cfg

May 5 2020, 9:09 PM · Cloud-Services, Operations, ops-codfw, DC-Ops
JHedden closed T248923: Puppet failures with tlsproxy::envoy on cloud-vps as Resolved.
May 5 2020, 4:50 PM · cloud-services-team (Kanban), Cloud-VPS
JHedden assigned T249022: Track and list the services that Cloud Services that connect to internal network endpoints to Bstorm.

Check if we have any netflow data from the network devices that would allow us to query src and dest traffic

May 5 2020, 4:49 PM · cloud-services-team (Kanban)
JHedden closed T250428: Investigate nova metadata issues that appeared during the Rocky upgrade, a subtask of T248635: upgrade cloud-vps openstack to Openstack version 'Rocky', as Resolved.
May 5 2020, 4:44 PM · cloud-services-team (Kanban), Cloud-VPS
JHedden closed T250428: Investigate nova metadata issues that appeared during the Rocky upgrade as Resolved.
May 5 2020, 4:44 PM · cloud-services-team (Kanban), Cloud-VPS
JHedden assigned T250717: Paging setup for WMCS to Bstorm.
May 5 2020, 4:42 PM · observability, cloud-services-team (Kanban)
JHedden closed T94608: Create a simple checklist to follow for announcing / doing planned maintenance (on labs) as Resolved.
May 5 2020, 4:39 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Incident-20150331-LabsNFS-Filesystem-Switch, Labs-Q4-Sprint-1
JHedden assigned T175964: Set up mail rate limiting for tools-mail to aborrero.
May 5 2020, 4:38 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Toolforge
JHedden updated the task description for T175964: Set up mail rate limiting for tools-mail .
May 5 2020, 4:37 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Toolforge
JHedden moved T249035: Requests to production are sometimes timing out or giving empty response from Inbox to Watching on the cloud-services-team (Kanban) board.
May 5 2020, 4:35 PM · cloud-services-team (Kanban), Traffic, Cloud-Services, Operations
JHedden moved T251297: Refactor the toolforge::k8s::kubeadm* modules from Inbox to Doing on the cloud-services-team (Kanban) board.
May 5 2020, 4:34 PM · Toolforge, cloud-services-team (Kanban), PAWS
JHedden moved T251298: Design the resource limits, RBAC and PSP needed for the PAWS Kubernetes cluster from Inbox to Soon! on the cloud-services-team (Kanban) board.
May 5 2020, 4:33 PM · Patch-For-Review, cloud-services-team (Kanban), PAWS
JHedden assigned T251598: Clean up wb_terms related views to Bstorm.
May 5 2020, 4:33 PM · cloud-services-team (Kanban), Data-Services
JHedden moved T251598: Clean up wb_terms related views from Inbox to Soon! on the cloud-services-team (Kanban) board.
May 5 2020, 4:32 PM · cloud-services-team (Kanban), Data-Services
JHedden changed the status of T169286: labstore1005 A PCIe link training failure error on boot, a subtask of T169289: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues, from Open to Stalled.
May 5 2020, 4:30 PM · cloud-services-team (Kanban), Toolforge
JHedden changed the status of T169286: labstore1005 A PCIe link training failure error on boot from Open to Stalled.

Waiting for the next reboot of this host

May 5 2020, 4:30 PM · cloud-services-team (Hardware), DC-Ops, Operations
JHedden lowered the priority of T225621: improve sync process to wikitech-static from High to Medium.
May 5 2020, 4:28 PM · cloud-services-team (Kanban), MediaWiki-Export-or-Import, wikitech.wikimedia.org
JHedden moved T247432: Preserve the ability to make interwiki links to Toolforge tools under the host based routing scheme from Inbox to Soon! on the cloud-services-team (Kanban) board.
May 5 2020, 4:24 PM · Toolforge, cloud-services-team (Kanban)
JHedden moved T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 from Inbox to Watching on the cloud-services-team (Kanban) board.
May 5 2020, 4:22 PM · Upstream, cloud-services-team (Kanban), DBA
JHedden closed T251027: "signatures" tool has failed job pods on Kubernetes cluster as Resolved.
May 5 2020, 4:21 PM · cloud-services-team (Kanban), Toolforge, Tools
JHedden triaged T251295: Plan the integration of new WMCS naming schemes into PAWS as Medium priority.
May 5 2020, 4:18 PM · Toolforge, cloud-services-team (Kanban), PAWS
JHedden triaged T250863: Upgrade calico to a more recent version (current is 3.14.0) as Low priority.
May 5 2020, 4:16 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
JHedden moved T250863: Upgrade calico to a more recent version (current is 3.14.0) from Inbox to Soon! on the cloud-services-team (Kanban) board.
May 5 2020, 4:16 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
JHedden assigned T250867: Script the process of upgrading a node with kubeadm to 1.16.9 to Bstorm.
May 5 2020, 4:15 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
JHedden moved T250867: Script the process of upgrading a node with kubeadm to 1.16.9 from Inbox to Doing on the cloud-services-team (Kanban) board.
May 5 2020, 4:15 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
JHedden triaged T250874: Refresh external certs for the toolforge k8s cluster after the upgrade as Medium priority.
May 5 2020, 4:14 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
JHedden raised the priority of T250874: Refresh external certs for the toolforge k8s cluster after the upgrade from Medium to Needs Triage.
May 5 2020, 4:14 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
JHedden triaged T250874: Refresh external certs for the toolforge k8s cluster after the upgrade as Medium priority.
May 5 2020, 4:14 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
JHedden triaged T251065: Reading Wikidatadump at NFS share instance from the wcdo Cloud VPS project is too slow as Medium priority.
May 5 2020, 4:13 PM · VPS-Projects, Data-Services, cloud-services-team (Kanban)
JHedden triaged T251294: Upgrade cloud-vps control plane to Debian Buster as Medium priority.
May 5 2020, 4:10 PM · cloud-services-team (Kanban)
JHedden triaged T251558: multilevel domains in the 'maps' project don't use tls as High priority.
May 5 2020, 4:09 PM · cloud-services-team (Kanban), Cloud-Services
JHedden raised the priority of T251558: multilevel domains in the 'maps' project don't use tls from High to Needs Triage.
May 5 2020, 4:09 PM · cloud-services-team (Kanban), Cloud-Services
JHedden triaged T251558: multilevel domains in the 'maps' project don't use tls as High priority.
May 5 2020, 4:09 PM · cloud-services-team (Kanban), Cloud-Services
JHedden triaged T251628: Serve some default well known files for Toolforge webservices as Low priority.
May 5 2020, 4:08 PM · cloud-services-team (Kanban), Regression, Kubernetes, Toolforge
JHedden added a project to T251719: Quarry or the Analytics wikireplicas role creates lots of InnoDB Purge Lag: Quarry.
May 5 2020, 4:06 PM · Quarry, Data-Services, cloud-services-team (Kanban)
JHedden triaged T251719: Quarry or the Analytics wikireplicas role creates lots of InnoDB Purge Lag as Medium priority.
May 5 2020, 4:04 PM · Quarry, Data-Services, cloud-services-team (Kanban)
JHedden added a comment to T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken.

Added some documentation at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

May 5 2020, 2:43 PM · cloud-services-team (Kanban), Cloud-VPS (Debian Jessie Deprecation)
JHedden closed T145703: Horizon loses credentials every day, a subtask of T239352: CloudVPS: horizon improvements, as Resolved.
May 5 2020, 1:59 PM · Horizon, Epic, cloud-services-team (Kanban)
JHedden closed T145703: Horizon loses credentials every day as Resolved.

Increasing the memcached cache size definitely helped.

May 5 2020, 1:58 PM · Security, cloud-services-team (Kanban), Horizon

May 1 2020

JHedden created P11112 buster apt gnupg2.
May 1 2020, 6:05 PM
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

The RAID card took drive 9 offline again during the virtual disk rebuild. We cannot update the SATA drive firmware until all the devices are healthy, and since that is never the case we cannot apply the update.

May 1 2020, 1:50 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations

Apr 30 2020

JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

I've cleared the foreign configuration on drives 4 and 9 again, once the rebuild completes I'll attempt the SATA firmware and system BIOS upgrades.

Apr 30 2020, 10:16 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352 that are running the same workloads without any problems.

Apr 30 2020, 10:11 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

I'm unable to upgrade the SATA because of the failed drive state:

Apr 30 2020, 10:02 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

Used the BIOS versions in that last log message, the correct iDRAC versions and log output are below

Apr 30 2020, 8:11 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations

Apr 28 2020

JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

Great! Thanks for the update. This host is currently out of service and can be taken offline anytime.

Apr 28 2020, 8:58 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations

Apr 27 2020

JHedden added a comment to T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken.

Added a Grafana dashboard for detailed instance metrics using the metricsinfra prometheus server: https://grafana-labs.wikimedia.org/d/000000590/metricsinfra-cloudvps-instance-details

Apr 27 2020, 10:52 PM · cloud-services-team (Kanban), Cloud-VPS (Debian Jessie Deprecation)
JHedden added a comment to T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken.

It looks like things will be noisy if we add the alert space rules right now.
https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1h&g0.expr=100%20-%20(node_filesystem_avail_bytes%7Bfstype%3D%22ext4%22%7D%2Fnode_filesystem_size_bytes%20*%20100)%20%3E%3D%2080&g0.tab=1

Apr 27 2020, 9:32 PM · cloud-services-team (Kanban), Cloud-VPS (Debian Jessie Deprecation)
JHedden updated the task description for T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken.
Apr 27 2020, 7:11 PM · cloud-services-team (Kanban), Cloud-VPS (Debian Jessie Deprecation)
JHedden added a comment to T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken.

Email based alert notifications are now enabled for the tools and cloudinfra projects.

Apr 27 2020, 6:23 PM · cloud-services-team (Kanban), Cloud-VPS (Debian Jessie Deprecation)

Apr 21 2020

JHedden renamed T250869: cloudvirt1004 failed RAID controller from cloudvirt1004 lost access to all drives to cloudvirt1004 failed RAID controller.
Apr 21 2020, 9:55 PM · cloud-services-team (Hardware)
JHedden added a comment to T250869: cloudvirt1004 failed RAID controller.

List of effected virtual machines

/etc/libvirt/qemu/i-00000406.xml:      <nova:name>toolsbeta-sgewebgrid-generic-0901</nova:name>
/etc/libvirt/qemu/i-00001507.xml:      <nova:name>incubator-mw</nova:name>
/etc/libvirt/qemu/i-00001d3c.xml:      <nova:name>tools-sgeexec-0901</nova:name>
/etc/libvirt/qemu/i-00002cf4.xml:      <nova:name>tools-sgewebgrid-lighttpd-0918</nova:name>
/etc/libvirt/qemu/i-00002cf5.xml:      <nova:name>tools-sgewebgrid-lighttpd-0919</nova:name>
/etc/libvirt/qemu/i-0000735c.xml:      <nova:name>media-streaming</nova:name>
/etc/libvirt/qemu/i-00007e14.xml:      <nova:name>wikilink-prod</nova:name>
/etc/libvirt/qemu/i-00007e7c.xml:      <nova:name>commonsarchive-mwtest</nova:name>
/etc/libvirt/qemu/i-000081a8.xml:      <nova:name>wikidata-autodesc</nova:name>
/etc/libvirt/qemu/i-000088a9.xml:      <nova:name>deployment-schema-2</nova:name>
/etc/libvirt/qemu/i-0000892a.xml:      <nova:name>discovery-testing-02</nova:name>
/etc/libvirt/qemu/i-00009819.xml:      <nova:name>visionoid</nova:name>
/etc/libvirt/qemu/i-0001027b.xml:      <nova:name>deployment-echostore01</nova:name>
/etc/libvirt/qemu/i-000105b2.xml:      <nova:name>Esther-outreachy-intern</nova:name>
/etc/libvirt/qemu/i-00012d1a.xml:      <nova:name>tools-k8s-worker-38</nova:name>
/etc/libvirt/qemu/i-00012d29.xml:      <nova:name>tools-k8s-worker-52</nova:name>
/etc/libvirt/qemu/i-00014212.xml:      <nova:name>canary1004-01</nova:name>
Apr 21 2020, 9:50 PM · cloud-services-team (Hardware)
JHedden created T250869: cloudvirt1004 failed RAID controller.
Apr 21 2020, 9:48 PM · cloud-services-team (Hardware)
JHedden added a comment to T94608: Create a simple checklist to follow for announcing / doing planned maintenance (on labs).

We should use the WMCS SRE run-book enhancement proposal for this

Apr 21 2020, 4:57 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Incident-20150331-LabsNFS-Filesystem-Switch, Labs-Q4-Sprint-1
JHedden lowered the priority of T94608: Create a simple checklist to follow for announcing / doing planned maintenance (on labs) from High to Medium.
Apr 21 2020, 4:56 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Incident-20150331-LabsNFS-Filesystem-Switch, Labs-Q4-Sprint-1
JHedden added a comment to T143639: Write a simple script that handles failovering proxies (or move behind HA proxy!).

Using a service virtual IP could be an option here, more notes on that at https://wikitech.wikimedia.org/wiki/User:Jhedden/notes/keepalived

Apr 21 2020, 4:51 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Cloud-Services
JHedden added a subtask for T249237: Fix Cloud VPS and Toolforge mail servers to work with the modern internet: T175964: Set up mail rate limiting for tools-mail .
Apr 21 2020, 4:47 PM · Goal, cloud-services-team (Kanban), Toolforge, Cloud-VPS, Epic
JHedden added a parent task for T175964: Set up mail rate limiting for tools-mail : T249237: Fix Cloud VPS and Toolforge mail servers to work with the modern internet.
Apr 21 2020, 4:47 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Toolforge
JHedden triaged T175964: Set up mail rate limiting for tools-mail as Medium priority.
Apr 21 2020, 4:47 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Toolforge
JHedden changed the status of T216733: cloudvirts: ensure we're running the latest raid controller firmware, a subtask of T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure, from Open to Stalled.
Apr 21 2020, 4:45 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden changed the status of T216733: cloudvirts: ensure we're running the latest raid controller firmware from Open to Stalled.

Waiting on Ceph storage which will allow easier hypervisor reboots

Apr 21 2020, 4:45 PM · Sustainability (Incident Followup), cloud-services-team (Kanban), Cloud-VPS
JHedden merged T234830: CloudVPS: m5-master databases for openstack may require re-encoding into T242455: Investigate options to improve CloudVPS backend database architecture .
Apr 21 2020, 4:42 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden merged task T234830: CloudVPS: m5-master databases for openstack may require re-encoding into T242455: Investigate options to improve CloudVPS backend database architecture .
Apr 21 2020, 4:42 PM · Wikimedia-Incident, cloud-services-team (Kanban), Cloud-VPS
JHedden triaged T249035: Requests to production are sometimes timing out or giving empty response as Medium priority.
Apr 21 2020, 4:40 PM · cloud-services-team (Kanban), Traffic, Cloud-Services, Operations
JHedden moved T226537: Follow up on past WMCS #wikimedia-incident tasks from Clinic Duty to Epics on the cloud-services-team (Kanban) board.
Apr 21 2020, 4:36 PM · cloud-services-team (Kanban), Epic
JHedden triaged T218713: Clean up non-proxy entries in wmflabs.org zone as Low priority.
Apr 21 2020, 4:33 PM · cloud-services-team (Kanban), Cloud-VPS
JHedden moved T218713: Clean up non-proxy entries in wmflabs.org zone from Clinic Duty to Inbox on the cloud-services-team (Kanban) board.
Apr 21 2020, 4:33 PM · cloud-services-team (Kanban), Cloud-VPS
JHedden triaged T250087: Designate tooz coordinator is a spof as Medium priority.
Apr 21 2020, 4:31 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden triaged T249941: refactor openstack puppet code to use lists of servers as Medium priority.
Apr 21 2020, 4:30 PM · cloud-services-team (Kanban)
JHedden triaged T249774: Grant "Cloud admin" rights to Reedy as Medium priority.
Apr 21 2020, 4:30 PM · User-bd808, cloud-services-team (Kanban), Cloud-VPS
JHedden triaged T249636: Audit Toolforge account approvals between 2020-03-30 and 2020-04-07 to ensure that database and LDAP state agree as High priority.
Apr 21 2020, 4:30 PM · cloud-services-team (Kanban), Toolforge
JHedden moved T249774: Grant "Cloud admin" rights to Reedy from Soon! to Doing on the cloud-services-team (Kanban) board.
Apr 21 2020, 4:28 PM · User-bd808, cloud-services-team (Kanban), Cloud-VPS
JHedden triaged T250098: Audit usage of *.tools.wmflabs.org GlobalSign TLS certificate and migrate any usage to LE as High priority.
Apr 21 2020, 4:28 PM · cloud-services-team (Kanban), Toolforge
JHedden triaged T249114: E-mails from noreply@pypi.org to tools.pywikibot@tools.wmflabs.org are not forwarded to certain recipients due to SPF as Medium priority.
Apr 21 2020, 4:27 PM · Pywikibot, Toolforge, cloud-services-team (Kanban)
JHedden removed a project from T188449: Get wikitech search logs from hadoop for documentation research: cloud-services-team (Kanban).
Apr 21 2020, 4:26 PM · wikitech.wikimedia.org, Documentation
JHedden triaged T247336: Do we still need /data/project/toolserver-home-archive/archive-2014-06-05.tar.xz as Medium priority.
Apr 21 2020, 4:24 PM · Data-Services, cloud-services-team (Kanban)
JHedden assigned T248923: Puppet failures with tlsproxy::envoy on cloud-vps to Andrew.
Apr 21 2020, 4:21 PM · cloud-services-team (Kanban), Cloud-VPS
JHedden moved T249079: Naming collision between "toolforge" Python packages from Inbox to Doing on the cloud-services-team (Kanban) board.
Apr 21 2020, 4:19 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden raised the priority of T249114: E-mails from noreply@pypi.org to tools.pywikibot@tools.wmflabs.org are not forwarded to certain recipients due to SPF from Medium to Needs Triage.
Apr 21 2020, 4:19 PM · Pywikibot, Toolforge, cloud-services-team (Kanban)
JHedden triaged T249114: E-mails from noreply@pypi.org to tools.pywikibot@tools.wmflabs.org are not forwarded to certain recipients due to SPF as Medium priority.
Apr 21 2020, 4:19 PM · Pywikibot, Toolforge, cloud-services-team (Kanban)
JHedden triaged T249787: Create Docker image for Toolforge that is purpose built to run pywikibot scripts as Medium priority.
Apr 21 2020, 4:18 PM · Patch-For-Review, Pywikibot, cloud-services-team (Kanban), Toolforge
JHedden triaged T250428: Investigate nova metadata issues that appeared during the Rocky upgrade as Medium priority.

In the past the agents were going offline due to missed rabbitMQ heartbeat messages. Consider creating a prometheus exporter to monitor the OpenStack nova and neutron agents to watch for up/down state.

Apr 21 2020, 4:17 PM · cloud-services-team (Kanban), Cloud-VPS
JHedden triaged T250457: Consider adding support for more optional ingress annotations to `webservice` for the Kubernetes backend as Low priority.
Apr 21 2020, 4:10 PM · cloud-services-team (Kanban), Toolforge
JHedden added a comment to T250787: remove cloud "dev" hosts from Icinga?.

The hosts in codfw are used for platform testing and staging. It's useful to have these in Icinga, but we don't need email notifications or on the alerts sub-page dashboard. Potentially we can add a host and service downtime for a _very_ long time.

Apr 21 2020, 4:09 PM · cloud-services-team (Kanban), Operations, observability
JHedden triaged T250787: remove cloud "dev" hosts from Icinga? as Medium priority.
Apr 21 2020, 4:06 PM · cloud-services-team (Kanban), Operations, observability

Apr 20 2020

JHedden updated the task description for T236606: Rebuild Toolforge elasticsearch cluster with Stretch or Buster.
Apr 20 2020, 1:29 PM · cloud-services-team (Kanban), Toolforge, Cloud-VPS (Debian Jessie Deprecation)
JHedden closed T247530: refill-api tool elasticsearch migration, a subtask of T236606: Rebuild Toolforge elasticsearch cluster with Stretch or Buster, as Resolved.
Apr 20 2020, 1:27 PM · cloud-services-team (Kanban), Toolforge, Cloud-VPS (Debian Jessie Deprecation)
JHedden closed T247530: refill-api tool elasticsearch migration as Resolved.

The elasticsearch version 5 cluster is being shutdown today. Your tool account credentials have been migrated to the new cluster which can be reached at http://elasticsearch.svc.tools.eqiad1.wikimedia.cloud

Apr 20 2020, 1:27 PM · Toolforge, Cloud-VPS (Debian Jessie Deprecation)
JHedden closed T247527: strephit tool elasticsearch migration, a subtask of T236606: Rebuild Toolforge elasticsearch cluster with Stretch or Buster, as Resolved.
Apr 20 2020, 1:26 PM · cloud-services-team (Kanban), Toolforge, Cloud-VPS (Debian Jessie Deprecation)