Page MenuHomePhabricator

JHedden (Jason Hedden)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
May 28 2019, 6:09 PM (35 w, 17 h)
Availability
Available
LDAP User
Jhedden
MediaWiki User
JHedden (WMF) [ Global Accounts ]

Recent Activity

Fri, Jan 24

JHedden edited P10264 cloudcontrol stretch python3 upgrades.
Fri, Jan 24, 7:26 PM
JHedden created P10264 cloudcontrol stretch python3 upgrades.
Fri, Jan 24, 7:24 PM
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

This host has been depooled from production and has no running workloads.

Fri, Jan 24, 4:50 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

During the next rebuild the RAID array kicked out Drive 4. Either we have 3 bad drives 2, 4 and 9 or the RAID adapter is bad. I'll send the TSR for this host to @Jclark-ctr

Fri, Jan 24, 4:45 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden merged T243605: Degraded RAID on cloudvirt1024 into T241884: Degraded RAID on cloudvirt1024.
Fri, Jan 24, 3:33 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden merged task T243605: Degraded RAID on cloudvirt1024 into T241884: Degraded RAID on cloudvirt1024.
Fri, Jan 24, 3:33 PM · ops-eqiad, Operations
JHedden reopened T241884: Degraded RAID on cloudvirt1024, a subtask of T199125: rack/setup/install cloudvirt102[34], as Open.
Fri, Jan 24, 2:42 PM · cloud-services-team (Kanban), ops-eqiad, Cloud-VPS, Operations
JHedden reopened T241884: Degraded RAID on cloudvirt1024 as "Open".

Drive 9 reported a lot of errors while rebuilding the RAID array, and now drives 2, 4, and 9 are missing from the RAID set again. I'll leave drive 9 out of the pool and test rebuilding the array with only 2 and 4.

Fri, Jan 24, 2:42 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations

Thu, Jan 23

JHedden moved T222950: (OoW) cloudvirt1006 - RAID battery failed from Backlog to Hardware faults on the cloud-services-team (Hardware) board.
Thu, Jan 23, 10:00 PM · cloud-services-team (Hardware), User-jbond, ops-eqiad, Operations
JHedden closed T243555: Degraded RAID on cloudvirt1024 as Invalid.

This is not a failure, the drive is currently rebuilding from task T241884

Thu, Jan 23, 9:59 PM · ops-eqiad, Operations
JHedden edited projects for T222950: (OoW) cloudvirt1006 - RAID battery failed, added: cloud-services-team (Hardware); removed cloud-services-team.
Thu, Jan 23, 9:43 PM · cloud-services-team (Hardware), User-jbond, ops-eqiad, Operations
JHedden closed T241884: Degraded RAID on cloudvirt1024 as Resolved.

Drives 2 and 4 had a foreign configuration. I've cleared the configuration and reassigned them as global host spares.

Thu, Jan 23, 9:42 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden closed T241884: Degraded RAID on cloudvirt1024, a subtask of T199125: rack/setup/install cloudvirt102[34], as Resolved.
Thu, Jan 23, 9:42 PM · cloud-services-team (Kanban), ops-eqiad, Cloud-VPS, Operations
JHedden moved T243536: cloudvirt1022 memory errors causing host to crash from Backlog to Hardware faults on the cloud-services-team (Hardware) board.
Thu, Jan 23, 6:38 PM · DC-Ops, ops-eqiad, Operations, cloud-services-team (Hardware)
JHedden created T243536: cloudvirt1022 memory errors causing host to crash.
Thu, Jan 23, 6:37 PM · DC-Ops, ops-eqiad, Operations, cloud-services-team (Hardware)
JHedden created P10250 wmcs-cold-migrate.
Thu, Jan 23, 3:38 PM

Wed, Jan 22

JHedden moved T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from Backlog to Hardware faults on the cloud-services-team (Hardware) board.
Wed, Jan 22, 4:34 PM · cloud-services-team (Hardware), Operations, ops-eqiad, DC-Ops, User-Zppix
JHedden edited projects for T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory, added: cloud-services-team (Hardware); removed cloud-services-team (Kanban).
Wed, Jan 22, 4:34 PM · cloud-services-team (Hardware), Operations, ops-eqiad, DC-Ops, User-Zppix

Tue, Jan 21

JHedden added a comment to T243355: puppet panel: Can't add new prefixes.

Labweb logs show 2020-01-21 22:51:43.638032 Forbidden (CSRF token missing or incorrect.): /project/prefixpuppet/

Tue, Jan 21, 10:57 PM · Horizon
JHedden triaged T243327: Test virtual machine migrations using Ceph based storage as High priority.
Tue, Jan 21, 10:29 PM · Epic, cloud-services-team (Kanban)
JHedden created P10238 cloudvirt1014 vms.
Tue, Jan 21, 8:38 PM
JHedden created T243327: Test virtual machine migrations using Ceph based storage.
Tue, Jan 21, 7:26 PM · Epic, cloud-services-team (Kanban)

Wed, Jan 15

JHedden closed T242460: Fix cloudmetrics icinga prometheus check as Resolved.

I updated prometheus to only bind on the loopback interface and configured Apache to proxy requests to the servers FQDN to prometheus. These changes sync up the cloudmetrics configuration with production and clears up the icinga errors when checking this service.

Wed, Jan 15, 9:52 PM · cloud-services-team (Kanban)
JHedden added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

The neutron APIs looks good too

Wed, Jan 15, 9:48 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden moved T242460: Fix cloudmetrics icinga prometheus check from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, Jan 15, 9:22 PM · cloud-services-team (Kanban)
JHedden claimed T242460: Fix cloudmetrics icinga prometheus check.
Wed, Jan 15, 8:45 PM · cloud-services-team (Kanban)
JHedden closed T242893: puppetmaster broken in the cloudstore project as Resolved.
Wed, Jan 15, 6:20 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden added a comment to T242893: puppetmaster broken in the cloudstore project.

/usr/local/bin/git-sync-upstream was having a hard time with the git repository in /var/lib/git/operations/puppet and consuming all available memory on the VM. I moved the git repo to /var/lib/git/operations/puppet-save-from-gtirloni and pulled down a fresh copy of the repo. I also confirmed that the puppet agent is working on all the hosts in the cloudstore project now.

Wed, Jan 15, 6:20 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden moved T242893: puppetmaster broken in the cloudstore project from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, Jan 15, 5:36 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden reopened T242893: puppetmaster broken in the cloudstore project as "Open".

reopening to track work on fixing the puppet master configuration.

Wed, Jan 15, 5:36 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T242893: puppetmaster broken in the cloudstore project as Resolved.
Wed, Jan 15, 5:33 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden added a comment to T242893: puppetmaster broken in the cloudstore project.

Hrm. Now I cannot seem to ssh to it. :)

Wed, Jan 15, 5:25 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden claimed T242893: puppetmaster broken in the cloudstore project.
Wed, Jan 15, 5:24 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T90364: Test Ceph for instance storage, a subtask of T207590: Research CephFS as a replacement for NFS, as Resolved.
Wed, Jan 15, 4:01 PM · Data-Services, cloud-services-team (Kanban)
JHedden closed T90364: Test Ceph for instance storage, a subtask of T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure, as Resolved.
Wed, Jan 15, 4:01 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T90364: Test Ceph for instance storage as Resolved.
Wed, Jan 15, 4:01 PM · Epic, Goal, Wikimedia-Incident, cloud-services-team (Kanban), Cloud-Services
JHedden closed T90364: Test Ceph for instance storage, a subtask of T225320: Ceph Proof of Concept Build and Testing, as Resolved.
Wed, Jan 15, 4:01 PM · Epic, cloud-services-team (Kanban)
JHedden closed T90364: Test Ceph for instance storage, a subtask of T220020: Action items and work for retro 20190403, as Resolved.
Wed, Jan 15, 4:01 PM · Epic, cloud-services-team (Kanban)
JHedden updated the task description for T240718: Perform failover tests on Ceph storage cluster.
Wed, Jan 15, 3:57 PM · Epic, cloud-services-team (Kanban)
JHedden triaged T240715: Configure prometheus monitoring for Ceph as Medium priority.
Wed, Jan 15, 3:57 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden closed T240715: Configure prometheus monitoring for Ceph, a subtask of T225320: Ceph Proof of Concept Build and Testing, as Resolved.
Wed, Jan 15, 3:57 PM · Epic, cloud-services-team (Kanban)
JHedden closed T240715: Configure prometheus monitoring for Ceph as Resolved.
Wed, Jan 15, 3:57 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)

Tue, Jan 14

JHedden closed T240871: CloudVPS: nova messing with instance disks, a subtask of T240851: CloudVPS: stretch base images fails to boot, as Resolved.
Tue, Jan 14, 10:36 PM · cloud-services-team (Kanban)
JHedden closed T240871: CloudVPS: nova messing with instance disks as Resolved.

Cleaned up all the stale entries with virsh undefine <domain id>

Tue, Jan 14, 10:36 PM · cloud-services-team (Kanban)
JHedden moved T240871: CloudVPS: nova messing with instance disks from Important to Doing on the cloud-services-team (Kanban) board.
Tue, Jan 14, 5:47 PM · cloud-services-team (Kanban)
JHedden claimed T240871: CloudVPS: nova messing with instance disks.
Tue, Jan 14, 5:47 PM · cloud-services-team (Kanban)
JHedden added a comment to T240871: CloudVPS: nova messing with instance disks.

This happens when a VM is migrated with the wmcs cold migration script without being undefined in virsh.

Tue, Jan 14, 5:12 PM · cloud-services-team (Kanban)

Mon, Jan 13

JHedden updated the task description for T225320: Ceph Proof of Concept Build and Testing.
Mon, Jan 13, 3:49 PM · Epic, cloud-services-team (Kanban)

Fri, Jan 10

JHedden added a comment to T242472: Degraded RAID on cloudvirt1013.

Multiple hardware errors reported for this host T241313

Fri, Jan 10, 10:44 PM · cloud-services-team (Hardware), ops-eqiad, Operations
JHedden triaged T242472: Degraded RAID on cloudvirt1013 as High priority.
Fri, Jan 10, 10:27 PM · cloud-services-team (Hardware), ops-eqiad, Operations
JHedden moved T242472: Degraded RAID on cloudvirt1013 from Backlog to Hardware faults on the cloud-services-team (Hardware) board.
Fri, Jan 10, 10:26 PM · cloud-services-team (Hardware), ops-eqiad, Operations
JHedden added a project to T242472: Degraded RAID on cloudvirt1013: cloud-services-team (Hardware).
Fri, Jan 10, 10:26 PM · cloud-services-team (Hardware), ops-eqiad, Operations
JHedden triaged T242462: cloudcontrol200[13]-dev linux bridge agent errors as Medium priority.
Fri, Jan 10, 8:47 PM · cloud-services-team (Kanban)
JHedden updated the task description for T242462: cloudcontrol200[13]-dev linux bridge agent errors .
Fri, Jan 10, 8:47 PM · cloud-services-team (Kanban)
JHedden created T242462: cloudcontrol200[13]-dev linux bridge agent errors .
Fri, Jan 10, 8:43 PM · cloud-services-team (Kanban)
JHedden triaged T242460: Fix cloudmetrics icinga prometheus check as Low priority.
Fri, Jan 10, 8:25 PM · cloud-services-team (Kanban)
JHedden created T242460: Fix cloudmetrics icinga prometheus check.
Fri, Jan 10, 8:25 PM · cloud-services-team (Kanban)
bd808 awarded T242455: Investigate options to improve CloudVPS backend database architecture a Love token.
Fri, Jan 10, 8:12 PM · Cloud-VPS, cloud-services-team (Kanban)
JHedden created T242455: Investigate options to improve CloudVPS backend database architecture .
Fri, Jan 10, 8:00 PM · Cloud-VPS, cloud-services-team (Kanban)

Thu, Jan 9

JHedden updated the task description for T240715: Configure prometheus monitoring for Ceph.
Thu, Jan 9, 11:16 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)

Wed, Jan 8

JHedden updated the task description for T240715: Configure prometheus monitoring for Ceph.
Wed, Jan 8, 10:39 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden updated the task description for T240715: Configure prometheus monitoring for Ceph.
Wed, Jan 8, 10:36 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)

Tue, Jan 7

JHedden moved T240715: Configure prometheus monitoring for Ceph from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Jan 7, 10:24 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden closed T241635: Request creation of commons-corruption-checker VPS project as Resolved.

Hi @TheSandDoctor, your CloudVPS project has been created.

Tue, Jan 7, 9:23 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
JHedden added a comment to T242088: CloudVPS: wrong operation reject based on quota limit.

Try it with OS_PROJECT_ID=testlabs

Tue, Jan 7, 2:09 PM · cloud-services-team (Kanban)

Mon, Jan 6

JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

And slot 4!

Mon, Jan 6, 3:05 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden added a comment to T241884: Degraded RAID on cloudvirt1024.

Looks like we're missing drives in slot 2 and 9 on this host.

Mon, Jan 6, 3:05 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden merged T241881: Degraded RAID on cloudvirt1024 into T241884: Degraded RAID on cloudvirt1024.
Mon, Jan 6, 3:04 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
JHedden merged task T241881: Degraded RAID on cloudvirt1024 into T241884: Degraded RAID on cloudvirt1024.
Mon, Jan 6, 3:04 PM · ops-eqiad, Operations

Thu, Jan 2

JHedden added a comment to T228238: Remove nfsiostat collector for diamond if possible, which may be broken on tools workers.

I enabled the node exporter mountstats plugin to help diagnose the "slowness" our users have been reporting on tools-sgebastion-07.tools.eqiad.wmflabs. Being able to line up multiple system metrics next to each other with a historical timeline can help identify usage patterns and resource contention.

Thu, Jan 2, 10:33 PM · cloud-services-team (Kanban)

Dec 23 2019

JHedden closed T240722: Fix Icingia disk space check on cloudcephosd100[1-3] servers, a subtask of T225320: Ceph Proof of Concept Build and Testing, as Resolved.
Dec 23 2019, 8:52 PM · Epic, cloud-services-team (Kanban)
JHedden closed T240722: Fix Icingia disk space check on cloudcephosd100[1-3] servers as Resolved.
Dec 23 2019, 8:52 PM · Epic, cloud-services-team (Kanban)
JHedden closed T240965: Enable private network interface on Ceph OSD and MON hosts, a subtask of T225320: Ceph Proof of Concept Build and Testing, as Resolved.
Dec 23 2019, 4:03 PM · Epic, cloud-services-team (Kanban)
JHedden closed T240965: Enable private network interface on Ceph OSD and MON hosts as Resolved.
Dec 23 2019, 4:03 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)

Dec 20 2019

JHedden committed rLPRI192e199fa23d: Refactor ceph keyring data (authored by JHedden).
Refactor ceph keyring data
Dec 20 2019, 10:50 PM

Dec 18 2019

JHedden added a comment to T240965: Enable private network interface on Ceph OSD and MON hosts.

Thanks for the review, I had the wrong subnet here but configured the hosts on the correct public 208.80.154.128/26 subnet.

Dec 18 2019, 6:27 AM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden updated the task description for T240965: Enable private network interface on Ceph OSD and MON hosts.
Dec 18 2019, 6:26 AM · Patch-For-Review, Epic, cloud-services-team (Kanban)

Dec 17 2019

JHedden updated the task description for T240715: Configure prometheus monitoring for Ceph.
Dec 17 2019, 10:18 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden updated the task description for T240715: Configure prometheus monitoring for Ceph.
Dec 17 2019, 10:15 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden moved T240965: Enable private network interface on Ceph OSD and MON hosts from Inbox to Doing on the cloud-services-team (Kanban) board.
Dec 17 2019, 4:38 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden claimed T240965: Enable private network interface on Ceph OSD and MON hosts.
Dec 17 2019, 4:38 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden updated subscribers of T240965: Enable private network interface on Ceph OSD and MON hosts.

@ayounsi and @Bstorm could you please review the vlan and subnet for the private interface? I think it's the right one but would like confirmation.

Dec 17 2019, 4:37 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden created T240965: Enable private network interface on Ceph OSD and MON hosts.
Dec 17 2019, 4:35 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden closed T239918: Deploy Ceph Nautilus on Buster as Resolved.
Dec 17 2019, 3:22 PM · Epic, cloud-services-team (Kanban)
JHedden closed T239918: Deploy Ceph Nautilus on Buster, a subtask of T225320: Ceph Proof of Concept Build and Testing, as Resolved.
Dec 17 2019, 3:22 PM · Epic, cloud-services-team (Kanban)
JHedden moved T239918: Deploy Ceph Nautilus on Buster from Inbox to Doing on the cloud-services-team (Kanban) board.
Dec 17 2019, 3:22 PM · Epic, cloud-services-team (Kanban)
JHedden moved T240718: Perform failover tests on Ceph storage cluster from Inbox to Doing on the cloud-services-team (Kanban) board.
Dec 17 2019, 3:22 PM · Epic, cloud-services-team (Kanban)

Dec 16 2019

JHedden added a comment to T240851: CloudVPS: stretch base images fails to boot.

We could also work around this with another hack, disabling spice and adding in just the ttyS1 serial interface to the nova libvirt guest config process.

Dec 16 2019, 10:36 PM · cloud-services-team (Kanban)
JHedden added a comment to T240851: CloudVPS: stretch base images fails to boot.

This is the commit that broke console output on the stretch hosts. https://gerrit.wikimedia.org/r/c/operations/puppet/+/554151

Dec 16 2019, 6:54 PM · cloud-services-team (Kanban)

Dec 13 2019

JHedden added a comment to T240715: Configure prometheus monitoring for Ceph.

grafana dashboards that work with the ceph prometheus plugin can be found at https://github.com/ceph/ceph/tree/master/monitoring/grafana

Dec 13 2019, 10:24 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden created T240722: Fix Icingia disk space check on cloudcephosd100[1-3] servers.
Dec 13 2019, 9:17 PM · Epic, cloud-services-team (Kanban)
JHedden created T240718: Perform failover tests on Ceph storage cluster.
Dec 13 2019, 8:15 PM · Epic, cloud-services-team (Kanban)
JHedden created T240715: Configure prometheus monitoring for Ceph.
Dec 13 2019, 8:11 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
JHedden committed rLPRI27c8161bb1d5: update ceph keydata key names (authored by JHedden).
update ceph keydata key names
Dec 13 2019, 2:16 PM

Dec 12 2019

JHedden committed rLPRI66d8d2af1843: add fake ceph rbd client key (authored by JHedden).
add fake ceph rbd client key
Dec 12 2019, 12:00 AM

Dec 11 2019

JHedden updated the task description for T225320: Ceph Proof of Concept Build and Testing.
Dec 11 2019, 8:03 PM · Epic, cloud-services-team (Kanban)

Dec 10 2019

JHedden committed rLPRI691766c64e37: add fake keys for ceph osd profile (authored by JHedden).
add fake keys for ceph osd profile
Dec 10 2019, 11:21 PM
JHedden committed rLPRIe4ada9c3fb96: update fake ceph mon secret paths (authored by JHedden).
update fake ceph mon secret paths
Dec 10 2019, 9:06 PM
JHedden committed rLPRI679156bda3db: add ceph fake keys (authored by JHedden).
add ceph fake keys
Dec 10 2019, 9:03 PM
JHedden closed T239917: Import Buster packages for Ceph Nautilus, a subtask of T239918: Deploy Ceph Nautilus on Buster, as Resolved.
Dec 10 2019, 6:11 PM · Epic, cloud-services-team (Kanban)