Fri, Jan 24
This host has been depooled from production and has no running workloads.
During the next rebuild the RAID array kicked out Drive 4. Either we have 3 bad drives 2, 4 and 9 or the RAID adapter is bad. I'll send the TSR for this host to @Jclark-ctr
Drive 9 reported a lot of errors while rebuilding the RAID array, and now drives 2, 4, and 9 are missing from the RAID set again. I'll leave drive 9 out of the pool and test rebuilding the array with only 2 and 4.
Thu, Jan 23
This is not a failure, the drive is currently rebuilding from task T241884
Drives 2 and 4 had a foreign configuration. I've cleared the configuration and reassigned them as global host spares.
Wed, Jan 22
Tue, Jan 21
Labweb logs show 2020-01-21 22:51:43.638032 Forbidden (CSRF token missing or incorrect.): /project/prefixpuppet/
Wed, Jan 15
I updated prometheus to only bind on the loopback interface and configured Apache to proxy requests to the servers FQDN to prometheus. These changes sync up the cloudmetrics configuration with production and clears up the icinga errors when checking this service.
The neutron APIs looks good too
/usr/local/bin/git-sync-upstream was having a hard time with the git repository in /var/lib/git/operations/puppet and consuming all available memory on the VM. I moved the git repo to /var/lib/git/operations/puppet-save-from-gtirloni and pulled down a fresh copy of the repo. I also confirmed that the puppet agent is working on all the hosts in the cloudstore project now.
reopening to track work on fixing the puppet master configuration.
Tue, Jan 14
Cleaned up all the stale entries with virsh undefine <domain id>
This happens when a VM is migrated with the wmcs cold migration script without being undefined in virsh.
Mon, Jan 13
Fri, Jan 10
Multiple hardware errors reported for this host T241313
Thu, Jan 9
Wed, Jan 8
Tue, Jan 7
Hi @TheSandDoctor, your CloudVPS project has been created.
Try it with OS_PROJECT_ID=testlabs
Mon, Jan 6
And slot 4!
Looks like we're missing drives in slot 2 and 9 on this host.
Thu, Jan 2
I enabled the node exporter mountstats plugin to help diagnose the "slowness" our users have been reporting on tools-sgebastion-07.tools.eqiad.wmflabs. Being able to line up multiple system metrics next to each other with a historical timeline can help identify usage patterns and resource contention.
Dec 23 2019
Dec 20 2019
Dec 18 2019
Thanks for the review, I had the wrong subnet here but configured the hosts on the correct public 126.96.36.199/26 subnet.
Dec 17 2019
Dec 16 2019
We could also work around this with another hack, disabling spice and adding in just the ttyS1 serial interface to the nova libvirt guest config process.
This is the commit that broke console output on the stretch hosts. https://gerrit.wikimedia.org/r/c/operations/puppet/+/554151
Dec 13 2019
grafana dashboards that work with the ceph prometheus plugin can be found at https://github.com/ceph/ceph/tree/master/monitoring/grafana