User Details
- User Since
- May 28 2019, 6:09 PM (209 w, 1 d)
- Roles
- Disabled
- LDAP User
- Jhedden
- MediaWiki User
- JHedden (WMF) [ Global Accounts ]
Jun 24 2020
May 14 2020
Full error after upgrading the qemu packages to match package versions:
May 12 2020
Starting RabbitMQ 3.7.8 on Erlang 21.2.6 Copyright (C) 2007-2018 Pivotal Software, Inc. Licensed under the MPL. See http://www.rabbitmq.com/ 2020-05-12 20:20:57.663 [info] <0.261.0> node : rabbit@cloudcontrol1003 home dir : /var/lib/rabbitmq config file(s) : /etc/rabbitmq/rabbitmq.config cookie hash : cnbu5mjmX6EnA/KuQQ1WwQ== log(s) : /var/log/rabbitmq/rabbit@cloudcontrol1003.log : /var/log/rabbitmq/rabbit@cloudcontrol1003_upgrade.log database dir : /var/lib/rabbitmq/mnesia/rabbit@cloudcontrol1003 2020-05-12 20:20:58.737 [info] <0.269.0> Memory high watermark set to 25717 MiB (26966397747 bytes) of 64292 MiB (67415994368 bytes) total 2020-05-12 20:20:58.743 [info] <0.271.0> Enabling free disk space monitoring 2020-05-12 20:20:58.743 [info] <0.271.0> Disk free limit set to 50MB 2020-05-12 20:20:58.747 [info] <0.274.0> Limiting to approx 65436 file handles (58890 sockets) 2020-05-12 20:20:58.748 [info] <0.275.0> FHC read buffering: OFF 2020-05-12 20:20:58.748 [info] <0.275.0> FHC write buffering: ON 2020-05-12 20:20:58.755 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-05-12 20:21:28.756 [warning] <0.261.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-05-12 20:21:28.756 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 8 retries left 2020-05-12 20:21:58.757 [warning] <0.261.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-05-12 20:21:58.757 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 7 retries left 2020-05-12 20:22:28.758 [warning] <0.261.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-05-12 20:22:28.759 [info] <0.261.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-05-12 20:16:04.072 [error] emulator Discarding message {'$gen_call',{<0.2146.0>,#Ref<0.2094199940.271056901.235963>},stat} from <0.2146.0> to <0.7921.0> in an old incarnation (2) of this node (1)
May 11 2020
These servers should mimic the network configuration we have in production:
May 7 2020
The virtual drive rebuild process was MUCH faster, the firmware upgrades completed successfully and all drives have remained online.
Thanks! I've imported the RAID config, restored the boot order settings and will verify it's fixed.
May 5 2020
You can use the partman config echo partman/standard.cfg partman/raid1-2dev.cfg
Check if we have any netflow data from the network devices that would allow us to query src and dest traffic
Waiting for the next reboot of this host
Added some documentation at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS
Increasing the memcached cache size definitely helped.
May 1 2020
The RAID card took drive 9 offline again during the virtual disk rebuild. We cannot update the SATA drive firmware until all the devices are healthy, and since that is never the case we cannot apply the update.
Apr 30 2020
I've cleared the foreign configuration on drives 4 and 9 again, once the rebuild completes I'll attempt the SATA firmware and system BIOS upgrades.
I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352 that are running the same workloads without any problems.
I'm unable to upgrade the SATA because of the failed drive state:
Used the BIOS versions in that last log message, the correct iDRAC versions and log output are below
Apr 28 2020
Great! Thanks for the update. This host is currently out of service and can be taken offline anytime.
Apr 27 2020
Added a Grafana dashboard for detailed instance metrics using the metricsinfra prometheus server: https://grafana-labs.wikimedia.org/d/000000590/metricsinfra-cloudvps-instance-details
It looks like things will be noisy if we add the alert space rules right now.
https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1h&g0.expr=100%20-%20(node_filesystem_avail_bytes%7Bfstype%3D%22ext4%22%7D%2Fnode_filesystem_size_bytes%20*%20100)%20%3E%3D%2080&g0.tab=1
Email based alert notifications are now enabled for the tools and cloudinfra projects.
Apr 21 2020
List of effected virtual machines
/etc/libvirt/qemu/i-00000406.xml: <nova:name>toolsbeta-sgewebgrid-generic-0901</nova:name> /etc/libvirt/qemu/i-00001507.xml: <nova:name>incubator-mw</nova:name> /etc/libvirt/qemu/i-00001d3c.xml: <nova:name>tools-sgeexec-0901</nova:name> /etc/libvirt/qemu/i-00002cf4.xml: <nova:name>tools-sgewebgrid-lighttpd-0918</nova:name> /etc/libvirt/qemu/i-00002cf5.xml: <nova:name>tools-sgewebgrid-lighttpd-0919</nova:name> /etc/libvirt/qemu/i-0000735c.xml: <nova:name>media-streaming</nova:name> /etc/libvirt/qemu/i-00007e14.xml: <nova:name>wikilink-prod</nova:name> /etc/libvirt/qemu/i-00007e7c.xml: <nova:name>commonsarchive-mwtest</nova:name> /etc/libvirt/qemu/i-000081a8.xml: <nova:name>wikidata-autodesc</nova:name> /etc/libvirt/qemu/i-000088a9.xml: <nova:name>deployment-schema-2</nova:name> /etc/libvirt/qemu/i-0000892a.xml: <nova:name>discovery-testing-02</nova:name> /etc/libvirt/qemu/i-00009819.xml: <nova:name>visionoid</nova:name> /etc/libvirt/qemu/i-0001027b.xml: <nova:name>deployment-echostore01</nova:name> /etc/libvirt/qemu/i-000105b2.xml: <nova:name>Esther-outreachy-intern</nova:name> /etc/libvirt/qemu/i-00012d1a.xml: <nova:name>tools-k8s-worker-38</nova:name> /etc/libvirt/qemu/i-00012d29.xml: <nova:name>tools-k8s-worker-52</nova:name> /etc/libvirt/qemu/i-00014212.xml: <nova:name>canary1004-01</nova:name>
We should use the WMCS SRE run-book enhancement proposal for this
Using a service virtual IP could be an option here, more notes on that at https://wikitech.wikimedia.org/wiki/User:Jhedden/notes/keepalived
Waiting on Ceph storage which will allow easier hypervisor reboots
In the past the agents were going offline due to missed rabbitMQ heartbeat messages. Consider creating a prometheus exporter to monitor the OpenStack nova and neutron agents to watch for up/down state.
The hosts in codfw are used for platform testing and staging. It's useful to have these in Icinga, but we don't need email notifications or on the alerts sub-page dashboard. Potentially we can add a host and service downtime for a _very_ long time.
Apr 20 2020
The elasticsearch version 5 cluster is being shutdown today. Your tool account credentials have been migrated to the new cluster which can be reached at http://elasticsearch.svc.tools.eqiad1.wikimedia.cloud