At 8:03 Icinga flagged errors for the instances for the instances running on ganeti1005 (and also for ganeti1005 itself). When the alert came in it was in fact non-accessible, but while I was logging in via mgmt it recovered.
dmesg is full of errors like
[13206136.352034] ehci_hcd lrw gf128mul tg3 ablk_helper ptp cryptd libata megaraid_sas pps_core usbcore libphy usb_common scsi_mod [13206136.352043] CPU: 23 PID: 4183 Comm: gnt-node Tainted: G B W 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.25-1~bpo8+3 [13206136.352044] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017 [13206136.352045] 0000000000000000 ffffffffa6329be5 fffff0bf23dbc240 ffffffffa6a00df0 [13206136.352048] ffffffffa618313c 0000000000000010 0000000000000009 fffff0bf23dbade0 [13206136.352050] ffffffffa6186af3 ffff904500000001 ffff9045ae544c30 ffff904500000001 [13206136.352052] Call Trace: [13206136.352060] [<ffffffffa6329be5>] ? dump_stack+0x5c/0x77 [13206136.352062] [<ffffffffa618313c>] ? bad_page+0xbc/0x120 [13206136.352065] [<ffffffffa6186af3>] ? get_page_from_freelist+0x993/0xad0 [13206136.352067] [<ffffffffa6187c27>] ? __alloc_pages_nodemask+0xf7/0x270 [13206136.352070] [<ffffffffa61da7d0>] ? alloc_pages_vma+0xb0/0x240 [13206136.352074] [<ffffffffa61f93ac>] ? mem_cgroup_commit_charge+0x7c/0xf0 [13206136.352076] [<ffffffffa61b6d31>] ? handle_mm_fault+0x1441/0x1700 [13206136.352080] [<ffffffffa605fe53>] ? __do_page_fault+0x253/0x510 [13206136.352085] [<ffffffffa66069d8>] ? page_fault+0x28/0x30 [13206136.352086] BUG: Bad page state in process gnt-node pfn:8f6f0a
Since this appeared out of the blue my guess is a memory error. I'd say we take down the host per https://wikitech.wikimedia.org/wiki/Ganeti#Shutdown_a_node_for_a_prolonged_period_of_time and run a memory test.