Were you able to verify in any way that the issue is the network card being unavailable after reboot?
I tried running the cookbook with the option --only-check and it worked fine 3 times in a row.
Wed, Sep 14
We verified that the problem is with the drive and not with the controller, because @Jclark-ctr moved the disk to cloudcephosd1034 and I could reproduce the issue there.
Tue, Sep 13
Mon, Sep 12
Fri, Sep 9
The "kicked off" part is explained by @Jclark-ctr rebooting the instance. The partition disappearing instead is another symptom of some fault with the drive or the bay. It happened a second time, I created a partition with fdisk, formatted it with mkfs.ext4, then after a few minutes the partition was no longer there.
No worries, now I was able to SSH, I created a test partition /dev/sde1 and indeed mkfs.ext4 /dev/sde1 did not throw any error this time... Then I was kicked off the instance, I SSH-ed again and now I can no longer see the partition I created 5 mins ago, I'm confused 🤔
@Jclark-ctr right now I cannot connect to cloudcephosd1030.mgmt.eqiad.wmnet with SSH.
Wed, Sep 7
Thanks @Jclark-ctr -- FYI I have temporarily removed this node from the Ceph cluster, so it can be safely rebooted/shut down if necessary.
cloudcephosd1030 and cloudcephosd1031 have hardware issues (see related tasks). I'm going to remove osd.231, osd.232 and osd.233 from the cluster (they're in cloudcephosd1030)
Tue, Sep 6
Please note that the instance is not currently in use, it is part of a new group of hosts that are being added to a Ceph cluster. I won't add this one to the cluster until the power supply issue is resolved.
Fri, Sep 2
@Volans I've now run the sre.dns.netbox, which completed successfully.
Wed, Aug 31
Tue, Aug 30
I downloaded the support logs from the Dell "Support Assist" interface, but the zip file is too big to upload here. Let me know if you need it.
Mon, Aug 29
Aug 25 2022
Aug 24 2022
Aug 17 2022
This seems to confirm that Ceph heartbeats are using Jumbo frames: https://github.com/ceph/ceph/pull/15535
Aug 13 2022
Aug 11 2022
Aug 10 2022
Aug 9 2022
Aug 5 2022
Aug 4 2022
Hi @CristianCantoro -- I can ssh into your instance and try to debug the issue. Do you have an example SQLite query that is getting stuck?
Aug 1 2022
Jul 29 2022
Yes, happy to move forward with 16.
Oops completely missed @taavi had already added a patch for this task :facepalm:
If anything the two patches look very similar! 😄