Page MenuHomePhabricator

Q1:rack/setup/install backup1012
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of backup1012

Hostname / Racking / Installation Details

Hostnames: backup1012
Racking Proposal: Any rack not shared with backup1004 / backup1005 / backup1006 / backup1007 / backup1011 to the best of the ability
Networking Setup: # of Connections:1 - Speed: 10G. - VLAN:Private/Public/Other(Specify) : AAAA records:Y/N, Additional IP records (Cassandra)? Yes/No
Partitioning/Raid: (see backup1011 for the same setup T326684 )

  • First SSD - not part of a RAID
  • Second SSD - not part of the RAID.
  • All others (HDs): RAID6.
  • The os should end up seeing 3 disks, sda, sdb (of aprox 480 GB each) and sdc (of around 160 available TB).
  • The recipe after that is: partman/custom/backup-format.cfg

OS Distro: Bookworm
Sub-team Technical Contact: Data persistence: @ABran-WMF @Marostegui (jaime will be out)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

backup1012
  • Receive in system on procurement task T368926 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

@Marostegui,

Please note there has been a slight change in the workflow for racking and installing hosts. The DC ops team, as a whole, does not have merge rights/root on the puppet repo. So we're now assigning racking tasks to the SRE sub-team at time of hardware order, which gives each sub-team a week or two to push the puppet repo updates for site.pp and preseed.yaml. Please update site.pp with these new hosts with the insetup role (not its final role) and update preseed.yaml with the paritition info.

Once you have merged these changes live (please feel free to reference this task in the patchset), please unassign yourself as task assignee and leave unasisgned in the 'racking tasks' column of ops-eqiad.

Change #1058295 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] backups: Add backup1012

https://gerrit.wikimedia.org/r/1058295

Change #1058295 merged by Marostegui:

[operations/puppet@production] backups: Add backup1012

https://gerrit.wikimedia.org/r/1058295

@Marostegui,

Please note there has been a slight change in the workflow for racking and installing hosts. The DC ops team, as a whole, does not have merge rights/root on the puppet repo. So we're now assigning racking tasks to the SRE sub-team at time of hardware order, which gives each sub-team a week or two to push the puppet repo updates for site.pp and preseed.yaml. Please update site.pp with these new hosts with the insetup role (not its final role) and update preseed.yaml with the paritition info.

Once you have merged these changes live (please feel free to reference this task in the patchset), please unassign yourself as task assignee and leave unasisgned in the 'racking tasks' column of ops-eqiad.

This is done.
Please keep in mind that I will be gone from 2nd Sept and @jcrespo will be back the 9th - so for any questions please talk to @ABran-WMF

Any ETA for this and the codfw equivalent, DC-ops? I know there may be some delays due to the vendor peculiarities, but my "Need by date" was 2024-09-08 and I haven't seen any update. Thank you (this is not an emergency, but we are soon running out of space on a shard of media backups).

Jclark-ctr added subscribers: elukey, Jclark-ctr.

I was having issues connecting to mgmt on a new server i just racked Might have to do with troubleshooting @elukey
is doing
jclark@cumin1002:~$ nslookup backup1012.mgmt.eqiad.wmnet
Server: 10.3.0.1
Address: 10.3.0.1#53

  • server can't find backup1012.mgmt.eqiad.wmnet: NXDOMAIN

switching to ip address it connected with no issues
ssh -L 8000:[IP_of_backup1012]:443 cumin1002.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm executed with errors:

  • backup1012 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console backup1012.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Hi folks! Not sure what is special about the server, but in this case the Supermicro Network settings for the BMC can't be applied cleanly and I need to figure out why. I hope to have an update soon, for the moment please don't proceed further with the server.

@elukey thanks, that is all I needed, some info on where we were and I wasn't aware of ongoing progress during my sabbatical, which is understandable. No worries. :-)

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm

This is the current error:

Applying Network changes to the BMC.
Error while configuring BIOS or mgmt interface: PATCH https://10.65.4.91/redfish/v1/Managers/1/EthernetInterfaces/1 returned HTTP 400 with message:
{"error":{"code":"Base.v1_10_3.GeneralError","Message":"A general error has occurred. See ExtendedInfo for more information.","@Message.ExtendedInfo":[{"MessageId":"Base.1.10.PropertyNotWritable","Severity":"Warning","Resolution":"Remove the property from the request body and resubmit the request if the operation failed.","Message":"The property StaticNameServers is a read only property and cannot be assigned a value.","MessageArgs":["StaticNameServers"],"RelatedProperties":[""]},{"MessageId":"Base.1.10.PropertyNotWritable","Severity":"Warning","Resolution":"Remove the property from the request body and resubmit the request if the operation failed.","Message":"The property StatelessAddressAutoConfig is a read only property and cannot be assigned a value.","MessageArgs":["StatelessAddressAutoConfig"],"RelatedProperties":[""]},{"MessageId":"Base.1.10.PropertyUnknown","Severity":"Warning","Resolution":"Remove the unknown property from the request body and resubmit the request if the operation failed.","Message":"The property StatelessAddressAutoConfig is not in the list of valid properties for the resource.","MessageArgs":["StatelessAddressAutoConfig"],"RelatedProperties":["StatelessAddressAutoConfig"]},{"MessageId":"Base.1.10.PropertyUnknown","Severity":"Warning","Resolution":"Remove the unknown property from the request body and resubmit the request if the operation failed.","Message":"The property StaticNameServers is not in the list of valid properties for the resource.","MessageArgs":["StaticNameServers"],"RelatedProperties":["StaticNameServers"]}]}}

I checked the firmware version of the BMC and I got: 'Oem': {'Supermicro': {'UniqueFilename': 'BMC_X12AST2600-ROT-5201MS_20221105_01.01.34_STDsp.bin'}}

This is what I get for mc-misc2001, a new node that I worked on this morning: 'UniqueFilename': 'BMC_X12AST2600-ROT-5201MS_20240816_06.04.04_OEMsp.bin'}},

The backup1012 host for some reason runs a 2022 firmware, meanwhile the others 2024. I think that we should upgrade the BMC firmware and retry the provision cookbook @Jclark-ctr.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm completed:

  • backup1012 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410091427_jclark_3193230_backup1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually
Jclark-ctr claimed this task.
Jclark-ctr updated the task description. (Show Details)

Please keep it open until we solve the firmware/BMC issue :)

@Jclark-ctr do you mind to pass me the BMC admin password via chat or email?

I've updated the firmware to version 20231203_01.03.10 but I keep seeing the same problem:

{"error":{"code":"Base.v1_10_3.GeneralError","message":"A general error has occurred. See ExtendedInfo for more information.","@Message.ExtendedInfo":[{"MessageId":"Base.1.10.PropertyNotWritable","Severity":"Warning","Resolut
ion":"Remove the property from the request body and resubmit the request if the operation failed.","Message":"The property StatelessAddressAutoConfig is a read only property and cannot be assigned a value.","MessageArgs":["St
atelessAddressAutoConfig"],"RelatedProperties":[""]},{"MessageId":"Base.1.10.PropertyUnknown","Severity":"Warning","Resolution":"Remove the unknown property from the request body and resubmit the request if the operation fail
ed.","Message":"The property StatelessAddressAutoConfig is not in the list of valid properties for the resource.","MessageArgs":["StatelessAddressAutoConfig"],"RelatedProperties":["StatelessAddressAutoConfig"]}]}}

I checked an-conf1004 that should be a similar host, and this is the firmware version: 20240313_01.04.04

Can't find it on the website though, so I think we'll have to reach out to Supermicro.

Email sent to Supermicro, we'll see if they are able to provide to us the right firmware.

@jcrespo the host is up and reimaged, but the mgmt interface is not reachable.. if you want to start configuring the host go ahead, I hope to receive the right firrmware during the next days and after that I'll just need to reboot the host (so if it is not an issue for you, go ahead configuring!).

MatthewVernon mentioned this in Unknown Object (Task).Oct 17 2024, 8:50 AM

I tried a new firmware but it didn't work, same error. I noticed that the hosts showing the issue are the same exact model (https://netbox.wikimedia.org/dcim/devices/?device_type_id=338), meanwhile for the other ones we don't have issues. I think that we'll need to make an exception in the provision cookbook :(

Made it work, I had to factory reset after the firmware upgrade to get the new default 'calvin' password to work correctly, plus all the new expected settings. Ran the provision cookbook and everything seems set up correctly.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm executed with errors:

  • backup1012 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console backup1012.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm completed:

  • backup1012 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410211814_elukey_1795361_backup1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Tested the reimage as well, all works! @jcrespo thanks for the patience, green light for production :)

Thanks, I can start provisioning it, however, there seems to be an issue with the disk monitoring. Our puppet installation seems to be identifying the RAID controller:

$ lspci -nn
...
98:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx [1000:10e2]

But the command that set up monitoring doesn't seem to be working for this host. I don't know if it is an utility issue, a driver issue, or a puppet detection issue, but the hosts at the moment cannot show the disk/controller status of the host:

$ perccli64 show J
{
"Controllers":[
{
        "Command Status" : {
                "CLI Version" : "007.1910.0000.0000 Oct 08, 2021",
                "Operating system" : "Linux 6.1.0-26-amd64",
                "Status Code" : 0,
                "Status" : "Success",
                "Description" : "None"
        },
        "Response Data" : {
                "Number of Controllers" : 0,
                "Host Name" : "backup1012",
                "Operating System " : "Linux 6.1.0-26-amd64"
        }
}
]
}
wiki_willy added subscribers: RobH, wiki_willy.

Re-opening this task, as the server has the incorrect RAID controller. We're working with Supermicro to get an upgraded RAID controller sent onsite, to replace and hopefully resolve the performance issues being seen. @RobH - can you provide frequent updates in this task and work closely with Supermicro on getting the part, until we have this issue resolved? Thanks, Willy

This is not great for me. I need these hosts by 2024-09-08 as documented at T368926, and I am running out of space. If I start resharding now likely we will lose all data on them due to the RAID solution changing.

Hi @jcrespo - thanks for your feedback on this. My apologies that these Config J servers have been causing a lot of headaches. Unfortunately, we still have to figure out how to best resolve the performance issues from the RAID controller. In your opinion, what would work best? For example, would it work better if we set up a Config J server with the upgraded RAID controller first, and then migrated the data after? Let me know your preference, and we'll do our best to workaround and accommodate that.

Also, @elukey - the RAID controller kit that Supermicro is currently suggesting for us is BTR-CV3908-FT1 - 8GB cache module, which is non-volatile so no battery needed. Let me if you have any initial thoughts or opinion on this one.

Thanks,
Willy

This is not great for me. I need these hosts by 2024-09-08 as documented at T368926, and I am running out of space. If I start resharding now likely we will lose all data on them due to the RAID solution changing.

I can wait for that a few extra days. But I need certainty on dates or know that a new testing period is in front of us that could take weeks (?) In the latter case, I would prefer to use what we have now and solve the backup storage, and do the testing elsewhere, under less stressful constraints. Later we can migrate or do whatever, but I really need to start working on it now, and know those files won't be lost in a few weeks. Migrating files on these hosts can take weeks. I will prefer taking any kind of performance penalty over certainty and stability.

Alternatively, I would wipe one of the other existing backups hosts and have the entire media backup cluster on the same platform, then do any tests on a less mission critical host, where losing all data is less time consuming. Migrating a less critical service to these new hosts.

Thanks for the context, Jaime. Based on your current needs and with the time constraints, it sounds like it'll be better having you continue working on the host in its current state. While we're escalating everything with Supermicro, it's been a bit difficult getting some solid ETAs in place. There's also the possibility that unexpected issues could pop up, and I don't want to potentially delay things any further.

Thank you for the alternative plan in getting the entire Media backup cluster on one of the existing hosts as well.
That's a great idea to shift things around, so that a less critical service is on the newer host. Please let us know if there's anything we can do to help on this front. I don't know if it would be useful or not, but we could bump up the 4x backup expansion and refresh hosts budgeted for next quarter and order them in Q2 instead, if it helps in any way.

@wiki_willy I am going to remove backup1010 and backup2010 from bacula and use it for mediabackups instead. This will solve my immediate needs, but please note I am only delaying the inevitable: that host was for the general bacula expansion, and we can delay that a bit more, as the growth there was much smaller (and we will impact other team instead of mine :-S).

Feel free to use and replace/service backup1012 and backup2012 as you want.

Also, @elukey - the RAID controller kit that Supermicro is currently suggesting for us is BTR-CV3908-FT1 - 8GB cache module, which is non-volatile so no battery needed. Let me if you have any initial thoughts or opinion on this one.

@wiki_willy forked the conversation to T378584 :)

Thanks so much @jcrespo, I appreciate your flexibility and patience on this.

@wiki_willy I am going to remove backup1010 and backup2010 from bacula and use it for mediabackups instead. This will solve my immediate needs, but please note I am only delaying the inevitable: that host was for the general bacula expansion, and we can delay that a bit more, as the growth there was much smaller (and we will impact other team instead of mine :-S).

Feel free to use and replace/service backup1012 and backup2012 as you want.

We have recieved the test controller kit. We are ready to install it whenever you're ready!

Feel free to use and replace/service backup1012 and backup2012 as you want.

This has been added to the unit. Please test when possible.

Change #1091731 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Move Dell bacula hosts to mediabackups

https://gerrit.wikimedia.org/r/1091731

Change #1091731 merged by Jcrespo:

[operations/puppet@production] backup: Move Dell bacula hosts to mediabackups

https://gerrit.wikimedia.org/r/1091731

@VRiley-WMF I am a bit confused with this task, did you install a battery module to the existing RAID card? That's what the OS tells me, but wanted to check with you what actually had happened physically, because my understanding after my last discussion with willy was there was suggestions of a different controller card. Can you confirm?

Was there installation of DRAM memory modules on the card?

The reason I ask was because the last suggestion was the possibility of installing a:

BTR-CV3908-FT1 - 8GB cache module, which is non-volatile so no battery needed

Which is what I thought we were going to test, but it makes no sense as a RAID controller change would need reformatting and a battery wouldn't should up.

@jcrespo correct we didn't replace the raid controller we just added the battery to the existing raid controller

Thanks, that works for me, I just was confused. I also checked and there is some integrated ram on chip. Will soon share a summary of my benchmarks with probably interesting results at T378584.

Change #1193137 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] backup1012: reimage

https://gerrit.wikimedia.org/r/1193137

Change #1193137 merged by JHathaway:

[operations/puppet@production] backup1012: reimage

https://gerrit.wikimedia.org/r/1193137

@jcrespo it took me a bit of time to coerce the box back into bios mode. I then tried reimaging with bookworm, but the raid step failed, due to the existence of the raid6 volume. After trying a couple of efforts, which failed, I booted off a rescue image and removed the raid6 volume with storcli.

Unfortunately the MD raid setup still fails, complaining with a strange error, "partman-auto-raid: Error: incorrect directory for /dev/md0",

When I look at the console, the volumes all seem to be setup correctly.

md2 : active raid1 sdb3[1] sda3[0]
      418913280 blocks super 1.2 [2/2] [UU]
        resync=DELAYED
      bitmap: 4/4 pages [16KB], 65536KB chunk

md1 : active raid1 sdb2[1] sda2[0]
      975872 blocks super 1.2 [2/2] [UU]
        resync=DELAYED

md0 : active raid1 sdb1[1] sda1[0]
      48793600 blocks super 1.2 [2/2] [UU]
      [====>................]  resync = 22.6% (11067520/48793600) finish=3.0min speed=203722K/sec

It is getting very late here, so unfortunately I will need to debug further tomorrow.

Change #1193370 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] installserver: Revert backup1012 to manual setup

https://gerrit.wikimedia.org/r/1193370

Change #1193370 merged by Jcrespo:

[operations/puppet@production] installserver: Revert backup1012 to manual setup

https://gerrit.wikimedia.org/r/1193370

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1003 for host backup1012.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1003 for host backup1012.eqiad.wmnet with OS bookworm completed:

  • backup1012 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510030933_jynus_1071830_backup1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

It is normal that the recipe was not working, not only the logical configuration was destroyed, the RAID was not setup, too, so it had no virtual RAID configuration either.

It is normal that the recipe was not working, not only the logical configuration was destroyed, the RAID was not setup, too, so it had no virtual RAID configuration either.

okay, I'm a bit confused now, what steps did you take to re-image it?

what steps did you take to re-image it

I had to redo the HW RAID setup, which was missing from the configuration of the host (it was there before) through the mgmt interface. Then I did a manual partitioning (which may not have been needed, but I did based on your feedback that the partitioning was correct). All of that was quite painful.

what steps did you take to re-image it

I had to redo the HW RAID setup, which was missing from the configuration of the host (it was there before) through the mgmt interface. Then I did a manual partitioning (which may not have been needed, but I did based on your feedback that the partitioning was correct). All of that was quite painful.

indeed that sounds painful.

How did you configure the raid through the mgmt interface? I couldn't figure out how to do that last night, which is why I installed storcli on a systemrescue boot image.

How did you configure the raid through the mgmt interface? I couldn't figure out how to do that last night, which is why I installed storcli on a systemrescue boot image.

The https interface has a terrible GUI a bit hidden between submenus.

The https interface has a terrible GUI a bit hidden between submenus.

ugh, I see that now, either from:

  1. Dashboard -> Storage
  2. System -> Storage Monitoring

I assumed it was a separate Broadcom utility