User Details
- User Since
- Dec 18 2014, 3:39 PM (491 w, 5 d)
- Availability
- Available
- LDAP User
- Papaul
- MediaWiki User
- Unknown
Today
complete
Yesterday
@Jhancock.wm it looks like we have another sretest2002 setup in b7 the switch has that configuration already so i went and delete the one in b7 since you have another in a8.
@Eevans like you mentioned on IRC "it's the same slot(s) that are having issues" I think we need to replace the main board and see. We have 4 decom PowerEdge R440's. I will ping @Jclark-ctr or @VRiley-WMF to see if they can coordinate with you and try to pull the main board from one of those servers and replace the one in aqs1013.After that, you can try to re-image the server.
@Jclark-ctr @VRiley-WMF please see above if you have time to work with @Eevans on this.
Thanks
Sat, May 18
@Jhancock.wm thank you for working on this. Like I mentioned to you this morning the reason kafka-main2009 was failing is because it was contacting the wrong puppet server for cert request. (see below) what I did was to delete the cert req on puppetmaster and restart the re-image.
Fri, May 17
Thank you will do
Tue, May 14
All the old mgmt switch are back in place
Wed, May 8
@cmooney I think this is just a human error issue. We were racking all the lsw1-d* yesterday and maybe we accidentally bumped into the cable. We will check once on site.
Mon, May 6
Fri, May 3
Resolved by rebooting both switches
@Jclark-ctr @VRiley-WMF when the task was auto generated, it shows that disk sdg1 failed see in task description line below (F)
Sat, Apr 27
Apr 18 2024
Apr 17 2024
@ssingh After 2 days working on this issue, I finally got at the bottom of the of problem. After many reboots on cp1115, I checked the model of the NIC (Broadcom 57414) and decided to test every single firmware available on Dell web site.
All the versions 22.xx , server pxe boot but give you the error "Failed to load ldlinux.c32"
versions 21.8x server boots sometimes and other times gets stuck
The last version, version 21.60.22.11 which was not listed on Dell product-support web site https://www.dell.com/support/home/en-us/product-support/servicetag/0-bTkxNWhsYWF2OFdQRm04TmF3QjhwZz090/drivers
is the only working version. I installed this version and reboot cp1115 six times and all the six times it did reboot in pxe without an issue.
Complete
@Jhancock.wm anything else left to be done on this task?
Apr 16 2024
@blink is there anything left for DC-ops to do on this task? Thanks
@ssingh unfortunately using the fs DAC didn't fix the issue. So we are back to zero. I am still working on it
Since Monday I setup in rack D1 and D2 the juniper switch as management switch and so far no issue. I had to :
- Setup the root password same as the server management password
- Disable the management interface
- Disable the chassis alert for the management interface
- Setup switch as management switch in Netbox to stop some Librenms and network alerts
Apr 15 2024
@Jhancock.wm cloud-hosts1-b1-codfw (2118)
Apr 13 2024
@ssingh I checked also cp2042 we are using FS.com DAC.
FYI W2W= Wave2Wave
Apr 12 2024
@ssingh one thing that I found between the server NiC and the switch interface is the vendor . In Eqiad, I checked 3 nodes cp1115, 1113 and 1100 all have for vendor under Transceiver inventory W2W and in Esams the vendor is FS. Since @ayounsi mentioned this morning that the request was not reaching the switch I focused on the media type used in esams and in eqiad so it looks like both connections are Direct Attach Copper but different vendor.
Apr 11 2024
Apr 8 2024
@ayounsi yes you are right since it will have an IP address it will be managed so I was thinking over it. Disable the mgmt interface just setup the root password on the switch and use it as a L2 switch so we don't have to deal with managing it.
Apr 5 2024
@ayounsi @cmooney thanks for all the inputs. What I am asking is to use the Juniper old switches as dummies switches(L2 config) . I need no automation or monitoring on those I will like to use those just as the existing switches . I just don't want to manually go in the 15 switches to setup the initial and basic setup that is why i was asking if it is possible to setup ZTP to work also with those switches. If it is too mush work to do, on the ZTP side I can setup manually. Please let me know if you have more questions
Thanks.
Apr 4 2024
Apr 2 2024
Apr 1 2024
@bking the pxe boot issue was that both 10G and 1G nic were set to pxe boot so that is why it was failing. i disable pxe boot on the 1G nic all good now.
You can resume the re-image
Mar 28 2024
complete
Mar 27 2024
@Jhancock.wm this is what 2003 is showing on console
┌───────────────────────┤ [!!] Partition disks ├────────────────────────┐ │ │ │ Failed to partition the selected disk │ │ This probably happened because the selected disk or free space is too │ │ small to be automatically partitioned. │ │ │ │ <Go Back>
With the 2 SSD's back in the server, same issue. Doing more troubleshooting, I found out that when the server was first re-image, it did create 2 LVM volumes one on the HDD and the other one on the SSD so each time I was deleting the HW and recreating ; t was not deleting for some reason the LVM volume. I had to recreate the HW RAID just with the HD's install the OS and then once the OS complete, create the RAID on the SSD's.
@jcrespo all your's
Mar 26 2024
dbprov2005 was failing after installing the OS may times.
after troubleshooting, when the server reboots into the OS after the OS install the cookbook fails. On the server console during the OS boot I get :
[ TIME ] Timed out waiting for device -4977-ac03-d76d6926114e. [ TIME ] Timed out waiting for device -6d79-4292-8546-2e76e67a0aa0. [DEPEND] Dependency failed for /dev…3-849e-4977-ac03-d76d6926114e. [DEPEND] Dependency failed for Swap.
Next step, I removed both ssd's and recreate the HW RAID with only the HD's and the server didn't have any issues
I am going to try to put back again the 2 SSD;s and reimage the server to see if it will fail .
Mar 25 2024
@klausman hello please see @Jhancock.wm comment above. Thank you.
Mar 22 2024
@cmooney what works for you works for me as well
Mar 21 2024
@MoritzMuehlenhoff i tried again the re-image once the server reboots after the OS install the cookbook failed with error below.
Exception raised while executing cookbook sre.hosts.reimage: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run raw_ret = runner.run() File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 658, in run fingerprint = self.puppet_installer.regenerate_certificate()[self.fqdn] File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 294, in regenerate_certificate raise PuppetHostsError( spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are:
dbprov2005 re-image is stocked at puppet run. When i login to the server and try to manually run puppet i get the error below.
Error: The CRL issued by 'CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US' has expired, verify time is synchronized Error: The CRL issued by 'CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US' has expired, verify time is synchronized
Mar 20 2024
Removed all old cables and unracked 4 switches out of 8
Mar 19 2024
@Jhancock.wm please proceed with this task and let me know if you have any issues.
@RKemper Thank you.
This server is in the process to be decommissioned . No action needed. Resolving this task
@fgiunchedi hello I will be working with you tomorrow on this since @Jhancock.wm has some things to take care of @16UTC
@RKemper hello please see @Jhancock.wm comment above.
Zeroize done on all the old switches in role a