User Details
- User Since
- Dec 18 2014, 3:39 PM (467 w, 2 d)
- Availability
- Available
- LDAP User
- Papaul
- MediaWiki User
- Unknown
Yesterday
@BTullis when you are back on Monday can you please add ceph200[1-3] to partman-early-command.sh so that the partman/custom/csphod.cfg can work with those nodes.
Thanks
@bking all your's
Thu, Nov 30
@Clement_Goubert @Joe all your's
fix
@Jhancock.wm I think you forget to setup the 3 additional IP's for those nodes (Networking Setup: Speed:1G - VLAN:Private(?)/Public/Other(Specify) : AAAA records:Yes(?), Additional IP records (Cassandra)? Yes (3)) . For a side note for next time when you are doing provision in Netbox for any restbase node please add the number of instances that will be on the rack/setup task . For exmple for this task it says 3 so under "How many Cassandra instances " you pick 3. I will try to fix this tomorrow.
@Eevans if i add the other 3 IP's addresses manually you should be good or do we have to re image all the hosts?
@Eevans i don't know since @Jhancock.wm did the provision and i just did the OS install, But I will check and let you know tomorrow. Thanks
@MoritzMuehlenhoff all your's
Wed, Nov 29
@bking all your's
@colewhite all your's
@Eevans All your's
Tue, Nov 28
new restbase nodes is using different partman recipe then the one in apt_repo.yaml file so adding the right partman recipe for the new node is breaking puppet on the apt node
'restbase201[3-9]|restbase202[0-7]': - reuse-parts.cfg - partman/custom/reuse-cassandrahosts-3ssd-jbod.cfg 'restbase202[8-9]|restbase203[0-5]': - partman/custom/cassandrahosts-3ssd-jbod.cfg:
on apt node i get
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Class[Install_server::Preseed_server]: parameter 'preseed_per_hostname' entry 'restbase202[8-9]|restbase203[0-5]' index 0 expects a match for Install_server::Preseed_host::Config = Pattern[/[-\/\w]+\.cfg/], got Struct (file: /etc/puppet/modules/profile/manifests/installserver/preseed.pp, line: 23, column: 3) on node apt1001.wikimedia.org Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
i am about to roll back the changes
@Jhancock.wm this is ready now for OS install. I did a test on restbase2028,PASS. on 2029 it looks like the network cable is not plug into NIC 1 since the MAC address i am seeing on the switch is different from the MAC address the server is trying to boot from. If you get on site tomorrow before me you can take a look and update the task.
Thanks
Mon, Nov 27
- first issue on 1157 the serial port address was set to COM1 and not com2
- second issue on 1157 boot order was set to network then disk making the server to keep rebooting into pxe after each OS install
- third The way the raid was setup was making to not boot to the correct disk (raid 1 on the ssd and raid 0 on each othe disk) I had to delete raid configuration and create only raid 1 on the ssd and install the os
- Fourth issue ii57 is sending the CSR on the puppetmaster and not on the puppetserver when you enter puppet 7 ( i am having John and Richardo check this)
@Jclark-ctr can you double check the other hosts for issues 1 to 3 and fix but do not start the OS install until I hear something from the automation team.
Thanks
Nov 2 2023
@cmooney cable is in place connected to lasw1-a2-codfw ge-0/0/46 ID 00756
@cmooney the order works for me
Nov 1 2023
@Jclark-ctr @VRiley-WMF those are ready now for OS install. Thanks
Oct 31 2023
yes we always pick the lower numbering unit for 2U host.
@Volans to get the the prefix ge vs xe maybe use the rack. In codfw we ahve only 10g servers racked in 10g rack and those racks are in row A A2, A4 and A7 and in row B B2, B4 and B7 so all the serves in those racks will be xe and the other racks in row A and B will be ge. Hope this helps.
Thanks
Oct 30 2023
@cmooney cable is place from mr1-codfw ge0/0/3 to lsw1-a2-codfw ge-0/0/47 ID 00745
Oct 27 2023
@cmooney for the cross rack link it does make sense to use copper with 1000BaseT sine we have those already on site.
On the other hand since A2 is a 10G rack, we don't have any 1G server racked in that rack and we will not have any 1G server racked in there in the future so I think we can reserve the last 4 ports as 1G and connect the mr1 to one of those ports (xe-0/0/0/47)
let me know what you think
Oct 20 2023
@cmooney @Jhancock.wm checked the server, no IP address set on it and she did reset it but it didn't resolve the issue. I asked to to upgrade the IDRAC firmware. I have seen that if the IDRAC version is too old the Redfish API will sometimes not be able to connect to the server so let see what she comes up with tomorrow after updating the IDRAC. Thanks
Oct 18 2023
@nskaggs you are correct even 1 additional rack isnt't possible at this time. Sorry about that.
Oct 17 2023
@Jclark-ctr this is now fixed. You can try running the re-image again
Oct 16 2023
@nskaggs hello true that codfw will me moving to the EVPN/VXLAN design but codfw doesn't have that many racks to dedicate 2 racks for WMCS. We have a total of 32 racks in codfw. 2 network racks, 1 fundraising rack, 1 dedicated WMCS rack and 1 rack that is already full so that put us to 27 racks left. If we have to dedicated anoter 2 racks for WMCS we have 25 racks available and right now all the other racks are getting close to be full. So if we have to go with 2 dedicated racks option for WMCS we need to think about expansion.
This is complete.
Oct 13 2023
@MatthewVernon you welcome
Oct 12 2023
@Eevans thanks. what about next week Monday the 16th at 10:00am CT
@Jclark-ctr @Andrew this now complete. I update the switch ports as recommended @ https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network#Datacenter_network
@ayounsi yes i checked the the vlan config on the switch and confirmed that the interface is in the right vlan. The reason you can not ssh into the host iis that I left the OS install at that stage when i get the error message so no OS on puppet run on the host. if i hit continue at that message, the OS install complete and the host reboots without an issue but the security.debian.org line in the sources.list file is commented out
# Line commented out by installer because it failed to verify: #deb http://security.debian.org/debian-security bullseye-security main contrib non-free # Line commented out by installer because it failed to verify: #deb-src http://security.debian.org/debian-security bullseye-security main contrib non-free
@cmooney hey i am working on 2 nodes cloudvirt1063 and 64 same rack E4 getting the message below. can you please see whu those nodes can not the apt node? thanks
Cannot access repository │ ││ The repository on security.debian.org couldn't be accessed, so its │ ││ updates will not be made available to you at this time. You should │ ││ investigate this later.
Oct 11 2023
@MatthewVernon sorry to hear that you are having some issue with this server. I was able to set all the disks as JBOD like you asked. However the install is not able to install the GRUB in /dev/sda so i leave that to you. Let me know if you have any questions.
@MoritzMuehlenhoff thanks
Oct 10 2023
looking at the gerrit history about the late command i see also that there where some changes made today @jbond @Volans can you please also see if those changes can cause the errors we are getting above. thanks
https://gerrit.wikimedia.org/r/q/late_command.sh
@MoritzMuehlenhoff i was getting the error above on cloudvirt1064 and wanted to drop in the virtual console to see the syslog but when i restart the image for the second time i got the error below during installation.
Cannot access repository │ ││ The repository on security.debian.org couldn't be accessed, so its │ ││ updates will not be made available to you at this time. You should │ ││ investigate this later. │ ││ │ ││ Commented out entries for security.debian.org have been added to the │ ││ /etc/apt/sources.list file. │ ││ │ └│ <Go Back>
@Jclark-ctr ok then the only thing left is to change it in netbox to use the public VLAN
on cloudvirt1064 during install i am getting when you reboot the server on console you get the server login prompt but since the system didn't complete the cookbook failed so i am looking into why the server is not able to run the command below.
Failed to run preseeded command │ ┐ │ │ Execution of preseeded command "wget -O /tmp/late_command │ │ │ │ http://apt.wikimedia.org/autoinstall/scripts/late_command.sh && sh │ │ │ │ /tmp/late_command" failed with exit code 1.
We(dc-ops) have been receiving a lot of interface alerts error in the pass 1 month or so. Will it be possible to silence those alerts going to dc-ops until this it fully in place and working because it looks like dc-ops can not do anything about those alerts.
{FIRING:5} InterfaceErros analytics dcops (enp130s0f0 node ops warning eqiad prometheus)
Thanks
Same server we already worked on this
Oct 6 2023
@akosiaris this is ready for service.
Please note this is the first time we are putting a Supermicro server in production any feedback will be great .
Thanks.
@Jhancock.wm when i was setting the other kubernetes node i had the line below for all the nodes, bit for some reason that line was replace with
1603 node /^kubernetes20(0[5-9]|[1-4][0-9]|5[012356])\.codfw\./ { 1604 role(kubernetes::worker) 1605 }
and this doesn't convert 2054 i will update site.pp with another patch. Thanks
https://gerrit.wikimedia.org/r/c/operations/puppet/+/951921/2/manifests/site.pp#1652
@Eevans hello when do you think it will be the best day for us to coordinate with you on relocating this node so that we are not block by it during the codfw switch migration from VC to VXLAN/EVPN?
Oct 5 2023
@bking I tried to do the re-images on cloudelastic1007, the re-image finished with the OS install without an issue. The part that did failed was the puppet run the reason being that the server was not in site.pp. Another issue is the servers supposed to be on the public VLAN not netbox is showing that is placed in private vVLAN.
@VRiley-WMF @Jclark-ctr can someone add those servers to site.pp. and fix the netbox by putting those servers in the public vlan please.
@Jhancock.wm first thing first you need to upgrade the idrac
@cmooney no problem
Oct 4 2023
I am thinking about something to consider when going servers refresh or new servers
@cmooney this should be a complication if we did have a mixed of 1G and 10G servers within the same rack which is not the case. In all existing 1G racks only 1G servers are racked same as for the 10G racks only 10G servers are racked. At the exception of rack x which is a 10G rack where we have only one 1G server racked there. I am already working on moving that server before the migration to a 1G rack to make things easy https://phabricator.wikimedia.org/T348142.
Oct 3 2023
papaul@fasw-c-codfw# show |compare
[edit interfaces interface-range disabled]
member "ge-[0-1]/0/16" { ... }
+ member "ge-[0-1]/0/17";
[edit interfaces]
- ge-0/0/17 {
- description frauth2001:eth0;
- }
- ge-1/0/17 {
- description frauth2001:eth1;
- }
Hello All, I took a look at the AMD GPU and it used the 2x8 pin for power according to the specs of the ml-staging servers we have in codfw, it comes with the GPU ready configuration cable install kit and 2 power supplies of 1600w. the GPU needs a MAX of 300w and on the main board of the server with have 2xRSR of 225w see image below. For the GPU cable we have some on site .
Oct 2 2023
We need to clean interfaces on the switch
Sep 8 2023
@cmooney thanks for the update. I think we can reuse those the MPO
Sep 7 2023
Hey @MoritzMuehlenhoff hey here is the decom task. once done. you can just assign it back to me.
We will be using to test the new codfw spine/leaf new design contint2001 and thumbor2004. contint2001 will be rename to sretest2003 and thumbor2004 to sretest2004. both server are plugged in port 41
Please leave this task assigned to me and don't close it once the testing please decommission the nodes and update this task
Sep 6 2023
Thank you for putting the summary together. Another scenario I was thinking about while reading the document is upgrade Junos on devices that have already a configuration (upgrading existing switches/routers to a new Junos release). We can not use in this case ZTP becasue we know it will setreset the device to factory.