User Details
- User Since
- Dec 18 2014, 3:39 PM (572 w, 2 d)
- Availability
- Available
- LDAP User
- Papaul
- MediaWiki User
- Unknown
Fri, Dec 5
Tue, Dec 2
@ssingh yes we have to depool the site, yes 10 AM CT
@ssingh We are planning on doing the first phase(loopback IP change on core routers and management router) of the ULSFO refresh next week Dec 09th at 10:00am. Please let me know if this work for you an your team.
Wed, Nov 26
@RobH I update the task description with all the connections that we need for phase 1 in December. Please don't forget the Cable ID's. Please let me know if you have any questions. Thanks
Tue, Nov 25
Sat, Nov 22
Thu, Nov 20
@ayounsi sretest1005 is the same as 2004 see below. what you can maybe check is the redfish /IDRAC version on sretest2004 and 1005
Wed, Nov 19
@ayounsi for the feed back i will work on it
I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that rack with public vlan just changing the network mask from /28 to /27
Both switches in drmrs are now running Junos: 23.4R2-S5.8. @cmooney i am sending the task to you since you wanted to do the cloud switches.
@ayounsi Please see below the steps to disable LLDP in the BIOS for Dell servers.
Tue, Nov 18
I took a look at xe-1/0/8 as you mentioned it was cp5002 and i saw dns5004 and just to realized that this task has been open since 2020 5 years ago so now on port xe-1/0/8 we have dns5004.
papaul@asw1-eqsin> show lldp neighbors Local Interface Parent Interface Chassis Id Port info System Name [----] xe-1/0/8 - 84:16:0c:5d:9c:70 NIC 1/10/25Gb SFP+ DA Broadcom Adv. Dual 25Gb Ethernet fw_version:AFW_218.0.219.9 [---]
papaul@asw1-eqsin> show lldp neighbors interface xe-1/0/8 LLDP Neighbor Information: Local Information: Index: 734 Time to live: 120 Time mark: Mon Nov 17 21:42:59 2025 Age: 7 secs Local Interface : xe-1/0/8 Parent Interface : - Local Port ID : 559 Ageout Count : 0
@cmooney @ayouns I update the task with all the IPV4 and IPV6 addresses for the links, irb's and loopbacks. Please review and let me know if there is anything I need to change or add.
Mon, Nov 17
@ayounsi yes I can look into it. Thanks.
Thu, Nov 13
After swapping both PEM 2 and 3
re0.cr1-codfw> show chassis environment pem
PEM 0 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
58 1 58 2
PEM 1 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
58 32 1856 90
PEM 2 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
58 0 0 0
PEM 3 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
58 2 116 5re0.cr2-codfw> show chassis environment pem
PEM 0 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
59 0 0 0
PEM 1 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
60 13 780 38
PEM 2 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
57 0 0 0
PEM 3 status:
State Online
Temperature OK
DC Output Voltage(V) Current(A) Power(W) Load(%)
55 0 0 0Nov 5 2025
@ssingh @Vgutierrez planning on doing this on Nov 19th @10:am CT. Thank you
Bother firewalls are not running Junos: 23.4R2-S5.5. Thanks to @Jgreen and @Dwisehaupt.
Closing this task now
Nov 4 2025
Nov 3 2025
@cmooney i update all the IP's to match the other POP sites. I will be re-running the configuration and validation sometimes this week in my lab and post back the result. I update also the irb interfaces configuation. I will update also the ip addresses of the link to eqsin and codfw later in the description.
Oct 31 2025
Oct 30 2025
@Dwisehaupt yes Wednesday 11/5 is ok with me. Let us do 10:00am CT. Thank you.
Oct 29 2025
@Dwisehaupt hello yes we can do this during the maintenance windows in November. Any day you prefer for that week? Thank you
We still have an ongoing email section going on with Juniper on this to understanding why in Eqiad the power is balance on all PEM's and not in codfw. Please see below for the last update we had from Juniper. Thanks.
Oct 28 2025
@cmooney thanks for the feedback, I will upgrade the diagram to match the 100G links between the core routers and the switches and the type of transceivers needed.
Oct 23 2025
@elukey no problem
@ssingh thanks for the update. I am planning on doing it before Thanksgiving any day during the week of November 17th works for me. Let me know if that woks for you and I can get back with you on the exact day and time.
Oct 22 2025
While trying to use the firmware upgrade cookbook with "sudo cookbook sre.hardware.upgrade-firmware ms-be2078 --new" i get the error below so i have to to run the cookbook by passing the flag for each component
"sudo cookbook sre.hardware.upgrade-firmware ms-be2078 -c bios --new " this works only for the BIOS and when doing the same for the IDRAC i get the second error below.
Is it possible please to look into the code and see why this is failing? In the main time i was able to manually upgrade the IDRAC. Thanks
@elukey i think the next step will be to try to install the OS without setting up the boot disk and let the OS take care of it. maybe this is one of the many cases where it is not possible to set out the boot disk before the OS install
Thanks.
@elukey on can you please provide me with one of the node that is working like you said so i can check what is different from this node and the one that is not working?
@elukey @MatthewVernon thank you that was very helpful information. Now I can answer you question
"In UEFI Boot Mode, fixed media (see Hard Disk items in the earlier section) may or may not be added to the
boot sequence. Unlike legacy Boot Mode, in UEFI Boot Mode, the OS has the ability to add to and modify the
boot sequence"
@ssingh @Vgutierrez hello just checking in to see if you have a day and time for this for drmrs.
Thanks
Oct 21 2025
can you please provide me with some context here on what we are trying to do, The only thing I see in the task is we are testing UEFI mode on the node.
1- Are we moving from Debain 11 to Debian 12?
2- What partman recipe are we using for testing?
Oct 16 2025
I do agree with you that we should have redundancy link to another switch. I have been thinking also for long term on the mgmt network design if we will have to go 2 links from the mr* to two different switches we are fixing the issue of we lost 1 switch we still have access to the mgmt network but we are not addressing the issue of what about if we lose the mgmt router itself. i know some will say no the mgmt network is not that critical. But if we are redesigning the mgmt network to have 2 links to 2 different switches then we are put in some type of redundancy but that is not a full redundancy because if the mgmt router goes down those 2 links are useless.
Oct 14 2025
Oct 7 2025
Sep 30 2025
Th node sent the puppet request to the wrong puppet master. I cleaned it up, you can re-run the cookbook with the --no-pxe flag
pt1979@puppetmaster1001:~$ sudo puppet cert --list Warning: `puppet cert` is deprecated and will be removed in a future release. (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run') "dbprov1007.eqiad.wmnet" (SHA256) 52:F7:CA:C9:25:18:85:D7:1C:C7:6B:DA:77:51:80:41:C2:1F:83:FC:EF:AA:2B:82:FB:A3:C2:48:A6:56:8D:9A
@Jhancock.wm see below why the server is failing. You have 2 options change the role int site.pp to insetup role to finish the install or have the server owner fix the puppet error below.
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /srv/puppet_code/environments/production/modules/profile/manifests/kubernetes/node.pp, line: 140, column: 15) on node dse-k8s-worker2003.codfw.wmnet
Sep 29 2025
We decided that the next time we have a n open window again for this @cmooney will himself drive the test. For this test, the move is complete.
I had a meeting today with @Jgreen about the new switch configuration. what we will be doing is to move the frack-fundraising VLAN to the new rack. See below for the process
-Create reht1 and add both et-0/1/0 and et-7/1/0 interfaces to it
redundancy-group 2 {
node 0 priority 100;
node 1 priority 1;
interface-monitor {
et-0/1/0 weight 255;
et-7/1/0 weight 255;
}
}- Create and add interface interface reth1.2135 to the trust security zone after removing the ip address from reth0.2135
- Setup interface et-0/0/47 on on both switch as Tagged
- Have one server in f5 to test before moving all the servers
Sep 26 2025
Sep 25 2025
Sep 23 2025
last update from Juniper yesterday
Sep 22 2025
I added the second fabric link xe-0/2/2
We move pfw1b-codfw today from rack C8 in DH7 to rack F5 in DH5 and all is back up online. Before the move we did some testing
1- Disconnect second HA link from both the firewalls
All was good
2- Disconnect first and second HA links, lost connectivity to node 0 (unknown state) and node 1 pass to ineligible state after some minutes about 1 minute node 1 went into disabled state but we still have connectivity from eqiad to codfw
3- Connect back HA link 1 node 0 came back online and node 1 automatically reboot and went into hold state after about a minute it went into the initial state secondary node
Redundancy group: 1 , Failover count: 0 node0 0 lost n/a n/a n/a node1 1 disabled no no None\
@Jclark-ctr cable are plugged into the wrong switch port nic 1 is connected to port xe-0/0/21 and nic 2 is connected to port xe-0/0/20 it should be the other way around see netbox
https://netbox.wikimedia.org/dcim/devices/3980/interfaces/
output on the switch is showing that mac address ending with 91 witch is nic 2 is connected to xe-0/0/20
Sep 18 2025
update from Juniper after our phone call today.
Hello Teams, Thank you for your time on our call. During our call we replaced the PEM we reboot the chassis and we were removing one by ones the PEMs we cofirm that we have power load balance. we did not lost the router, we confirm that the core issue of "unbalanced" power is not related to the physical cabling. It seems to be a part of the router's design. The router will continue to pull power from only the PEMs it needs, leaving others in a standby state for efficiency and redundancy. The load that was on PEM 1, for example, will not shift to PEM 2 or PEM 3 just because you switched the cables, however , I will ask you time to continue checking internally , I will do internal consultation we will share updates in 24hr or so, however, I will ask you to share us the logs and ouputs with the session logs.
out put of todays' troubleshooting
Last login: Tue May 20 13:04:15 on ttyu0
out put of todays' troubleshooting
Last login: Tue May 20 13:04:15 on ttyu0
Sep 16 2025
The BIO reader is installed now and working. so closing this task
@cmooney we have the spare PEM on site. I need to get on a call with Juniper to troubleshooting this. Do you think Thursday will be a good day to put the router in maintenance mode and I can communicate with Juniper the day and time we can work with them on this. I replace PEM0 with the spare PEM sent by Juniper, there is a little bit change, PEM0 is now getting power but PEM! is still pulling a lot of power
Sep 15 2025
@VRiley-WMF es1056 added, you can resume with your install.
Sep 12 2025
@VRiley-WMF the issue is that es1056 is missing in the this patch
https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172182/1/modules/profile/data/profile/installserver/preseed.yaml
you can add es1056 and add me to +2 your code.
We tested the last 2 cross cage links for the frack migration and all is working now. We are ready for the move on the 22nd.
Juniper shipped out a new PEM to replace with PEM0 and see if that will fix the issue.
Complete
Sep 10 2025
Done on the switch side
Done on the switch side
mr1-eqsin and cr2/3-eqsin are now running BGP for the management network. Resolving this task. Thanks @ayounsi
Sep 9 2025
Tested all the cross cage links (7) only 2 links are not coming up. I will do more testing tomorrow.
Sep 8 2025
BGP is up on mr1-eqsin cr2/3-eqsin
mr1-eqsin# run show route protocol ospf

