Page MenuHomePhabricator

codfw: Testing Out Sample PDUs
Closed, ResolvedPublicRequest

Assigned To
Authored By
wiki_willy
Oct 14 2020, 12:09 AM
Referenced Files
F34688676: PXL_20211014_143809881.jpg
Oct 14 2021, 3:11 PM
F34469889: CamScanner 05-26-2021 10.38.pdf
May 26 2021, 4:22 PM
F34429804: UnityMib.txt
Apr 27 2021, 1:55 PM
F34185706: PXL_20210324_160518168.jpg
Mar 24 2021, 4:22 PM
F34114558: TS1051092PDF.PDF
Feb 20 2021, 12:44 AM
F33913382: EMOCMX36JJ2J2_2.pdf
Nov 12 2020, 8:12 PM
F32383878: 20200909-ECX-WIKI [48C13]_DEV01.PDF
Oct 14 2020, 12:09 AM

Description

This task is to review the technical specs for the following PDU options:

  • Chatsworth 48port -
  • Enconnex 48port -

And then test out the samples at codfw, to determine if they are better alternatives to our existing ServerTech PDUs.

Thanks,
Willy

Event Timeline

wiki_willy created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Latest update - looks like we need to order a minimum of 100 of the Enconnex (because it's customized), so let's scrap that one. Some additional details I gathered for the Chatsworth are below:

Model Number - EA-3126-C
Cost - $1650
Master/Expansion - None, but the PDUs can be daisy chained together
Evaluation Period - Typically 2-3 weeks

Thanks,
Willy

Specs for Eaton PDU attached:

No Master/Expansion, but PDUs can be linked together
Sample PDU can be sent in 2-3 weeks

fgiunchedi added a subscriber: fgiunchedi.

Once the PDU are installed please let observability know. At minimum we'd need to test librenms discovery and their SNMP MIB to snmp-exporter for pulling power data into Prometheus

Thanks @fgiunchedi . Now that the holidays are over, I'm re-engaging the vendor on discussions. After all the paperwork and stuff, my guess is we'll probably have the sample PDUs onsite in about a month or so, but we'll be sure to update the Phab task as we make progress. Thanks, Willy

Once the PDU are installed please let observability know. At minimum we'd need to test librenms discovery and their SNMP MIB to snmp-exporter for pulling power data into Prometheus

Updating task with the new single row Chatsworth design. It's not already supported by Librenms, so it looks like we would have to add it in. A few other notes I took from our meeting with the Account Reps - 3yr warranty (can usually send RMA in 2days, then we ship broken PDU back), 31 days to test the sample PDU (tho we can keep it longer if needed), 3phase is color coded, clips hold the power plugs in, switching capability available on other models, can swap controller module, field failure rate is less than .5%, MTBF of 1.7532 million hours

Received the test PDU . I tested both colored cables, the one we bought first (red on image) and the one that was sent to us for testing( blue on image) all fits well with the lock mechanism on the PDU. The lock on the PDU secured the cable very well better then the

PXL_20210324_160518168.jpg (4×3 px, 3 MB)

I will open a ticket with CY1 to come and connect the new PDU for testing

First issue :

  • The mounting buttons don't align with the mounting bracket in the rack
  • The PDU is using USB Temperature and Humidity Sensor or we have RJ11 Temperature and Humidity Sensor
  • The panel is up side down
  • The PDU can only be linked with the same PDU or a PUD that has a RJ45 port

Change 676098 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add test pdu ps-test-d8-codfw

https://gerrit.wikimedia.org/r/676098

Change 676098 merged by Papaul:

[operations/puppet@production] Add test pdu ps-test-d8-codfw

https://gerrit.wikimedia.org/r/676098

The test PDU is on online .its been monitor in Librenms
https://librenms.wikimedia.org/device/203
The only thing left is the setup in icinga

@fgiunchedi ^

Sounds good @Papaul ! So in Icinga we're monitoring each phase to see if it hits 80%/85% of the 30A breaker, and in Prometheus we're collecting most of what we can via snmp (current, voltage, sensors).

@wiki_willy looks like the PDU is (somewhat) supported by LibreNMS so that's good news! (although I'm not seeing sensors data and inlet power, maybe that's expected for the sensors?).
Could we ask the vendor which SNMP OIDs we need to use to get input phase(s) data (volt/amp/etc) and sensors data? We could probably work it out ourselves via the MIB but I think better ask. Thank you !

Thanks for the feedback @fgiunchedi. We plan on setting up a follow-up meeting with the vendor next week to provide them some feedback, so we'll be sure to pass along your questions. Let us know though, if you have any other questions/concerns that pop up in the meantime. Thanks, Willy

Sounds good @Papaul ! So in Icinga we're monitoring each phase to see if it hits 80%/85% of the 30A breaker, and in Prometheus we're collecting most of what we can via snmp (current, voltage, sensors).

@wiki_willy looks like the PDU is (somewhat) supported by LibreNMS so that's good news! (although I'm not seeing sensors data and inlet power, maybe that's expected for the sensors?).
Could we ask the vendor which SNMP OIDs we need to use to get input phase(s) data (volt/amp/etc) and sensors data? We could probably work it out ourselves via the MIB but I think better ask. Thank you !

@fgiunchedi please see below for the information we received from the vendor

Please find the unity MIB attached here. Our team uses a MIB browser, such as iReasoning to access the MIB. You should see all the OID information here. If you would like our team to go over the MIB file, we are certainly happy to do so. Please let me know and I’ll set up a call.
https://www.ireasoning.com/

Thank you @Papaul, could you forward the attached mib? I'll take a look, though I think a call will be best

@fgiunchedi please see the document you requested

Thank you @Papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry4) we'd need to add support for inbound current and environmental monitors. I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?

Also a few questions that popped up while reading the MIB:

  • Are we going to use multiple PDUs chained together?
  • For input current the mib has "line" and "branch" concepts, I'm not super familiar with these and would be great to clarify what's what in the daisy-chained case too
  • Temperature is in degrees F (not a big deal, need to check if librenms can convert for us under the hood)
  • @Papaul I see the sensor connected, there is one reading for temp and one for humidity; and two other readings for temp/humidity with bogus values. Does the sensor have two probes physically or just one ?

Hi @fgiunchedi - Chatsworth has been pretty flexible with the amount of time we have for testing it, so I think we should be ok keeping it for a longer duration. Just let us approximately how long we would need to keep it for, and we can pass the info along to them. Or if this PDU doesn't seem that great from a monitoring and software compatibility aspect, let us know as well. We want to make sure the PDU makes sense for everyone across the org...and we can always pass on this and try out the next manufacturer if it makes more sense. Thanks for all the help, it's definitely appreciated. ~Willy

I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?

Hi @fgiunchedi - Chatsworth has been pretty flexible with the amount of time we have for testing it, so I think we should be ok keeping it for a longer duration. Just let us approximately how long we would need to keep it for, and we can pass the info along to them. Or if this PDU doesn't seem that great from a monitoring and software compatibility aspect, let us know as well. We want to make sure the PDU makes sense for everyone across the org...and we can always pass on this and try out the next manufacturer if it makes more sense. Thanks for all the help, it's definitely appreciated. ~Willy

Thank you @wiki_willy that makes sense! I think we should aim at keeping the PDU until the end of the quarter if possible, if not maybe another 3-4 weeks?

WRT monitoring/software support, certainly having a PDU already fully supported by librenms is ideal. We have to do some work anyways for the Prometheus bits (and we did some LibreNMS work when moving from sentry3 to sentry4 IIRC) so definitely not a blocker on my end. If the vendor will work with us to improve LibreNMS support upstream for their products that will of course benefit everyone I think!

Thank you @Papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry4) we'd need to add support for inbound current and environmental monitors. I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?

Also a few questions that popped up while reading the MIB:

  • Are we going to use multiple PDUs chained together?
  • For input current the mib has "line" and "branch" concepts, I'm not super familiar with these and would be great to clarify what's what in the daisy-chained case too
  • Temperature is in degrees F (not a big deal, need to check if librenms can convert for us under the hood)
  • @Papaul I see the sensor connected, there is one reading for temp and one for humidity; and two other readings for temp/humidity with bogus values. Does the sensor have two probes physically or just one ?

@fgiunchedi i changed the temperature to Celsius, the sensor have only 1 probe but the PDU can take up to 2 sensors

Thank you @Papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry4) we'd need to add support for inbound current and environmental monitors. I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?

Also a few questions that popped up while reading the MIB:

  • Are we going to use multiple PDUs chained together?
  • For input current the mib has "line" and "branch" concepts, I'm not super familiar with these and would be great to clarify what's what in the daisy-chained case too
  • Temperature is in degrees F (not a big deal, need to check if librenms can convert for us under the hood)
  • @Papaul I see the sensor connected, there is one reading for temp and one for humidity; and two other readings for temp/humidity with bogus values. Does the sensor have two probes physically or just one ?

@fgiunchedi i changed the temperature to Celsius, the sensor have only 1 probe but the PDU can take up to 2 sensors

Thank you for the update on the sensor/probe! I can see the change F -> C in the SNMP option but not in the sensor values, it looks like to me the values are fixed in F; that's fine as we can convert, I wonder what the setting is for though, perhaps it changes the web interface (?)

-CPI-UNIFIED-MIB::celsiusTemp.0 = INTEGER: fahrenheit(0)
+CPI-UNIFIED-MIB::celsiusTemp.0 = INTEGER: celsius(1)
-CPI-UNIFIED-MIB::cpiPduSensorValue.1.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 8832
-CPI-UNIFIED-MIB::cpiPduSensorValue.2.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 3532
+CPI-UNIFIED-MIB::cpiPduSensorValue.1.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 8826
+CPI-UNIFIED-MIB::cpiPduSensorValue.2.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 3524

re: chaining, do you know if we're going to use multiple PDUs per rack ?

I did some work on this last week, there's temporary patches on netmon1002 to get things going at least minimally and collect voltage/current/power/etc from the PDU's branches. I ran into troubles with conditional discovery and asked upstream about it: https://community.librenms.org/t/skipping-values-based-on-oids-in-another-table-with-yaml-discovery/15689

Received the second test PDU

On the Raritan PDU all the power cords fits well. the PDU also mount well on our rack. The only problem is, the plugs are on 2 rows and not on single row like the other PDU's. I will open a ticket with CY1 to plug it it tomorrow to test the software part.

TICKET NO. 1979021 was create to request to plug the Raritan PDU in Rack D8.

Change 698007 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] ADD new Raritan test PDU

https://gerrit.wikimedia.org/r/698007

Change 698007 merged by Papaul:

[operations/puppet@production] ADD new Raritan test PDU

https://gerrit.wikimedia.org/r/698007

@fgiunchedi new Raritan test PDU is racked in D8
https://netbox.wikimedia.org/dcim/devices/3427/
still working on librenms
https://librenms.wikimedia.org/device/206
the host is already in Icinga but you still need to doing what you do to make it works
you cam download the full MIB by logging to the PDU web interface then click on maintenance

Let me know if you have any questions

Thank you @Papaul ! I see the device in librenms but looks like discovery isn't working. I've removed and added the device again without success: https://librenms.wikimedia.org/device/208/overview

And indeed even running discovery.php from command line shows snmpget fails, which I can't reproduce:

# ./discovery.php -h ps2-test-d8-codfw.mgmt.codfw.wmnet -d                    
LibreNMS Discovery                                 
SQL[select `migration` from `migrations` order by `id` desc limit 1 [] 1.59ms]
                                                   
SQL[select count(*) as aggregate from `migrations` limit 1 [] 2.44ms]
                                                   
SQL[SELECT version() [] 1.13ms]                    
                                                   
===================================                
Version info:                                      
Commit SHA: 692b5d5f5390041b34749c4cc3f1241759d21b51
Commit Date: 1618994912
DB Schema: 2021_04_08_151101_add_foreign_keys_to_port_group_port_table (1240)
PHP: 7.3.27-1~deb10u1
MySQL: 10.4.18-MariaDB-log
RRDTool: 1.7.1
SNMP: NET-SNMP 5.7.3
==================================DEBUG!
Updating os_def.cache  
SQL[SELECT * FROM `devices` WHERE disabled = 0 AND snmp_disable = 0 AND `hostname` LIKE 'ps2-test-d8-codfw.mgmt.codfw.wmnet' ORDER BY device_id DESC [] 2.47ms] 
  
SQL[select * from `devices` where `device_id` = ? limit 1 [208] 1.87ms] 
  
SQL[select * from `devices_attribs` where `devices_attribs`.`device_id` = ? and `devices_attribs`.`device_id` is not null [208] 1.57ms] 
  
ps2-test-d8-codfw.mgmt.codfw.wmnet 208 generic SQL[select `hostname`, `overwrite_ip` from `devices` where `hostname` = ? limit 1 ["ps2-test-d8-codfw.mgmt.codfw.wmnet"] 1.33ms] 
  
[FPING] '/usr/bin/fping' '-e' '-q' '-c' '3' '-p' '500' '-t' '500' 'ps2-test-d8-codfw.mgmt.codfw.wmnet'
  
response:  {"xmt":3,"rcv":3,"loss":0,"min":32.0,"max":32.1,"avg":32.0,"dup":0,"exitcode":0} 
SQL[select `device_groups`.*, `device_group_device`.`device_id` as `pivot_device_id`, `device_group_device`.`device_group_id` as `pivot_device_group_id` from `device_groups` inner join `device_group_device` on `device_groups`.`id` = `device_group_device`.`device_group_id` where `device_group_device`.`device_id` = ? [208] 2.24ms] 
  
SQL[select exists(select * from `alert_schedule` where (`start` <= ? and `end` >= ? and (`recurring` = ? or (`recurring` = ? and ((time(`start`) < time(`end`) and time(`start`) <= ? and time(`end`) > ?) or (time(`start`) > time(`end`) and (time(`end`) <= ? or time(`start`) > ?))) and (`recurring_day` like ? or `recurring_day` is null)))) and (exists (select * from `devices` inner join `alert_schedulables` on `devices`.`device_id` = `alert_schedulables`.`alert_schedulable_id` where `alert_schedule`.`schedule_id` = `alert_schedulables`.`schedule_id` and `alert_schedulables`.`alert_schedulable_type` = ? and `alert_schedulables`.`alert_schedulable_id` = ?) or exists (select * from `device_groups` inner join `alert_schedulables` on `device_groups`.`id` = `a
lert_schedulables`.`alert_schedulable_id` where `alert_schedule`.`schedule_id` = `alert_schedulables`.`schedule_id` and `alert_schedulables`.`alert_schedulable_type` = ? and `alert_schedulables`.`alert_schedulable_id` in (?)))) as `exists` ["2021-06-04T10:21:03.544519Z","2021-06-04T10:21:03.544519Z",0,1,"10:21:03","10:21:03","10:21:03","10:21:03","%","device",208,"device_group",10] 2.52ms]                                                                                                                      
SQL[INSERT IGNORE INTO `device_perf` (`xmt`,`rcv`,`loss`,`min`,`max`,`avg`,`device_id`,`timestamp`,`debug`)  VALUES (:xmt,:rcv,
:loss,:min,:max,:avg,:device_id,NOW(),:debug) {"xmt":3,"rcv":3,"loss":0,"min":32,"max":32.1,"avg":32,"device_id":208,"debug":"[]"} 2.13ms]                                                                                                                      
SQL[UPDATE `devices` set `last_ping`=?,`last_ping_timetaken`=? WHERE device_id=? ["2021-06-04 10:21:03",32,208] 1.82ms] 
  
SNMP Check response code: 1  
SNMP['/usr/bin/snmpget' '-v2c' '-c' 'COMMUNITY' '-Oqv' '-m' 'SNMPv2-MIB' '-M' '/srv/deployment/librenms/librenms/mibs' 'udp:HOS
TNAME:0' 'sysObjectID.0']                                                                                
Exitcode: 1                                        
snmpget: Failure in sendto (Sub-id not found: (top) -> sysObjectID) (Invalid argument)
                                                   
snmpget: Failure in sendto (Sub-id not found: (top) -> sysObjectID) (Invalid argument)
SNMP Unreachable./discovery.php ps2-test-d8-codfw.mgmt.codfw.wmnet 2021-06-04 10:21:03 - 0 devices discovered in 1.323 secs
SNMP [2/0.02s]: Get[2/0.02s] Getnext[0/0.00s] Walk[0/0.00s]
MySQL [3/0.01s]: Cell[1/0.00s] Row[-1/-0.00s] Rows[1/0.00s] Column[0/0.00s] Update[1/0.00s] Insert[1/0.00s] Delete[0/0.00s]
Graphite connection made to graphite-in.eqiad.wmnet
Graphite [0/0.00s]:                                
RRD [0/0.00s]:

Also oddly enough adding the device from command line via addhost.php doesn't pass the initial validation when trying to pull data via snmp whereas adding the device from the web ui works.

Ok I think I got the correct addhost.php incantation to add the device and get discovery to work properly:

./addhost.php ps2-test-d8-codfw.mgmt.codfw.wmnet COMMUNITY v2c 161 udp

Without "161 udp" for some reason the snmpget call tried port 0 which obviously doesn't work. New device is at https://librenms.wikimedia.org/device/209 and we'll see what librenms makes of it.

The discovery has worked as expected now and data is collected, I'm seeing only one input for volts/hertz/current as opposed to individual phases, and for current the outlets (all zeros) are also displayed. Next up will be to figure out if we can get distinct phases monitored

Environment sensors are not attached and thus not displayed, @Papaul will connect them.

@wiki_willy The Raritan PDU uses another type of sensor then the Chartsworth, we will have to ask Rahi to send us Raritan smart snesor. Thanks

Thanks @Papaul. So in terms of feedback for Raritan, so far it's:

  • convert PDU to one row of plugs (instead of 2 rows)
  • request a Raritan smart sensor

@fgiunchedi or @Papaul - do you have any additional feedback or requests for Raritan? We can always set up a meeting with them if there are a lot of questions.

Thanks,
Willy

Thanks @Papaul. So in terms of feedback for Raritan, so far it's:

  • convert PDU to one row of plugs (instead of 2 rows)
  • request a Raritan smart sensor

@fgiunchedi or @Papaul - do you have any additional feedback or requests for Raritan? We can always set up a meeting with them if there are a lot of questions.

I have a question for us: ATM librenms supports inlet and outlet power monitoring for this Raritan PDU, although I think we'd need/want individual phases monitoring (for inlet) ? That's something the PDU supports AFAICT but we'd need to add to librenms.

Hi @fgiunchedi - sorry for the delay. Just to do a quick check before you put a lot of time and effort in....from your perspective on the monitoring side of things, do you have a preference between the Chatsworth or Raritan PDU?

Hi @fgiunchedi - sorry for the delay. Just to do a quick check before you put a lot of time and effort in....from your perspective on the monitoring side of things, do you have a preference between the Chatsworth or Raritan PDU?

Raritan worked out of the box in librenms so that's certainly a plus, but there's no phase monitoring ATM so I think we'll need to add that (? not sure, see my previous question). I think so far my vote goes to Raritan (all other things being equal, e.g. if the temperature sensors work, and the PDU as a whole is ok with dcops)

Hi Filippo - is this following link the one used for phase monitoring?

https://librenms.wikimedia.org/device/device=108/tab=health/metric=current/

If so, that's correct...we'll need that part to work in librenms. There's actually just a few fields that we really set thresholds for in monitoring the PDUs, which are listed here: T247358#6118215. Hope this helps a little, but feel free to DM me and we can go over it more if needed. Much appreciated for all the help! Thanks, Willy

Raritan worked out of the box in librenms so that's certainly a plus, but there's no phase monitoring ATM so I think we'll need to add that (? not sure, see my previous question). I think so far my vote goes to Raritan (all other things being equal, e.g. if the temperature sensors work, and the PDU as a whole is ok with dcops)

Hi Filippo - is this following link the one used for phase monitoring?

https://librenms.wikimedia.org/device/device=108/tab=health/metric=current/

Yes that's correct

If so, that's correct...we'll need that part to work in librenms. There's actually just a few fields that we really set thresholds for in monitoring the PDUs, which are listed here: T247358#6118215. Hope this helps a little, but feel free to DM me and we can go over it more if needed. Much appreciated for all the help! Thanks, Willy

Ok thank you! I'll look into having per-phase data as well, and will reach out if I have questions

@fgiunchedi the sensors for the Raritan PDU are in place

@fgiunchedi the sensors for the Raritan PDU are in place

Thank you @Papaul ! I can confirm temperature/humidity now work out of the box: https://librenms.wikimedia.org/device/209/health

What's left is phase monitoring which I'll be looking into next

Back to @Papaul since work is on hold AIUI

Received the PDU today.
Mounting the PDU in the rack was very easy, took the PDU out of the box and straight in the rack no adjustment. It is also very light compare to the other PDU's
The power cable fits well in the PDU and the locking mechanism is great. The only issue I am seeing with this, it is plastic so can break very easy.
I will coordinate with CY1 to plug the PDU in rack D1 sometimes next week for testing.

Downtimed ps1-d1-codfw until 2021-11-08 14:13:13 UTC on Icinga

Note: The new Eaton PDU has only SNMP V1 and V3

@fgiunchedi I got in touch with Eaton technical support team , they told me that the only options for SNMP are V1 and V3. I really don't like the ideal of using SNMP V1 and I don't know if we are thinking of using SNMP V3 since we don't use it anywhere in the infrastructure . Let me know what are your thoughts on this
The PDU is already online but not in Librenms .

Thanks.

Thanks @Papaul, that's unfortunate re: v2 support but thanks for investigating further. We could try with v3 and see what happens though

I spent some time today with this, I can successfully snmpwalk the device, yet librenms refuses to add it. So on the device's end things seem to be working as expected, still figuring out the librenms side

Ok device added (with a temporary password), apparently librenms with zero v3 configuration specified won't work

@Papaul the device is now collecting data, however AFAICS only outlets are discovered for now. Let's see when more data is accumulated if the numbers are sensible.

Ok device added (with a temporary password), apparently librenms with zero v3 configuration specified won't work

Upstream issue: https://github.com/librenms/librenms/issues/13390

Change 732298 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] librenms: add snmp v3 dummy section

https://gerrit.wikimedia.org/r/732298

@fgiunchedi thank you for getting the monitoring part up and running .

Change 732298 merged by Filippo Giunchedi:

[operations/puppet@production] librenms: add snmp v3 dummy section

https://gerrit.wikimedia.org/r/732298

I did some more work on Eaton PDU and librenms today. From the snmpwalk output at P17569 I think all the data we need (per-phase data, environmental data at least) is there, and shouldn't be a whole lot of effort to add to librenms (perhaps a week of work). Out of the box we get per-outlet current data only in librenms: https://librenms.wikimedia.org/device/215/health

I think librenms support is doable with a little bit of work, which will be of course contributing upstream so it'll help us and others too. Let me know what you think @Papaul @wiki_willy

Mentioned in SAL (#wikimedia-operations) [2021-11-15T18:52:19Z] <volans@cumin2002> START - Cookbook sre.hosts.downtime for 40 days, 0:00:00 on ps1-d1-codfw with reason: Testing new PDU devices T265435

Mentioned in SAL (#wikimedia-operations) [2021-11-15T18:52:23Z] <volans@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 40 days, 0:00:00 on ps1-d1-codfw with reason: Testing new PDU devices T265435

This is complete.