This task is to review the technical specs for the following PDU options:
- Chatsworth 48port -
- Enconnex 48port -
And then test out the samples at codfw, to determine if they are better alternatives to our existing ServerTech PDUs.
Thanks,
Willy
wiki_willy | |
Oct 14 2020, 12:09 AM |
F34688676: PXL_20211014_143809881.jpg | |
Oct 14 2021, 3:11 PM |
F34469889: CamScanner 05-26-2021 10.38.pdf | |
May 26 2021, 4:22 PM |
F34429804: UnityMib.txt | |
Apr 27 2021, 1:55 PM |
F34185706: PXL_20210324_160518168.jpg | |
Mar 24 2021, 4:22 PM |
F34114558: TS1051092PDF.PDF | |
Feb 20 2021, 12:44 AM |
F33913382: EMOCMX36JJ2J2_2.pdf | |
Nov 12 2020, 8:12 PM |
F32383878: 20200909-ECX-WIKI [48C13]_DEV01.PDF | |
Oct 14 2020, 12:09 AM |
This task is to review the technical specs for the following PDU options:
And then test out the samples at codfw, to determine if they are better alternatives to our existing ServerTech PDUs.
Thanks,
Willy
Latest update - looks like we need to order a minimum of 100 of the Enconnex (because it's customized), so let's scrap that one. Some additional details I gathered for the Chatsworth are below:
Model Number - EA-3126-C
Cost - $1650
Master/Expansion - None, but the PDUs can be daisy chained together
Evaluation Period - Typically 2-3 weeks
Thanks,
Willy
Specs for Eaton PDU attached:
No Master/Expansion, but PDUs can be linked together
Sample PDU can be sent in 2-3 weeks
Once the PDU are installed please let observability know. At minimum we'd need to test librenms discovery and their SNMP MIB to snmp-exporter for pulling power data into Prometheus
Thanks @fgiunchedi . Now that the holidays are over, I'm re-engaging the vendor on discussions. After all the paperwork and stuff, my guess is we'll probably have the sample PDUs onsite in about a month or so, but we'll be sure to update the Phab task as we make progress. Thanks, Willy
Updating task with the new single row Chatsworth design. It's not already supported by Librenms, so it looks like we would have to add it in. A few other notes I took from our meeting with the Account Reps - 3yr warranty (can usually send RMA in 2days, then we ship broken PDU back), 31 days to test the sample PDU (tho we can keep it longer if needed), 3phase is color coded, clips hold the power plugs in, switching capability available on other models, can swap controller module, field failure rate is less than .5%, MTBF of 1.7532 million hours
Received the test PDU . I tested both colored cables, the one we bought first (red on image) and the one that was sent to us for testing( blue on image) all fits well with the lock mechanism on the PDU. The lock on the PDU secured the cable very well better then the
I will open a ticket with CY1 to come and connect the new PDU for testing
First issue :
Change 676098 had a related patch set uploaded (by Papaul; author: Papaul):
[operations/puppet@production] Add test pdu ps-test-d8-codfw
Change 676098 merged by Papaul:
[operations/puppet@production] Add test pdu ps-test-d8-codfw
The test PDU is on online .its been monitor in Librenms
https://librenms.wikimedia.org/device/203
The only thing left is the setup in icinga
Sounds good @Papaul ! So in Icinga we're monitoring each phase to see if it hits 80%/85% of the 30A breaker, and in Prometheus we're collecting most of what we can via snmp (current, voltage, sensors).
@wiki_willy looks like the PDU is (somewhat) supported by LibreNMS so that's good news! (although I'm not seeing sensors data and inlet power, maybe that's expected for the sensors?).
Could we ask the vendor which SNMP OIDs we need to use to get input phase(s) data (volt/amp/etc) and sensors data? We could probably work it out ourselves via the MIB but I think better ask. Thank you !
Thanks for the feedback @fgiunchedi. We plan on setting up a follow-up meeting with the vendor next week to provide them some feedback, so we'll be sure to pass along your questions. Let us know though, if you have any other questions/concerns that pop up in the meantime. Thanks, Willy
@fgiunchedi please see below for the information we received from the vendor
Please find the unity MIB attached here. Our team uses a MIB browser, such as iReasoning to access the MIB. You should see all the OID information here. If you would like our team to go over the MIB file, we are certainly happy to do so. Please let me know and I’ll set up a call.
https://www.ireasoning.com/
Thank you @Papaul, could you forward the attached mib? I'll take a look, though I think a call will be best
Thank you @Papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry4) we'd need to add support for inbound current and environmental monitors. I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?
Also a few questions that popped up while reading the MIB:
Hi @fgiunchedi - Chatsworth has been pretty flexible with the amount of time we have for testing it, so I think we should be ok keeping it for a longer duration. Just let us approximately how long we would need to keep it for, and we can pass the info along to them. Or if this PDU doesn't seem that great from a monitoring and software compatibility aspect, let us know as well. We want to make sure the PDU makes sense for everyone across the org...and we can always pass on this and try out the next manufacturer if it makes more sense. Thanks for all the help, it's definitely appreciated. ~Willy
Thank you @wiki_willy that makes sense! I think we should aim at keeping the PDU until the end of the quarter if possible, if not maybe another 3-4 weeks?
WRT monitoring/software support, certainly having a PDU already fully supported by librenms is ideal. We have to do some work anyways for the Prometheus bits (and we did some LibreNMS work when moving from sentry3 to sentry4 IIRC) so definitely not a blocker on my end. If the vendor will work with us to improve LibreNMS support upstream for their products that will of course benefit everyone I think!
@fgiunchedi i changed the temperature to Celsius, the sensor have only 1 probe but the PDU can take up to 2 sensors
Thank you for the update on the sensor/probe! I can see the change F -> C in the SNMP option but not in the sensor values, it looks like to me the values are fixed in F; that's fine as we can convert, I wonder what the setting is for though, perhaps it changes the web interface (?)
-CPI-UNIFIED-MIB::celsiusTemp.0 = INTEGER: fahrenheit(0) +CPI-UNIFIED-MIB::celsiusTemp.0 = INTEGER: celsius(1) -CPI-UNIFIED-MIB::cpiPduSensorValue.1.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 8832 -CPI-UNIFIED-MIB::cpiPduSensorValue.2.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 3532 +CPI-UNIFIED-MIB::cpiPduSensorValue.1.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 8826 +CPI-UNIFIED-MIB::cpiPduSensorValue.2.'.000ED'.51.48.49.53.51.56.52 = INTEGER: 3524
re: chaining, do you know if we're going to use multiple PDUs per rack ?
I did some work on this last week, there's temporary patches on netmon1002 to get things going at least minimally and collect voltage/current/power/etc from the PDU's branches. I ran into troubles with conditional discovery and asked upstream about it: https://community.librenms.org/t/skipping-values-based-on-oids-in-another-table-with-yaml-discovery/15689
On the Raritan PDU all the power cords fits well. the PDU also mount well on our rack. The only problem is, the plugs are on 2 rows and not on single row like the other PDU's. I will open a ticket with CY1 to plug it it tomorrow to test the software part.
Change 698007 had a related patch set uploaded (by Papaul; author: Papaul):
[operations/puppet@production] ADD new Raritan test PDU
Change 698007 merged by Papaul:
[operations/puppet@production] ADD new Raritan test PDU
@fgiunchedi new Raritan test PDU is racked in D8
https://netbox.wikimedia.org/dcim/devices/3427/
still working on librenms
https://librenms.wikimedia.org/device/206
the host is already in Icinga but you still need to doing what you do to make it works
you cam download the full MIB by logging to the PDU web interface then click on maintenance
Let me know if you have any questions
Thank you @Papaul ! I see the device in librenms but looks like discovery isn't working. I've removed and added the device again without success: https://librenms.wikimedia.org/device/208/overview
And indeed even running discovery.php from command line shows snmpget fails, which I can't reproduce:
# ./discovery.php -h ps2-test-d8-codfw.mgmt.codfw.wmnet -d LibreNMS Discovery SQL[select `migration` from `migrations` order by `id` desc limit 1 [] 1.59ms] SQL[select count(*) as aggregate from `migrations` limit 1 [] 2.44ms] SQL[SELECT version() [] 1.13ms] =================================== Version info: Commit SHA: 692b5d5f5390041b34749c4cc3f1241759d21b51 Commit Date: 1618994912 DB Schema: 2021_04_08_151101_add_foreign_keys_to_port_group_port_table (1240) PHP: 7.3.27-1~deb10u1 MySQL: 10.4.18-MariaDB-log RRDTool: 1.7.1 SNMP: NET-SNMP 5.7.3 ==================================DEBUG! Updating os_def.cache SQL[SELECT * FROM `devices` WHERE disabled = 0 AND snmp_disable = 0 AND `hostname` LIKE 'ps2-test-d8-codfw.mgmt.codfw.wmnet' ORDER BY device_id DESC [] 2.47ms] SQL[select * from `devices` where `device_id` = ? limit 1 [208] 1.87ms] SQL[select * from `devices_attribs` where `devices_attribs`.`device_id` = ? and `devices_attribs`.`device_id` is not null [208] 1.57ms] ps2-test-d8-codfw.mgmt.codfw.wmnet 208 generic SQL[select `hostname`, `overwrite_ip` from `devices` where `hostname` = ? limit 1 ["ps2-test-d8-codfw.mgmt.codfw.wmnet"] 1.33ms] [FPING] '/usr/bin/fping' '-e' '-q' '-c' '3' '-p' '500' '-t' '500' 'ps2-test-d8-codfw.mgmt.codfw.wmnet' response: {"xmt":3,"rcv":3,"loss":0,"min":32.0,"max":32.1,"avg":32.0,"dup":0,"exitcode":0} SQL[select `device_groups`.*, `device_group_device`.`device_id` as `pivot_device_id`, `device_group_device`.`device_group_id` as `pivot_device_group_id` from `device_groups` inner join `device_group_device` on `device_groups`.`id` = `device_group_device`.`device_group_id` where `device_group_device`.`device_id` = ? [208] 2.24ms] SQL[select exists(select * from `alert_schedule` where (`start` <= ? and `end` >= ? and (`recurring` = ? or (`recurring` = ? and ((time(`start`) < time(`end`) and time(`start`) <= ? and time(`end`) > ?) or (time(`start`) > time(`end`) and (time(`end`) <= ? or time(`start`) > ?))) and (`recurring_day` like ? or `recurring_day` is null)))) and (exists (select * from `devices` inner join `alert_schedulables` on `devices`.`device_id` = `alert_schedulables`.`alert_schedulable_id` where `alert_schedule`.`schedule_id` = `alert_schedulables`.`schedule_id` and `alert_schedulables`.`alert_schedulable_type` = ? and `alert_schedulables`.`alert_schedulable_id` = ?) or exists (select * from `device_groups` inner join `alert_schedulables` on `device_groups`.`id` = `a lert_schedulables`.`alert_schedulable_id` where `alert_schedule`.`schedule_id` = `alert_schedulables`.`schedule_id` and `alert_schedulables`.`alert_schedulable_type` = ? and `alert_schedulables`.`alert_schedulable_id` in (?)))) as `exists` ["2021-06-04T10:21:03.544519Z","2021-06-04T10:21:03.544519Z",0,1,"10:21:03","10:21:03","10:21:03","10:21:03","%","device",208,"device_group",10] 2.52ms] SQL[INSERT IGNORE INTO `device_perf` (`xmt`,`rcv`,`loss`,`min`,`max`,`avg`,`device_id`,`timestamp`,`debug`) VALUES (:xmt,:rcv, :loss,:min,:max,:avg,:device_id,NOW(),:debug) {"xmt":3,"rcv":3,"loss":0,"min":32,"max":32.1,"avg":32,"device_id":208,"debug":"[]"} 2.13ms] SQL[UPDATE `devices` set `last_ping`=?,`last_ping_timetaken`=? WHERE device_id=? ["2021-06-04 10:21:03",32,208] 1.82ms] SNMP Check response code: 1 SNMP['/usr/bin/snmpget' '-v2c' '-c' 'COMMUNITY' '-Oqv' '-m' 'SNMPv2-MIB' '-M' '/srv/deployment/librenms/librenms/mibs' 'udp:HOS TNAME:0' 'sysObjectID.0'] Exitcode: 1 snmpget: Failure in sendto (Sub-id not found: (top) -> sysObjectID) (Invalid argument) snmpget: Failure in sendto (Sub-id not found: (top) -> sysObjectID) (Invalid argument) SNMP Unreachable./discovery.php ps2-test-d8-codfw.mgmt.codfw.wmnet 2021-06-04 10:21:03 - 0 devices discovered in 1.323 secs SNMP [2/0.02s]: Get[2/0.02s] Getnext[0/0.00s] Walk[0/0.00s] MySQL [3/0.01s]: Cell[1/0.00s] Row[-1/-0.00s] Rows[1/0.00s] Column[0/0.00s] Update[1/0.00s] Insert[1/0.00s] Delete[0/0.00s] Graphite connection made to graphite-in.eqiad.wmnet Graphite [0/0.00s]: RRD [0/0.00s]:
Also oddly enough adding the device from command line via addhost.php doesn't pass the initial validation when trying to pull data via snmp whereas adding the device from the web ui works.
Ok I think I got the correct addhost.php incantation to add the device and get discovery to work properly:
./addhost.php ps2-test-d8-codfw.mgmt.codfw.wmnet COMMUNITY v2c 161 udp
Without "161 udp" for some reason the snmpget call tried port 0 which obviously doesn't work. New device is at https://librenms.wikimedia.org/device/209 and we'll see what librenms makes of it.
The discovery has worked as expected now and data is collected, I'm seeing only one input for volts/hertz/current as opposed to individual phases, and for current the outlets (all zeros) are also displayed. Next up will be to figure out if we can get distinct phases monitored
Environment sensors are not attached and thus not displayed, @Papaul will connect them.
@wiki_willy The Raritan PDU uses another type of sensor then the Chartsworth, we will have to ask Rahi to send us Raritan smart snesor. Thanks
Thanks @Papaul. So in terms of feedback for Raritan, so far it's:
@fgiunchedi or @Papaul - do you have any additional feedback or requests for Raritan? We can always set up a meeting with them if there are a lot of questions.
Thanks,
Willy
I have a question for us: ATM librenms supports inlet and outlet power monitoring for this Raritan PDU, although I think we'd need/want individual phases monitoring (for inlet) ? That's something the PDU supports AFAICT but we'd need to add to librenms.
Hi @fgiunchedi - sorry for the delay. Just to do a quick check before you put a lot of time and effort in....from your perspective on the monitoring side of things, do you have a preference between the Chatsworth or Raritan PDU?
Raritan worked out of the box in librenms so that's certainly a plus, but there's no phase monitoring ATM so I think we'll need to add that (? not sure, see my previous question). I think so far my vote goes to Raritan (all other things being equal, e.g. if the temperature sensors work, and the PDU as a whole is ok with dcops)
Hi Filippo - is this following link the one used for phase monitoring?
https://librenms.wikimedia.org/device/device=108/tab=health/metric=current/
If so, that's correct...we'll need that part to work in librenms. There's actually just a few fields that we really set thresholds for in monitoring the PDUs, which are listed here: T247358#6118215. Hope this helps a little, but feel free to DM me and we can go over it more if needed. Much appreciated for all the help! Thanks, Willy
Yes that's correct
If so, that's correct...we'll need that part to work in librenms. There's actually just a few fields that we really set thresholds for in monitoring the PDUs, which are listed here: T247358#6118215. Hope this helps a little, but feel free to DM me and we can go over it more if needed. Much appreciated for all the help! Thanks, Willy
Ok thank you! I'll look into having per-phase data as well, and will reach out if I have questions
Thank you @Papaul ! I can confirm temperature/humidity now work out of the box: https://librenms.wikimedia.org/device/209/health
What's left is phase monitoring which I'll be looking into next
Received the PDU today.
Mounting the PDU in the rack was very easy, took the PDU out of the box and straight in the rack no adjustment. It is also very light compare to the other PDU's
The power cable fits well in the PDU and the locking mechanism is great. The only issue I am seeing with this, it is plastic so can break very easy.
I will coordinate with CY1 to plug the PDU in rack D1 sometimes next week for testing.
@fgiunchedi I got in touch with Eaton technical support team , they told me that the only options for SNMP are V1 and V3. I really don't like the ideal of using SNMP V1 and I don't know if we are thinking of using SNMP V3 since we don't use it anywhere in the infrastructure . Let me know what are your thoughts on this
The PDU is already online but not in Librenms .
Thanks.
Thanks @Papaul, that's unfortunate re: v2 support but thanks for investigating further. We could try with v3 and see what happens though
I spent some time today with this, I can successfully snmpwalk the device, yet librenms refuses to add it. So on the device's end things seem to be working as expected, still figuring out the librenms side
Ok device added (with a temporary password), apparently librenms with zero v3 configuration specified won't work
@Papaul the device is now collecting data, however AFAICS only outlets are discovered for now. Let's see when more data is accumulated if the numbers are sensible.
Change 732298 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] librenms: add snmp v3 dummy section
Change 732298 merged by Filippo Giunchedi:
[operations/puppet@production] librenms: add snmp v3 dummy section
I did some more work on Eaton PDU and librenms today. From the snmpwalk output at P17569 I think all the data we need (per-phase data, environmental data at least) is there, and shouldn't be a whole lot of effort to add to librenms (perhaps a week of work). Out of the box we get per-outlet current data only in librenms: https://librenms.wikimedia.org/device/215/health
I think librenms support is doable with a little bit of work, which will be of course contributing upstream so it'll help us and others too. Let me know what you think @Papaul @wiki_willy
Mentioned in SAL (#wikimedia-operations) [2021-11-15T18:52:19Z] <volans@cumin2002> START - Cookbook sre.hosts.downtime for 40 days, 0:00:00 on ps1-d1-codfw with reason: Testing new PDU devices T265435
Mentioned in SAL (#wikimedia-operations) [2021-11-15T18:52:23Z] <volans@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 40 days, 0:00:00 on ps1-d1-codfw with reason: Testing new PDU devices T265435