It looks like the errors / timeouts on newer ms-be machines might be due to a missing firmware upgrade for the hardware raid controller P840. We can test an upgrade on one of the machines in codfw to begin with though and expand to other hp machines that would need the upgrade, possibly other controller models too.
HP raid firmware audit:
root@cumin1001:~# cumin 'F:manufacturer = HP' '[ -x /usr/sbin/hpssacli ] && cat /sys/class/scsi_disk/*\:1\:0\:0/device/rev /sys/class/scsi_device/*\:0\:0\:0/device/rev 2>/dev/null' ===== NODE GROUP ===== (1) db2034.codfw.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 7.02 7.02 ===== NODE GROUP ===== (1) db1089.eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 5.04 5.04 ===== NODE GROUP ===== (1) cloudcontrol1004.wikimedia.org ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- HPG2 HPG2 HPG2 HPG2 ===== NODE GROUP ===== (22) cloudvirt1020.eqiad.wmnet,db1082.eqiad.wmnet,ms-be[2028-2029,2031-2033,2035-2036,2038-2039].codfw.wmnet,ms-be[1 029-1039].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 4.52 4.52 ===== NODE GROUP ===== (2) labstore[1006-1007].wikimedia.org ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 4.52 4.52 4.52 4.52 ===== NODE GROUP ===== (1) cloudvirt1019.eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.60 5.04 6.60 ===== NODE GROUP ===== (7) ms-be[2017-2018,2020-2021,2030].codfw.wmnet,ms-be[1017,1028].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.60 6.60 ===== NODE GROUP ===== (5) ms-be[2016,2019].codfw.wmnet,ms-be[1019-1021].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 3.00 3.00 ===== NODE GROUP ===== (3) lvs[1010-1012].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.64 6.64 ===== NODE GROUP ===== (10) db[2036,2038-2041].codfw.wmnet,lvs[2001,2003-2006].codfw.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 5.42 5.42 ===== NODE GROUP ===== (7) restbase[2007-2009].codfw.wmnet,restbase[1010-1011,1013,1015].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.06 ===== NODE GROUP ===== (2) ms-be[2023,2037].codfw.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.06 6.06 ===== NODE GROUP ===== (9) cloudvirt[1013-1014].eqiad.wmnet,db1092.eqiad.wmnet,labvirt1012.eqiad.wmnet,ms-be[2025,2027].codfw.wmnet,ms-be[1022-1023,1027].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 4.02 4.02 ===== NODE GROUP ===== (1) db2060.codfw.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.68 6.68 ===== NODE GROUP ===== (1) ms-be2034.codfw.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.30 6.30 ===== NODE GROUP ===== (7) db[2035,2037,2044,2048-2049,2068].codfw.wmnet,lvs2002.codfw.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 8.00 8.00 ===== NODE GROUP ===== (33) db[1074-1081,1083-1088,1090-1091,1093-1095].eqiad.wmnet,labsdb[1009-1011].eqiad.wmnet,ms-be[2022,2024,2026].codfw.wmnet,ms-be[1024-1026].eqiad.wmnet,restbase[1012,1014].eqiad.wmnet,snapshot[1005-1007].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 3.56 3.56 ===== NODE GROUP ===== (32) db[2033,2043,2045-2047,2050-2056,2058-2059,2061-2063,2065-2067,2069-2070].codfw.wmnet,dbstore2001.codfw.wmnet,labvirt[1001-1009].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 6.00 6.00 ===== NODE GROUP ===== (3) db[2042,2057].codfw.wmnet,dbstore2002.codfw.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 8.32 8.32 ===== NODE GROUP ===== (2) ms-be[1016,1018].eqiad.wmnet ----- OUTPUT of '[ -x /usr/sbin/h.../rev 2>/dev/null' ----- 1.34 1.34
Links to firmware downloads
- Smart Array H240ar, H240nr, H240, H241, H244br, P240nr, P244br, P246br, P440ar, P440, P441, P542D, P741m, P840, P840ar, and P841 version 6.88 (2 Apr 2019)
- Smart Array P220i, P222, P420i, P420, P421, P721m, and P822 version 8.32(2 Nov 2017)