Page MenuHomePhabricator

rack/setup/deploy ms-be102[2-7]
Closed, ResolvedPublic

Description

This task will track the planning for the racking and then implementation of the new swift backends ordered for eqiad.

ms-be102[2-7] were ordered on T136336.

We'll need to plan out where these 6 new systems will rack, and if they need to be in wholly different racks and rows. Since we have 4 rows, but 6 systems, some will end up sharing a row. I didn't want to just make assumptions based on the current config, and that these new machines would just be evenly distributed across it. Since these are newer machines, there may be plans to re-balance the backends.

ms-be1022

  • - receive in normally via T136336 (includes racktables update)
  • - apply hostname label, update racktables
  • - rack in X
  • - add mgmt (hostname & asset tag) and production (hostname) dns entries
  • - update bios settings and ilom settings
  • - update install_server module
  • - install OS
  • - accept/sign puppet/salt keys
  • - service implementation

ms-be1023

  • - receive in normally via T136336 (includes racktables update)
  • - apply hostname label, update racktables
  • - rack in X
  • - add mgmt (hostname & asset tag) and production (hostname) dns entries
  • - update bios settings and ilom settings
  • - update install_server module
  • - install OS
  • - accept/sign puppet/salt keys
  • - service implementation

ms-be1024

  • - receive in normally via T136336 (includes racktables update)
  • - apply hostname label, update racktables
  • - rack in X
  • - add mgmt (hostname & asset tag) and production (hostname) dns entries
  • - update bios settings and ilom settings
  • - update install_server module
  • - install OS
  • - accept/sign puppet/salt keys
  • - service implementation

ms-be1025

  • - receive in normally via T136336 (includes racktables update)
  • - apply hostname label, update racktables
  • - rack in X
  • - add mgmt (hostname & asset tag) and production (hostname) dns entries
  • - update bios settings and ilom settings
  • - update install_server module
  • - install OS
  • - accept/sign puppet/salt keys
  • - service implementation

ms-be1026

  • - receive in normally via T136336 (includes racktables update)
  • - apply hostname label, update racktables
  • - rack in X
  • - add mgmt (hostname & asset tag) and production (hostname) dns entries
  • - update bios settings and ilom settings
  • - update install_server module
  • - install OS
  • - accept/sign puppet/salt keys
  • - service implementation

ms-be1027

  • - receive in normally via T136336 (includes racktables update)
  • - apply hostname label, update racktables
  • - rack in X
  • - add mgmt (hostname & asset tag) and production (hostname) dns entries
  • - update bios settings and ilom settings
  • - update install_server module
  • - install OS
  • - accept/sign puppet/salt keys
  • - service implementation

Event Timeline

RobH created this task.May 31 2016, 6:07 PM
Restricted Application added subscribers: Zppix, Southparkfan. · View Herald TranscriptMay 31 2016, 6:07 PM
Southparkfan updated the task description. (Show Details)May 31 2016, 7:34 PM
RobH added a comment.EditedJun 7 2016, 3:46 PM

We'll need to plan out where these 6 new systems will rack, and if they need to be in wholly different racks and rows. Since we have 4 rows, but 6 systems, some will end up sharing a row. I didn't want to just make assumptions based on the current config, and that these new machines would just be evenly distributed across it. Since these are newer machines, there may be plans to re-balance the backends.

@fgiunchedi is out until June 10th. He would be the ideal person to advise on where these should rack, in terms of their planned configuration/service groups.

@faidon may also know, so I'm flagging him in this task (and dropping him a PM about it as well.) If he isn't certain of the planned service group config, it will likely need to wait for @fgiunchedi.

RobH updated the task description. (Show Details)Jun 9 2016, 4:44 PM
RobH assigned this task to fgiunchedi.Jun 9 2016, 4:52 PM

Filippo is back shortly, so assigning to him for input. Please provide feedback and assign back to me for followup, thank you!

fgiunchedi reassigned this task from fgiunchedi to RobH.Jun 13 2016, 2:22 PM

thanks @RobH, let's allocate 2x in each of row A/B/D, space permitting especially in row A I think @Cmjohnson ?
If there's 10G ports available lets go with those, if not 1G works too

@fgiunchedi Row A is pretty full at the moment. I can add to A5 which is a 10G rack.

A5/C8 and D8 are all 10G racks

I can add 1 each to 10G and 1 each to 1G if you like? or 2 per 10G switch. Let me know your preference

Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptJun 22 2016, 6:06 PM

thanks @Cmjohnson ! I was checking again the allocation and there's a correction: row A isn't needed. Please go with 2x machines in each of B/C/D.
wrt 10G vs 1G let's go with 2x per 10G switch, thanks!

@fgiunchedi That will be 2 each in rows A/C/D for 10G.

@Cmjohnson ok! let's stick with B/C/D for rows and 10G for C/D and 1G for B

ms-be1022 1H IN A 10.65.4.65
wmf6970 1H IN A 10.65.4.65
ms-be1023 1H IN A 10.65.4.66
wmf6971 1H IN A 10.65.4.66
ms-be1024 1H IN A 10.65.4.67
wmf6972 1H IN A 10.65.4.67
ms-be1025 1H IN A 10.65.4.68
wmf6973 1H IN A 10.65.4.68
ms-be1026 1H IN A 10.65.4.69

@Cmjohnson I see ms-be1027 is missing from the list? also what's the status of these servers? thanks!

Change 298978 had a related patch set uploaded (by Filippo Giunchedi):
add ms-be102[2-7]

https://gerrit.wikimedia.org/r/298978

Change 298978 merged by Filippo Giunchedi:
add ms-be102[2-7]

https://gerrit.wikimedia.org/r/298978

Mentioned in SAL [2016-07-18T08:54:53Z] <godog> swift eqiad-prod: ms-be102[3-6] to weight 500 - T136631

disks ordered for ms-be1027

Reedy added a subscriber: Reedy.Jul 30 2016, 8:51 PM

These all seem to be flapping with

[21:27:18] <icinga-wm> PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds.
[21:35:27] <icinga-wm> RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
Jul 30 18:18:38 ms-be1026 kernel: [1391836.163876] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff8812d751b4c0Tag:0x00000000:00000310 CDBLen: 6 CDB: 0x1201... SN: 0x0  BEING SENT
Jul 30 18:18:38 ms-be1026 kernel: [1391836.163885] hpsa 0000:08:00.0: scsi 0:3:0:0: Aborting command RAID              HP       P840             controller SSDSmartPathCap- En- Exp=1
Jul 30 18:18:38 ms-be1026 kernel: [1391836.303832] hpsa 0000:08:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0
Jul 30 18:18:38 ms-be1026 kernel: [1391836.304431] hpsa 0000:08:00.0: invalid command: LUN:0000000000000000 CDB:00000000100300000000000000000000
Jul 30 18:18:38 ms-be1026 kernel: [1391836.304578] hpsa 0000:08:00.0: probably means device no longer present
Jul 30 18:18:38 ms-be1026 kernel: [1391836.304580] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff8812d751b4c0Tag:0x00000000:00000310 CDBLen: 6 CDB: 0x1201... SN: 0x0  SENT, FAILED
Jul 30 18:18:38 ms-be1026 kernel: [1391836.304732] hpsa 0000:08:00.0: scsi 0:3:0:0: FAILED to abort command RAID              HP       P840             controller SSDSmartPathCap- En- Exp=1

from syslog on 1026

Jul 30 19:39:50 ms-be1025 kernel: [1396757.715929] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff8817d41c3200Tag:0x00000000:000004b0 CDBLen: 6 CDB: 0x1201... SN: 0x0  BEING SENT
Jul 30 19:39:50 ms-be1025 kernel: [1396757.715936] hpsa 0000:08:00.0: scsi 0:3:0:0: Aborting command RAID              HP       P840             controller SSDSmartPathCap- En- Exp=1
Jul 30 19:39:50 ms-be1025 kernel: [1396757.726701] hpsa 0000:08:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0
Jul 30 19:39:50 ms-be1025 kernel: [1396757.735898] hpsa 0000:08:00.0: invalid command: LUN:0000000000000000 CDB:00000000b00400000000000000000000
Jul 30 19:39:50 ms-be1025 kernel: [1396757.735901] hpsa 0000:08:00.0: probably means device no longer present
Jul 30 19:39:50 ms-be1025 kernel: [1396757.735903] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff8817d41c3200Tag:0x00000000:000004b0 CDBLen: 6 CDB: 0x1201... SN: 0x0  SENT, FAILED
Jul 30 19:39:50 ms-be1025 kernel: [1396757.735906] hpsa 0000:08:00.0: scsi 0:3:0:0: FAILED to abort command RAID              HP       P840             controller SSDSmartPathCap- En- Exp=1

on ms-be1025

Jul 30 15:33:04 ms-be1023 kernel: [1315245.373281] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff881ac39803c0Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0  BEING SENT
Jul 30 15:33:04 ms-be1023 kernel: [1315245.373449] hpsa 0000:08:00.0: scsi 0:3:0:0: Aborting command RAID              HP       P840             controller SSDSmartPathCap- En- Exp=1
Jul 30 15:33:04 ms-be1023 kernel: [1315245.374664] hpsa 0000:08:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376607] hpsa 0000:08:00.0: invalid command: LUN:0000000000000000 CDB:00000000f00100000000000000000000
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376608] hpsa 0000:08:00.0: probably means device no longer present
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376610] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff881ac39803c0Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0  SENT, FAILED
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376612] hpsa 0000:08:00.0: scsi 0:3:0:0: FAILED to abort command RAID              HP       P840             controller SSDSmartPathCap- En- Exp=1

from 1023,

Jul 30 21:00:39 ms-be1022 kernel: [296846.232622] blk_update_request: critical target error, dev sda, sector 314451976
Jul 30 21:00:39 ms-be1022 kernel: [296846.273787] sd 0:1:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 30 21:00:39 ms-be1022 kernel: [296846.273794] sd 0:1:0:0: [sda] tag#0 Sense Key : Hardware Error [current] 
Jul 30 21:00:39 ms-be1022 kernel: [296846.273799] sd 0:1:0:0: [sda] tag#0 Add. Sense: Logical unit failure
Jul 30 21:00:39 ms-be1022 kernel: [296846.273803] sd 0:1:0:0: [sda] tag#0 CDB: Read(10) 28 00 06 fc 20 08 00 00 08 00
Jul 30 21:00:39 ms-be1022 kernel: [296846.273806] blk_update_request: critical target error, dev sda, sector 117186568
Jul 30 21:00:39 ms-be1022 kernel: [296846.309537] sd 0:1:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 30 21:00:39 ms-be1022 kernel: [296846.309544] sd 0:1:0:0: [sda] tag#0 Sense Key : Hardware Error [current] 
Jul 30 21:00:39 ms-be1022 kernel: [296846.309548] sd 0:1:0:0: [sda] tag#0 Add. Sense: Logical unit failure
Jul 30 21:00:39 ms-be1022 kernel: [296846.309552] sd 0:1:0:0: [sda] tag#0 CDB: Read(10) 28 00 12 be 28 08 00 00 08 00
Jul 30 21:00:39 ms-be1022 kernel: [296846.309555] blk_update_request: critical target error, dev sda, sector 314451976

and lots more from 1022.

Driver? Kernel?

Jul 30 15:24:39 ms-be1023 kernel: [1314740.414164] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff880a17153640Tag:0x00000000:000002f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 BEING SENT
Jul 30 15:24:39 ms-be1023 kernel: [1314740.414168] hpsa 0000:08:00.0: scsi 0:3:0:0: Aborting command RAID HP P840 controller SSDSmartPathCap- En- Exp=1
Jul 30 15:24:39 ms-be1023 kernel: [1314740.838079] hpsa 0000:08:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0
Jul 30 15:24:39 ms-be1023 kernel: [1314740.839242] hpsa 0000:08:00.0: invalid command: LUN:0000000000000000 CDB:00000000f00200000000000000000000
Jul 30 15:24:39 ms-be1023 kernel: [1314740.839245] hpsa 0000:08:00.0: probably means device no longer present
Jul 30 15:24:39 ms-be1023 kernel: [1314740.839416] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff880a17153640Tag:0x00000000:000002f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 SENT, FAILED
Jul 30 15:24:39 ms-be1023 kernel: [1314740.839420] hpsa 0000:08:00.0: scsi 0:3:0:0: FAILED to abort command RAID HP P840 controller SSDSmartPathCap- En- Exp=1
Jul 30 15:25:01 ms-be1023 CRON[30416]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 30 15:25:01 ms-be1023 CRON[30418]: (swift) CMD ( test -x /usr/bin/swift-recon-cron && /usr/bin/swift-recon-cron /etc/swift/object-server.conf)
Jul 30 15:30:01 ms-be1023 CRON[34337]: (swift) CMD ( test -x /usr/bin/swift-recon-cron && /usr/bin/swift-recon-cron /etc/swift/object-server.conf)
Jul 30 15:33:01 ms-be1023 CRON[36874]: (root) CMD (/usr/local/sbin/puppet-run > /dev/null 2>&1)
Jul 30 15:33:04 ms-be1023 kernel: [1315245.373281] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff881ac39803c0Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 BEING SENT
Jul 30 15:33:04 ms-be1023 kernel: [1315245.373449] hpsa 0000:08:00.0: scsi 0:3:0:0: Aborting command RAID HP P840 controller SSDSmartPathCap- En- Exp=1
Jul 30 15:33:04 ms-be1023 kernel: [1315245.374664] hpsa 0000:08:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376607] hpsa 0000:08:00.0: invalid command: LUN:0000000000000000 CDB:00000000f00100000000000000000000
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376608] hpsa 0000:08:00.0: probably means device no longer present
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376610] hpsa 0000:08:00.0: scsi 0:0:0:0 Aborting command ffff881ac39803c0Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 SENT, FAILED
Jul 30 15:33:04 ms-be1023 kernel: [1315245.376612] hpsa 0000:08:00.0: scsi 0:3:0:0: FAILED to abort command RAID HP P840 controller SSDSmartPathCap- En- Exp=1

on ms-1023

it looks like the controller isn't responding under load and makes the icinga alarms flap. Given that this is new hardware and we have the same gen9 hardware for e.g. ms-be1018, I checked the firmware version and it is indeed different:

root@ms-be1018:~# hpssacli controller slot=1 show | grep -i firmware
   Firmware Version: 1.34
root@ms-be1024:~# hpssacli controller slot=3 show | grep -i firmware
   Firmware Version: 3.56

looking for the firmware, version 4.02(B) for P840 was released on 28 Apr 2016 : http://h20564.www2.hpe.com/hpsc/swd/public/detail?sp4ts.oid=7274906&swItemId=MTX_4d3778d5c6644691afc2ba751e&swEnvOid=4176#tab-history
so it looks likely we'll need to upgrade the firmware at the very least

fgiunchedi reassigned this task from RobH to Cmjohnson.Aug 1 2016, 9:21 AM

@Cmjohnson we'll likely require firmware upgrades on these machines, see also T141756

on ms-be1023 with the firmware upgraded I'm still seeing a couple of kernel messages from hpsa about aborted commands and check_hpssacli takes around 40s to complete.

Another difference from other gen9 hardware we have (e.g. ms-be1020) is the distribution and kernel. To test this theory we could either reimage one of the new systems in codfw with trusty or reimage ms-be1022/ms-be1027 with trusty (not yet in service, but their disks are replaced now). Note that reimaging data loss is limited to the SSDs where container databases are stored and will be rebuilt, the spinning disks are untouched.

Mentioned in SAL [2016-08-02T09:49:50Z] <godog> temporarily stop puppet and disable check_hpssacli on ms-be1023 T136631

Another difference is load average and system cpu % utilization, I've tried running perf top on a couple of machines

ms-be1023 (gen9, jessie, kernel 4.4, swift 2.2)

45.94%  [kernel]                                                  [k] native_queued_spin_lock_slowpath                           
 4.62%  python2.7                                                 [.] PyEval_EvalFrameEx                                         
 3.03%  [kernel]                                                  [k] acpi_os_write_port                                         
 1.93%  [kernel]                                                  [k] update_sd_lb_stats                                         
 1.78%  python2.7                                                 [.] PyObject_GetAttr                                           
 0.89%  [kernel]                                                  [k] copy_page                                                  
 0.77%  python2.7                                                 [.] PyDict_GetItem                                             
 0.72%  [kernel]                                                  [k] load_balance                                               
 0.71%  [kernel]                                                  [k] __schedule                                                 
 0.63%  [kernel]                                                  [k] source_load                                                
 0.55%  [kernel]                                                  [k] _raw_spin_lock                                             
 0.55%  [kernel]                                                  [k] idle_cpu                                                   
 0.53%  [kernel]                                                  [k] _find_next_bit.part.0                                      
 0.47%  python2.7                                                 [.] PyFrame_New                                                
 0.47%  [kernel]                                                  [k] menu_select                                                
 0.40%  [kernel]                                                  [k] dequeue_entity                                             
 0.39%  [kernel]                                                  [k] int_sqrt                                                   
 0.37%  [kernel]                                                  [k] cpumask_next_and                                           
 0.36%  [kernel]                                                  [k] pick_next_task_fair                                        
 0.35%  python2.7                                                 [.] 0x00000000000a72a4                                         
 0.35%  python2.7                                                 [.] 0x00000000000a72a0                                         
 0.30%  [kernel]                                                  [k] _raw_spin_lock_irqsave                                     
 0.29%  [kernel]                                                  [k] find_busiest_group                                         
 0.28%  [kernel]                                                  [k] kmem_cache_free                                            
 0.27%  [kernel]                                                  [k] enqueue_entity

and ms-be1020 (gen9, trusty, kernel 3.13, swift 1.13)

9.03%  python2.7                      [.] 0x000000000014f565
8.18%  python2.7                      [.] PyEval_EvalFrameEx
7.88%  [unknown]                      [.] 0x00007f8403353763
6.56%  [kernel]                       [k] _raw_spin_lock_irqsave
5.11%  libc-2.19.so                   [.] 0x000000000007ef40
3.99%  hpssacli                       [.] 0x000000000031bdb0
3.52%  libsqlite3.so.0.8.6            [.] 0x000000000001e2a1
2.95%  python2.7                      [.] PyObject_GetAttr
2.08%  [kernel]                       [k] _raw_spin_lock
2.06%  [kernel]                       [k] copy_page_rep
1.83%  [kernel]                       [k] _raw_spin_lock_bh
1.54%  [kernel]                       [k] find_busiest_group
1.51%  python2.7                      [.] PyDict_GetItem
1.02%  libcrypto.so.1.0.0             [.] 0x000000000006533a
0.83%  [kernel]                       [k] page_fault
0.83%  python2.7                      [.] PyFrame_New
0.82%  [kernel]                       [k] acpi_os_write_port
0.76%  libc-2.19.so                   [.] malloc
0.53%  [kernel]                       [k] __schedule
0.51%  [kernel]                       [k] idle_cpu
0.47%  [kernel]                       [k] update_blocked_averages
0.42%  python2.7                      [.] PyTuple_New
0.41%  perf                           [.] 0x0000000000058510
0.41%  python2.7                      [.] _PyObject_GenericSetAttrWithDict
0.41%  [kernel]                       [k] __switch_to

Mentioned in SAL (#wikimedia-operations) [2016-10-04T14:50:08Z] <godog> eqiad-prod: ms-be1022 to weight 1000 T136631

Mentioned in SAL (#wikimedia-operations) [2016-10-10T11:52:20Z] <godog> swift eqiad-prod: ms-be1022 to weight 2000 T136631

Mentioned in SAL (#wikimedia-operations) [2016-10-18T15:47:29Z] <godog> eqiad-prod: ms-be1022 to weight 3000 T136631

Reedy removed a subscriber: Reedy.Oct 18 2016, 3:52 PM

Mentioned in SAL (#wikimedia-operations) [2016-11-11T00:23:07Z] <godog> swift eqiad-prod: ms-be1027 to weight 1000 - T136631

Mentioned in SAL (#wikimedia-operations) [2016-11-21T19:28:53Z] <godog> swift eqiad-prod ms-be1027 to weight 250 - T136631

Mentioned in SAL (#wikimedia-operations) [2016-11-22T17:46:22Z] <godog> swift eqiad-prod: ms-be1027 to weight 1000 T136631

Mentioned in SAL (#wikimedia-operations) [2016-11-23T19:24:51Z] <godog> swift eqiad-prod: ms-be1027 to weight 2000 T136631

Mentioned in SAL (#wikimedia-operations) [2016-11-28T19:29:09Z] <godog> swift eqiad-prod: ms-be1027 to weight 3000 - T136631

fgiunchedi closed this task as Resolved.Dec 1 2016, 6:30 PM

This is complete!