Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jcrespo | T206203 Implement database binary backups into the production infrastructure | |||
Resolved | jcrespo | T213406 Purchase and setup remaining hosts for database backups | |||
Resolved | RobH | T216175 HP Gen9 onboard controller review | |||
Resolved | fgiunchedi | T220787 Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool | |||
Resolved | jcrespo | T220572 Productionize eqiad and codfw source backup hosts & codfw backup test host | |||
Resolved | Papaul | T219463 rack/setup/install (5) codfw dedicated dump slaves | |||
Resolved | Papaul | T219461 rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups | |||
Resolved | Cmjohnson | T218985 rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) | |||
Unknown Object (Task) |
Event Timeline
These new hosts have a HP408i controller and I have noticed this:
@MoritzMuehlenhoff is kindly taking a look :)
This is weird, do we have a second server of that model for comparison? I don't even see the controller is lspci (it should identify as "Subsystem: Hewlett-Packard Company Smart Array P408i-a SR Gen10"), so I'd like to rule out a hardware/connection issue with that specific server.
@MoritzMuehlenhoff db2097 is online now and it is one of the new ones, same batch as db2102. You can also check there.
Keep in mind that even if the controller doesn't appear to be there, the storage on /srv looks good on both db2102 and db2097.
The RAID controller shows up in early device detection by the kernel:
[ 4.385654] smartpqi 0000:5c:00.0: added scsi 0:1:0:0: Direct-Access HPE LOGICAL VOLUME RAID-1(ADM) SSDSmartPathCap+ En+ Exp+ qd=192 [ 4.400970] scsi 0:2:0:0: RAID HPE P408i-a SR Gen10 1.98 PQ: 0 ANSI: 5 [ 4.437509] smartpqi 0000:5c:00.0: added scsi 0:2:0:0: RAID HPE P408i-a SR Gen10 SSDSmartPathCap- En- Exp+ qd=0
But I don't see an RAID controller in "lspci -v" on neither db2097 not db2102. This controller should be supported by the smartpqi driver, but even if the current smartpqi driver in Linux 4.9 would not support our model is should still be listed in lspci.
@Papaul Are we maybe missing a step to enable the controller somewhere in the firmware/BIOS or maybe some method to actually expose it to the host?
@MoritzMuehlenhoff I can take a look. Since those are new GEN10 servers there are a lot of changes in the BIOS on where to find stuffs.
Thanks @Papaul!
Feel free to take either db2097 or db2102 down anytime you want to check them. They have no data
@MoritzMuehlenhoff unfortunately I do not see anything helpful in the BIOS setting on db2097.
Doing some reading on the HP site to see if i can find anything.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00018944en_us
So to sum up.
We can use the storage:
root@db2102:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 3.5T 3.6G 3.5T 1% /srv root@db2102:~# touch /srv/test root@db2102:~# rm /srv/test root@db2102:~# root@db2102:~# fdisk -l Disk /dev/sda: 3.5 TiB, 3840699359232 bytes, 7501365936 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 262144 bytes / 524288 bytes Disklabel type: gpt Disk identifier: 308170B4-330A-43CA-850C-7B6F344BA9DC Device Start End Sectors Size Type /dev/sda1 2048 78125055 78123008 37.3G Linux filesystem /dev/sda2 78125056 93749247 15624192 7.5G Linux swap /dev/sda3 93749248 7501365247 7407616000 3.5T Linux LVM Disk /dev/mapper/tank-data: 3.5 TiB, 3792695197696 bytes, 7407607808 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 262144 bytes / 524288 bytes root@db2102:~# smartctl --all /dev/sda smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HPE Product: LOGICAL VOLUME Revision: 1.98 User Capacity: 3,840,699,359,232 bytes [3.84 TB] Logical block size: 512 bytes Physical block size: 4096 bytes Rotation Rate: Solid State Device Logical Unit id: 0x600508b1001cd269aa739b4484818ff5 Serial number: PEYHC0DRHBZ75K Device type: disk Local Time is: Thu Apr 11 16:50:58 2019 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Disabled or Not Supported === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 0 C Drive Trip Temperature: 0 C Error Counter logging not supported Device does not support Self Test logging
But we cannot really see the controller or the disks with hpssacli:
root@db2102:~# hpssacli controller all show config Error: No controllers detected. Possible causes: - The driver for the installed controller(s) is not loaded. - On LINUX, the scsi_generic (sg) driver module is not loaded. See the README file for more details. root@db2102:~# lsmod | egrep -i "sg|hp" sg 32768 0 hpwdt 16384 0 hpilo 20480 0 shpchp 36864 0 ipmi_msghandler 49152 2 ipmi_devintf,ipmi_si scsi_mod 225280 5 smartpqi,sd_mod,ses,scsi_transport_sas,sg
@Papaul have you double checked that the RAID controller is not set up to work as HBA mode?
@MoritzMuehlenhoff I do see the controller with lssci:
root@db2102:~# lsscsi [0:0:0:0] enclosu HPE Smart Adapter 1.98 - [0:1:0:0] disk HPE LOGICAL VOLUME 1.98 /dev/sda [0:2:0:0] storage HPE P408i-a SR Gen10 1.98 -
HPE renamed the tool, I installed "ssacli" and now "ssacli controller all show config" works fine.
Great catch @MoritzMuehlenhoff thanks!
I have created T220787 to follow up our tools and monitoring needed changes to adapt to the new Gen10
Mentioned in SAL (#wikimedia-operations) [2019-04-12T07:12:24Z] <marostegui> Manually install ssacli on db2[097|098|099|100|101|102] T220787 T220572
Change 506697 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Productionize db2097 for backup source of s1 and s6
Change 506697 merged by Jcrespo:
[operations/puppet@production] mariadb: Productionize db2097 for backup source of s1 and s6
Mentioned in SAL (#wikimedia-operations) [2019-04-26T16:18:52Z] <jynus> stop s6 mariadb instance on dbstore2001 T220572
Mentioned in SAL (#wikimedia-operations) [2019-04-27T12:37:12Z] <jynus> stopping dbstore2002:s6 to clone it to db2097 T220572
Mentioned in SAL (#wikimedia-operations) [2019-04-27T12:37:46Z] <jynus> correcting last log, stopping dbstore2002:s1 to clone it to db2097 T220572
Change 506871 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications for db2097
Change 506871 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications for db2097
Change 506948 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Productionize db2098 for backup source of s2 and s3
Change 506948 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Productionize db2098 for backup source of s2 and s3
Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:25:03Z] <jynus> stop dbstore2001:s2 for cloning to db2098 T220572
Change 506957 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Productionize db2099 for backup source of s4 and s5
Mentioned in SAL (#wikimedia-operations) [2019-04-29T09:13:51Z] <jynus> stop dbstore2002:s4 for cloning to db2099 T220572
Change 506957 merged by Jcrespo:
[operations/puppet@production] mariadb: Productionize db2099 for backup source of s4 and s5
Mentioned in SAL (#wikimedia-operations) [2019-04-29T12:07:14Z] <jynus> stop dbstore2002:s3 and dbstore2001:s5 for cloning to db2098/99 T220572
Change 507269 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1
Mentioned in SAL (#wikimedia-operations) [2019-04-30T10:08:42Z] <jynus> stop s7 and x1 instances on dbstore2* for cloning T220572
Change 507269 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1
Mentioned in SAL (#wikimedia-operations) [2019-04-30T15:18:39Z] <jynus> stop s8 instance on dbstore2001 for cloning to db2100 T220572
Change 507407 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications for db2098-db2101
Change 507407 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications for db2098-db2101
Change 507746 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Set db2102 as a backup test host on codfw
Change 507746 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Set db2102 as a backup test host on codfw
Mentioned in SAL (#wikimedia-operations) [2019-05-02T09:07:01Z] <jynus> reboot db2102 T220572
Mentioned in SAL (#wikimedia-operations) [2019-05-02T09:41:57Z] <jynus> testing backups on db2102 (increased network and disk usage) T220572
db2102 is setup, pending loading data, which being done now while testing at the same time the latest recover_dump.py version and generated backup.
Mentioned in SAL (#wikimedia-operations) [2019-05-02T12:42:09Z] <jynus> stopping several instances at dbstore1001 to clone them to db1139/40 T220572
eqiad is complete too, also pending only possible recompressions to save space, like most of the codfw servers here.
I didn't realize how slow myloader is with innodb compression, that may take a few more hours to load from logical backup.
This is close to being finished, most important pending stuff is reconfiguration of backup sources.
Change 507867 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Setup db1139 and db1140 as the new eqiad backup sources
Change 507867 merged by Jcrespo:
[operations/puppet@production] mariadb: Setup db1139 and db1140 as the new eqiad backup sources
Mentioned in SAL (#wikimedia-operations) [2019-05-03T08:27:24Z] <jynus> starting table recompression on new backup source hosts on eqiad and codfw (stop replication) T220572
Change 507925 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications on db1139 and db1140
Change 507942 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add db1139 and db1140 mysql instances to prometheus
Change 507942 merged by Jcrespo:
[operations/puppet@production] mariadb: Add db1139 and db1140 mysql instances to prometheus
All the hosts have been setup and provisioned. Only pending patch to deploy is https://gerrit.wikimedia.org/r/507925 There is, however, a few iterations of table optimization and compression.
The db2102 logical load took 3161m43.477s (over 2 days 4 hours), but it resulted on MySQL datadir only taking 827GB before start, and 859 after a day of replication, while current compressed enwiki hosts take more like 976G, so they could be some room for defragmentation to save on backup space.
Change 507925 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications on db1139 and db1140