Page MenuHomePhabricator

Productionize eqiad and codfw source backup hosts & codfw backup test host
Closed, ResolvedPublic

Description

The following hosts need to be productionized:

Codfw backup source hosts:

  • db2097
  • db2098
  • db2099
  • db2100
  • db2101

Codfw backup test host:

  • db2102

Eqiad backup source hosts:

Event Timeline

These new hosts have a HP408i controller and I have noticed this:

1root@db2102:~# hpssacli controller all show config
2
3Error: No controllers detected. Possible causes:
4 - The driver for the installed controller(s) is not loaded.
5 - On LINUX, the scsi_generic (sg) driver module is not loaded.
6 See the README file for more details
7
8root@db2102:~# lsmod | grep sg
9sg 32768 0
10ipmi_msghandler 49152 2 ipmi_devintf,ipmi_si
11scsi_mod 225280 5 smartpqi,sd_mod,ses,scsi_transport_sas,sg

@MoritzMuehlenhoff is kindly taking a look :)

This is weird, do we have a second server of that model for comparison? I don't even see the controller is lspci (it should identify as "Subsystem: Hewlett-Packard Company Smart Array P408i-a SR Gen10"), so I'd like to rule out a hardware/connection issue with that specific server.

We will have them today or tomorrow as Papaul is installing them right now.

@MoritzMuehlenhoff db2097 is online now and it is one of the new ones, same batch as db2102. You can also check there.
Keep in mind that even if the controller doesn't appear to be there, the storage on /srv looks good on both db2102 and db2097.

The RAID controller shows up in early device detection by the kernel:

[    4.385654] smartpqi 0000:5c:00.0: added scsi 0:1:0:0: Direct-Access     HPE      LOGICAL VOLUME   RAID-1(ADM)  SSDSmartPathCap+ En+ Exp+ qd=192
[    4.400970] scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10 1.98 PQ: 0 ANSI: 5
[    4.437509] smartpqi 0000:5c:00.0: added scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10              SSDSmartPathCap- En- Exp+ qd=0

But I don't see an RAID controller in "lspci -v" on neither db2097 not db2102. This controller should be supported by the smartpqi driver, but even if the current smartpqi driver in Linux 4.9 would not support our model is should still be listed in lspci.

@Papaul Are we maybe missing a step to enable the controller somewhere in the firmware/BIOS or maybe some method to actually expose it to the host?

@MoritzMuehlenhoff I can take a look. Since those are new GEN10 servers there are a lot of changes in the BIOS on where to find stuffs.

Thanks @Papaul!
Feel free to take either db2097 or db2102 down anytime you want to check them. They have no data

@MoritzMuehlenhoff unfortunately I do not see anything helpful in the BIOS setting on db2097.

So to sum up.
We can use the storage:

root@db2102:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   3.5T  3.6G  3.5T   1% /srv
root@db2102:~# touch /srv/test
root@db2102:~# rm /srv/test
root@db2102:~#
root@db2102:~# fdisk -l
Disk /dev/sda: 3.5 TiB, 3840699359232 bytes, 7501365936 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 bytes
Disklabel type: gpt
Disk identifier: 308170B4-330A-43CA-850C-7B6F344BA9DC

Device        Start        End    Sectors  Size Type
/dev/sda1      2048   78125055   78123008 37.3G Linux filesystem
/dev/sda2  78125056   93749247   15624192  7.5G Linux swap
/dev/sda3  93749248 7501365247 7407616000  3.5T Linux LVM


Disk /dev/mapper/tank-data: 3.5 TiB, 3792695197696 bytes, 7407607808 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 bytes

root@db2102:~# smartctl  --all /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HPE
Product:              LOGICAL VOLUME
Revision:             1.98
User Capacity:        3,840,699,359,232 bytes [3.84 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Rotation Rate:        Solid State Device
Logical Unit id:      0x600508b1001cd269aa739b4484818ff5
Serial number:        PEYHC0DRHBZ75K
Device type:          disk
Local Time is:        Thu Apr 11 16:50:58 2019 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

But we cannot really see the controller or the disks with hpssacli:

root@db2102:~# hpssacli controller all show config

Error: No controllers detected. Possible causes:
       	- The driver for the installed controller(s) is not loaded.
       	- On LINUX, the scsi_generic (sg) driver module is not loaded.
       	See the README file for more details.

root@db2102:~# lsmod | egrep -i "sg|hp"
sg                     32768  0
hpwdt                  16384  0
hpilo                  20480  0
shpchp                 36864  0
ipmi_msghandler        49152  2 ipmi_devintf,ipmi_si
scsi_mod              225280  5 smartpqi,sd_mod,ses,scsi_transport_sas,sg

@Papaul have you double checked that the RAID controller is not set up to work as HBA mode?

The RAID controller shows up in early device detection by the kernel:

[    4.385654] smartpqi 0000:5c:00.0: added scsi 0:1:0:0: Direct-Access     HPE      LOGICAL VOLUME   RAID-1(ADM)  SSDSmartPathCap+ En+ Exp+ qd=192
[    4.400970] scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10 1.98 PQ: 0 ANSI: 5
[    4.437509] smartpqi 0000:5c:00.0: added scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10              SSDSmartPathCap- En- Exp+ qd=0

But I don't see an RAID controller in "lspci -v" on neither db2097 not db2102. This controller should be supported by the smartpqi driver, but even if the current smartpqi driver in Linux 4.9 would not support our model is should still be listed in lspci.

@MoritzMuehlenhoff I do see the controller with lssci:

root@db2102:~# lsscsi
[0:0:0:0]    enclosu HPE      Smart Adapter    1.98  -
[0:1:0:0]    disk    HPE      LOGICAL VOLUME   1.98  /dev/sda
[0:2:0:0]    storage HPE      P408i-a SR Gen10 1.98  -

I tried 4.19 on db2102, doesn't make a difference.

HPE renamed the tool, I installed "ssacli" and now "ssacli controller all show config" works fine.

Great catch @MoritzMuehlenhoff thanks!
I have created T220787 to follow up our tools and monitoring needed changes to adapt to the new Gen10

Mentioned in SAL (#wikimedia-operations) [2019-04-12T07:12:24Z] <marostegui> Manually install ssacli on db2[097|098|099|100|101|102] T220787 T220572

Change 506697 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Productionize db2097 for backup source of s1 and s6

https://gerrit.wikimedia.org/r/506697

Change 506697 merged by Jcrespo:
[operations/puppet@production] mariadb: Productionize db2097 for backup source of s1 and s6

https://gerrit.wikimedia.org/r/506697

Mentioned in SAL (#wikimedia-operations) [2019-04-26T16:18:52Z] <jynus> stop s6 mariadb instance on dbstore2001 T220572

Mentioned in SAL (#wikimedia-operations) [2019-04-27T12:37:12Z] <jynus> stopping dbstore2002:s6 to clone it to db2097 T220572

Mentioned in SAL (#wikimedia-operations) [2019-04-27T12:37:46Z] <jynus> correcting last log, stopping dbstore2002:s1 to clone it to db2097 T220572

Change 506871 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications for db2097

https://gerrit.wikimedia.org/r/506871

Recompression is ongoing on db2097, but technically it is done.

Change 506871 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications for db2097

https://gerrit.wikimedia.org/r/506871

Change 506948 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Productionize db2098 for backup source of s2 and s3

https://gerrit.wikimedia.org/r/506948

Change 506948 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Productionize db2098 for backup source of s2 and s3

https://gerrit.wikimedia.org/r/506948

Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:25:03Z] <jynus> stop dbstore2001:s2 for cloning to db2098 T220572

Change 506957 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Productionize db2099 for backup source of s4 and s5

https://gerrit.wikimedia.org/r/506957

Mentioned in SAL (#wikimedia-operations) [2019-04-29T09:13:51Z] <jynus> stop dbstore2002:s4 for cloning to db2099 T220572

Change 506957 merged by Jcrespo:
[operations/puppet@production] mariadb: Productionize db2099 for backup source of s4 and s5

https://gerrit.wikimedia.org/r/506957

Mentioned in SAL (#wikimedia-operations) [2019-04-29T12:07:14Z] <jynus> stop dbstore2002:s3 and dbstore2001:s5 for cloning to db2098/99 T220572

98 and 99 done, althought they need recompression (specially s3).

Change 507269 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1

https://gerrit.wikimedia.org/r/507269

Mentioned in SAL (#wikimedia-operations) [2019-04-30T10:08:42Z] <jynus> stop s7 and x1 instances on dbstore2* for cloning T220572

Change 507269 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1

https://gerrit.wikimedia.org/r/507269

Mentioned in SAL (#wikimedia-operations) [2019-04-30T15:18:39Z] <jynus> stop s8 instance on dbstore2001 for cloning to db2100 T220572

Change 507407 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications for db2098-db2101

https://gerrit.wikimedia.org/r/507407

Change 507407 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications for db2098-db2101

https://gerrit.wikimedia.org/r/507407

Change 507746 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Set db2102 as a backup test host on codfw

https://gerrit.wikimedia.org/r/507746

Change 507746 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Set db2102 as a backup test host on codfw

https://gerrit.wikimedia.org/r/507746

Mentioned in SAL (#wikimedia-operations) [2019-05-02T09:41:57Z] <jynus> testing backups on db2102 (increased network and disk usage) T220572

db2102 is setup, pending loading data, which being done now while testing at the same time the latest recover_dump.py version and generated backup.

Mentioned in SAL (#wikimedia-operations) [2019-05-02T12:42:09Z] <jynus> stopping several instances at dbstore1001 to clone them to db1139/40 T220572

eqiad is complete too, also pending only possible recompressions to save space, like most of the codfw servers here.

I didn't realize how slow myloader is with innodb compression, that may take a few more hours to load from logical backup.

This is close to being finished, most important pending stuff is reconfiguration of backup sources.

Change 507867 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Setup db1139 and db1140 as the new eqiad backup sources

https://gerrit.wikimedia.org/r/507867

Change 507867 merged by Jcrespo:
[operations/puppet@production] mariadb: Setup db1139 and db1140 as the new eqiad backup sources

https://gerrit.wikimedia.org/r/507867

Mentioned in SAL (#wikimedia-operations) [2019-05-03T08:27:24Z] <jynus> starting table recompression on new backup source hosts on eqiad and codfw (stop replication) T220572

Change 507925 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications on db1139 and db1140

https://gerrit.wikimedia.org/r/507925

Change 507942 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add db1139 and db1140 mysql instances to prometheus

https://gerrit.wikimedia.org/r/507942

Change 507942 merged by Jcrespo:
[operations/puppet@production] mariadb: Add db1139 and db1140 mysql instances to prometheus

https://gerrit.wikimedia.org/r/507942

All the hosts have been setup and provisioned. Only pending patch to deploy is https://gerrit.wikimedia.org/r/507925 There is, however, a few iterations of table optimization and compression.

The db2102 logical load took 3161m43.477s (over 2 days 4 hours), but it resulted on MySQL datadir only taking 827GB before start, and 859 after a day of replication, while current compressed enwiki hosts take more like 976G, so they could be some room for defragmentation to save on backup space.

Change 507925 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications on db1139 and db1140

https://gerrit.wikimedia.org/r/507925

Compression has finished for these hosts.