Page MenuHomePhabricator

Investigation if Fluorine needs bigger disks or we retain too much data
Closed, ResolvedPublic

Description

I just chopped down the log-retention time on fluorine, but this problem is likely to keep recurring. If we had a server with 4tb instead of 2 we'd be able to go back to 180 days of logs.

I don't know if it's possible to expand fluorine in place or if we need to move services over to a new box.

Event Timeline

Andrew raised the priority of this task from to High.
Andrew updated the task description. (Show Details)
Andrew added subscribers: Andrew, gerritbot, hoo and 6 others.

We don't have any 4TB disks on site, but they could be ordered. What is the overall capacity and raid requirements for the logging server?

I forgot to ask speed requirements for disks.

This might be moot -- sounds like we're maybe just retaining way more logs than anyone actually wants. Stay tuned...

yuvipanda lowered the priority of this task from High to Medium.Mar 18 2015, 7:55 AM

Back to normal, since there don't seem to have been any alerts of late.

(Assigning to @Andrew as the last person to have touched it :) )

No need to do anything about this yet, I still need to gather info.

I'm going to pull the hardware-requests project off this, and update the subject of the task accordingly.

RobH renamed this task from Fluorine needs bigger disks to Investigation if Fluorine needs bigger disks or we retain too much data.Apr 16 2015, 6:10 PM
RobH removed a project: hardware-requests.

From the sound of the last updates to T88393, I think we do still need to buy some bigger disks here, right?

fgiunchedi added a subscriber: fgiunchedi.

(claiming it, I'll take a look)

unclear if we'll need the disks, let's wait and see how big api.log gets while rotated

so api.log went from 35M to 18G

-rw-r--r-- 1 udp2log udp2log 35M Apr 28 06:25 api.log-20150428.gz
-rw-r--r-- 1 udp2log udp2log 35M Apr 29 06:25 api.log-20150429.gz
-rw-r--r-- 1 udp2log udp2log 18G Apr 30 06:25 api.log-20150430.gz

meaning we can store ~47d of api.log, let's say even 30d as @Anomie suggested there wouldn't be much disk available

partitioning on fluorine is a mess

fluorine:/a/mw-log/archive$ df -h
Filesystem           Size  Used Avail Use% Mounted on
/dev/md1              74G   17G   54G  24% /
udev                 3.9G  4.0K  3.9G   1% /dev
tmpfs                798M  356K  798M   1% /run
none                 5.0M     0  5.0M   0% /run/lock
none                 3.9G     0  3.9G   0% /run/shm
/dev/mapper/vg0-lv0  1.9T  1.1T  781G  59% /a
fluorine:/a/mw-log/archive$ sudo pvs
  PV         VG   Fmt  Attr PSize   PFree  
  /dev/md2   vg0  lvm2 a-   383.80g 383.80g
  /dev/md3   vg0  lvm2 a-     1.82t 376.00m
fluorine:/a/mw-log/archive$ cat /proc/partitions 
major minor  #blocks  name

   8        0  488386584 sda
   8        1          1 sda1
   8        2   78125056 sda2
   8        3  402448384 sda3
   8        5    7811072 sda5
   8       16  488386584 sdb
   8       17    7811072 sdb1
   8       18   78125056 sdb2
   8       19          1 sdb3
   8       21  402447360 sdb5
   8       32 1953514584 sdc
   8       33 1953513472 sdc1
   8       48 1953514584 sdd
   8       49 1953514542 sdd1
   9        3 1953512312 md3
   9        0    7810036 md0
   9        2  402446200 md2
   9        1   78123960 md1
 252        0 1953124352 dm-0
fluorine:/a/mw-log/archive$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sdb2[1] sda2[0]
      78123960 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sdb5[1] sda3[0]
      402446200 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda5[0] sdb1[1]
      7810036 blocks super 1.2 [2/2] [UU]
      
md3 : active raid1 sdc1[0] sdd1[1]
      1953512312 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
fluorine:/a/mw-log/archive$

so, plan to do this online:

  • swap sda with 2tb disk
  • rebuild sda from sdb with same partitioning scheme
  • swap sdb with 2tb disk
  • rebuild sdb from sda
  • expand sdb5 and sda5 to 2tb
  • extend vg0 to span md2 too

offline:

  • find and setup a spare machine with right specs (4 raid10 2tb/3tb disks?)
  • transfer data from fluorine
  • switchover (?)

@Cmjohnson do we have spare 2tb/3tb disks that could fit into fluorine?

*-disk:0
     description: ATA Disk
     product: SAMSUNG HE502HJ
     physical id: 0
     bus info: scsi@0:0.0.0
     logical name: /dev/sda
     version: 1AJ3
     serial: S2B6J90ZC12911
     size: 465GiB (500GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 signature=00054da8
8        0 2930266584 sda
8        1    7811072 sda1
8        2   78125056 sda2
8        3 1867188224 sda3
8        4  977140736 sda4

@Cmjohnson let's swap sdb today as early as possible (i.e. before SWAT)

sdb swapped, root and swap arrays rebuilt already, data arrays rebuilding

md5 : active raid1 sdb4[1] sda4[0]
      977009472 blocks super 1.2 [2/2] [UU]
      	resync=DELAYED
      
md4 : active raid1 sdb3[1] sda3[0]
      1867056960 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.0% (1522624/1867056960) finish=449.2min speed=69210K/sec

note due to msdos partition table limitation there are two 2TB + 1TB data partitions instead of 3TB partitions. both will be joined to the existing vg0

ok gave another 500G to /a on fluorine

fluorine:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md5 : active raid1 sdb4[1] sda4[0]
      977009472 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdb3[1] sda3[0]
      1867056960 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdb1[3] sda1[2]
      7810036 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[3] sda2[2]
      78123960 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdd1[1] sdc1[0]
      1953512312 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@fluorine:~# pvcreate /dev/md4
  Physical volume "/dev/md4" successfully created
root@fluorine:~# vgextend --verbose vg0 /dev/md4
    Checking for volume group "vg0"
    Archiving volume group "vg0" metadata (seqno 12).
    Wiping cache of LVM-capable devices
    Adding physical volume '/dev/md4' to volume group 'vg0'
    Volume group "vg0" will be extended by 1 new physical volumes
    Creating volume group backup "/etc/lvm/backup/vg0" (seqno 13).
  Volume group "vg0" successfully extended
root@fluorine:~# lvresize -rv --size +500G vg0/lv0
    Finding volume group vg0
    Executing: fsadm --verbose check /dev/vg0/lv0
fsadm: "xfs" filesystem found on "/dev/mapper/vg0-lv0"
fsadm: Skipping filesystem check for device "/dev/mapper/vg0-lv0" as the filesystem is mounted on /a
    fsadm failed: 3
    Archiving volume group "vg0" metadata (seqno 13).
  Extending logical volume lv0 to 2.31 TiB
    Found volume group "vg0"
    Found volume group "vg0"
    Loading vg0-lv0 table (252:0)
    Suspending vg0-lv0 (252:0) with device flush
    Found volume group "vg0"
    Resuming vg0-lv0 (252:0)
    Creating volume group backup "/etc/lvm/backup/vg0" (seqno 14).
  Logical volume lv0 successfully resized
    Executing: fsadm --verbose resize /dev/vg0/lv0 2477412352K
fsadm: "xfs" filesystem found on "/dev/mapper/vg0-lv0"
fsadm: Device "/dev/mapper/vg0-lv0" size is 2536870248448 bytes
fsadm: Parsing xfs_info "/a"
fsadm: Resizing Xfs mounted on "/a" to fill device "/dev/mapper/vg0-lv0"
fsadm: Executing xfs_growfs /a
meta-data=/dev/mapper/vg0-lv0    isize=256    agcount=4, agsize=122070272 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=488281088, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=238418, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 488281088 to 619353088
root@fluorine:~# df -h
Filesystem           Size  Used Avail Use% Mounted on
/dev/md1              74G   17G   54G  24% /
udev                 3.9G  4.0K  3.9G   1% /dev
tmpfs                798M  352K  798M   1% /run
none                 5.0M     0  5.0M   0% /run/lock
none                 3.9G     0  3.9G   0% /run/shm
/dev/mapper/vg0-lv0  2.4T  1.8T  585G  76% /a

new disks on fluorine, resolving, let's followup on related T88393

also note that I've added only one raid1 as a PV at the moment, we can add the other if needed

root@fluorine:/a/mw-log/archive# pvs
  PV         VG   Fmt  Attr PSize   PFree  
  /dev/md3   vg0  lvm2 a-     1.82t      0 
  /dev/md4   vg0  lvm2 a-     1.74t 780.93g
  /dev/md5        lvm2 a-   931.75g 931.75g