Page MenuHomePhabricator

Improve logging for HDFS Namenodes
Closed, DeclinedPublic

Description

The current set up of LVM volumes on the namenodes is not optimal:

elukey@an-master1001:~$ df -h
Filesystem                            Size  Used Avail Use% Mounted on
/dev/md0                               46G   31G   13G  72% /
/dev/mapper/an--master1001--vg-lvol0  173G  8.0G  165G   5% /var/lib/hadoop/name

elukey@an-master1001:~$ sudo lvs
  LV    VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 an-master1001-vg -wi-ao---- 175.95g

elukey@an-master1001:~$ sudo pvs
  PV         VG               Fmt  Attr PSize   PFree
  /dev/md2   an-master1001-vg lvm2 a--  175.95g    0
elukey@an-master1002:~$ df -h
Filesystem                               Size  Used Avail Use% Mounted on
/dev/md0                                  46G   19G   25G  44% /
/dev/mapper/an--master1002--vg-backup    138G  119G   20G  87% /srv
/dev/mapper/an--master1002--vg-namenode   35G  8.1G   27G  24% /var/lib/hadoop/name

elukey@an-master1002:~$ sudo lvs
  LV       VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  backup   an-master1002-vg -wi-ao---- 140.75g
  namenode an-master1002-vg -wi-ao----  35.19g

elukey@an-master1002:~$ sudo pvs
  PV         VG               Fmt  Attr PSize   PFree
  /dev/md2   an-master1002-vg lvm2 a--  175.95g    0

On an-master1001 we don't really use the LVM volume, and on an-master1002 we still use the vg-backup that shouldn't be needed anymore (it is a lvm volume snapshot from an-coord1001, it was created before we had the mysql replication to db1108). Moreover, we store /var/log/hadoop-hdfs/* logs on the root partition, that is not great since it is tiny.

What we should do is:

  • resize/remove lvm volumes/partitions that are not needed.
  • think about having /var/log/hadoop-hdfs on a lvm volume, and increase the logging retention (hdfs-audit.log, hdfs-namenode.log, etc..)
  • verify the partman config of these nodes and how it will change with the buster migration (the SRE team have standardized a lot of partman recipes, the default is now to have data under /srv).

Event Timeline

After ops-sync today here's what we want. We want to put Hadoop logs on their own LVM partition.

  • Create a new LVM partition and mount it at /var/log/hadoop
  • Symlink /var/log/hadoop-* into /var/log/hadoop, e.g. /var/log/hadoop-hdfs -> /var/log/hadoop/hadoop-hdfs

@elukey after the new LVM partition is created, should it be declared in puppet with a mount resource? And should we also declare the directories and symlinks in puppet too?

Change 661391 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] hadoop: Add hiera setting to symlink hadoop logs to /var/log/hadoop

https://gerrit.wikimedia.org/r/661391

@Ottomata and I discussed next steps for this ticket, and came up with the following:

  • Create a puppet patch that allows a hiera setting for symlinking hadoop logs into /var/log/hadoop (https://gerrit.wikimedia.org/r/661391)
  • On the test cluster, on the standby master node, we can manually make /var/log/hadoop a directory which will behave like a mounted logical volume, then use hiera to enable linking hadoop logs into /var/log/hadoop and test that logging works as it should
  • On the main cluster, on the standby node (an-master1002), we can stop the hadoop daemons, manually mount a logging logical volume to /var/log/hadoop, temporarily backup the logs, enable the symlinks via hiera, copy the logs back, and restart the hadoop daemons
  • Promote the standby node to active and repeat for an-master1001

This is a lot of manual steps, and can be expanded if the high level plan is good. What do you think @elukey?

@Ottomata and I discussed next steps for this ticket, and came up with the following:

Added some comments, thanks for kicking off the work!

  • On the test cluster, on the standby master node, we can manually make /var/log/hadoop a directory which will behave like a mounted logical volume, then use hiera to enable linking hadoop logs into /var/log/hadoop and test that logging works as it should

I'd prefer if we tested properly the whole procedure in Hadoop test. I quickly checked and the /srv partition is probably not really used, and it doesn't contain anything. You could test your procedure of creating logical volumes in there, and apply the symlink change in puppet as well.

  • On the main cluster, on the standby node (an-master1002), we can stop the hadoop daemons, manually mount a logging logical volume to /var/log/hadoop, temporarily backup the logs, enable the symlinks via hiera, copy the logs back, and restart the hadoop daemons
  • Promote the standby node to active and repeat for an-master1001

The main doubt that I have is what to do with a use case like an-master1001 where there is no space left for new logical volumes. If you see the description of the task /var/lib/hadoop/name is clearly oversized and should be reduced if we wanted to add another logical volume and partition, since there is no free space left (see output of pvs). Same thing for an-master1002, there is no space left for new volumes.

Maybe I am missing something but what is the idea? I thought that we agreed on resizing existing volumes first (that is the difficult part of the task), but I have may missed something. Let me know :)

I'd prefer if we tested properly the whole procedure in Hadoop test. I quickly checked and the /srv partition is probably not really used, and it doesn't contain anything. You could test your procedure of creating logical volumes in there, and apply the symlink change in puppet as well.

Ah cool, we talked about testing this there but just by creating the dir, not the LVM partition. If we can do LVM there, then cool!

I have a plan for this task, let me know your thoughts!

Assumption: the SRE team decided to consider /srv as the canonical place where to put raid-based-volumes/partitions and where to store important data to preserve (on a more reliable medium). All the partman partitions now have this convention, so in my opinion we should try to adapt for simple use cases like this one.

  • an-master1002
  1. stop all daemons
  2. copy the content of /var/lib/hadoop/name to /srv
  3. delete the namenode LV, expand the backup one and rename it to -srv
  4. expand the (renamed) -srv LV's ext4 partition
  5. instruct the hdfs namenode to look into /srv for the namenode's data (as we do in Hadoop test).
  • an-master1001
  1. stop all daemons
  2. rename the volume to -srv
  3. change the LV mountpoint to /srv
  4. adapt the HDFS Namenode to look into /srv for the namenode's data (as we do in hadoop test).

In both cases we'll end up with a single /srv partition, so we could use the same partman recipe to preserve /srv when reimaging to Buster. We'd also have less problems of a single volume/partition to fill up, and we'll be able to add under /srv the HDFS Namenode logs once we figure out how to do it in T276906

Change 680179 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv

https://gerrit.wikimedia.org/r/680179

Procedure for an-master1002:

add downtime for the host
puppet disable
merge https://gerrit.wikimedia.org/r/680179

systemctl stop hadoop-hdfs-namenode
systemctl stop hadoop-hdfs-zkfc
systemctl stop hadoop-yarn-resourcemanager

mv /var/lib/hadoop /srv/hadoop

umount /srv
umount /var/lib/hadoop/name
lvrename an-master1002-vg backup srv
lvremove /dev/an-master1002-vg/namenode

change /etc/fstab to remove the namenode mountpoint, and use `blkid` to update the new UUID for the /srv partition

lvextend -l +100%FREE /dev/an-master1002-vg/srv
resize2fs /dev/an-master1002-vg/srv

Change 680179 merged by Elukey:

[operations/puppet@production] role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv

https://gerrit.wikimedia.org/r/680179

an-master1002 done:

elukey@an-master1002:~$ sudo lsblk -i
NAME                         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                            8:0    0 223.6G  0 disk  
|-sda1                         8:1    0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sda2                         8:2    0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sda3                         8:3    0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1002--vg-srv 253:0    0   176G  0 lvm   /srv
sdb                            8:16   0 223.6G  0 disk  
|-sdb1                         8:17   0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sdb2                         8:18   0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sdb3                         8:19   0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1002--vg-srv 253:0    0   176G  0 lvm   /srv


elukey@an-master1002:~$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                 63G     0   63G   0% /dev
tmpfs                                13G  1.3G   12G  10% /run
/dev/md0                             46G   19G   25G  43% /
tmpfs                                63G     0   63G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                                63G     0   63G   0% /sys/fs/cgroup
tmpfs                                13G     0   13G   0% /run/user/13926
/dev/mapper/an--master1002--vg-srv  173G  115G   58G  67% /srv

elukey@an-master1002:~$ ls -l /srv/
total 24
drwxr-xr-x 3 root analytics-admins  4096 Oct 13  2020 backup
drwxr-xr-x 4 root root              4096 Sep 25  2018 hadoop
drwx------ 2 root root             16384 Sep 25  2018 lost+found
elukey@an-master1002:~$ ls -l /srv/hadoop/
total 8
drwxr-xr-x 3 root root 4096 Sep 25  2018 extra
drwx------ 4 hdfs hdfs 4096 Apr 16 08:42 name

Change 680259 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::analytics_cluster::hadoop::master: move hadoop dirs under /srv

https://gerrit.wikimedia.org/r/680259

Procedure for an-master1001:

disable puppet
merge https://gerrit.wikimedia.org/r/680259

failover hdfs and yarn to an-master1002

systemctl stop hadoop-hdfs-namenode
systemctl stop hadoop-hdfs-zkfc
systemctl stop hadoop-yarn-resourcemanager

umount /var/lib/hadoop/name
lvrename an-master1001-vg lvol0 srv

change /etc/fstab to mount the volume as /srv

create a /srv/hadoop directory, and move the rest of the /srv content under it (since /srv was previously /var/lib/hadoop/name)

run puppet to change config and start daemons

Once all looks good, failover yarn and hdfs from 1002 to 1001

Change 680259 merged by Elukey:

[operations/puppet@production] role::analytics_cluster::hadoop::master: move hadoop dirs under /srv

https://gerrit.wikimedia.org/r/680259

Mentioned in SAL (#wikimedia-analytics) [2021-04-19T06:50:29Z] <elukey> move /var/lib/hadoop/name partition under /srv/hadoop/name on an-master1001 - T265126

@razzi @Ottomata given the good results in T276906 (.gz files are way more little in size) we may want to stop this task here, and avoid to move the logs under /srv for the moment. In case we want to do it, it should be sufficient to add something like -D hadoop.log.dir=/srv/hadoop/logs to the target daemon (like Namenode, etc..) in hadoop-env.sh.

Let me know your thoughts :)

I'm fine with this either way! Really, if we were to do this, we should do it for all Hadoop daemons, just to be consistent. So, perhaps we can just not do this until it is needed after all?

I'm fine with this either way! Really, if we were to do this, we should do it for all Hadoop daemons, just to be consistent. So, perhaps we can just not do this until it is needed after all?

With the .gz files, deployed across all hosts, in theory we should be good. My original point was to keep the audit log for security purposes on a big enough partition (so only the Namenodes), but we can also keep everything on root since the files are not that big now and it is on RAID as well. In case we wanted to do it it would be a quick code review, so all work to discover how it should be done will stay in this task if needed.

Declining for now, let's re-open if we decide to do this later.

Change 661391 abandoned by Elukey:

[operations/puppet@production] hadoop: Add hiera setting to symlink hadoop logs to /var/log/hadoop

Reason:

https://gerrit.wikimedia.org/r/661391