Improve logging for HDFS Namenodes
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	elukey
	Oct 9 2020, 9:46 AM

Description

The current set up of LVM volumes on the namenodes is not optimal:

elukey@an-master1001:~$ df -h
Filesystem                            Size  Used Avail Use% Mounted on
/dev/md0                               46G   31G   13G  72% /
/dev/mapper/an--master1001--vg-lvol0  173G  8.0G  165G   5% /var/lib/hadoop/name

elukey@an-master1001:~$ sudo lvs
  LV    VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 an-master1001-vg -wi-ao---- 175.95g

elukey@an-master1001:~$ sudo pvs
  PV         VG               Fmt  Attr PSize   PFree
  /dev/md2   an-master1001-vg lvm2 a--  175.95g    0

elukey@an-master1002:~$ df -h
Filesystem                               Size  Used Avail Use% Mounted on
/dev/md0                                  46G   19G   25G  44% /
/dev/mapper/an--master1002--vg-backup    138G  119G   20G  87% /srv
/dev/mapper/an--master1002--vg-namenode   35G  8.1G   27G  24% /var/lib/hadoop/name

elukey@an-master1002:~$ sudo lvs
  LV       VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  backup   an-master1002-vg -wi-ao---- 140.75g
  namenode an-master1002-vg -wi-ao----  35.19g

elukey@an-master1002:~$ sudo pvs
  PV         VG               Fmt  Attr PSize   PFree
  /dev/md2   an-master1002-vg lvm2 a--  175.95g    0

On an-master1001 we don't really use the LVM volume, and on an-master1002 we still use the vg-backup that shouldn't be needed anymore (it is a lvm volume snapshot from an-coord1001, it was created before we had the mysql replication to db1108). Moreover, we store /var/log/hadoop-hdfs/* logs on the root partition, that is not great since it is tiny.

What we should do is:

resize/remove lvm volumes/partitions that are not needed.
think about having /var/log/hadoop-hdfs on a lvm volume, and increase the logging retention (hdfs-audit.log, hdfs-namenode.log, etc..)
verify the partman config of these nodes and how it will change with the buster migration (the SRE team have standardized a lot of partman recipes, the default is now to have data under /srv).

Details

Subject	Repo	Branch	Lines +/-
hadoop: Add hiera setting to symlink hadoop logs to /var/log/hadoop	operations/puppet	production	+20 -1
role::analytics_cluster::hadoop::master: move hadoop dirs under /srv	operations/puppet	production	+3 -0
role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv	operations/puppet	production	+4 -1

Customize query in gerrit

Related Objects

Mentioned In: T276906: Configure the HDFS Namenodes to use the log4j rolling gzip appender
T278423: Upgrade the Hadoop masters to Debian Buster
Mentioned Here: T276906: Configure the HDFS Namenodes to use the log4j rolling gzip appender

Event Timeline

elukey created this task.Oct 9 2020, 9:46 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2020, 9:46 AM

• razzi claimed this task.Oct 22 2020, 7:04 PM

elukey moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.Oct 26 2020, 10:07 AM

Ottomata moved this task from Q1 2021/2022 to Q3 2020/2021 on the Analytics-Clusters board.Dec 14 2020, 4:45 PM

Ottomata updated the task description. (Show Details)Jan 21 2021, 9:15 PM

After ops-sync today here's what we want. We want to put Hadoop logs on their own LVM partition.

Create a new LVM partition and mount it at /var/log/hadoop
Symlink /var/log/hadoop-* into /var/log/hadoop, e.g. /var/log/hadoop-hdfs -> /var/log/hadoop/hadoop-hdfs

@elukey after the new LVM partition is created, should it be declared in puppet with a mount resource? And should we also declare the directories and symlinks in puppet too?

Change 661391 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] hadoop: Add hiera setting to symlink hadoop logs to /var/log/hadoop

https://gerrit.wikimedia.org/r/661391

gerritbot added a project: Patch-For-Review.Feb 3 2021, 3:34 PM

@Ottomata and I discussed next steps for this ticket, and came up with the following:

Create a puppet patch that allows a hiera setting for symlinking hadoop logs into /var/log/hadoop (https://gerrit.wikimedia.org/r/661391)
On the test cluster, on the standby master node, we can manually make /var/log/hadoop a directory which will behave like a mounted logical volume, then use hiera to enable linking hadoop logs into /var/log/hadoop and test that logging works as it should
On the main cluster, on the standby node (an-master1002), we can stop the hadoop daemons, manually mount a logging logical volume to /var/log/hadoop, temporarily backup the logs, enable the symlinks via hiera, copy the logs back, and restart the hadoop daemons
Promote the standby node to active and repeat for an-master1001

This is a lot of manual steps, and can be expanded if the high level plan is good. What do you think @elukey?

In T265126#6800421, @razzi wrote:

@Ottomata and I discussed next steps for this ticket, and came up with the following:

Create a puppet patch that allows a hiera setting for symlinking hadoop logs into /var/log/hadoop (https://gerrit.wikimedia.org/r/661391)

Added some comments, thanks for kicking off the work!

On the test cluster, on the standby master node, we can manually make /var/log/hadoop a directory which will behave like a mounted logical volume, then use hiera to enable linking hadoop logs into /var/log/hadoop and test that logging works as it should

I'd prefer if we tested properly the whole procedure in Hadoop test. I quickly checked and the /srv partition is probably not really used, and it doesn't contain anything. You could test your procedure of creating logical volumes in there, and apply the symlink change in puppet as well.

On the main cluster, on the standby node (an-master1002), we can stop the hadoop daemons, manually mount a logging logical volume to /var/log/hadoop, temporarily backup the logs, enable the symlinks via hiera, copy the logs back, and restart the hadoop daemons

Promote the standby node to active and repeat for an-master1001

The main doubt that I have is what to do with a use case like an-master1001 where there is no space left for new logical volumes. If you see the description of the task /var/lib/hadoop/name is clearly oversized and should be reduced if we wanted to add another logical volume and partition, since there is no free space left (see output of pvs). Same thing for an-master1002, there is no space left for new volumes.

Maybe I am missing something but what is the idea? I thought that we agreed on resizing existing volumes first (that is the difficult part of the task), but I have may missed something. Let me know :)

I'd prefer if we tested properly the whole procedure in Hadoop test. I quickly checked and the /srv partition is probably not really used, and it doesn't contain anything. You could test your procedure of creating logical volumes in there, and apply the symlink change in puppet as well.

Ah cool, we talked about testing this there but just by creating the dir, not the LVM partition. If we can do LVM there, then cool!

elukey mentioned this in T278423: Upgrade the Hadoop masters to Debian Buster.Mar 25 2021, 11:05 AM

Ottomata mentioned this in T276906: Configure the HDFS Namenodes to use the log4j rolling gzip appender.Mar 30 2021, 8:09 PM

I have a plan for this task, let me know your thoughts!

Assumption: the SRE team decided to consider /srv as the canonical place where to put raid-based-volumes/partitions and where to store important data to preserve (on a more reliable medium). All the partman partitions now have this convention, so in my opinion we should try to adapt for simple use cases like this one.

an-master1002

stop all daemons
copy the content of /var/lib/hadoop/name to /srv
delete the namenode LV, expand the backup one and rename it to -srv
expand the (renamed) -srv LV's ext4 partition
instruct the hdfs namenode to look into /srv for the namenode's data (as we do in Hadoop test).

an-master1001

stop all daemons
rename the volume to -srv
change the LV mountpoint to /srv
adapt the HDFS Namenode to look into /srv for the namenode's data (as we do in hadoop test).

In both cases we'll end up with a single /srv partition, so we could use the same partman recipe to preserve /srv when reimaging to Buster. We'd also have less problems of a single volume/partition to fill up, and we'll be able to add under /srv the HDFS Namenode logs once we figure out how to do it in T276906

+1!

• razzi added a project: Analytics-Kanban.Apr 7 2021, 6:58 PM

Change 680179 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv

https://gerrit.wikimedia.org/r/680179

Procedure for an-master1002:

add downtime for the host
puppet disable
merge https://gerrit.wikimedia.org/r/680179

systemctl stop hadoop-hdfs-namenode
systemctl stop hadoop-hdfs-zkfc
systemctl stop hadoop-yarn-resourcemanager

mv /var/lib/hadoop /srv/hadoop

umount /srv
umount /var/lib/hadoop/name
lvrename an-master1002-vg backup srv
lvremove /dev/an-master1002-vg/namenode

change /etc/fstab to remove the namenode mountpoint, and use `blkid` to update the new UUID for the /srv partition

lvextend -l +100%FREE /dev/an-master1002-vg/srv
resize2fs /dev/an-master1002-vg/srv

Change 680179 merged by Elukey:

[operations/puppet@production] role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv

https://gerrit.wikimedia.org/r/680179

an-master1002 done:

elukey@an-master1002:~$ sudo lsblk -i
NAME                         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                            8:0    0 223.6G  0 disk  
|-sda1                         8:1    0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sda2                         8:2    0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sda3                         8:3    0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1002--vg-srv 253:0    0   176G  0 lvm   /srv
sdb                            8:16   0 223.6G  0 disk  
|-sdb1                         8:17   0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sdb2                         8:18   0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sdb3                         8:19   0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1002--vg-srv 253:0    0   176G  0 lvm   /srv


elukey@an-master1002:~$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                 63G     0   63G   0% /dev
tmpfs                                13G  1.3G   12G  10% /run
/dev/md0                             46G   19G   25G  43% /
tmpfs                                63G     0   63G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                                63G     0   63G   0% /sys/fs/cgroup
tmpfs                                13G     0   13G   0% /run/user/13926
/dev/mapper/an--master1002--vg-srv  173G  115G   58G  67% /srv

elukey@an-master1002:~$ ls -l /srv/
total 24
drwxr-xr-x 3 root analytics-admins  4096 Oct 13  2020 backup
drwxr-xr-x 4 root root              4096 Sep 25  2018 hadoop
drwx------ 2 root root             16384 Sep 25  2018 lost+found
elukey@an-master1002:~$ ls -l /srv/hadoop/
total 8
drwxr-xr-x 3 root root 4096 Sep 25  2018 extra
drwx------ 4 hdfs hdfs 4096 Apr 16 08:42 name

Change 680259 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::analytics_cluster::hadoop::master: move hadoop dirs under /srv

https://gerrit.wikimedia.org/r/680259

Procedure for an-master1001:

disable puppet
merge https://gerrit.wikimedia.org/r/680259

failover hdfs and yarn to an-master1002

systemctl stop hadoop-hdfs-namenode
systemctl stop hadoop-hdfs-zkfc
systemctl stop hadoop-yarn-resourcemanager

umount /var/lib/hadoop/name
lvrename an-master1001-vg lvol0 srv

change /etc/fstab to mount the volume as /srv

create a /srv/hadoop directory, and move the rest of the /srv content under it (since /srv was previously /var/lib/hadoop/name)

run puppet to change config and start daemons

Once all looks good, failover yarn and hdfs from 1002 to 1001

Change 680259 merged by Elukey:

[operations/puppet@production] role::analytics_cluster::hadoop::master: move hadoop dirs under /srv

https://gerrit.wikimedia.org/r/680259

Mentioned in SAL (#wikimedia-analytics) [2021-04-19T06:50:29Z] <elukey> move /var/lib/hadoop/name partition under /srv/hadoop/name on an-master1001 - T265126

@razzi @Ottomata given the good results in T276906 (.gz files are way more little in size) we may want to stop this task here, and avoid to move the logs under /srv for the moment. In case we want to do it, it should be sufficient to add something like -D hadoop.log.dir=/srv/hadoop/logs to the target daemon (like Namenode, etc..) in hadoop-env.sh.

Let me know your thoughts :)

I'm fine with this either way! Really, if we were to do this, we should do it for all Hadoop daemons, just to be consistent. So, perhaps we can just not do this until it is needed after all?

In T265126#7013940, @Ottomata wrote:

I'm fine with this either way! Really, if we were to do this, we should do it for all Hadoop daemons, just to be consistent. So, perhaps we can just not do this until it is needed after all?

With the .gz files, deployed across all hosts, in theory we should be good. My original point was to keep the audit log for security purposes on a big enough partition (so only the Namenodes), but we can also keep everything on root since the files are not that big now and it is on RAID as well. In case we wanted to do it it would be a quick code review, so all work to discover how it should be done will stay in this task if needed.

Declining for now, let's re-open if we decide to do this later.

Change 661391 abandoned by Elukey:

[operations/puppet@production] hadoop: Add hiera setting to symlink hadoop logs to /var/log/hadoop

Reason: