Page MenuHomePhabricator

Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition
Closed, ResolvedPublic

Assigned To
Authored By
Ottomata
May 31 2022, 6:36 PM
Referenced Files
Restricted File
Jun 16 2022, 1:44 PM
Restricted File
Jun 16 2022, 10:52 AM
F35211739: image.png
Jun 6 2022, 1:06 PM

Description

Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure

WIP

Action Items:

  • Make old journalnode edits files are cleaned properly now that namenodes are back online and saving fs image snapshots.
  • Reduce profile::hadoop::backup::namenode::fsimage_retention_days, 20 is too many
  • Possibly separate image backup storage from namenode data storage partitions won't do yet
  • hdfs dfsadmin -fetchImage should have kept failing and not recovered.
  • gobblin did not fail with proper error codes while NameNodes were offline covered by other alerts
  • Make sure journalnodes alert sooner about disk journalnode partition
  • Check that bacula backups of fs image snapshots are available and usable
  • Check that the alerting for disk space is correct on an-master hosts - since we seem not to have been alerted to /srv/ becoming full on an-master1002

Event Timeline

BTullis moved this task from Incoming to Ops on the Data-Engineering board.

Change 802169 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Reduce the retntion time for hadoop namenode fsimages

https://gerrit.wikimedia.org/r/802169

Change 802169 merged by Btullis:

[operations/puppet@production] Reduce the retention time for hadoop namenode fsimages

https://gerrit.wikimedia.org/r/802169

Regarding the alerts, it did send IRC alerts but only to #wikimedia-operations and not #wikimedia-analytics
That channel isn't watched as carefully by the data engineering team, so we missed the notification.

image.png (439×846 px, 78 KB)

https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-operations/20220522.txt

Perhaps we should modify the contact lists generally for this kind of host, so that we get alerted to the common checks like disk space.

I have double-checked that the backups are correctly configured for the namenode fsimage backups.
From the file system:

root@an-master1002:/srv/backup/hadoop/namenode# ls -lh
total 52G
-rw-r--r-- 1 hdfs hdfs 7.4G May 31 00:10 fsimage_0000000008180591445
-rw-r--r-- 1 hdfs hdfs 7.5G Jun  1 00:10 fsimage_0000000008235321589
-rw-r--r-- 1 hdfs hdfs 7.6G Jun  2 00:10 fsimage_0000000008242051352
-rw-r--r-- 1 hdfs hdfs 7.6G Jun  3 00:10 fsimage_0000000008246103989
-rw-r--r-- 1 hdfs hdfs 7.4G Jun  4 00:10 fsimage_0000000008255690800
-rw-r--r-- 1 hdfs hdfs 7.3G Jun  5 00:09 fsimage_0000000008262212942
-rw-r--r-- 1 hdfs hdfs 7.2G Jun  6 00:09 fsimage_0000000008265491409

From bacula

$ pwd
cwd is: /srv/backup/hadoop/namenode/
$ ls
fsimage_0000000008180591445
fsimage_0000000008235321589
fsimage_0000000008242051352
fsimage_0000000008246103989
fsimage_0000000008255690800
fsimage_0000000008262212942
fsimage_0000000008265491409

So that's fine.

Change 804593 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the analytics contact group to all relevant hosts in icinga

https://gerrit.wikimedia.org/r/804593

I have added the following patch to add the analytics contact group to all of our relevant hosts.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/804593

I believe that this will also affect services on that host, such as the Disk Space check that is most relevant to this task.

Change 804593 merged by Btullis:

[operations/puppet@production] Add the analytics contact group to all relevant hosts in icinga

https://gerrit.wikimedia.org/r/804593

I have verified that we will now get notified via #wikimedia-analytics and email to analytics-alerts@wikimedia.org about all service checks on relevant hosts.

  • Possibly separate image backup storage from namenode data storage partitions

This is a bit tricky given the current setup because all of the space in the volume group has already been assigned to /srv/

btullis@an-master1002:~$ df -h -t ext4
Filesystem                          Size  Used Avail Use% Mounted on
/dev/md0                             46G  7.1G   37G  17% /
/dev/mapper/an--master1002--vg-srv  173G   93G   80G  54% /srv
btullis@an-master1002:~$ sudo vgs
  VG               #PV #LV #SN Attr   VSize    VFree
  an-master1002-vg   1   1   0 wz--n- <175.95g    0
btullis@an-master1002:~$ sudo lvs
  LV   VG               Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  srv  an-master1002-vg -wi-ao---- <175.95g

I'm going to decline this recommendation for now, but when it comes to refreshing the an-master* servers I will review the storage configuration to try to permit this level of separation.

These are the two remaining things for now:

  • hdfs dfsadmin -fetchImage should have kept failing and not recovered.
  • gobblin did not fail with proper error codes while NameNodes were offline

I'll carry on working on them.

BTullis triaged this task as Medium priority.Jun 14 2022, 2:39 PM

The gobblin problem is a known issue - we have setup alerts (that worked!) that cover us from this.

Having looked into the issue with the hdfs dfsadmin -fetchImage job, I'm not sure that this is the part that needs fixing.

The job is supposed to run once per day. We only saw one failure, with every other scheduled run completing successfully.

 journalctl -u hadoop-namenode-backup-fetchimage|grep systemd|grep hadoop-
<snip>
May 20 00:10:04 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 21 00:10:11 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 22 00:10:31 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 23 00:07:11 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Main process exited, code=exited, status=255/EXCEPTION
May 23 00:07:11 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Failed with result 'exit-code'.
May 24 00:10:03 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 25 00:10:02 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 26 00:10:00 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 27 00:10:02 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 28 00:10:01 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 29 00:10:03 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.
May 30 00:10:00 an-master1002 systemd[1]: hadoop-namenode-backup-fetchimage.service: Succeeded.

We also have another check that measures the age of the most recent backup file and this alerted us too at the correct time;
{F35246439,width=60%}
However, we know that some of these runs were backing up the latest image saved to /srv/hadoop/name/current which hadn't been updated by the namenode process.

This critical issue was reported in /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1002.log but we don't currently generate an alert from this.

2022-05-23 00:07:10,335 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to save image for /srv/hadoop/name
2022-05-23 00:07:11,191 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory /srv/hadoop/name

So the question is, what's the best way to alert us to this type of occurrence again?

I looked into /usr/local/lib/nagios/plugins/check_journal_pattern which is already available, but it only calls journalctl and these log entries aren't captured by the journal.

Change 806205 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a new check for the age of the standby namenode fsimage

https://gerrit.wikimedia.org/r/806205

This was discussed with @fgiunchedi in #wikimedia-observability.

Whilst it would be nice to use prometheus for this, we don't have any convenient way of exporting a file's modification time yet. prometheus-file-flag.py was discussed as a possible way of doing that, but it will require work before it is usable.

Therefore we took a pragmatic approach and use the check_file_age plugin for Icinga. I've created a patch that creates this alert.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/806205

It generates a warning if the image is more than 90 minutes od and a critical alert if it is 2 hours old.

There is a risk that the alert will fire during a rolling restart of the nameservers: sre.hadoop.roll-restart-masters but with a 90 minute warning window, this might be OK.

It will definitely fire if we run on an-master1002 server for any period of time, because an-master1001 will be the standby server in this case and it will be writing the FSImage files instead.

Change 806205 merged by Btullis:

[operations/puppet@production] Add a new check for the age of the standby namenode fsimage

https://gerrit.wikimedia.org/r/806205

Change 806212 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a sudo_user parameter to the hadoop fsimage freshness check

https://gerrit.wikimedia.org/r/806212

Change 806212 merged by Btullis:

[operations/puppet@production] Add a sudo_user parameter to the hadoop fsimage freshness check

https://gerrit.wikimedia.org/r/806212

This check is now in place and working. I will resolve this ticket.
{F35246833,width=80%}

I have marked the incident report as in-review: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure

Is there a process for getting it reviewed and set to final? @Ottomata ?

I don't know! Asking in #wikimedia-sre IRC

It looks like we just wait for the ONFIRE team to review it and either ask for revisions or set it to final themselves.