Page MenuHomePhabricator

Investigate Gobblin dataloss during namenode failure
Closed, ResolvedPublic

Description

During the namenode outage a few weeks ago, and also during a failed namenode failover today, gobblin seems to have lost data without erroring about it.

hdfs dfs -ls /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 3 items
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:12 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users      23154 2022-06-23 12:11 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655986216453_179_1.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz\

The file written by the Gobblin job that was launched at 13:10 wrote a 0 length .gz file. This 0 length .gz file is causing Unexpected end of input stream errors when Refine tries to decompress and ingest this file.

This time around the failures arein 2022-06-23 hours 12 and 13 for the following legacy eventlogging topics:

# 'analytics'
eventlogging_MobileWikiAppInstallReferrer
eventlogging_MobileWikiAppWatchlist
eventlogging_ContentTranslationCTA
eventlogging_CentralAuth
eventlogging_MobileWikiAppEdit
eventlogging_ContentTranslationSuggestion

# 'legacy' (migrated to Event Platform)
eventlogging_ReferencePreviewsBaseline
eventlogging_HelpPanel
eventlogging_SuggestedTagsAction

Once we fix the gobblin data, the proper re-refine commands will be:

sudo -u analytics kerberos-run-command analytics refine_eventlogging_analytics --ignore_failure_flag=true --table_include_regex='mobilewikiappedit|contenttranslationsuggestion|centralauth|mobilewikiappinstallreferrer|contenttranslationcta|contenttranslationsuggestion|mobilewikiappwatchlist|centralauth' --since='2022-06-22T12:00:00.000Z' --until='2022-06-23T15:00:00.000Z'

sudo -u analytics kerberos-run-command analytics refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='helppanel|referencepreviewsbaseline|suggestedtagsaction' --since='2022-06-22T12:00:00.000Z' --until='2022-06-23T15:00:00.000Z'

Investigation into why gobblin failed and how to restore the data will go below in comments.

Event Timeline

Okay, in this gobblin task attempt, I found the following logs
https://yarn.wikimedia.org/jobhistory/logs/an-worker1135.eqiad.wmnet:8041/container_e44_1655808530211_10727_01_000002/attempt_1655808530211_10727_m_000000_0/analytics/syslog/?start=0

2022-06-23 13:11:14,006 WARN [Task-committing-pool-0] org.apache.gobblin.writer.RetryWriter: Caught exception. Operation will be retried. Attempt #1
java.io.IOException: Failed to commit all writers.
	at org.apache.gobblin.writer.PartitionedDataWriter.commit(PartitionedDataWriter.java:280)
	at org.apache.gobblin.writer.RetryWriter$3.call(RetryWriter.java:131)
	at org.apache.gobblin.writer.RetryWriter$3.call(RetryWriter.java:128)
	at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
	at com.github.rholder.retry.Retryer.call(Retryer.java:160)
	at com.github.rholder.retry.Retryer$RetryerCallable.call(Retryer.java:318)
	at org.apache.gobblin.writer.RetryWriter.callWithRetry(RetryWriter.java:141)
	at org.apache.gobblin.writer.RetryWriter.commit(RetryWriter.java:136)
	at org.apache.gobblin.runtime.fork.Fork.commitData(Fork.java:647)
	at org.apache.gobblin.runtime.fork.Fork.commit(Fork.java:384)
	at org.apache.gobblin.runtime.Task.commit(Task.java:919)
	at org.apache.gobblin.runtime.GobblinMultiTaskAttempt$1$1.call(GobblinMultiTaskAttempt.java:248)
	at org.apache.gobblin.runtime.GobblinMultiTaskAttempt$1$1.call(GobblinMultiTaskAttempt.java:243)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.gobblin.util.executors.MDCPropagatingRunnable.run(MDCPropagatingRunnable.java:39)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
2022-06-23 13:11:15,015 INFO [Task-committing-pool-0] org.apache.gobblin.writer.FsDataWriter: Moving data from /wmf/gobblin/task_working/eventlogging_legacy/job_eventlogging_legacy_1655989815701/task-staging/eventlogging_ContentTranslationCTA/attempt_1655808530211_10727_m_000000_0/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz to /wmf/gobblin/task_working/eventlogging_legacy/job_eventlogging_legacy_1655989815701/task-output/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz

2022-06-23 13:11:15,030 INFO [Task-committing-pool-0] org.apache.gobblin.runtime.fork.Fork-0: Fork 0 of task task_eventlogging_legacy_1655989815701_0 successfully committed data

WHY IS THIS WARN!?!??! THIS SHOULD BE A FAILURE!!! Okay fine, retry it, but the job does not retry!?

Looking at gobblin metrics for this task attempt. This task workunit was responsible for 3 topic (partitions): eventlogging_ExternalExtension, eventlogging_ChangesListHighlights, and eventlogging_ContentTranslationCTA. There are KafkaExtractorTopicMetadata metrics for each of these three, but only taskCommitted events for eventlogging_ExternalExtension, eventlogging_ChangesListHighlights tasks. There looks like there is a final taskCommitted for the overall workunit?

hdfs dfs -cat /wmf/gobblin/metrics/gobblin.metrics.job_eventlogging_legacy_1655989815701/* | grep attempt_1655808530211_10727_m_000000_0

Anyway, in any case it looks like even though the write to HDFS failed, Gobblin considered its task successful.

Looking into the source code for this error message shows that there is a gobblin.writer.retry.max_attempts config that defaults to 1?. Well okay fine, but it should fail after one try. Just checking code...the Retryer it is making will retry only on 'non transient' exceptions, which Gobblin defines as RuntimeExceptions. IOException is not a RuntimeException, so, it should pass the rejectionPredicate and then just cause the stopStragegy to fire and result in a RetryException, but it doesn't! WHY?

BTW, it looks like all eventlogging analytics gobblin job logs fail to extract a timestamp, which causes a warn message to be output to the application logs for every event, including the event json itself! That's a lot of duplicate data!

Oh, but Caught exception. Operation will be retried. Attempt #1 should not be logged if attempt # < RETRY_MAX_ATTEMPTS. If it was using the default, maxAttempts would be 1, attempt # would be 1, and it would not log. so maxAttempts must be more than 1? We don't set gobblin.writer.retry.max_attempts anywhere that I can find.

On a second reading, I think that retry attempt #2 must be successful, and that is why the eventual Fork 0 of task task_eventlogging_legacy_1655989815701_0 successfully committed data message. Also, I suppose we are looking for an error earlier than this one? At this point, part.task_eventlogging_legacy_1655989815701_0_0.txt.gz has already been written in the task-staging directory. Something caused it to be 0 bytes when it was written, not when it is committed. Will continue to look...

Am looking in Gobblin source for where our SimpleStringWriter#write() method is called from, but I'm not familiar enough with the code to find it! There are other IOExceptions being thrown in those logs, like:

2022-06-23 13:11:13,777 INFO [Thread-16] org.apache.hadoop.hdfs.DataStreamer: Exception in createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as 10.64.21.122:50010
	at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:121)
	at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1751)
	at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1655)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:710)

But none of them seem to raise an exception up through the write() call. I'd really expect gobblin to fail if it fails writing to its HDFS staging dir! GRRRrrr.

Another curious question: why is this job not able to extract the timestamps from the dt field? It certainly exists!

Let me try to summarize the findings Jospeh just explained to me.

Somewhere around 13:08 the standby namenode (an-master1001 at the time) was restarted. On restart, it asked all of the datanodes to provide a full block report. At 13:10, the gobblin_eventlogging_legacy job was launched. Because the datanodes were busy communicating with an-master1001, some communication with some datanodes resulted in socket errors (This may also be why namenode startup times are taking SO long right now). Gobblin was writing to blocks for the task-staging hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz file during this time. Gobblin tried 7 times to write this block, but was never actually successful doing so. Each time it tried and failed it logged something like:

2022-06-23 13:11:15,189 WARN [Thread-23] org.apache.hadoop.hdfs.DataStreamer: Abandoning BP-1552854784-10.64.21.110-1405114489661:blk_1967363852_893703988
2022-06-23 13:11:15,194 WARN [Thread-23] org.apache.hadoop.hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[10.64.36.140:50010,DS-f51dac9b-db68-4444-8d01-d95687867317,DISK]

Even so, gobblin proceeded as if the file was successfully written, and committed (moved) it to its final output location. Then the gobblin job finished.

20 minutes later, at about the same time that the Refine job tried to read and ingest this file, the active namenode printed the following in its HDFS audit log:

$ zgrep "/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz" hadoop-hdfs-namenode-an-master1002.log.2.gz 

2022-06-23 13:31:17,204 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1655808530211_10727_m_000000_0_163302783_1, pending creates: 1], src=/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz

2022-06-23 13:31:17,204 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz closed.

attempt_1655808530211_10727_m_000000_0 is the same map task we were just looking at. It looks like gobblin never properly closed its lease on this open file, EVEN thought it committed (moved) it to its output location? And, because the lease was not closed, eventually HDFS decided to time out the open lease and just restore the file from whatever the last entry it has committed journal log for this file, which, in this case, would have been the file creation at 0 bytes.

Qs:

  • Why/how could Gobblin not fail correctly when it was not able to write this block. Why did it 'succeed' while not closing the lease on the file?
  • What is the HDFS timeout logic for an open and inactive lease?
Backfill

Backfilling the data. Jospeh had a good idea: instead of discovering the right offsets, we can just run a custom gobblin job to import the offending topics into a temp dir, and then replace the raw data with the partitioned hours we import into the temp dir. We can do this because these topics are small. We just import all data for these topics.

$ cat ./gobblin_eventlogging_legacy_backfill.T311263.properties

include=/srv/deployment/analytics/refinery/gobblin/common/kafka_to_hdfs_hourly.properties

# Prep by creating these directories and then chown -R analytics:privatedata-users
gobblin.cluster.work.dir=/user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/gobblin
data.publisher.final.dir=/user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw

state.store.dir=${gobblin.cluster.work.dir}/state_store
qualitychecker.row.err.file=${gobblin.cluster.work.dir}/err
metrics.log.dir=${gobblin.cluster.work.dir}/metrics
mr.job.root.dir=${gobblin.cluster.work.dir}/mr_working
task.data.root.dir=${gobblin.cluster.work.dir}/task_working

job.name=gobblin_eventlogging_legacy_otto_backfill_T311263
job.group=gobblin
extract.namespace=org.wikimedia.analytics.eventlogging_legacy


topic.include=^eventlogging_(MobileWikiAppInstallReferrer|MobileWikiAppWatchlist|ContentTranslationCTA|CentralAuth|MobileWikiAppEdit|ContentTranslationSuggestion|ReferencePreviewsBaseline|HelpPanel|SuggestedTagsAction)$

# Going to import all from kafka for this backfill
bootstrap.with.offset=earliest

writer.partition.timestamp.columns=meta.dt,dt

mr.job.max.mappers=10
# from an-launcher1002
PYTHONPATH=/srv/deployment/analytics/refinery/python sudo -u analytics --preserve-env=PYTHONPATH kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/gobblin ./gobblin_eventlogging_legacy_backfill.T311263.properties

# application_1655808530211_15844

For reference, these are all the 0 size files from hours 12 and 13:

-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_146_0.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=13/part.task_eventlogging_legacy_1655989815701_146_1.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_12_0.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=13/part.task_eventlogging_legacy_1655989815701_12_1.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_HelpPanel/year=2022/month=06/day=23/hour=13/part.task_eventlogging_legacy_1655989815701_185_1.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/hour=13/part.task_eventlogging_legacy_1655989815701_162_1.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_43_0.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_61_0.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/hour=13/part.task_eventlogging_legacy_1655989815701_155_1.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-23 13:31 /wmf/data/raw/eventlogging_legacy/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_99_0.txt.gz

(Note how they all have the same 13:31 timestamp, about 20 minutes after the gobblin job finished, HDFS 'restored' them).

I'm going to fully replace all of the data from my backfill in each of these hours.

First, move away the bad hours:

$ cat ./gobblin_eventlogging_legacy_backfill.T311263.move_bad_data.sh
#!/bin/bash

hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_CentralAuth/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=12 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_CentralAuth/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_CentralAuth/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=13 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_CentralAuth/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=12 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=13 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_HelpPanel/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_HelpPanel/year=2022/month=06/day=23/hour=13 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_HelpPanel/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/hour=13 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/hour=12 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/hour=12 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/hour=13 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/
hdfs dfs -mkdir -p  /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23 &&  hdfs dfs -mv /wmf/data/raw/eventlogging_legacy/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/hour=12 /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw_orig_bad_hours_with_0_size_files/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/
sudo -u analytics kerberos-run-command analytics ./gobblin_eventlogging_legacy_backfill.T311263.move_bad_data.sh

Then, move in the newly imported good hours from the custom gobblin job imported temp dir:

$ cat ./gobblin_eventlogging_legacy_backfill.T311263.move_good_data.sh

#!/bin/bash

hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=12 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=13 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=12 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=13 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_HelpPanel/year=2022/month=06/day=23/hour=13 /wmf/data/raw/eventlogging_legacy/eventlogging_HelpPanel/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/hour=13 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/hour=12 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/hour=12 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/hour=13 /wmf/data/raw/eventlogging_legacy/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/
hdfs dfs -mv /user/otto/gobblin_eventlogging_legacy_otto_backfill_T311263/raw/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/hour=12 /wmf/data/raw/eventlogging_legacy/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/
sudo -u analytics kerberos-run-command analytics ./gobblin_eventlogging_legacy_backfill.T311263.move_good_data.sh

Checking backfilled raw dirs:

-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=12/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users     166147 2022-06-24 14:44 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=12/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_6_219.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=13/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users     166926 2022-06-24 14:45 /wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=13/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_6_220.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users     124377 2022-06-24 14:44 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_4_219.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=12/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users      48877 2022-06-24 14:44 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=12/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_2_199.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=13/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users      56294 2022-06-24 14:44 /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=13/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_2_200.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_HelpPanel/year=2022/month=06/day=23/hour=13/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users    1374836 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_HelpPanel/year=2022/month=06/day=23/hour=13/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_8_177.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/hour=13/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users     163422 2022-06-24 14:45 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/hour=13/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_5_220.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/hour=12/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users      57280 2022-06-24 14:44 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/hour=12/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_3_303.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/hour=12/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users      14540 2022-06-24 14:44 /wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/hour=12/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_1_155.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/hour=13/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users     399377 2022-06-24 14:45 /wmf/data/raw/eventlogging_legacy/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/hour=13/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_7_228.txt.gz
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-06-24 14:46 /wmf/data/raw/eventlogging_legacy/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/hour=12/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users       4759 2022-06-24 14:44 /wmf/data/raw/eventlogging_legacy/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/hour=12/part.task_gobblin_eventlogging_legacy_otto_backfill_T311263_1656081797704_0_172.txt.gz

And finally, re-running Refine:

sudo -u analytics kerberos-run-command analytics refine_eventlogging_analytics --ignore_failure_flag=true --table_include_regex='mobilewikiappedit|contenttranslationsuggestion|centralauth|mobilewikiappinstallreferrer|contenttranslationcta|contenttranslationsuggestion|mobilewikiappwatchlist|centralauth' --since='2022-06-22T12:00:00.000Z' --until='2022-06-23T15:00:00.000Z'

# application_1655808530211_15928

22/06/24 15:10:18 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=12 -> `event`.`centralauth` /wmf/data/event/centralauth/year=2022/month=6/day=23/hour=12. (# refined records: 3445)
22/06/24 15:10:19 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppWatchlist/year=2022/month=06/day=23/hour=12 -> `event`.`mobilewikiappwatchlist` /wmf/data/event/mobilewikiappwatchlist/year=2022/month=6/day=23/hour=12. (# refined records: 159)
22/06/24 15:10:19 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=12 -> `event`.`contenttranslationsuggestion` /wmf/data/event/contenttranslationsuggestion/year=2022/month=6/day=23/hour=12. (# refined records: 856)
22/06/24 15:10:21 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppInstallReferrer/year=2022/month=06/day=23/hour=12 -> `event`.`mobilewikiappinstallreferrer` /wmf/data/event/mobilewikiappinstallreferrer/year=2022/month=6/day=23/hour=12. (# refined records: 645)
22/06/24 15:10:23 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_MobileWikiAppEdit/year=2022/month=06/day=23/hour=13 -> `event`.`mobilewikiappedit` /wmf/data/event/mobilewikiappedit/year=2022/month=6/day=23/hour=13. (# refined records: 2017)
22/06/24 15:10:24 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12 -> `event`.`contenttranslationcta` /wmf/data/event/contenttranslationcta/year=2022/month=6/day=23/hour=12. (# refined records: 1641)
22/06/24 15:10:27 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_CentralAuth/year=2022/month=06/day=23/hour=13 -> `event`.`centralauth` /wmf/data/event/centralauth/year=2022/month=6/day=23/hour=13. (# refined records: 3492)
22/06/24 15:10:34 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationSuggestion/year=2022/month=06/day=23/hour=13 -> `event`.`contenttranslationsuggestion` /wmf/data/event/contenttranslationsuggestion/year=2022/month=6/day=23/hour=13. (# refined records: 1046)
sudo -u analytics kerberos-run-command analytics refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='helppanel|referencepreviewsbaseline|suggestedtagsaction' --since='2022-06-22T12:00:00.000Z' --until='2022-06-23T15:00:00.000Z'


# application_1655808530211_15936

22/06/24 15:12:37 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_SuggestedTagsAction/year=2022/month=06/day=23/hour=12 -> `event`.`suggestedtagsaction` /wmf/data/event/suggestedtagsaction/year=2022/month=6/day=23/hour=12. (# refined records: 37)
22/06/24 15:12:40 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_ReferencePreviewsBaseline/year=2022/month=06/day=23/hour=13 -> `event`.`referencepreviewsbaseline` /wmf/data/event/referencepreviewsbaseline/year=2022/month=6/day=23/hour=13. (# refined records: 5409)
22/06/24 15:12:56 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_HelpPanel/year=2022/month=06/day=23/hour=13 -> `event`.`helppanel` /wmf/data/event/helppanel/year=2022/month=6/day=23/hour=13. (# refined records: 9965)

Looks good! That's 11/11 datasets backfilled and refined.

So, data saved this time. We likely lost some data from when we had the namenode outage a few weeks ago and didn't notice this.

Mentioned in SAL (#wikimedia-analytics) [2022-06-24T15:14:44Z] <ottomata> backfilled eventlogging data lost during failed gobblin job - T311263

Thanks a lot @Ottomata for the backfill.

So, data saved this time. We likely lost some data from when we had the namenode outage a few weeks ago and didn't notice this.

I confirm I completely missed the dataloss during the previous outage. I was careful about making sure there was no loss but messed up my reasoning :(
Here are the topics for which we have incurred some loss for day 2022-06-09 (hours 13 or 14, marked per topic):

  • eventlogging_ReferencePreviewsCite [14]
  • eventlogging_MobileWikiAppFindInPage [14]
  • eventlogging_TwoColConflictExit [13]
  • eventlogging_SpecialMuteSubmit [13]
  • eventlogging_CentralNoticeImpression [13]
  • eventlogging_WikipediaPortal [13/14]

Doing some more logs, gobblin and HDFS code reading, and I have a little more findings:

In the log we are investigating, there is actually one failure log about finally failing the create the block.

2022-06-23 13:11:13,870 WARN [Thread-16] org.apache.hadoop.hdfs.DataStreamer: DataStreamer Exception
java.io.IOException: Unable to create new block.
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1669)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:710)
2022-06-23 13:11:13,871 WARN [Thread-16] org.apache.hadoop.hdfs.DataStreamer: Could not get block locations. Source file "/wmf/gobblin/task_working/eventlogging_legacy/job_eventlogging_legacy_1655989815701/task-staging/eventlogging_ContentTranslationCTA/attempt_1655808530211_10727_m_000000_0/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz" - Aborting...

This IOException is thrown from HDFS DataStreamer after all of its block create/write attempts have failed. I don't quite understand why gobblin catches this in its commit chain, instead of its write/writeEnvelope chain? This exception is caught here in PartitionedDataWriter#commit(). It is just logged at that point. Only later is a new IOException thrown because at least one of the partitionWriters didn't commit. This new IOException("Failed to commit all writers."). However, note that this exception is for the part.task_eventlogging_legacy_1655989815701_0_0.txt.gz file and logged in the [Task-committing-pool-0]. It really does look like it just retries again and succeeds to 'commit'?

One more thing: Is this closing code in Fork.java relevant? fork.closeWriterOnCompletion defaults to false.

List of datanodes that were excluding during attempt_1655808530211_10727_m_000000_0:

grep 'Excluding datanode'  application_1655808530211_10727_failed_gobblin.T311263.13_00_run.log | awk -F 'DatanodeInfoWithStorage\\[' '{print $2}' | awk -F ':' '{print $1}' | sort | uniq -c | sort -nr
    10 10.64.36.140
     8 10.64.53.32
     8 10.64.36.144
     6 10.64.5.42
     6 10.64.5.27
     6 10.64.21.122
     5 10.64.53.42
     5 10.64.36.8
     5 10.64.36.106
     5 10.64.21.121
     5 10.64.21.120
     4 10.64.53.46
     3 10.64.53.43
     3 10.64.5.30
     3 10.64.5.13
     3 10.64.36.143
     3 10.64.21.131
     3 10.64.21.119
     2 10.64.5.41
     2 10.64.53.8
     2 10.64.53.45
     2 10.64.53.27
     2 10.64.53.12
     2 10.64.5.21
     2 10.64.5.15
     2 10.64.5.12
     2 10.64.36.9
     2 10.64.36.145
     2 10.64.36.139
     2 10.64.36.138
     2 10.64.36.115
     2 10.64.21.8
     2 10.64.21.130
     1 10.64.53.47
     1 10.64.53.36
     1 10.64.53.33
     1 10.64.5.22
     1 10.64.5.16
     1 10.64.36.105
     1 10.64.21.9
     1 10.64.21.13
     1 10.64.21.129

With hostnames? :)

14:51:44 [@an-launcher1002:/home/otto] $ for dn in $(grep 'Excluding datanode'  application_1655808530211_10727_failed_gobblin.T311263.13_00_run.log | awk -F 'DatanodeInfoWithStorage\\[' '{print $2}' | awk -F ':' '{print $1}'); do dig -x $dn | grep 'eqiad.wmnet'; done | awk '{print $NF}' | sort | uniq -c  | sort -nr
     10 an-worker1108.eqiad.wmnet.
      8 an-worker1099.eqiad.wmnet.
      8 analytics1076.eqiad.wmnet.
      6 an-worker1096.eqiad.wmnet.
      6 an-worker1086.eqiad.wmnet.
      6 an-worker1078.eqiad.wmnet.
      5 an-worker1132.eqiad.wmnet.
      5 an-worker1112.eqiad.wmnet.
      5 an-worker1085.eqiad.wmnet.
      5 an-worker1084.eqiad.wmnet.
      5 analytics1066.eqiad.wmnet.
      4 an-worker1116.eqiad.wmnet.
      3 an-worker1129.eqiad.wmnet.
      3 an-worker1113.eqiad.wmnet.
      3 an-worker1111.eqiad.wmnet.
      3 an-worker1098.eqiad.wmnet.
      3 an-worker1083.eqiad.wmnet.
      3 an-worker1081.eqiad.wmnet.
      2 an-worker1140.eqiad.wmnet.
      2 an-worker1136.eqiad.wmnet.
      2 an-worker1134.eqiad.wmnet.
      2 an-worker1133.eqiad.wmnet.
      2 an-worker1125.eqiad.wmnet.
      2 an-worker1123.eqiad.wmnet.
      2 an-worker1115.eqiad.wmnet.
      2 an-worker1107.eqiad.wmnet.
      2 an-worker1106.eqiad.wmnet.
      2 an-worker1103.eqiad.wmnet.
      2 an-worker1100.eqiad.wmnet.
      2 an-worker1097.eqiad.wmnet.
      2 an-worker1091.eqiad.wmnet.
      2 analytics1067.eqiad.wmnet.
      2 analytics1058.eqiad.wmnet.
      1 an-worker1141.eqiad.wmnet.
      1 an-worker1128.eqiad.wmnet.
      1 an-worker1126.eqiad.wmnet.
      1 an-worker1117.eqiad.wmnet.
      1 an-worker1101.eqiad.wmnet.
      1 an-worker1094.eqiad.wmnet.
      1 analytics1077.eqiad.wmnet.
      1 analytics1065.eqiad.wmnet.
      1 analytics1059.eqiad.wmnet.
EChetty renamed this task from Gobblin dataloss during namenode failure to Investigate Gobblin dataloss during namenode failure.Jun 30 2022, 5:24 PM

I'm not working on this anymore. Should we close? We never quite figured out exactly why Gobblin failed in this strange way, but fingers crossed it might not happen again.

odimitrijevic subscribed.

Adding to data-pipelines to assess if there is data loss that should be noted for the first outage where the backfill might not have been run.