Druid loading of navigationtiming gets stuck
Open, MediumPublic
Actions

Assigned To

None

Authored By

	mforns
	Jan 28 2021, 6:53 PM

Description

Intermitently, the systemd service using HiveToDruid to load event.navigationtiming to Druid gets stuck for days, until we have to kill the job. This originates holes in the datasource (and in Turnilo charts). We should backfill those, and investigate why this is happening, and try to fix it.

Related Objects

Mentioned Here: T271568: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues

Event Timeline

mforns created this task.Jan 28 2021, 6:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2021, 6:53 PM

I assume that making druid-indexation-jobs fail in non-silent ways (T271568) will help to react faster to this issue.

Me and Razzi killed the two Spark jobs that were stuck the other day (to reboot an-launcher) because they started days before and they seemed stuck. During a chat with Marcel we checked the Druid UI and noticed that the two jobs are still in Druid, one in RUNNING state and the other one in WAITING (notice the dates, 25/21 vs more recent ones like 29th, that is today).

I grepped on druid1001's middle manager logs and found:

elukey@druid1001:/var/log/druid$ grep "index_hadoop_event_navigationtiming_jjolhekc" /var/log/druid/middlemanager.log 
2021-01-21T18:00:40,186 INFO org.apache.druid.indexing.worker.WorkerTaskManager: Task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] started.
2021-01-21T18:00:40,187 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Running command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /etc/druid:/etc/druid/middlemanager:/usr/share/druid
[..cut..]
2021-01-21T18:00:40,191 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Logging task index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z output to: /srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z/log
2021-01-21T18:00:46,079 INFO org.apache.druid.cli.CliPeon: * druid.metrics.emitter.dimension.taskId: index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z
2021-01-21T18:00:46,086 INFO org.apache.druid.cli.CliPeon: * sun.java.command: org.apache.druid.cli.Main internal peon /srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z/task.json /srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z/1f4b792a-80df-4c9e-9600-d62062b17ae9/status.json /srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z/1f4b792a-80df-4c9e-9600-d62062b17ae9/report.json
  "id" : "index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z",
  "groupId" : "index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z",
    "availabilityGroup" : "index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z",
2021-01-21T18:00:47,995 INFO org.apache.druid.indexing.worker.executor.ExecutorLifecycle: Attempting to lock file[/srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z/lock].
2021-01-21T18:00:47,998 INFO org.apache.druid.indexing.worker.executor.ExecutorLifecycle: Acquired lock file[/srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z/lock] in 2ms.
2021-01-21T18:00:48,106 INFO org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner: Running task: index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z
2021-01-21T18:00:58,708 INFO org.apache.druid.indexer.JobHelper: MR job id [job_1609949395033_58007] is written to the file [/srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z/mapReduceJobId.json]
  "id" : "index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z",
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner: Starting graceful shutdown of task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z].

== day changed ==

2021-02-01T14:48:41,203 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Shutdown [index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] because: [shut down request via HTTP endpoint]
2021-02-01T14:48:41,203 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Closing output stream to task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z].
2021-02-01T14:48:41,221 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Shutdown [index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] because: [shut down request via HTTP endpoint]
2021-02-01T14:48:41,221 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Closing output stream to task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z].
2021-02-01T14:48:41,660 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Process exited with status[2] for task: index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z
2021-02-01T14:48:41,663 INFO org.apache.druid.indexing.common.tasklogs.FileTaskLogs: Wrote task log to: /srv/druid/indexing-logs/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z.log
2021-02-01T14:48:41,664 INFO org.apache.druid.indexing.common.tasklogs.FileTaskLogs: Wrote task report to: /srv/druid/indexing-logs/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z.report.json
2021-02-01T14:48:41,666 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Removing task directory: /srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z
2021-02-01T14:48:41,671 INFO org.apache.druid.indexing.worker.WorkerTaskManager: Task [index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] completed with status [FAILED].

On /srv/druid/tasks there is still the directory of the job started on the 21st (the one indicated in the logs above):

elukey@druid1001:/srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z$ ls -l
total 64
drwxr-xr-x 2 druid druid  4096 Jan 21 18:01 1f4b792a-80df-4c9e-9600-d62062b17ae9
-rw-r--r-- 1 druid druid     0 Jan 21 18:00 lock
-rw-r--r-- 1 druid druid 28132 Jan 28 15:51 log
-rw-r--r-- 1 druid druid    25 Jan 21 18:00 mapReduceJobId.json
-rw-r--r-- 1 druid druid 24737 Jan 21 18:00 task.json

After inspection of some files, ~~it seems as if the indexation went fine~~ it seems that the indexation failed (due to a manual kill by us), but the lock wasn't released up to then, leaving the job running for days (or something similar).

• fdans triaged this task as Unbreak Now! priority.Feb 1 2021, 4:52 PM

• fdans moved this task from Incoming to Ops Week on the Analytics board.

In the overlord logs on an-druid1002 I can see this interesting set of logs, than then repeat over and over:

/var/log/druid/overlord.2.log:2021-01-21T18:00:40,152 INFO org.apache.druid.indexing.overlord.MetadataTaskStorage: Inserting task index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z with status: TaskStatus{id=index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z, status=RUNNING, duration=-1, errorMsg=null}
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,168 INFO org.apache.druid.indexing.overlord.TaskLockbox: Adding task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] to activeTasks
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,169 INFO org.apache.druid.indexing.overlord.TaskLockbox: Added task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] to TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z', dataSource='event_navigationtiming', interval=2021-01-21T12:00:00.000Z/2021-01-21T13:00:00.000Z, version='2021-01-21T18:00:40.169Z', priority=50, revoked=false}]
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,169 INFO org.apache.druid.indexing.overlord.MetadataTaskStorage: Adding lock on interval[2021-01-21T12:00:00.000Z/2021-01-21T13:00:00.000Z] version[2021-01-21T18:00:40.169Z] for task: index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,173 INFO org.apache.druid.indexing.overlord.TaskQueue: Asking taskRunner to run: index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,174 INFO org.apache.druid.indexing.overlord.RemoteTaskRunner: Added pending task index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,174 INFO org.apache.druid.indexing.overlord.RemoteTaskRunner: Coordinator asking Worker[druid1001.eqiad.wmnet:8091] to add task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z]
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,181 INFO org.apache.druid.indexing.overlord.RemoteTaskRunner: Task index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z switched from pending to running (on [druid1001.eqiad.wmnet:8091])
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,191 INFO org.apache.druid.indexing.overlord.RemoteTaskRunner: Worker[druid1001.eqiad.wmnet:8091] wrote RUNNING status for task [index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] on [TaskLocation{host='null', port=-1, tlsPort=-1}]
/var/log/druid/overlord.2.log:2021-01-21T18:00:40,193 INFO org.apache.druid.indexing.overlord.RemoteTaskRunner: Worker[druid1001.eqiad.wmnet:8091] wrote RUNNING status for task [index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] on [TaskLocation{host='druid1001.eqiad.wmnet', port=8201, tlsPort=-1}]
/var/log/druid/overlord.2.log:2021-01-21T18:00:48,061 INFO org.apache.druid.indexing.overlord.TaskLockbox: Task[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z] already present in TaskLock[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z]
/var/log/druid/overlord.2.log:2021-01-23T21:11:44,801 INFO org.apache.druid.indexing.overlord.RemoteTaskRunner: Shutdown [index_kafka_wmf_netflow_a742513f5b29bf9_blpmoekl] because: [task is not in knownTaskIds[[index_kafka_wmf_netflow_83b3aa7dc0e95f3_ccdkpgmo, index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z, index_kafka_wmf_netflow_83b3aa7dc0e95f3_bfjgpjhb, index_kafka_wmf_netflow_83b3aa7dc0e95f3_kakpjkao]]]
/var/log/druid/overlord.2.log:2021-01-25T00:02:52,484 INFO org.apache.druid.indexing.overlord.TaskLockbox: Cannot create a new taskLockPosse for request[TimeChunkLockRequest{lockType=EXCLUSIVE, groupId='index_hadoop_event_navigationtiming_hbnibbkj_2021-01-25T00:02:52.474Z', dataSource='event_navigationtiming', interval=2021-01-21T00:00:00.000Z/2021-01-22T00:00:00.000Z, preferredVersion='null', priority=50, revoked=false}] because existing locks[[TaskLockPosse{taskLock=TimeChunkLock{type=EXCLUSIVE, groupId='index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z', dataSource='event_navigationtiming', interval=2021-01-21T12:00:00.000Z/2021-01-21T13:00:00.000Z, version='2021-01-21T18:00:40.169Z', priority=50, revoked=false}, taskIds=[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z]}]] have same or higher priorities
/var/log/druid/overlord.2.log:2021-01-25T00:02:57,367 INFO org.apache.druid.indexing.overlord.TaskLockbox: Cannot create a new taskLockPosse for request[TimeChunkLockRequest{lockType=EXCLUSIVE, groupId='index_hadoop_event_navigationtiming_hbnibbkj_2021-01-25T00:02:52.474Z', dataSource='event_navigationtiming', interval=2021-01-21T00:00:00.000Z/2021-01-22T00:00:00.000Z, preferredVersion='null', priority=50, revoked=false}] because existing locks[[TaskLockPosse{taskLock=TimeChunkLock{type=EXCLUSIVE, groupId='index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z', dataSource='event_navigationtiming', interval=2021-01-21T12:00:00.000Z/2021-01-21T13:00:00.000Z, version='2021-01-21T18:00:40.169Z', priority=50, revoked=false}, taskIds=[index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z]}]] have same or higher priorities
[.. the last bit repeats..]

Nothing really stands out from the Yarn application logs. I wanted to inspect /srv/druid/task/index_hadoop_event_navigationtiming_jjolhekc_2021-01-21T18:00:40.142Z$ on druid1001 but now the dir is gone (since we killed the job), let's save it next time if it re-happens (before killing) for later investigation.

elukey lowered the priority of this task from Unbreak Now! to Medium.Feb 2 2021, 7:57 AM

Ottomata moved this task from Ops Week to Operational Excellence on the Analytics board.Apr 19 2021, 6:10 PM

JArguello-WMF added a project: Data-Engineering.Jun 6 2022, 5:27 PM

JArguello-WMF moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.

JArguello-WMF removed a project: Analytics.

lbowmaker edited projects, added Data-Platform-SRE; removed Data-Engineering.Nov 10 2023, 2:38 PM

Gehel moved this task from Incoming to Bugs / Issues on the Data-Platform-SRE board.Dec 7 2023, 1:53 PM

	F34065558: Screenshot from 2021-01-29 17-20-02.png
	Jan 29 2021, 4:30 PM

Druid loading of navigationtiming gets stuckOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Druid loading of navigationtiming gets stuck
Open, MediumPublic
Actions