Page MenuHomePhabricator

Investigate why netflow hive_to_druid job is so slow
Closed, ResolvedPublic

Description

We saw that the netflow to druid daily and hourly jobs are taking 16h and 5h respectively.
This might have been a coincidence with the mediawiki history monthly jobs,
but we should make sure the netflow data and jobs are sustainable.

Event Timeline

mforns created this task.Jun 3 2020, 5:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 3 2020, 5:31 PM
elukey added a subscriber: elukey.Jun 4 2020, 6:01 AM

I have restarted the daily job since yesterday I killed it to reboot an-launcher1001 (new memory settings), and it still showed hours and hours of running time. What I found was that the spark job had only the driver still running, waiting for the druid map-reduce indexation job. I then tracked it down to druid1002's middle manager, and the only log available was:

WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(com.google.protobuf.InvalidProtocolBufferException): Protocol message contained an invalid tag (zero).

I then tried to kill the job, restart the middle manager and restart it, but the log still persists. I checked on other druid hosts and I find it, so I think it might be something not problematic that druid has been logging for a while (the hdfs client in druid may be more recent than our old 2.6 hadoop cdh version, ending up in protobuf warnings, this is my theory).

So the problem still remains why the map-reduce druid indexation job takes ages..

elukey added a comment.Jun 4 2020, 7:04 AM

There is something weird going on, the last daily segment for wmf_netflow that I see from the coordinator console is 2020-03-05, after that it is all hourly segments.

elukey added a comment.Jun 4 2020, 8:23 AM

From the coordinator console I missed something very clear, namely that after 2020-03-05 we got a huge increase in the segment sizes. The dimensions moved from 9 to 14.

Good catch @elukey!
I think we probably agreed to keep netflow data indefinitely in its original size (~300Mb / day). Now that it's ~12gb /day, we need to discuss retention :)
Storage is not an issue as of now (cluster total size is 13.7Tb, and we currently use 3.8Tb).
However we won't sustain 1Tb per quarter for more than a few quarters :)

See T229682#5402701, we don't need all that data past x months it's totally fine to anonymize it, drop some dimensions, and reduce the granularity.

Ack @ayounsi :)
Data anonymization/schema-change (dropping columns) means reindexation. It's not very complicated by means we need to setup another job.
@ayounsi Can you please devise specs for us as to which dimensions to keep/drop, and possibly time-granularity query reduction?
Thanks :)

mforns added a comment.Jun 4 2020, 1:07 PM

@JAllemandou and @ayounsi

We already have a netflow druid sanitization job set up that drops some fields, see:
https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/druid_load.pp#L66
The size of the data after sanitization (90 days) is ~300MB per day. It should be fine for keeping indefinitely.

CDanis added a subscriber: CDanis.Jun 4 2020, 2:27 PM

Ack @mforns - No work needed from you @ayounsi then :)

Milimetric assigned this task to elukey.Jun 4 2020, 3:46 PM
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Ops Week on the Analytics board.
elukey added a comment.Jun 5 2020, 5:55 AM

Summary of what I have gathered with Joseph and Marcel (please correct me if I am wrong):

  • the data shows a big jump around march since it is exactly 90d ago, so the limit for sanitization (that seems to reduce the size of the data a lot).
  • due to the above, we don't really know for how much time hourly and daily indexations have failed (maybe intermittently). The settings for Hive2Druid were not good for the size of the netflow data, and most of the indexations ended up in timeouts.
  • the data left was the realtime indexation one, causing issue like T254161 (a brief interruption of the realtime indexation means "holes" unti sanitization replaces the segments).
  • all the spark jobs run in Spark cluster mode, so the driver is on Yarn, and its return code is not propagated when spark-submit returns to the systemd timer (so no non zero return codes that trigger icinga alarms). We could potentially move to local mode (spark driver on an-launcher) but it will require more resources.

After Joseph fixes:

  • the daily indexations seem working.
  • the hourly indexations seem working but we noticed that the tasks are locked for some time before kicking in, we suspect due to realtime indexations.

We are not going to re-index all the past months since sanitization will slowly do the job and the only report for holes in data has been solved.

fdans closed this task as Resolved.Jun 15 2020, 4:18 PM