Page MenuHomePhabricator

Dropping data from druid takes down aqs hosts
Open, NormalPublic


We had issues with dropping data from druid and revisited our deleting strategy via this commit:
that disables a datasource and later sends a delete task. Still last month when the system timer kicked in that did the deletion we trigger an outage, documented here:

There has to be a bigger issue with deletion of data from deep storage, we need to research that.


Related Gerrit Patches:

Event Timeline

Nuria created this task.Jun 18 2019, 4:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 18 2019, 4:42 PM
Milimetric assigned this task to Nuria.Jun 20 2019, 4:38 PM
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.
elukey claimed this task.Jun 24 2019, 11:57 AM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
elukey added a project: User-Elukey.

Sent the following to the Druid users google group:

Hi everybody,
I am Luca from the Wikimedia foundation's Analytics team, we are happy users of a cluster of 3 nodes with 0.12.3. This cluster is primarily used to be the backend of a HTTP API that serves a datasource that we call "MediaWiki history reduced". We re-generate it every month (for complex reasons we are still not able to append new data), and index this data into druid. We end up having monthly "snapshots" (every one having ~4800 files / 220GB on the segments on disk cache for every node), and we periodically drop the older ones via automated scripts. Our procedure to drop the data source is:

  • disable the target datasource via the coordinator's HTTP API.
  • drop data from deep storage (in our case, HDFS) sending a kill task to the overlord's API.

The strange thing that happens is that most of the times when the above commands are issued, all the Druid brokers stop serving traffic until they get restarted. We haven't collected jstack data yet, but I am wondering if this is a known bug/use-case that somebody else already experienced in the past. Our suspicion is that the brokers get stuck on one or more historical slowing down, but we don't have more ideas about a more precise root cause since we didn't find anything useful in the logs. As far as we can see when we drop the old datasources there is no traffic coming from the brokers requesting them.
Our configuration for timeouts is the following (if it helps):

  • broker PT30S druid.server.http.defaultQueryTimeout: 60000
  • historical druid.processing.numMergeBuffers: 10
    1. 20 Druid http client threads * 3 brokers druid.server.http.numThreads: 60 druid.historical.cache.useCache: false druid.historical.cache.populateCache: false druid.server.http.defaultQueryTimeout: 20000

Any suggestion would be really appreciated :)
Thanks in advance,

elukey moved this task from Backlog to In Progress on the User-Elukey board.Jun 24 2019, 12:12 PM

Change 519181 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics|public::worker: set stricter query timeouts

Change 519181 merged by Elukey:
[operations/puppet@production] role::druid::analytics|public::worker: set stricter query timeouts

elukey added a comment.EditedJun 26 2019, 10:22 AM

I applied the changes, but from the grafana dashboard ( I can still see requests taking more than 5s to complete. So I went directly on the host, and checked the various -request.log files on which we should have the list of metrics that the prometheus exporter sees. A quick way to get those values is:

grep 2019-06-26 /var/log/druid/historical-requests.log | egrep -o 'query/time\":[0-9]*' | sort | uniq -c
grep 2019-06-26 /var/log/druid/broker-requests.log | egrep -o 'query/time\":[0-9]*' | sort | uniq -c

The above greps don't return any value above than 5s. If I see a p90 of 7s for example, I'd expect to find at least one request that took that long.

How the prometheus exporter work?

It follows basically what listed in
In the code we define some buckets:

The idea is that for every request, we +1 every bucket that has a value bigger than the request's one. For example, say that query/time for a request is 4s. The exporter does the following:

  • bucket '10' +=1
  • bucket '100' += 1
  • bucket '500' += 1
  • bucket '1000 +=1
  • bucket '10000' +=1
  • bucket 'inf' (nothing)
  • sum += query/time ms value (this is the total sum of the latency)

The histogram_quantile prometheus function (used in grafana) should then make the magic to render latency data.

The following might happen:

  • the code that I wrote to implement the above is not correct
  • the buckets granularity are not enough to get precise data

What I mean with the second one is that we only have two buckets in the range 1s->10s, that might skew calculations. Maybe adding more buckets in that range will make a difference?

Got an answer from druid-users:

I'm not aware of an existing bug/issue that matches the situation described, would you be able to provide the Broker jstack data when this happens?
The broker does maintain a view of what segments are currently being served by historicals, it seems possible that something is breaking/getting stuck due to a large number of segment drops/deletes.

Change 519361 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::data_purge: avoid saturday

Change 519361 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::data_purge: avoid saturday

Change 519365 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/software/druid_exporter@master] Add more granularity to query/time|size buckets

Mentioned in SAL (#wikimedia-operations) [2019-06-27T13:15:18Z] <elukey> start druid drop datasource test - might affect AQS - T226035

Nuria added a comment.Jun 27 2019, 3:32 PM

No alarms so test went well just with changing timeouts? or better, random bug we were not able to repro.

Sadly I believe the latter, I didn't see any impact on brokers.. Joseph loaded two snapshots and dropped them, I think that we missed to recreate some other factor that played a big role during the past outage.

elukey lowered the priority of this task from High to Normal.Jul 5 2019, 8:31 AM

Triggered manually the timer to drop the segment, that deleted 2019-02 from the public cluster. I did notice a very brief increase in latency for the druid1004 broker, but then it self recovered without causing AQS 50x. The timeouts change that we did probably had some positive effects, or we are still not able to reproduce the issue.

Nuria added a comment.Jul 5 2019, 4:41 PM

At this point I think we can close this ticket and re-open if it is to happen again?

@Nuria probably yes, I agree!

Nuria closed this task as Resolved.Jul 15 2019, 2:43 PM
elukey reopened this task as Open.Aug 16 2019, 10:53 AM

Very interesting - yesterday the systemd timer dropped old snapshots and the AQS alert fired, but only for a brief moment and then it self-recovered. This should be the proof of the theory that our fix in is the right path to follow. Re-opening to see if anything can be done to avoid any alarm to fire at all.

Change 519365 merged by Elukey:
[operations/software/druid_exporter@master] Add more granularity to query/time|size buckets

Today I added some changes to the prometheus-druid-exporter:

elukey moved this task from In Progress to Stalled on the User-Elukey board.Sep 17 2019, 10:26 AM
elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Oct 4 2019, 1:42 PM