Page MenuHomePhabricator

Keep webrequest_sampled_128's druid segments for more than a week
Closed, ResolvedPublic1 Estimated Story Points

Description

During the SRE summit it was asked if webrequest_sampled_128 data could be kept for more than a week in Turnilo/Druid (ideally, as much as possible - according with our 3 months retention policy) since it would be useful for various use cases.

Event Timeline

elukey triaged this task as Medium priority.Jun 21 2019, 5:41 AM
elukey created this task.
elukey renamed this task from Keep webrequest_sampled_128's druid segments for more than a week in Druid to Keep webrequest_sampled_128's druid segments for more than a week.Jun 21 2019, 5:41 AM

This data set's size in Druid is 100GB per week.
We can increase it to a month with our current capacity.
Would that be OK?

Yes definitely, a month would surely help a lot Brandon and the traffic team (they need to look at week over week trends) and surely Faidon as well.

But in druid admin ui i just see 1 week?

Screen Shot 2019-06-24 at 10.13.23 AM.png (230×386 px, 36 KB)

I see, there was a rule on druid UI dropping this data every 7 days, I think we should move away from that and keep our puppet config to be the one that dictates how much data we keep. cc @JAllemandou in case there is any issue

Change 518769 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Keeping webrequest_sampled data for 30 days

https://gerrit.wikimedia.org/r/518769

@Nuria: We on purpose did it the way it is setup, in order to facilitate loading data in druid in case it is needed (data present in deep-storage for 60 days) while still keeping space on druid.
Having agreed we should keep 1 month of data in druid, I still recommend using rules to unload data after 1 month and keep 60 days in deep storage, as 2 month means 2Tb per server in druid, probably too much.

The thing is that in 2 years of having this data available we have never needed to load data from deep storage , also since there was no documentation about this I do not think any of us realized loading from deep storage was a possibility.

I would prefer to keep 1 month of data and if we think we need more we can make the case for the hardware w/o having to do any "special loading" but rather just have the data available for everyone. Do you see any problems with this, there might be something I am missing regarding deep storage, please let me know.

With better/more precise explanation:

  • In order for data to be dropped from deepstorage, it needs to be unloaded from historical nodes. This can be done in 2 ways: disabling a full datasource, or disabling segments using rules.
  • Once segments are disabled, you can run the kill task to drop them.

Given the need to use rules to disable segments from historical, I'd rather keep the max data in hadoop (no storage issue so far).

So, let me understand, teh https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-druid-deep-storage-data requires that we disable segments using rules before we can drop data?

Correct (see https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html, paragraph How to permanently delete data). We can also use API calls to mark segments as unused if we prefer not using rules.

Change 518769 abandoned by Nuria:
Keeping webrequest_sampled data for 30 days

Reason:
...

https://gerrit.wikimedia.org/r/518769

Ok, I have edited the UI rules to have 30 days retention, the deep storage will keep 60 days of data so no changes are needed there

Nuria changed the point value for this task from 0 to 1.

Leaving on done for couple weeks to make sure settings are taking effect

I can see 8 days of data now so i think we are good. Closing.