Page MenuHomePhabricator

Increase webrequest_sampled_live Druid datasource's retention
Open, Needs TriagePublic

Description

Hi folks!

Me and Filippo were wondering if we could increase the data retention for webrequest_sampled_live in Druid Analytics. The datasource's size seems around ~20G from what I can see in the coordinator's console, maybe setting the retention to 3 days or more is ok/doable? It would help a lot folks doing traffic analysis :)

Let us know!

Event Timeline

I see no problem to keeping a few days more of the webrequest_sample_live data in druid.

Related question: Do we wish to remove webrequest_sampled_128 and only keep the webrequest_sampled_live ?

Mentioned in SAL (#wikimedia-analytics) [2023-05-25T12:31:28Z] <elukey> set "loadByPeriod(P3D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460

I see no problem to keeping a few days more of the webrequest_sample_live data in druid.

Thanks! I've set 3 days in the coordinator's settings, let's see how it goes then we can ramp up to say 7/8 days?

Related question: Do we wish to remove webrequest_sampled_128 and only keep the webrequest_sampled_live ?

@Volans Any preference?

What do you mean by removing it? Making it start after 3 days?
It's very useful to be able to check things also few days later and in particular in incident documents is very useful to be able to link the _128 one so that it lasts for a month.
I personally don't mind if we keep the new live one or the old one, but I think that having one dataset that lasts longer (say a month) is very useful!

I think that Joseph meant to suggest if we want to have only 30 days of webrequest_sampled_live, and deprecated webrequest_sampled_128 (this is my understanding after reading again the proposal).

The size of the total datasource should be around ~20GB x 30 = 600GB, that is more or less the size of webrequest_sampled_128 right now. The only doubt that I have is related to segment size, I don't recall if Druid automatically re-compacts or not, so queries may be slower for big ranges?

I am pretty ignorant in this Druid part so if Joseph likes the idea I am +1 to ramp up live to 30 days :)