Page MenuHomePhabricator

Increase webrequest_sampled_live Druid datasource's retention
Open, Needs TriagePublic

Description

Hi folks!

Me and Filippo were wondering if we could increase the data retention for webrequest_sampled_live in Druid Analytics. The datasource's size seems around ~20G from what I can see in the coordinator's console, maybe setting the retention to 3 days or more is ok/doable? It would help a lot folks doing traffic analysis :)

Let us know!

Event Timeline

I see no problem to keeping a few days more of the webrequest_sample_live data in druid.

Related question: Do we wish to remove webrequest_sampled_128 and only keep the webrequest_sampled_live ?

Mentioned in SAL (#wikimedia-analytics) [2023-05-25T12:31:28Z] <elukey> set "loadByPeriod(P3D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460

I see no problem to keeping a few days more of the webrequest_sample_live data in druid.

Thanks! I've set 3 days in the coordinator's settings, let's see how it goes then we can ramp up to say 7/8 days?

Related question: Do we wish to remove webrequest_sampled_128 and only keep the webrequest_sampled_live ?

@Volans Any preference?

What do you mean by removing it? Making it start after 3 days?
It's very useful to be able to check things also few days later and in particular in incident documents is very useful to be able to link the _128 one so that it lasts for a month.
I personally don't mind if we keep the new live one or the old one, but I think that having one dataset that lasts longer (say a month) is very useful!

I think that Joseph meant to suggest if we want to have only 30 days of webrequest_sampled_live, and deprecated webrequest_sampled_128 (this is my understanding after reading again the proposal).

The size of the total datasource should be around ~20GB x 30 = 600GB, that is more or less the size of webrequest_sampled_128 right now. The only doubt that I have is related to segment size, I don't recall if Druid automatically re-compacts or not, so queries may be slower for big ranges?

I am pretty ignorant in this Druid part so if Joseph likes the idea I am +1 to ramp up live to 30 days :)

I compared the segments stored in Druid for the webrequest_sampled_live and webrequest_sampled_128 datasources. There are 2 segments per hour for each datasource, the _128 one spread the data better than the _live (one big segment and one small segment instead of 2 medium segments) - This shouldn't impact performance nor storage a lot. Another difference is that data for _128 is re-compacted daily, leading to 32 files instead of 48 (24*2) for _live- This is not a big difference either.

Those small differences makes me think that keeping only the _live datasource with 30 days of data is feasible.

Last but not least: I have double checked, the segments generated by the _live datasource are saved on HDFS (as one would expect) - we need to remember to drop them from here after 90 days (see https://github.com/wikimedia/operations-puppet/blob/7dbf4af33e71d3e4aeddf1164b2a83f40ddaef6d/modules/profile/manifests/analytics/refinery/job/data_purge.pp#L154)!

Mentioned in SAL (#wikimedia-analytics) [2023-05-31T07:29:42Z] <elukey> set "loadByPeriod(P8D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460

Mentioned in SAL (#wikimedia-analytics) [2023-06-07T08:02:28Z] <elukey> set "loadByPeriod(P15D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460

Change 927976 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] analytics refinery: add a data purge job for webrequest_sampled_live

https://gerrit.wikimedia.org/r/927976

Change 927976 merged by Elukey:

[operations/puppet@production] analytics refinery: add a data purge job for webrequest_sampled_live

https://gerrit.wikimedia.org/r/927976

Next steps:

  • Wait a couple of weeks to get webrequest_sampled_live retention to 30 days, check druid metrics etc..
  • Discuss with DE and SRE about deprecating webrequest_sampled_128

@Volans asking your opinion here since you built the Superset dashboard for SRE. Deprecating webrequest_sampled_128 means that if we have a Benthos issue or if we loose data that is streamed to the kafka topic, we'll see "holes" in the related data in Druid until it falls out of retention. We don't have this problem with webrequest_sampled_128 since data is coming from Hadoop, so it would be a new failure scenario for us. I think it should be fine to keep only sampled_live, but lemme know your thoughts!

After a chat with @elukey I now better understand the two pipelines:

webrequest_sampled_128:

varnish -> varnishkafka -> kafka jumbo -> HDFS -> druid

webrequest_sampled_live:

varnish -> varnishkafka -> kakfa jumbo -> benthos -> kafka jumbo -> druid

There are various moving parts in both and I don't think that moving to the _live would open up to more failures, just different failure scenarios.
In addition, as a recent issue has shown (T337088), also the existing _128 can incur in missing data. But it is possible to backfill it.

One issue with maintaining both is that when an incident occurs people share links to the live Superset dashboard but then when you want to revisit them or discuss the incident during the incident review ritual you have to re-create all the links with the _128 dashboard because the data in the live links is past its retention period.
So having a single one to maintain and use will also simplify its usage and sharing capabilities.

In light of this I have no problem to move to the webrequest_sampled_live only and have to maintain only one Superset dashboard and reduce data duplication and confusion.

My only question is if in the future it would be possible to evaluate ways to backfill data into druid in case of issues in the pipeline, but currently I would not consider it a blocker.

My 2 cents :)

My only question is if in the future it would be possible to evaluate ways to backfill data into druid in case of issues in the pipeline, but currently I would not consider it a blocker.

We could change loading batch data from webrequest_sampled_128 to webrequest_sampled_live, making a real lambda-architecture for this job :)
The only downside I can see here is that the sampling and the various enhancement functions would be done differently (java vs python), making the newly indexed data (every hour) change the existing data. Normally the changes would not be statistically different, but maybe it could be problematic for use-cases I don't think of.
Let's continue to brainstorm and make this job a single druid datasource :)