Page MenuHomePhabricator

Move some analytics jobs to day time in Virginia
Open, MediumPublic

Description

Analytics is the second most power hungry cluster in production (just behind k8s). Given that it consumes 38kW on average. And looking PJM grid data (436g for each kWh for three months) this roughly translates to 397kg of CO2 every day or 145 tonnes of CO2 every year. That's roughly 350 people doing a trans-Atlantic flight (Based on ICAO calculator, FRA-JFK)

Other clusters have an extremely stable consumption so we can't move them but analytics varies a lot:
https://grafana.wikimedia.org/d/f64mmDzMz/power-usage?orgId=1&viewPanel=110

image.png (1,900×817 px, 190 KB)

PJM grid is greener during the day (because of the solar the average drops to 395g per kWh in Winter and I estimate it drops to ~336 g per kWh) and for every 10% of power consumption moved to day time we save 3.7kg CO2 in winter and 9.4kg CO2 in summer every day. That translates to 2.4 tonnes of CO2 saved every year (assuming average of summer and winter). That's roughly 6 people doing a trans-Atlantic flight. Noting that the impact will constantly increase since several solar projects are connecting to PJM including Fox Squirrel Solar (last part connected this December) and BayWa r.e. solar projects (to be finished by end of 2025).

Event Timeline

Related: T371321: [Idea] Collect pageview data using client-side instrumentation. I'd expect that energy consumption would go down if we stopped searching for needles in the webrequest haystack.

Ottomata renamed this task from Move some anlaytics jobs to day time in Virginia to Move some analytics jobs to day time in Virginia.Feb 18 2025, 6:54 PM

FWIW. Almost all of the power consumption in analytics infra (understandably) is coming from an-worker nodes. There is some large consumption coming from presto cluster too: https://grafana.wikimedia.org/goto/ytdntxGDR?orgId=1

Since the vast majority of analytic jobs are scheduled via Airflow, we could certainly change the schedule cron definitions of each DAG to achieve this, but it would require careful planning, as today we rely heavily on the @daily alias, which makes everything run at midnight UTC.

We could certainly pick one heavy job and do an experimental rescheduling as a start.

Since the vast majority of analytic jobs are scheduled via Airflow, we could certainly change the schedule cron definitions of each DAG to achieve this, but it would require careful planning, as today we rely heavily on the @daily alias, which makes everything run at midnight UTC.

That seems to be 7pm virgina time. If we push the default a bit earlier, that could automatically have the intended effect.

We could certainly pick one heavy job and do an experimental rescheduling as a start.

That sounds good to me!

In T384166#11565913, @xcollazo wrote:
Since the vast majority of analytic jobs are scheduled via Airflow, we could certainly change the schedule cron definitions of each DAG to achieve this, but it would require careful planning, as today we rely heavily on the @daily alias, which makes >>everything run at midnight UTC.

That seems to be 7pm virgina time. If we push the default a bit earlier, that could automatically have the intended effect.

Hm, I don't think we can push to earlier. We need the last hour of the UTC day worth of source data to compute the day. Which is when the daily job triggers currently.
We could shift the end/start of day to earlier (maybe that's what you meant). Like instead of computing days from 00:00UTC to 23:59UTC, we could do it from 20:00UTC to 19:59UTC? This should work the same in theory, but might be quite confusing for people consuming the resulting data though...

For monthly jobs maybe we could delay them 12 hours or so. So that when they trigger it's 7am in Virginia, instead of 7pm? For daily jobs that would be too much delay...

For monthly jobs maybe we could delay them 12 hours or so.

Right, we would have to be selective, and certainly we should not delay anything that runs hourly or that generally has a tight SLA.

Perhaps we should figure what are our most compute intensive jobs (i.e. biggest bang for the buck) and go from there?

Perhaps we should figure what are our most compute intensive jobs (i.e. biggest bang for the buck) and go from there?

+1

Ahoelzl triaged this task as Medium priority.May 1 2026, 3:43 PM

Moving daily and monthly jobs will incur a 12 hour data delivery delay which will impact downstream pipelines.
The estimated climate impact reduction is less than 2% though. Keeping this in the backlog for now until we can clarify the delivery delay is acceptable.

Another idea: Maybe already implemented. A lot of my queries for my work are done on webrequest raw because most the derivative tables won't work for my case and they are over long period of time which makes my queries slow and expensive, OTOH, I don't need an exact number or value. I just want to get the sense of how things are. In fact, the only times, that I needed an exact result were when I had to forensics which had exact time was known and the queries were fast.

I said all of this to suggest that maybe having a derivative 1:128 sample table of webrequest could help a lot. We could move some jobs to use the sampled version and just multiply the result by 128 at the end. We could change the partition to be daily for example. This would help me a lot actually.

I know there is one in druid but I couldn't get it to work on spark-sql, it was saying the table doesn't exist. If I'm doing something wrong, then it'd be nice to switch some of the queries in airflow dags directly to the druid table maybe.

Another idea: Maybe already implemented. A lot of my queries for my work are done on webrequest raw because most the derivative tables won't work for my case and they are over long period of time which makes my queries slow and expensive, OTOH, I don't need an exact number or value. I just want to get the sense of how things are. In fact, the only times, that I needed an exact result were when I had to forensics which had exact time was known and the queries were fast.

I said all of this to suggest that maybe having a derivative 1:128 sample table of webrequest could help a lot. We could move some jobs to use the sampled version and just multiply the result by 128 at the end. We could change the partition to be daily for example. This would help me a lot actually.

I know there is one in druid but I couldn't get it to work on spark-sql, it was saying the table doesn't exist. If I'm doing something wrong, then it'd be nice to switch some of the queries in airflow dags directly to the druid table maybe.

What about if you did the 1/128 sampling directly within your spark-sql query? Would this be efficient?

I was thinking that you could simply add a condition like this:

WHERE mod(sequence,128) = 0

I tried this in a SQL Lab query here: https://superset.wikimedia.org/sqllab?savedQueryId=1422

Maybe it won't be any quicker than examining the full webrequest table, but it might be worth a go.

oh that's a clever trick. I tried it on a query and it went from 235s to 175s which is nice (granted, the data might have gotten to cache, etc. so it's not 100% accurate, I didn't benchmark) but if the data isn't there to run any checks on, it could be even faster, we can increase the partition time (from hour to day for example) plus it's more explicit for everyone to use (I didn't think of this trick until now). Definitely a good idea to start from.

And/or, TABLESAMPLE? Although I'm not sure if on our version of Spark that will reduce IO. Spark 4 looks like it has TABLESAMPLE SYSTEM which will skip blocks it doesn't need to read?

derivative 1:128 sample table of webrequest

Also FYI in case you aren't aware of this one: https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Traffic/Pageview_actor (but that might not work for what you are doing)

Also, there is a sampled 1:128 webrequest in Druid! The easiest ways to query it are via Turnilo or via Superset SQL lab:

Screenshot 2026-05-19 at 20.56.43.png (910×560 px, 46 KB)

This dataset in Druid is fed live from a stream, but I think only keeps the past one month of data.

We also download the 1:128 webrequest sampled data on the hadoop cluster, but we have not created a table of it. Could be worth it.

Having a table for that would be really nice for me. I used pageview_actor and it worked sometimes but all cases (e.g. when I'm working to extract insight on multimedia usage)

1:128 webrequest sampled data

And/or we could add Hive bucketing to webrequest?