All the transclusion-related events are sent to the change-prop.transcludes.resource-change topic, and right now it has a around 800 events per second. One of the ChangeProp workers is always around 90% CPU usage, which means it's almost at it's limit since a worker can only use one CPU core.
Most likely this worker is the one doing Varnish purges on the transcodes topic - construction HTCP packets is pretty CPU-intence and it's not bound on any IO, but we need to verify that. A brutal way to verify would be to kill the worker and look at the graphs, but I'm not sure it's a good idea. A less invasive way would be to add some sampled logging with a worker pid.
We need to consider partitioning the transcludes topic and adding support for partitioned topics in ChangeProp. Support for partitioning will come handy implementing T157088 too. There should be a parameter for a rule whether to use 1 worker for all partitions, or to use a worker-per-partition since we only want one rule to respect partitioning.