As part of the recent switch of changeprop processing from eqiad to codfw, we noticed a temporary dip in Parsoid p99 latencies:
However, p99 duly recovered after a few hours.
I think I have noticed this pattern before. I have also seen heap usage of specific workers grow over time, which did affect latency when usage grew beyond about 1G.
The current Parsoid heap limit is set at 800m, which is fairly high. Service-runner heap limits are only enforced after being breached for several minutes in a row, so allow for temporary spikes. It is thus not necessary to set the limit to the absolute maximum memory you expect a sane request to consume.
Reducing this heap limit can potentially reduce p99 latencies by restarting large workers sooner. It might also reduce the rate of non-graceful worker restarts. I would thus propose to gradually lower the configured heap limit, perhaps to 600m first, followed by 500m or 400m.