Starting today 2020-10-15, we have noticed an increase in system CPU usage by the systemd unit trafficserver.service. The first time this happened was between 09:19 and 11:30:
The problem seems to be affecting cache_text nodes in multiple DCs. We have observed it in eqsin, ulsfo, and esams. While the issue is ongoing frontend traffic is not negatively affected.
While this is happening, purged does not manage to send PURGEs quickly enough and starts queuing locally to the point of triggering the following alert, which resolves itself after a couple of minutes:
14:59 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060