Checking the telemetry metrics for jobrunner -> wdqs-internal we found weird patterns in error rates.
Checking more closely it appears that wdqs-internal is serving more requests (type fallback ones) and thus throttling more of them:
(c.f. https://grafana-rw.wikimedia.org/d/000000344/wikidata-quality?orgId=1&refresh=30s)
The system is reacting as it is told to do but should we adapt the service to this new behavior if it persists?
Are there ways to measure the actual user-impact of these errors?
AC:
- determine if some actions need to be taken
- configure the system to support this load if yes, decline the task otherwise