We want to make sure we have a good way to follow and monitor the general usage of the endpoint but, more importantly, we want to check if our assumptions regarding edge-cases are correct and continue to be so. There are already a couple of tasks to add specific measurements, this ticket serves as the more holistic view on how (and what) to monitor.
- Monitor how many requests result in parser cache miss, and therefore a parsing action.
- Monitor how many requests result in fetching from FlaggedRevs cache (T421013: Add monitoring for parse requests in Attribution API)
- Monitor how many requests result in failure from FlaggedRevs cache (and produce no-answer for # of citations)
- Monitor how many requests result with HTML elements with display:none - which suggests having a invisible watermarks, resulting in double values
- Latency exceeding 400ms
- 'null' value rates across fields --> Which fields are returning null, how frequently
- Log null events for data fetch failures to logstash; ignore unimplemented features (like contributor count)
- 404 spikes --> Differentiate image lookup failures from standard poor formatting
- Potentially group 404's by namespace (easily differentiate files from articles, for example)
- Timing metrics (in lieu of OpenTelemetry) --> Break down different sub calls & workflows so we can more easily identify bottlenecks
- Signal usage --> tracking query parameter usage within 'expand' to determine rate of usage for essential only vs higher levels of attribution
NOTE: Per team discussions, we will create a dedicated Attribution API dashboard to house the metrics listed above.