Page MenuHomePhabricator

📊 Enable Observability & Monitoring for Image Matching Algorithm
Closed, DeclinedPublic

Description

Context

We want the Image Recommendation data pipeline to respect system and data quality SLOs. The system we are developing is coupled with data generated by external processes (users interaction, mysql dumps, analytics data pipelines). While we should strive to proper unit and integration testing to ensure correctness of our code, there’s a category of failure scenarios that will require introspection, instrumentation and analysis of the system.

Acceptance Criteria

System Metrics
Spark sinks

  • in/out records
  • CPU Usage
  • Memory Usage
  • Executor Counts
  • Runtime and resource utilization for algo

Service-level Metrics

  • Response/Request Time
  • Number of records served
  • Errors per minute
  • Average latency
  • API Usage
    • Call volume per second
    • Unique app clients

Dataset Metrics

  • Total number of records
    • Per Wiki
  • Total number of images per page
    • Per Wiki
  • Summary of population statistics
  • Size and counts of intermediate and final datasets
Subtasks
Out of Scope
  • Collecting Android's Feedback of user judgments and incorporating back into the algo
Open Questions
  • Are we dropping records from one stage of the pipeline to another?
  • Are there unexpected/malformed inputs?
  • Do we see significant changes in input/output compared to previous runs?
  • Who is responsible for owning and maintaining system-level metrics... PET? SRE?
  • How should we collect metrics?
    • StatsD has a prometheus bridge. We should investigate this
  • CI.