Page MenuHomePhabricator

Improve ipoid's grafana dashboard
Closed, DeclinedPublic

Description

TSP monitors ipoid to ensure that it's still up and running daily imports/serving data. The grafana dashboard contains some useful information toward this end but can be more useful if it documented patterns and things to watch out for, similar to MediaModeration's dashboard.

The follow information would be useful to update the dashboard with:

  • ipoid serves comparatively little data. It occasionally fails on a timeout (visible in logstash as well) but otherwise is expected to maintain ~99% uptime on queries
  • ipoid's import is the main point of failure and the memory usage is the fastest health check.
    • the import runs on a cron job that kicks off every 4 hours and only if another container is not running
    • the container has a timeout of a week starting from its initialization
    • a successful import observed from the memory usage chart will have: 1. a sharp peak (as the feed is ingested and processed), 2. a sharp decline (sorting/processing done) 3. a slow decline into no memory usage (as the feed data is imported)
    • if the spike does not reach no memory, something has interrupted the feed (timeout, error, etc)
    • no memory usage time ranges shouldn't be larger than ~12 hours (for now, as it's almost guaranteed another feed needs to be imported given how long imports take these days)
    • imports take ~24 hours (estimated, as all recent imports have been interrupted/restarted) and can be observed through memory usage
    • if the import memory usage looks abnormal, investigate further with logstash and if logstash yields no valuable data, check the container logs

While the feed is our main concern, we should also prioritize that graph and maybe move it up/make it take up more columns.

Related Objects