Page MenuHomePhabricator

Collect and display basic metrics for all tools (service groups)
Open, MediumPublic

Description

Track basic metrics for each service group:

  • cpu hours used (I think we can get this from qacct)
  • disk space used (du -s)
  • database usage (number of rows in service group db)
  • number of raw hits to tools.wmflabs.org/service-group

Provide aggregate reports and reports per service group with daily granularity.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
chasemp triaged this task as Medium priority.Apr 4 2016, 2:00 PM

One use of this would be proactively monitoring for large databases like the ones that are being looked at in T132431: labsdb1001 and labsdb1003 short on available space.

A big hammer method for checking user/tool database sizes:

SELECT
    table_schema
  , sum( data_length ) as data_bytes
  , sum( index_length ) as index_bytes
  , sum( table_rows ) as row_count
  , count(1) as tables
FROM information_schema.TABLES
WHERE table_schema regexp '^[psu][0-9]'
GROUP BY table_schema;

Something could run that once per day (or maybe even once a week) aon each distinct database host for labs and log the data in a way that could be used to produce nice timeseries reports.

On 2016-04-21 there are 223 distinct user/tool schemas on tools.labsdb. Other hosts have far fewer (c1=58, c3=35).

https://tools.wmflabs.org/tool-db-usage/ now provides point-in-time database usage information.

I wonder if nowadays all this should be ingested by prometheus and exposed through grafana dashboards instead of implementing specific tools to show each bit of data

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!