Maniphest T207040

Graphite1001 disk usage at 96%
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Oct 15 2018, 2:57 PM

Description

While graphite1004 needs to be put in service (T196484) graphite1001 has reached 96% utilization.

Growth during the last ~20d looks like it has been mostly driven mostly by ores and zuul (metrics created)

220856 ores
 34543 zuul
  8889 servers
  2943 MediaWiki
  2612 frontend
  2518 webpagetest
  1092 eventstreams
   633 daily
   562 restbase
   523 librenms
   372 varnish
   308 aqs
   286 test_joal
   108 nodepool
    78 mw
    48 parsoid
    46 graphoid
    36 tilerator
    28 labstore
    24 logstash
    18 wikibase
    18 swift
    12 thumbor
    12 mobileapps
    12 changeprop
     6 proton
     6 eventlogging

Related Objects

Mentioned Here: T183454: Deprovision Diamond collectors no longer in use
T196484: rack/setup/install graphite1004

Event Timeline

fgiunchedi created this task.Oct 15 2018, 2:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2018, 2:57 PM

ores appears to be capturing worker-specific metrics at ores.<server_name>.uwsgi.worker.<worker id>.(...) The <worker id> field appears variable and unpredictable. Depending on the implementation, this could be the source of the ballooning usage and would contain sparse data (e.g. <worker id> being a thread number that has a short lifespan). Total in uwsgi alone: ~183,000 metrics.
zuul might be concerning, but evaluating usefulness of metrics might be worth considering. Each extension name in zuul.pipeline.(postmerge|gate-and-submit|test).mediawiki.extensions.<extension name> gets 18 metrics and appears to follow files. I imagine a situation where the metric becomes useless if an extension is removed, or renamed. Total ~39,000 metrics.
There are 637 servers no longer reporting data to servers.<hostname>.(...) and 577 of them do not appear in monitoring. It's possible the non-reporting but in monitoring nodes are due to T183454. Total: ~ 151,000 metrics.

I suggest two things:

Disable the ores uwsgi metrics collection.
Remove the hosts in servers.<hostname> that are dead or no longer reporting metrics.

I estimate at current rate of utilization the disk will be full in less than 15 days (unless there is a run on creates, then even less).

jijiki assigned this task to fgiunchedi.Oct 23 2018, 2:06 PM

jijiki triaged this task as Medium priority.

colewhite moved this task from Inbox to Up next on the observability board.Nov 26 2018, 4:07 PM

Resolving, we're onto new graphite hardware now with more resources.

Graphite1001 disk usage at 96%Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Graphite1001 disk usage at 96%
Closed, ResolvedPublic
Actions