Page MenuHomePhabricator

cpjobqueue spamming statsd metrics
Closed, ResolvedPublic

Description

Noticed this via graphite "drops" alert, looks like cpjobqueue is sending an increasing amount of statsd traffic, clogging everything up:

U 10.64.48.29:51915 -> 10.64.32.155:8125
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:243082|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms
cpjobqueue.gc.minor:295770|ms

Event Timeline

fgiunchedi triaged this task as Unbreak Now! priority.Nov 25 2017, 1:37 PM

Mentioned in SAL (#wikimedia-operations) [2017-11-25T13:40:32Z] <godog> drop incoming statsd from scb to graphite1001 temporarily - T181333

More spam, from kafka topic for cpjobqueue, note the repeated retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-

root@graphite1001:/var/log/carbon# ls -la /var/lib/carbon/whisper/kafka/mirror/main-eqiad_to_main-codfw/kafka-mirror/kafka2001_codfw_wmnet_9997/kafka/producer/producer-topic-metrics
total 140
drwxr-xr-x 112 _graphite _graphite 16384 Nov 16 07:42 .
drwxr-xr-x   5 _graphite _graphite    86 Nov  8  2016 ..
drwxr-xr-x   2 _graphite _graphite   136 Jun 29 20:45 eqiad_change-prop_backlinks_resource-change_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  8  2016 eqiad_change-prop_error_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Jun 29 21:07 eqiad_change-prop_retry_change-prop_backlinks_resource-change_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite  4096 Nov  6 15:08 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_flaggedrevs_CacheUpdate_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite  4096 Nov  6 15:08 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_updateBetaFeaturesUserCounts_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite  4096 Nov  6 15:07 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_flaggedrevs_CacheUpdate_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 15:06 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_updateBetaFeaturesUserCounts_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 15:11 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_flaggedrevs_CacheUpdate_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 15:05 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_updateBetaFeaturesUserCounts_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite  4096 Nov  6 15:11 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_flaggedrevs_CacheUpdate_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 15:01 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_updateBetaFeaturesUserCounts_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 15:07 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_flaggedrevs_CacheUpdate_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite  4096 Nov  6 15:09 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_updateBetaFeaturesUserCounts_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 14:58 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_flaggedrevs_CacheUpdate_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite  4096 Nov  6 14:57 eqiad_change-prop_retry_change-prop_retry_change-prop_retry_mediawiki_job_updateBetaFeaturesUserCounts_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 14:57 eqiad_change-prop_retry_change-prop_retry_mediawiki_job_flaggedrevs_CacheUpdate_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  6 14:56 eqiad_change-prop_retry_change-prop_retry_mediawiki_job_updateBetaFeaturesUserCounts_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  8  2016 eqiad_change-prop_retry_change-prop_transcludes_continue_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer
drwxr-xr-x   2 _graphite _graphite   136 Nov  8  2016 eqiad_change-prop_retry_change-prop_transcludes_resource-change_kafka-mirror-main-eqiad_to_main-codfw-kafka2001-producer

Mentioned in SAL (#wikimedia-operations) [2017-11-25T14:10:37Z] <godog> roll-restart cpjobqueue to alleviate metrics leak - T181333

Looks like cxserver is having the same problem wrt repeated gc metrics

14:24:39.537511 IP scb1003.eqiad.wmnet.50926 > graphite1001.eqiad.wmnet.8125: UDP, length 1428
E....q@.@.0.
@ .
@ .......[acxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms
cxserver.gc.minor:229441|ms

Mentioned in SAL (#wikimedia-operations) [2017-11-25T14:26:53Z] <godog> restart cxserver on scb100[34] - T181333

Mentioned in SAL (#wikimedia-operations) [2017-11-25T14:34:04Z] <godog> rolling restart of cxserver to alleviate metrics leak - T181333

Statsite and network graphs from graphite1001

2017-11-25-160423_1022x281_scrot.png (281×1 px, 62 KB)

2017-11-25-160430_501x290_scrot.png (290×501 px, 29 KB)

Top 20 sent metrics from scb in eqiad:

scb1002:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut -d. -f1 | uniq -dc | sort -rn | head -20

1140470 changeprop
1133078 mobileapps
 222126 cpjobqueue
  16937 ores
   1431 cxserver
    422 graphoid
    408 trendingedits
    192 recommendation-api
    188 citoid
    165 mathoid
    152 eventstreams
    106 service_checker
scb1003:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut -d. -f1 | uniq -dc | sort -rn | head -20
1444927 changeprop
 339528 cpjobqueue
  24190 ores
  15768 mobileapps
   3524 cxserver
    673 graphoid
    376 trendingedits
    225 mathoid
    211 citoid
    207 recommendation-api
    190 eventstreams
    132 service_checker
scb1004:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut -d. -f1 | uniq -dc | sort -rn | head -20
1473690 changeprop
 309392 cpjobqueue
  22288 ores
  13766 mobileapps
   3703 cxserver
    629 graphoid
    389 trendingedits
    219 mathoid
    216 recommendation-api
    213 citoid
    181 eventstreams
    129 service_checker

Mentioned in SAL (#wikimedia-operations) [2017-11-25T15:37:17Z] <volans> restarted statsd-proxy on graphite1001 (died during investigation) T181333

Mentioned in SAL (#wikimedia-operations) [2017-11-25T15:45:19Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@e35aa05]: Rollback. Disable GC metric reporting T181333

Mentioned in SAL (#wikimedia-operations) [2017-11-25T15:45:50Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@e35aa05]: Rollback. Disable GC metric reporting T181333 (duration: 00m 31s)

Mentioned in SAL (#wikimedia-operations) [2017-11-25T16:05:11Z] <godog> unban statsd traffic from scb on graphite1001 - T181333

fgiunchedi lowered the priority of this task from Unbreak Now! to High.Nov 27 2017, 9:32 AM

This has been fixed temporarily by reverting to a previous version of service-runner, downgrading severity

The new version of service-runner have been published that only sends GC metric once per second - that should prevent this from happening.

fgiunchedi claimed this task.

Resolving, we haven't seen adverse effects from service-runner's latest version.