Page MenuHomePhabricator

logster should not resolve statsd's IP every time it sends a metric
Closed, ResolvedPublic

Description

Avoiding continuous DNS resolutions in cachestats considerably decreased the amount of requests sent to our DNS recursors:

Screenshot from 2017-07-21 17-30-32.png (900×1 px, 140 KB)

However, logster is also performing one DNS request every time it sends a metric. We might want to patch it to cache the statsd IP.

Event Timeline

ema triaged this task as Medium priority.Jul 21 2017, 3:36 PM

Logster is run in the root crontab every minute, we pass the statsd hostname:port via:

# Sets up Logster to read from the Varnishkafka instance stats JSON file
# and report metrics to statsd.
varnishkafka::monitor::statsd { 'webrequest':
    graphite_metric_prefix => $graphite_metric_prefix,
    statsd_host_port       => hiera('statsd'),
}

So one solution could be to leverage ipresolve in puppet with a reasonable TTL to pass statsd's ip directly to logster, avoiding the DNS queries almost completely.

Change 367370 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/debs/logster@master] Reduce the DNS queries for the statsd domain

https://gerrit.wikimedia.org/r/367370

Change 367370 merged by Elukey:
[operations/debs/logster@master] Reduce the DNS queries for the statsd domain

https://gerrit.wikimedia.org/r/367370

Change 367382 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/debs/logster@master] Reduce the DNS queries for the statsd domain

https://gerrit.wikimedia.org/r/367382

Change 367382 merged by Elukey:
[operations/debs/logster@master] Reduce the DNS queries for the statsd domain

https://gerrit.wikimedia.org/r/367382

Made a mess in the git repo for logster, tried to amend it as best as I could removing unnecessary cruft to avoid future mistakes. When I merged the first code review, filed for the debian branch, I didn't specific in git review the appropriate remote branch and 'master' was taken. +2, submit damage done. I tried to revert but nothing changed much, master's history was the same as debian. So since there is not much point in my opinion in keeping two branches like this (last change was 2y ago) I deleted the debian branch and removed my last two changes from master's history.

Change 367383 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/debs/logster@master] Update changelog for version 0.11-2~jessie

https://gerrit.wikimedia.org/r/367383

Diff between the current version on cp1008 and the new one:

elukey@copper:~/logster$ debdiff logster_0.0.10-1~jessie1.dsc  /var/cache/pbuilder/result/jessie-amd64/logster_0.0.11-2~jessie1.dsc
dpkg-source: warning: extracting unsigned source package (/home/elukey/logster/logster_0.0.10-1~jessie1.dsc)
dpkg-source: warning: extracting unsigned source package (/var/cache/pbuilder/result/jessie-amd64/logster_0.0.11-2~jessie1.dsc)
diff -Nru logster-0.0.10/debian/changelog logster-0.0.11/debian/changelog
--- logster-0.0.10/debian/changelog     2015-03-09 20:25:06.000000000 +0000
+++ logster-0.0.11/debian/changelog     2017-07-24 11:11:58.000000000 +0000
@@ -1,4 +1,18 @@
-logster (0.0.10-1~jessie1) wikimedia-trusty; urgency=medium
+logster (0.0.11-2~jessie1) jessie-wikimedia; urgency=medium
+
+  * Reduce DNS queries for the statsd domain.
+  * Moved the version numbering to jessie-wikimedia since we don't use
+    precise/trusty anymore.
+
+ -- Luca Toscano <ltoscano@wikimedia.org>  Mon, 24 Jul 2017 15:52:47 +0000
+
+logster (0.0.11-1~trusty1) wikimedia-trusty; urgency=medium
+
+  * Add line count to LineCountLogster
+
+ -- Andrew Otto (WMF) <vagrant@mediawiki-vagrant.dev>  Mon, 30 Mar 2015 15:52:47 +0000
+
+logster (0.0.10-1~trusty1) wikimedia-trusty; urgency=medium

   * Don't send non-numeric metrics to statsd
   * Add debian/logrotate
diff -Nru logster-0.0.10/logster/logster.py logster-0.0.11/logster/logster.py
--- logster-0.0.10/logster/logster.py   2015-03-09 20:02:40.000000000 +0000
+++ logster-0.0.11/logster/logster.py   2017-07-24 11:09:38.000000000 +0000
@@ -272,6 +272,7 @@
 def submit_statsd(metrics, options):
     if (not options.dry_run):
         host = options.statsd_host.split(':')
+        host[0] = socket.gethostbyname(host[0])

     for metric in metrics:
         metric_name = metric.name
diff -Nru logster-0.0.10/logster/parsers/LineCountLogster.py logster-0.0.11/logster/parsers/LineCountLogster.py
--- logster-0.0.10/logster/parsers/LineCountLogster.py  2015-03-09 20:02:40.000000000 +0000
+++ logster-0.0.11/logster/parsers/LineCountLogster.py  2017-07-24 11:09:38.000000000 +0000
@@ -53,6 +53,7 @@

         # Return a list of metrics objects
         return [
-            MetricObject('line_rate',  (float(self.line_count) / float(self.duration)), 'lines per sec', type='float', slope='both'),
+            MetricObject('count',  self.line_count, 'lines on log since last run', type='int'),
+            MetricObject('rate',  (float(self.line_count) / float(self.duration)), 'lines per sec', type='float', slope='both'),
         ]

So overall the diff looks good and it seems that I didn't do much damage with my merge. The only diff that I'd need to check with my team is:

https://github.com/wikimedia/operations-debs-logster/commit/0ca3cae3c937cbeed848c0b14b7b4af7a1b0efaa

Related phab task is https://phabricator.wikimedia.org/T94193 and it seems safe to put in prod, but just wanted to double check first. We use the JsonLogster parser and not the LineCountLogster, it was probably related to a old instance of logster not running anymore.

The history diff looks weird but those are currently the last two entries on cp1008:

logster (0.0.10-1~jessie1) wikimedia-trusty; urgency=medium

  * Don't send non-numeric metrics to statsd
  * Add debian/logrotate

 -- Andrew Otto (WMF) <otto@wikimedia.org>  Mon, 09 Mar 2015 20:03:58 +0000

logster (0.0.9-1~precise1) wikimedia-precise; urgency=low

  * Fix for multiple outputs

 -- Andrew Otto (WMF) <otto@wikimedia.org>  Mon, 29 Dec 2014 16:42:46 +0000

Pull request merged from upstream: https://github.com/etsy/logster/commit/a24249886391a8b885d21c882f6fcaa95e29b015

We could proceed in two ways:

  1. package 0.0.11-1~jessie1 even if upstream moved from 0.0.10 to 1.x. We created 0.0.11 internally to support a small use case (mentioned before) that is probably not used anymore.
  2. package 1.0.1 from upstream, test and deploy it.

After a chat with Moritz and Ema we decided to pick the current jessie version and apply the patch on top of it. In another task I'll check the current fork that we maintain in wikimedia (very very old), see if any followup with upstream is needed and then package 1.x.

In the meantime, this is the new package:

elukey@copper:~/logster/logster-0.0.10$ lintian /var/cache/pbuilder/result/jessie-amd64/logster_0.0.10-2~jessie1_amd64.changes
W: logster source: out-of-date-standards-version 3.9.5 (current is 3.9.6)
W: logster: binary-without-manpage usr/bin/logster
elukey@copper:~/logster/logster-0.0.10$ cd ..
elukey@copper:~/logster$ debdiff logster_0.0.10-1~jessie1.dsc  /var/cache/pbuilder/result/jessie-amd64/logster_0.0.10-2~jessie1.dsc
dpkg-source: warning: extracting unsigned source package (/home/elukey/logster/logster_0.0.10-1~jessie1.dsc)
dpkg-source: warning: extracting unsigned source package (/var/cache/pbuilder/result/jessie-amd64/logster_0.0.10-2~jessie1.dsc)
diff -Nru logster-0.0.10/debian/changelog logster-0.0.10/debian/changelog
--- logster-0.0.10/debian/changelog	2015-03-09 20:25:06.000000000 +0000
+++ logster-0.0.10/debian/changelog	2017-07-26 08:16:18.000000000 +0000
@@ -1,3 +1,10 @@
+logster (0.0.10-2~jessie1) jessie-wikimedia; urgency=medium
+
+  * Reduce DNS lookups for the statsd domain (Bug: #T171318)
+  * Add myself to the Uploaders list
+
+ -- Luca Toscano <ltoscano@wikimedia.org>  Wed, 26 Jul 2017 07:53:58 +0000
+
 logster (0.0.10-1~jessie1) wikimedia-trusty; urgency=medium

   * Don't send non-numeric metrics to statsd
diff -Nru logster-0.0.10/debian/control logster-0.0.10/debian/control
--- logster-0.0.10/debian/control	2015-03-09 20:25:06.000000000 +0000
+++ logster-0.0.10/debian/control	2017-07-26 08:11:33.000000000 +0000
@@ -1,5 +1,6 @@
 Source: logster
 Maintainer: Andrew Otto (WMF) <otto@wikimedia.org>
+Uploaders: Luca Toscano <ltoscano@wikimedia.org>
 Section: python
 Priority: optional
 Build-Depends:  python, python-setuptools, debhelper (>= 9),
diff -Nru logster-0.0.10/debian/patches/reduce_dns_lookups_statsd.diff logster-0.0.10/debian/patches/reduce_dns_lookups_statsd.diff
--- logster-0.0.10/debian/patches/reduce_dns_lookups_statsd.diff	1970-01-01 00:00:00.000000000 +0000
+++ logster-0.0.10/debian/patches/reduce_dns_lookups_statsd.diff	2017-07-26 07:51:41.000000000 +0000
@@ -0,0 +1,10 @@
+--- a/logster/logster.py
++++ b/logster/logster.py
+@@ -272,6 +272,7 @@
+ def submit_statsd(metrics, options):
+     if (not options.dry_run):
+         host = options.statsd_host.split(':')
++        host[0] = socket.gethostbyname(host[0])
+
+     for metric in metrics:
+         metric_name = metric.name
diff -Nru logster-0.0.10/debian/patches/series logster-0.0.10/debian/patches/series
--- logster-0.0.10/debian/patches/series	1970-01-01 00:00:00.000000000 +0000
+++ logster-0.0.10/debian/patches/series	2017-07-26 07:50:26.000000000 +0000
@@ -0,0 +1 @@
+reduce_dns_lookups_statsd.diff

Ideally we upgrade to 1.x in Debian, the version currently in the archive is from 2014 and hasn't been touched since.

New package uploaded to jessie-wikimedia and rolled out to role cache::misc/upload/text/canary.

elukey claimed this task.

Impact to maerlant and acamar's traffic:

Screen Shot 2017-07-26 at 11.14.38 AM.png (745×1 px, 70 KB)

Screen Shot 2017-07-26 at 11.14.51 AM.png (782×1 px, 93 KB)

In eqiad hydrongen and chromium didn't register such a high drop since their normal baseline of UDP traffic is way higher than their friends in other DCs.

Mentioned in SAL (#wikimedia-releng) [2017-07-26T09:24:10Z] <hashar> deployment-cache-upload04 deployment-cache-text04 upgraded logster 0.0.10-1~jessie1 -> 0.0.10-2~jessie1 - T171318

Change 367383 abandoned by Elukey:
Update changelog for version 0.0.11-2~jessie

https://gerrit.wikimedia.org/r/367383