rack/setup/install graphite1004
Closed, ResolvedPublic

Description

This task will track the receiving, racking, setup, and installation of graphite1004.

Racking Proposal: This new graphite host should be located in different rack from any existing graphite hosts in the site. graphite100[1-3] are located in C4, B1, & A3, do not place this host in those racks. Place it in any other 1G racks.

graphite1004:

  • - receive in system on procurement task T194862
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation
There are a very large number of changes, so older changes are hidden. Show Older Changes
Cmjohnson updated the task description. (Show Details)Jul 3 2018, 12:48 PM
fgiunchedi moved this task from Backlog to Externally blocked on the monitoring board.

@Cmjohnson what's the status for graphite1004 ?

@fgiunchedi I am currently working through 14 racking tasks...the CP's are the highest priority. I am not sure where the graphite falls in the priority list but I am working through the. I am set to do the dbxproxy servers next.

Thanks for the update @Cmjohnson, not particularly urgent but it would be nice to have graphite1004 before the end of the quarter

Change 449529 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt/production dns graphite1004

https://gerrit.wikimedia.org/r/449529

Change 449529 merged by Cmjohnson:
[operations/dns@master] Adding mgmt/production dns graphite1004

https://gerrit.wikimedia.org/r/449529

Cmjohnson updated the task description. (Show Details)Jul 31 2018, 8:01 PM

Change 449715 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding graphite1004 mgmt/production dns

https://gerrit.wikimedia.org/r/449715

Cmjohnson updated the task description. (Show Details)Aug 2 2018, 7:19 PM
Cmjohnson reassigned this task from Cmjohnson to RobH.
Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.

This servers is ready for install, assigning to @RobH for help with installation

Change 450252 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting graphite1004 install params

https://gerrit.wikimedia.org/r/450252

Change 450252 merged by RobH:
[operations/puppet@production] setting graphite1004 install params

https://gerrit.wikimedia.org/r/450252

RobH reassigned this task from RobH to fgiunchedi.Aug 3 2018, 6:27 PM
RobH removed projects: Patch-For-Review, ops-eqiad.

Please note I set this to role spare, since I wasn't sure if setting it to any other role may produce logging spam/traffic/alerts to the other graphite hosts. When in doubt, go for the smaller impact role choice before service implementation.

I've assigned this to @fgiunchedi for service implementation.

RobH updated the task description. (Show Details)Aug 3 2018, 6:27 PM

Thanks @RobH ! Yeah role spare makes sense in this case.

Change 452576 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Assign graphite1004 its role

https://gerrit.wikimedia.org/r/452576

Change 452576 merged by Filippo Giunchedi:
[operations/puppet@production] Assign graphite1004 its role

https://gerrit.wikimedia.org/r/452576

Change 452578 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: fix graphite-manage stretch command

https://gerrit.wikimedia.org/r/452578

Change 452579 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: mirror metric traffic to graphite1004

https://gerrit.wikimedia.org/r/452579

Change 452578 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: fix graphite-manage stretch command

https://gerrit.wikimedia.org/r/452578

Change 452579 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: mirror metric traffic to graphite1004

https://gerrit.wikimedia.org/r/452579

Mentioned in SAL (#wikimedia-operations) [2018-08-13T23:24:33Z] <godog> restart carbon-c-relay on graphite1001 to mirror traffic to graphite1004 - T196484

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Aug 21 2018, 12:56 AM

Mentioned in SAL (#wikimedia-operations) [2018-08-22T16:58:26Z] <godog> start backfilling metrics from graphite1001 into graphite1004 - T196484

Change 454695 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: install dummy carbonate.conf

https://gerrit.wikimedia.org/r/454695

Change 454695 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: install dummy carbonate.conf

https://gerrit.wikimedia.org/r/454695

Change 454872 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Shift carbon/statsd write traffic to graphite1004

https://gerrit.wikimedia.org/r/454872

Change 454874 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] diamond: send metrics to graphite1004

https://gerrit.wikimedia.org/r/454874

Change 454875 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: add graphite1004 to cluster_servers

https://gerrit.wikimedia.org/r/454875

Change 454876 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: move to graphite1004

https://gerrit.wikimedia.org/r/454876

Change 454877 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: move alerting to graphite1004

https://gerrit.wikimedia.org/r/454877

Change 454878 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] calico: allow statsd traffic to graphite1004

https://gerrit.wikimedia.org/r/454878

Change 454878 merged by Alexandros Kosiaris:
[operations/puppet@production] calico: allow statsd traffic to graphite1004

https://gerrit.wikimedia.org/r/454878

Change 454874 merged by Filippo Giunchedi:
[operations/puppet@production] diamond: send metrics to graphite1004

https://gerrit.wikimedia.org/r/454874

Change 454872 abandoned by Filippo Giunchedi:
Shift carbon/statsd write traffic to graphite1004

Reason:
Superceded

https://gerrit.wikimedia.org/r/454872

Change 455808 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Switch statsd/carbon to graphite1004

https://gerrit.wikimedia.org/r/455808

Change 455808 merged by Filippo Giunchedi:
[operations/dns@master] Switch statsd/carbon to graphite1004

https://gerrit.wikimedia.org/r/455808

Mentioned in SAL (#wikimedia-operations) [2018-08-29T09:04:43Z] <godog> switch statsd and carbon CNAMEs to graphite1004 - T196484

Mentioned in SAL (#wikimedia-operations) [2018-08-29T09:04:43Z] <godog> switch statsd and carbon CNAMEs to graphite1004 - T196484

This got reverted, due to the fact that carbon-c-relay frontend started dropping metrics. Stretch has a new version (2.5) compared to what we run on jessie (1.11) so we'll need to investigate what changed and if the newer version in buster (3.2) might help.

fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.Oct 3 2018, 1:16 PM
fgiunchedi moved this task from In progress to Up next on the monitoring board.Oct 15 2018, 2:16 PM

Change 468009 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/dns@master] move statsd cname to graphite1004

https://gerrit.wikimedia.org/r/468009

Change 468009 merged by Cwhite:
[operations/dns@master] move statsd cname to graphite1004

https://gerrit.wikimedia.org/r/468009

Change 468388 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] graphite: add interface::rps settings to graphite hosts

https://gerrit.wikimedia.org/r/468388

Mentioned in SAL (#wikimedia-operations) [2018-10-18T19:11:28Z] <shdubsh> upping ring buffer size on graphite1004 in an attempt to mitigate dropped packets at the interface -- T196484

Mentioned in SAL (#wikimedia-operations) [2018-10-19T07:05:58Z] <godog> bump /proc/sys/net/core/rmem_default temporarily to 1MB and bounce statsd-proxy statsite-instances on graphite1004 - T196484

Mentioned in SAL (#wikimedia-operations) [2018-10-19T07:50:21Z] <godog> bump /proc/sys/net/core/rmem_default temporarily to 2MB and bounce statsd-proxy statsite-instances on graphite1004 - T196484

Setting a 2MB socket receive buffer has helped getting errors down to ~0, unfortunately statsd-proxy nor statsite support setting SO_RCVBUF socket option via configuration, so I did this to temporarily set the buffer to 2MB and then back to its default:

echo 2097152 > /proc/sys/net/core/rmem_default
systemctl restart statsd-proxy statsite-instances
echo 212992 > /proc/sys/net/core/rmem_default

There's also ~1 packet/s dropped by the NIC which we'll need to tackle too

Mentioned in SAL (#wikimedia-operations) [2018-10-19T09:37:05Z] <godog> bump /proc/sys/net/core/rmem_default temporarily to 6MB and bounce statsd-proxy statsite-instances on graphite1004 - T196484

Tried 6MB per thread now: we're ingesting about 30MB/s of udp traffic, with 4 statsd-proxy threads each should be able to buffer its share of bandwidth (7.5MB/s) for ~1s

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Oct 19 2018, 3:13 PM
fgiunchedi moved this task from Up next to In progress on the monitoring board.Oct 29 2018, 1:59 PM

Change 470410 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/dns@master] update graphite-in to use graphite1004

https://gerrit.wikimedia.org/r/470410

Change 470512 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/statsd-proxy@wmf_v0.0.10] add socket_bufsize option to make SO_RCVBUF tunable

https://gerrit.wikimedia.org/r/470512

Change 470557 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] lower TTL for graphite CNAMEs before failover

https://gerrit.wikimedia.org/r/470557

Change 470557 merged by Filippo Giunchedi:
[operations/dns@master] lower TTL for graphite CNAMEs before failover

https://gerrit.wikimedia.org/r/470557

Change 470410 merged by Cwhite:
[operations/dns@master] update graphite-in to use graphite1004

https://gerrit.wikimedia.org/r/470410

Mentioned in SAL (#wikimedia-operations) [2018-10-30T16:27:11Z] <shdubsh> updated graphite-in cname to graphite1004 - T196484

Change 470659 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] graphite: add queue_depth and batch_size options to carbon-c-relay

https://gerrit.wikimedia.org/r/470659

Change 470659 merged by Cwhite:
[operations/puppet@production] graphite: add queue_depth and batch_size options to carbon-c-relay

https://gerrit.wikimedia.org/r/470659

Change 454877 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: move alerting to graphite1004

https://gerrit.wikimedia.org/r/454877

Change 468388 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: add interface::rps settings to graphite hosts

https://gerrit.wikimedia.org/r/468388

Change 454875 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: add graphite1004 to cluster_servers

https://gerrit.wikimedia.org/r/454875

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:44:42Z] <godog> pool graphite1004 for reads - T196484

Change 471965 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: add graphite1004 to cluster_servers

https://gerrit.wikimedia.org/r/471965

Change 471965 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: add graphite1004 to cluster_servers

https://gerrit.wikimedia.org/r/471965

Change 454876 merged by Filippo Giunchedi:
[operations/puppet@production] Move read traffic to graphite1004

https://gerrit.wikimedia.org/r/454876

Change 471986 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: remove old graphite hardware from graphite-web cluster

https://gerrit.wikimedia.org/r/471986

Change 471987 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: remove old graphite hardware from receiving metrics

https://gerrit.wikimedia.org/r/471987

Change 471988 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Use graphite2003 when codfw graphite is enabled

https://gerrit.wikimedia.org/r/471988

Change 471988 merged by Filippo Giunchedi:
[operations/puppet@production] Use graphite2003 when codfw graphite is enabled

https://gerrit.wikimedia.org/r/471988

Change 471994 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] role: alias static path on graphite

https://gerrit.wikimedia.org/r/471994

Change 471994 merged by Cwhite:
[operations/puppet@production] role: alias static path on graphite

https://gerrit.wikimedia.org/r/471994

Change 471986 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: remove old graphite hardware from graphite-web cluster

https://gerrit.wikimedia.org/r/471986

Change 472002 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: backup graphite/django sqlite database

https://gerrit.wikimedia.org/r/472002

Change 472002 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: backup graphite/django sqlite database

https://gerrit.wikimedia.org/r/472002

Change 470512 merged by Cwhite:
[operations/debs/statsd-proxy@wmf_v0.0.10] add socket_bufsize option to make SO_RCVBUF tunable

https://gerrit.wikimedia.org/r/470512

Change 472489 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] statsd_proxy: add socket_receive_bufsize parameter

https://gerrit.wikimedia.org/r/472489

Change 472489 merged by Cwhite:
[operations/puppet@production] statsd_proxy: add socket_receive_bufsize parameter

https://gerrit.wikimedia.org/r/472489

Change 471987 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: remove old graphite hardware from receiving metrics

https://gerrit.wikimedia.org/r/471987

fgiunchedi closed this task as Resolved.Nov 13 2018, 1:24 PM

graphite1004 is fully in service, followup for decom is T209357: Return graphite100[13] to spares pool (or decom)

fgiunchedi updated the task description. (Show Details)Nov 13 2018, 1:24 PM