Page MenuHomePhabricator

Move coal from graphite#001 nodes to webperf#001
Closed, ResolvedPublic

Description

At the moment coal runs on graphite machines, though this is not strictly needed and they can run on a dedicated machine (baremetal or VM)

  • Weed out coal puppet module dependencies from graphite module
  • Not done: Backup /var/lib/coal (or /var/lib/carbon/whisper) in bacula.
  • Migrate coal's retention config to graphite in puppet. – https://gerrit.wikimedia.org/r/427945
  • Update coal to write to graphite1001 via its line protocol instead of to disk directly. – https://gerrit.wikimedia.org/r/427664
  • Move coal processor from graphite1001/graphite2001 to webperf1001/webperf2001 - https://gerrit.wikimedia.org/r/#/c/429252
  • Move coal web from graphite1001/graphite2001 to webperfx001
  • Remove coal-web away from graphitex001

Event Timeline

Ottomata triaged this task as Medium priority.
Krinkle changed the task status from Open to Stalled.EditedAug 3 2017, 4:04 AM
Krinkle removed Krinkle as the assignee of this task.
Krinkle lowered the priority of this task from Medium to Low.
Krinkle subscribed.

The current thinking at T158837 might result in the removal of the coal and coal-web services. Shelving this task for now.

Change 427664 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: convert to using graphite instead of writing to whisper

https://gerrit.wikimedia.org/r/427664

@fgiunchedi Mentioned in #wikimedia-perf that he thought he remembered there being a reason why submitting metrics via graphite wouldn't work.

First, here's a demonstration that it does. Red lines are several different coal metrics, still being written directly to whisper files by coal, just as they have been. Blue lines are those same metrics, being written by a second coal process to a separate key, via graphite (coal2.* instead of coal.*):

image.png (1×2 px, 391 KB)

As you can see, they line up perfectly. Manual examination of the whisper files shows that the same values are being written by both processes, at the same timestamps.

This works because of two changes that were made during the recent refactors:

  1. We only commit an offset in Kafka once all 5 5-minute periods that include that offset have been processed and submitted.
  2. We align data boundaries to the 0-second mark (eg, each data boundary has a timestamp such that ts % 60 == 0)

Neither of these were previously true. ZeroMQ, the message broker that was being used, doesn't retain data. And the prior incarnation of coal aligned its data boundaries based on startup time (and restarted several dozen times a day due to various ZeroMQ timeouts).

Carbon, meanwhile, does align its data boundaries to the same %60 that we're using at the moment, which means we're submitting data with a timestamp that matches the timestamp that carbon is going to write. (Not true previously, when the timestamp would have been adjusted by carbon to the appropriate boundary). Additionally, carbon will write the last value received for any given period, and the prior version of coal could have sent mulitple, different values for a given metric within the same window (especially if multiple instances were running). This is no longer the case.

Thanks @Imarlier for the explanation and insight! Makes sense to me, the other thing I suggest checking is coal's whisper files aggregation/retetion periods vs graphite's defaults (modules/role/manifests/graphite/base.pp) and see if that needs adjusting when new coal whisper files are created.

Change 427664 merged by jenkins-bot:
[performance/coal@master] coal: convert to using graphite instead of writing to whisper

https://gerrit.wikimedia.org/r/427664

Change 427958 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: incorporate etcd to test for master datacenter

https://gerrit.wikimedia.org/r/427958

Krinkle renamed this task from Move coal from graphite machine(s) to Move coal from graphite#001 nodes to webperf#001.Apr 24 2018, 4:16 PM
Krinkle raised the priority of this task from Low to Medium.
Krinkle updated the task description. (Show Details)

Change 427958 merged by jenkins-bot:
[performance/coal@master] coal: incorporate etcd to test for master datacenter

https://gerrit.wikimedia.org/r/427958

@Imarlier I landed it as-is.

Nevermind about using the /etc/wikimedia-cluster file (puppet), usage). Passing it as parameter from Puppet is probably as good or better that using that file. They both come from the same source and other services use the ::realm variable for this purpose as well. In Git, the only use of the /etc/ file is actually MediaWiki PHP, which is that way because it can't easily get values from Puppet. But for pretty much anything else, the parameter works well.

Change 430601 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: deploy to webperf machines as well

https://gerrit.wikimedia.org/r/430601

Change 430601 merged by jenkins-bot:
[performance/coal@master] coal: deploy to webperf machines as well

https://gerrit.wikimedia.org/r/430601

Change 430605 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: scap config is new to me...

https://gerrit.wikimedia.org/r/430605

Change 430605 merged by jenkins-bot:
[performance/coal@master] coal: scap config is new to me...

https://gerrit.wikimedia.org/r/430605

Change 430625 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal_web: Use the graphite API to fetch data instead of whisper

https://gerrit.wikimedia.org/r/430625

Change 430625 merged by jenkins-bot:
[performance/coal@master] coal_web: Use the graphite API to fetch data instead of whisper

https://gerrit.wikimedia.org/r/430625

Change 431583 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] coal: require requests module; deploy to webperf

https://gerrit.wikimedia.org/r/431583

Ops - would appreciate a merge on the patch above ^

Change 431583 merged by Alexandros Kosiaris:
[operations/puppet@production] coal: require requests module; deploy to webperf

https://gerrit.wikimedia.org/r/431583

Change 431615 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] coal: require requests module; deploy to webperf

https://gerrit.wikimedia.org/r/431615

Change 431615 merged by Dzahn:
[operations/puppet@production] coal: require requests module; deploy to webperf

https://gerrit.wikimedia.org/r/431615

Change 431636 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] performance::site: require libapache2-mod-uwsgi

https://gerrit.wikimedia.org/r/431636

Change 431636 merged by Dzahn:
[operations/puppet@production] performance::site: require libapache2-mod-uwsgi

https://gerrit.wikimedia.org/r/431636

Change 431638 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] performance::site: load apache mod proxy

https://gerrit.wikimedia.org/r/431638

Change 431638 merged by Dzahn:
[operations/puppet@production] performance::site: load apache mod proxy

https://gerrit.wikimedia.org/r/431638

Change 431644 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] coal-web: needs proxy_http module as well

https://gerrit.wikimedia.org/r/431644

Change 431644 merged by Dzahn:
[operations/puppet@production] coal-web: needs proxy_http module as well

https://gerrit.wikimedia.org/r/431644

Change 431659 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] performance.wikimedia.org: serve from webperfX001

https://gerrit.wikimedia.org/r/431659

Change 431659 merged by Dzahn:
[operations/puppet@production] performance.wikimedia.org: serve from webperfX001

https://gerrit.wikimedia.org/r/431659

Change 431779 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] performance website: allow traffic

https://gerrit.wikimedia.org/r/431779

Change 431779 merged by Dzahn:
[operations/puppet@production] performance website: allow traffic

https://gerrit.wikimedia.org/r/431779

Change 431792 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] performance website: remove from graphite hosts

https://gerrit.wikimedia.org/r/431792

Change 431792 merged by Dzahn:
[operations/puppet@production] performance website: remove from graphite hosts

https://gerrit.wikimedia.org/r/431792

Change 432095 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: don't deploy to graphite

https://gerrit.wikimedia.org/r/432095

Change 432095 merged by jenkins-bot:
[performance/coal@master] coal: don't deploy to graphite

https://gerrit.wikimedia.org/r/432095

Also ticking off the "backup files in bacula" checkbox, because we now use the regular carbon storage, which at some point between between 2015 and now has been added to the backup process (despite being considerably larger than the subset of coal metrics).

Resource: puppet:/modules/profile/manifests/backup/director.pp#L145

Also ticking off the "backup files in bacula" checkbox, because we now use the regular carbon storage, which at some point between between 2015 and now has been added to the backup process (despite being considerably larger than the subset of coal metrics).

Resource: puppet:/modules/profile/manifests/backup/director.pp#L145

Carbon files are not backed up in bacula (too many files). The part in puppet is the fileset declaration, said fileset is never actually applied/used to graphite role with backup::set and thus never backed up.

@fgiunchedi Ah, I see now. b44b1213518cf5 added it, but acc4a831bc623 left the declaration unused.