Page MenuHomePhabricator

Toolforge: prometheus: refresh setup
Closed, ResolvedPublic

Description

The prometheus setup for Toolforge may need some refresh. The most obvious thing is role::prometheus::tools which is very old and in a deprecated layout (needs refactoring).

  • puppet refactor into a modern layout
  • buster support
  • drop old jessie instances

Event Timeline

aborrero created this task.Nov 12 2019, 3:42 PM

Change 550506 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: refactor prometheus role

https://gerrit.wikimedia.org/r/550506

Change 550506 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: refactor prometheus role

https://gerrit.wikimedia.org/r/550506

Next step would be to have the VMs rebuild in Debian Buster rather than Debian Jessie.

aborrero triaged this task as Medium priority.Nov 22 2019, 10:14 AM

Will take care of this "soon".

aborrero moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.
aborrero updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T10:20:14Z] <arturo> create new VM instance tools-prometheus-03 (T238096)

Change 568953 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: wmcs_scripts: refresh package requirements

https://gerrit.wikimedia.org/r/568953

Change 568953 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: wmcs_scripts: refresh package requirements

https://gerrit.wikimedia.org/r/568953

Mentioned in SAL (#wikimedia-operations) [2020-01-30T12:22:08Z] <arturo> add prometheus 2.7.1+ds-3+k8s+buster to buster-wikimedia T238096 (basically a rebuild from stretch)

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T12:57:54Z] <arturo> created domain tools.wmcloud.org in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T12:59:46Z] <arturo> associated floating IPv4 185.15.56.60 to tools-prometheus-03 (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T13:09:38Z] <arturo> created FQDN prometheus.tools.wmcloud.org pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test T238096

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T13:14:59Z] <arturo> drop floating IP 185.15.56.60 and FQDN prometheus.tools.wmcloud.org because this is not how the prometheus setup is right now. Use a web proxy instead tools-prometheus-new.wmflabs.org (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T13:42:06Z] <arturo> disable puppet in prometheus servers while syncing metric data (T238096)

Change 569019 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-labs-targets: use python-keystoneauth1 for sessions

https://gerrit.wikimedia.org/r/569019

Change 569021 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: wmcs_scripts: drop package requirements

https://gerrit.wikimedia.org/r/569021

Change 569019 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-labs-targets: use python-keystoneauth1 for sessions

https://gerrit.wikimedia.org/r/569019

Change 569021 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: wmcs_scripts: drop package requirements

https://gerrit.wikimedia.org/r/569021

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T16:25:29Z] <arturo> point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-01-30T16:27:54Z] <arturo> create VM tools-prometheus-04 as cold standby of tools-prometheus-03 (T238096)

Does this include updates to the toolforge prometheus instance? I wanted to funnel PAWS metrics there but it was such an old version when I tried it I gave up on getting it to work.

aborrero updated the task description. (Show Details)Jan 31 2020, 12:11 PM

Does this include updates to the toolforge prometheus instance? I wanted to funnel PAWS metrics there but it was such an old version when I tried it I gave up on getting it to work.

Indeed. But the prometheus software itself is not changing version at the moment.

I think there is certain amount of support in ops/puppet.git for running your own prometheus setup in the paws project like we do in tools.

Chicocvenancio added a comment.EditedJan 31 2020, 1:17 PM

yeah, my plan then was to actually go the full k8s route with prometheus-operator and such. Have not gotten around to implementing it yet.

Mentioned in SAL (#wikimedia-cloud) [2020-01-31T14:00:19Z] <arturo> syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0{3,4} due to some inconsistencies preventing prometheus from starting (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-01-31T14:05:58Z] <arturo> leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-02-03T09:38:06Z] <arturo> tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-{03,04} (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-02-03T12:48:19Z] <arturo> shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy tools-prometheus.wmflabs.org to tools-prometheus-03, data synced (T238096)

Bstorm added a comment.Feb 3 2020, 9:39 PM

FYI @aborrero the current instances are generating alerts because the /srv/ drives are filling up at current settings and setup.

Right now that's looking like 90% on 03 and 97% full on 04. It's using a huge amount of space already, but we may need larger disk for these with all the new metrics we send unless the retention is configured way off. Just pinging you because they are noisy at the moment. We figured you probably are already aware, but just in case.

Mentioned in SAL (#wikimedia-cloud) [2020-02-04T11:37:41Z] <arturo> re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) T238096

Mentioned in SAL (#wikimedia-cloud) [2020-02-04T11:38:05Z] <arturo> start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-02-05T11:22:31Z] <arturo> restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-02-06T10:27:08Z] <arturo> shutdown again tools-prometheus-01, no longer in use (T238096)

Mentioned in SAL (#wikimedia-cloud) [2020-02-07T10:55:49Z] <arturo> drop jessie VM instances tools-prometheus-{01,02} which were shutdown (T238096)

aborrero closed this task as Resolved.Feb 7 2020, 10:56 AM

Work here is done. Please reopen if required.