Page MenuHomePhabricator

Serve >= 50% of production Prometheus systems with Prometheus v2
Open, NormalPublic

Description

Prometheus 2.x has been released some time ago and we should migrate to it since there are some performance and storage benefits:
https://prometheus.io/blog/2017/11/08/announcing-prometheus-2-0/

Migration wise a few things changed and in particular the on disk format has changed, there's a migration guide here: https://prometheus.io/docs/prometheus/2.0/migration/ however we'll be trying to convert v1 storage into v2 with
https://gitlab.com/gitlab-org/prometheus-storage-migrator as suggested by @colewhite.

Proposed plan of attack:

  • Build a Prometheus 2 Debian package with k8s support (we currently re-build the stock Debian package with k8s support added back).
  • Add the relevant puppetization to be able to use 2.x instead of 1.x on a given host
  • Build internally a prometheus-storage-migrator Debian package and upload to stretch-wikimedia
  • Test the conversion in beta first
    • Setup another deployment-prometheus instance with Prometheus 2 (deployment-prometheus02)
    • Copy storage from old instance to new, and convert with storage migrator
    • Verify 2.x works as expected (e.g. metrics are preserved from v1, new metrics are being ingested, etc)
  • Convert production Prometheus instances
    • For sites with pairs of Prometheus hosts we can take one host out of rotation and perform the migration there
    • Once migration is done verify queries work as expected and put the host back in service
    • For PoPs (single Prometheus host) we'll have to find strategies to minimize downtime

All of the above assuming the storage migrator works as expected (e.g. doesn't run out of memory), if that fails:

  • Setup 2.x to read from 1.x on the same host for missing data
  • Flip traffic to redirect queries to 2.x instead of 1.x (modulo removed features of query language, I don't think what got removed is widely used in our environment)
  • Once the retention period has passed and/or enough data has accumulated in 2.x, remove 1.x instances

Migration checklist for codfw/eqiad hosts

  • Depool host and stop puppet
  • Take a LV snapshot for all instances and mount it
  • rsync snapshotted data to graphite2001 (spare host, data migration will happen there)
  • Reimage prometheus host with stretch
  • Set prometheus::server::prometheus_v2 flag in hiera for prometheus host
  • Install prometheus 2.7.1 package on prometheus host (forcing block duration to 2h with --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h temporarily)
  • Start puppet
  • Validate metrics are being collected, v2 storage will start empty

Migration / backfill procedure

  • Start prometheus-storage-migrator on rsync'd data
  • Once migration has finished, rsync data back to prometheus host
  • Confirm no overlapping blocks directory are present between migrated data and new data
  • Stop puppet on prometheus host
  • Stop prometheus
  • Move migrated data into prometheus storage directory
  • Remove --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h from prometheus server flags
  • Start prometheus
  • Start puppet
  • Confirm historical metrics are present and new metrics are collected
  • Repool prometheus host

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 22 2018, 10:16 AM
MoritzMuehlenhoff triaged this task as Normal priority.Feb 28 2018, 11:02 AM
fgiunchedi updated the task description. (Show Details)Jan 14 2019, 3:14 PM
fgiunchedi renamed this task from Upgrade to Prometheus 2.x to Serve >= 50% of production Prometheus systems with Prometheus v2.
fgiunchedi updated the task description. (Show Details)Jan 17 2019, 4:30 PM

Converting beta prometheus worked well (sans puppetization) and I'll be testing a conversion on production data from codfw on graphite2001 (spare host)

fgiunchedi updated the task description. (Show Details)Jan 17 2019, 4:37 PM

There was an open question re: having Prometheus 2 package co-installable with Prometheus 1, I think it is simpler to keep the package name the same and thus it'll be treated as a package upgrade by dpkg. This way we have to convert all Prometheus instances on a given host at the same time, though I believe that's acceptable.

For a single host upgrade I was thinking sth like this (assuming storage conversion works):

  • Convert Prometheus data to v2 (TODO estimate how long this takes, since it'll be roughly the time datapoints will be missing from v2)
  • Upgrade Prometheus package
  • Switch Puppet to use Prometheus 2 for that host (config syntax and command line flags have changed)
fgiunchedi updated the task description. (Show Details)Jan 17 2019, 4:53 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-18T08:12:57Z] <godog> depool and take snapshots of prometheus data on prometheus2003 to test v2 conversion - T187987

Change 486051 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP prometheus: add feature flag for v2 compat

https://gerrit.wikimedia.org/r/486051

Change 486251 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: set retention period for v2 compatibility

https://gerrit.wikimedia.org/r/486251

Change 486251 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: set retention period for v2 compatibility

https://gerrit.wikimedia.org/r/486251

I ran a test conversion on graphite2001 using prometheus-storage-migrator and a snapshot of data taken from prometheus2003 and parallelism 10:

InstanceRetention# metricsv1 data sizev2 data sizeconversion time
analytics4032h65072MB668MB5m
k8s4032h2410926.5GB18GB9.5h
global10920h115503173GB152G18h
ops2190h1448041254GB384G73h
services4032h47841360GB177G67h

Change 486051 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add feature flag for v2 compat

https://gerrit.wikimedia.org/r/486051

As discussed on IRC: Let's upgrade to 2.7.1 next week as that fixes a security issue (CVE-2019-3826) in the internal UI (not exposed in production, but in https://beta-prometheus.wmflabs.org/). Change is already pending in Salsa: https://salsa.debian.org/go-team/packages/prometheus/commit/1cd743bc0012935842adb5941258c9ed8bff85fe

Mentioned in SAL (#wikimedia-operations) [2019-02-11T14:16:49Z] <godog> depool and take a snapshot of prometheus data for all instances on prometheus2003 - T187987

fgiunchedi updated the task description. (Show Details)Mon, Feb 11, 2:28 PM
fgiunchedi updated the task description. (Show Details)Mon, Feb 11, 3:10 PM
jbond added a subscriber: jbond.Mon, Feb 11, 5:35 PM

A "big rsync + snapshot prometheus + final rsync" yields about ~2h30m for the final rsync to run, with the bottleneck being a gazillion files on a spinning disk for the global prometheus instance. IOW ~3h (rsync + reimage) will be our gap between new and migrated data in Prometheus v2

Change 490325 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use stretch for prometheus hosts

https://gerrit.wikimedia.org/r/490325

Change 490325 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use stretch for prometheus hosts

https://gerrit.wikimedia.org/r/490325

Change 486059 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: use Prometheus 2 on prometheus2003

https://gerrit.wikimedia.org/r/486059

Change 486059 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: use Prometheus 2 on prometheus2003

https://gerrit.wikimedia.org/r/486059

Change 490375 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use rules_ops.yml for prometheus 2

https://gerrit.wikimedia.org/r/490375

Mentioned in SAL (#wikimedia-operations) [2019-02-13T18:06:58Z] <godog> reimage prometheus2003 - T187987

Change 490375 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use yaml rules files for prometheus v2

https://gerrit.wikimedia.org/r/490375

Change 490582 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add rules_k8s.yml converted from rules_k8s.conf

https://gerrit.wikimedia.org/r/490582

Change 490582 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add rules_k8s.yml converted from rules_k8s.conf

https://gerrit.wikimedia.org/r/490582

Status update: yesterday I've reimaged prometheus2003 and prometheus 2.7.1 has been running there, host is still depooled but collecting metrics similarly to its counterpart on prometheus2004 (ATM ~22.5k samples/s)

I've started migrating prometheus2003 data on graphite2001 although the rsync I ran yesterday wasn't complete so I've been rsync'ing the missing data from prometheus2004 and began the migration for analytics and services instances.

Change 490834 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't require Prometheus::Server when writing k8s token

https://gerrit.wikimedia.org/r/490834