Prometheus 2.x has been released some time ago and we should migrate to it since there are some performance and storage benefits:
https://prometheus.io/blog/2017/11/08/announcing-prometheus-2-0/
Migration wise a few things changed and in particular the on disk format has changed, there's a migration guide here: https://prometheus.io/docs/prometheus/2.0/migration/ however we'll be trying to convert v1 storage into v2 with
https://gitlab.com/gitlab-org/prometheus-storage-migrator as suggested by @colewhite.
Proposed plan of attack:
[x] Build a Prometheus 2 Debian package with k8s support (we currently re-build the stock Debian package with k8s support added back).
[x] Add the relevant puppetization to be able to use 2.x instead of 1.x on a given host
[x] Build internally a prometheus-storage-migrator Debian package and upload to `stretch-wikimedia`
[x] Test the conversion in beta first
[x] Setup another deployment-prometheus instance with Prometheus 2 (`deployment-prometheus02`)
[x] Copy storage from old instance to new, and convert with storage migrator
[x] Verify 2.x works as expected (e.g. metrics are preserved from v1, new metrics are being ingested, etc)
[] Convert production Prometheus instances
[] For sites with pairs of Prometheus hosts we can take one host out of rotation and perform the migration there
[] Once migration is done verify queries work as expected and put the host back in service
[] For PoPs (single Prometheus host) we'll have to find strategies to minimize downtime
All of the above assuming the storage migrator works as expected (e.g. doesn't run out of memory), if that fails:
[] Setup 2.x to read from 1.x on the same host for missing data
[] Flip traffic to redirect queries to 2.x instead of 1.x (modulo removed features of query language, I don't think what got removed is widely used in our environment)
[] Once the retention period has passed and/or enough data has accumulated in 2.x, remove 1.x instances
= Migration checklist for codfw/eqiad hosts
[] Depool host and stop puppet
[] Take a LV snapshot for all instances and mount it
[] rsync snapshotted data to graphite2001 (spare host, data migration will happen there)
[] Reimage prometheus host with stretch
[] Set `prometheus::server::prometheus_v2` flag in hiera for prometheus host
[] Install prometheus 2.7.1 package on prometheus host (forcing block duration to 2h with `--storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h` temporarily)
[] Start puppet
[] Validate metrics are being collected, v2 storage will start empty
== Migration / backfill procedure
[] Start `prometheus-storage-migrator` on rsync'd data
[] Once migration has finished, rsync data back to prometheus host
[] Confirm no overlapping blocks directory are present between migrated data and new data
[] Stop puppet on prometheus host
[] Stop prometheus
[] Move migrated data into prometheus storage directory
[] Remove `--storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h` from prometheus server flags
[] Start prometheus
[] Start puppet
[] Confirm historical metrics are present and new metrics are collected
[] Repool prometheus host
== Status
[] http://prometheus-labmon.eqiad.wmnet/labs
[x] http://prometheus.svc.codfw.wmnet/analytics
[x] http://prometheus.svc.codfw.wmnet/global
[x] http://prometheus.svc.codfw.wmnet/k8s
[x] http://prometheus.svc.codfw.wmnet/ops
[x] http://prometheus.svc.codfw.wmnet/services
[x] http://prometheus.svc.eqiad.wmnet/analytics
[x] http://prometheus.svc.eqiad.wmnet/global
[x] http://prometheus.svc.eqiad.wmnet/k8s
[x] http://prometheus.svc.eqiad.wmnet/k8s-staging
[x] http://prometheus.svc.eqiad.wmnet/ops
[x] http://prometheus.svc.eqiad.wmnet/services
[x] http://prometheus.svc.eqsin.wmnet/ops
http://prometheus.svc.esams.wmnet/ops
[x] http://prometheus.svc.ulsfo.wmnet/ops