Page MenuHomePhabricator

100% of Prometheus traffic served by Prometheus v2
Closed, ResolvedPublic

Description

Prometheus 2.x has been released some time ago and we should migrate to it since there are some performance and storage benefits:
https://prometheus.io/blog/2017/11/08/announcing-prometheus-2-0/

Migration wise a few things changed and in particular the on disk format has changed, there's a migration guide here: https://prometheus.io/docs/prometheus/2.0/migration/ however we'll be trying to convert v1 storage into v2 with
https://gitlab.com/gitlab-org/prometheus-storage-migrator as suggested by @colewhite.

Proposed plan of attack:

  • Build a Prometheus 2 Debian package with k8s support (we currently re-build the stock Debian package with k8s support added back).
  • Add the relevant puppetization to be able to use 2.x instead of 1.x on a given host
  • Build internally a prometheus-storage-migrator Debian package and upload to stretch-wikimedia
  • Test the conversion in beta first
    • Setup another deployment-prometheus instance with Prometheus 2 (deployment-prometheus02)
    • Copy storage from old instance to new, and convert with storage migrator
    • Verify 2.x works as expected (e.g. metrics are preserved from v1, new metrics are being ingested, etc)
  • Convert production Prometheus instances
    • For sites with pairs of Prometheus hosts we can take one host out of rotation and perform the migration there
    • Once migration is done verify queries work as expected and put the host back in service
    • For PoPs (single Prometheus host) we'll have to find strategies to minimize downtime

All of the above assuming the storage migrator works as expected (e.g. doesn't run out of memory), if that fails:

  • Setup 2.x to read from 1.x on the same host for missing data
  • Flip traffic to redirect queries to 2.x instead of 1.x (modulo removed features of query language, I don't think what got removed is widely used in our environment)
  • Once the retention period has passed and/or enough data has accumulated in 2.x, remove 1.x instances

Migration checklist for codfw/eqiad hosts

  • Depool host and stop puppet
  • Take a LV snapshot for all instances and mount it
  • rsync snapshotted data to graphite2001 (spare host, data migration will happen there)
  • Reimage prometheus host with stretch
  • Set prometheus::server::prometheus_v2 flag in hiera for prometheus host
  • Install prometheus 2.7.1 package on prometheus host (forcing block duration to 2h with --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h temporarily)
  • Start puppet
  • Validate metrics are being collected, v2 storage will start empty

Migration / backfill procedure

  • Start prometheus-storage-migrator on rsync'd data
  • Once migration has finished, rsync data back to prometheus host
  • Confirm no overlapping blocks directory are present between migrated data and new data
  • Stop puppet on prometheus host
  • Stop prometheus
  • Move migrated data into prometheus storage directory
  • Remove --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h from prometheus server flags
  • Start prometheus
  • Start puppet
  • Confirm historical metrics are present and new metrics are collected
  • Repool prometheus host

Status

Details

Related Gerrit Patches:
operations/puppet : productionprometheus: remove v1 rules files
operations/puppet : productionprometheus: remove v2 feature flag
operations/puppet : productionhieradata: bast3002 to prometheus v2
operations/puppet : productionhieradata: labmon1001 to prometheus v2
operations/puppet : productionhieradata: labmon1002 to prometheus v2
operations/puppet : productionhieradata: bast4002 to Prometheus v2
operations/puppet : productionhieradata: prometheus v2 on bast5001
operations/puppet : productionprometheus: don't require Prometheus::Server when writing k8s token
operations/puppet : productionhieradata: run Prometheus v2 on prometheus2004
operations/puppet : productionprometheus: set v2 max block duration to 24h
operations/puppet : productionhieradata: Prometheus v2 for prometheus1004
operations/puppet : productionprometheus: use yaml rules for prometheus v2 k8s-staging
operations/puppet : productionhieradata: run prometheus 2 on prometheus1003
operations/puppet : productionprometheus: add rules_k8s.yml converted from rules_k8s.conf
operations/puppet : productionprometheus: use yaml rules files for prometheus v2
operations/puppet : productionhieradata: use Prometheus 2 on prometheus2003
operations/puppet : productioninstall_server: use stretch for prometheus hosts
operations/puppet : productionprometheus: add feature flag for v2 compat
operations/puppet : productionprometheus: set retention period for v2 compatibility

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2019-01-18T08:12:57Z] <godog> depool and take snapshots of prometheus data on prometheus2003 to test v2 conversion - T187987

Change 486051 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP prometheus: add feature flag for v2 compat

https://gerrit.wikimedia.org/r/486051

Change 486251 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: set retention period for v2 compatibility

https://gerrit.wikimedia.org/r/486251

Change 486251 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: set retention period for v2 compatibility

https://gerrit.wikimedia.org/r/486251

I ran a test conversion on graphite2001 using prometheus-storage-migrator and a snapshot of data taken from prometheus2003 and parallelism 10:

InstanceRetention# metricsv1 data sizev2 data sizeconversion time
analytics4032h65072MB668MB5m
k8s4032h2410926.5GB18GB9.5h
global10920h115503173GB152G18h
ops2190h1448041254GB384G73h
services4032h47841360GB177G67h

Change 486051 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add feature flag for v2 compat

https://gerrit.wikimedia.org/r/486051

As discussed on IRC: Let's upgrade to 2.7.1 next week as that fixes a security issue (CVE-2019-3826) in the internal UI (not exposed in production, but in https://beta-prometheus.wmflabs.org/). Change is already pending in Salsa: https://salsa.debian.org/go-team/packages/prometheus/commit/1cd743bc0012935842adb5941258c9ed8bff85fe

Mentioned in SAL (#wikimedia-operations) [2019-02-11T14:16:49Z] <godog> depool and take a snapshot of prometheus data for all instances on prometheus2003 - T187987

fgiunchedi updated the task description. (Show Details)Feb 11 2019, 2:28 PM
fgiunchedi updated the task description. (Show Details)Feb 11 2019, 3:10 PM
jbond added a subscriber: jbond.Feb 11 2019, 5:35 PM

A "big rsync + snapshot prometheus + final rsync" yields about ~2h30m for the final rsync to run, with the bottleneck being a gazillion files on a spinning disk for the global prometheus instance. IOW ~3h (rsync + reimage) will be our gap between new and migrated data in Prometheus v2

Change 490325 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use stretch for prometheus hosts

https://gerrit.wikimedia.org/r/490325

Change 490325 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use stretch for prometheus hosts

https://gerrit.wikimedia.org/r/490325

Change 486059 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: use Prometheus 2 on prometheus2003

https://gerrit.wikimedia.org/r/486059

Change 486059 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: use Prometheus 2 on prometheus2003

https://gerrit.wikimedia.org/r/486059

Change 490375 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use rules_ops.yml for prometheus 2

https://gerrit.wikimedia.org/r/490375

Mentioned in SAL (#wikimedia-operations) [2019-02-13T18:06:58Z] <godog> reimage prometheus2003 - T187987

Change 490375 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use yaml rules files for prometheus v2

https://gerrit.wikimedia.org/r/490375

Change 490582 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add rules_k8s.yml converted from rules_k8s.conf

https://gerrit.wikimedia.org/r/490582

Change 490582 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add rules_k8s.yml converted from rules_k8s.conf

https://gerrit.wikimedia.org/r/490582

Status update: yesterday I've reimaged prometheus2003 and prometheus 2.7.1 has been running there, host is still depooled but collecting metrics similarly to its counterpart on prometheus2004 (ATM ~22.5k samples/s)

I've started migrating prometheus2003 data on graphite2001 although the rsync I ran yesterday wasn't complete so I've been rsync'ing the missing data from prometheus2004 and began the migration for analytics and services instances.

Change 490834 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't require Prometheus::Server when writing k8s token

https://gerrit.wikimedia.org/r/490834

Update: while converting global instance with full retention 10500h the migrator ran into a problem that would abort the migration. I've shortened the retention to 8500h hours (i.e. 354 days) and that has finished the migration successfully. I've reported the issue upstream as https://gitlab.com/gitlab-org/prometheus-storage-migrator/issues/4 in case that rings a bell to them.

Backfilling the global storage is what's blocking having prometheus2003 back in production at this point so I'm leaning towards merging the migrated data even with a 83 days shorter retention than original.

Volans added a subscriber: Volans.Feb 26 2019, 10:33 AM

From the error reported in the upstream issue it seems that is data-dependent. Have you tried by any chance any other retention between 8500h and 10500h?
Can we enable any more debugging to get a better idea of which metric is throwing the error so that maybe we can just skip it instead?

From the error reported in the upstream issue it seems that is data-dependent. Have you tried by any chance any other retention between 8500h and 10500h?

Indeed, I have tried with 9000h for example and that failed too, one additional complication is that each try can take ~15-20h before failing :(

Can we enable any more debugging to get a better idea of which metric is throwing the error so that maybe we can just skip it instead?

There's no more debugging from the code as it is now AFAICS.

Since the 8500h migration worked I'm thinking of going with that for prometheus2003 so that host is done and can be put in production, then start converting prometheus2004 (or prometheus1003) and see how far we get.

Would that mean that the missing hours will be totally lost? In that case probably better to ask the users that were asking for longer retention to make sure we're not loosing any required data. (my 2 cents)

Would that mean that the missing hours will be totally lost? In that case probably better to ask the users that were asking for longer retention to make sure we're not loosing any required data. (my 2 cents)

Yes that's what that means, they'd be lost from prometheus2003 and possibly others too if we can't migrate more retention.

re: affected users, I'm looping in Performance-Team since IIRC they were involved in https://gerrit.wikimedia.org/r/c/operations/puppet/+/404434 originally to ask if they'd be ok with losing ~84d days at the end of the current retention period for global instance.

Gilles added a subscriber: Gilles.Feb 26 2019, 2:34 PM

What does that mean, exactly? Losing 84 days worth of data that's already 15 months old?

What does that mean, exactly? Losing 84 days worth of data that's already 15 months old?

That's correct yes, essentially trimming off the retention

Sounds fine to me!

Thanks @Gilles !

I have tried again a 9000h migration and that worked, I've now merged the migrated storage with existing storage and will be putting prometheus2003 back in production today.

Thanks @Gilles !
I have tried again a 9000h migration and that worked, I've now merged the migrated storage with existing storage and will be putting prometheus2003 back in production today.

And so I did, turns out however that vector queries (e.g. rate()) in Prometheus 2 don't like if the results have the same labelset but are different metrics, like in our case of migrated storage plus the already-running v2 storage upstream issue.

What I thought it was possible is: spin up a prometheus 2 collecting data while the v1 storage was migrating and then merge the two (migrated v1 + brand-new v2) to minimize the gap during migration, however as experienced above while this is possible it is not safe/reliable in the face of vector queries.

So, change of plans: migrate v1 storage to v2 but don't start a v2 instance while migration is in progress, start v2 only when migration is fully complete and new metrics can start flowing. The downside of this approach is that the gap will be as big as the migration time. So far this time has been in the order of 24h for our biggest instances (ops and global).

The other approach mentioned in the task description is to setup remote read from v1. I think however we should stick with the migrator and take a one-time hit of metric gap, instead of having to keep v1 around until the full retention has passed.

Thoughts / input welcome as always!

To test the new plan above I've started an rsync + migration of all instances of prometheus2003, starting from a snapshot of data from prometheus2004. As soon as the migration is done successfully then prometheus is started, so the gap is rsync time + migration time. For big instances hosted on spinning disks (global and services) the rsync time might be significant, in that case we can consider transferring the snapshot block device itself instead of looking at the filesystem.

Status update:

The migrations have finished for all instances on prometheus2003, however the same symptom remains. To explain in detail what's happening I'll give an example for the global instance and the cluster_input:rsyslog_input_submitted:rate5m metric before/after the migration in the screenshot below. The green line is pre-migration (i.e. data from Prometheus v1) and the red line is post-migration (i.e. when Prometheus v2 started again ingesting new data). The two metrics have the same labels yet are regarded by Prometheus as two separate metrics (see also the JSON below where the result list has two elements). In itself having such a result didn't use to be a problem before Prometheus 2.5 which shipped a fix for this issue making operations such as rate impossible on such results where two distinct metrics share the same labels.

I have been debugging why post-migration the new metrics are regarded as "new", and at the same time why testing the migration on analytics instance worked as expected (i.e. no duplicated metrics e.g. for prometheus_engine_queries) although without much success so far.

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "__name__": "cluster_input:rsyslog_input_submitted:rate5m",
          "cluster": "syslog",
          "input": "imtcp(6514)",
          "job": "federate-ops",
          "site": "codfw"
        },
        "values": [
          [
            1552317531.172,
            "6227.620833333333"
          ],
          [
            1552320987.172,
            "6622.216666666666"
          ],
          [
            1552324443.172,
            "5789.508333333333"
          ],
          [
            1552327899.172,
            "5787.7375"
          ],
          [
            1552331355.172,
            "5859.45"
          ],
          [
            1552334811.172,
            "5783.85"
          ],
          [
            1552338267.172,
            "5989.416666666667"
          ],
          [
            1552341723.172,
            "5358.641666666666"
          ],
          [
            1552345179.172,
            "5011.729166666667"
          ],
          [
            1552348635.172,
            "4530.6"
          ],
          [
            1552352091.172,
            "4072.0625"
          ],
          [
            1552355547.172,
            "3930.0833333333335"
          ],
          [
            1552359003.172,
            "5058.954166666666"
          ],
          [
            1552362459.172,
            "4158.95"
          ],
          [
            1552365915.172,
            "3677.170833333333"
          ],
          [
            1552369371.172,
            "3602.7875"
          ],
          [
            1552372827.172,
            "3982.9791666666665"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "cluster_input:rsyslog_input_submitted:rate5m",
          "cluster": "syslog",
          "input": "imtcp(6514)",
          "job": "federate-ops",
          "site": "codfw"
        },
        "values": [
          [
            1551508827.172,
            "4084.982979237587"
          ],
          [
            1551512283.172,
            "4148.945833333333"
          ],
          [
            1551515739.172,
            "4971.770833333333"
          ],
          [
            1551519195.172,
            "5200.245833333333"
          ],
          [
            1551522651.172,
            "4739.388384301308"
          ],
          [
            1551526107.172,
            "4740.575609715445"
          ],
          [
            1551529563.172,
            "4838.3149456702895"
          ],
          [
            1551533019.172,
            "6006.662722090737"
          ],
          [
            1551536475.172,
            "6072.666291690102"
          ],
          [
            1551539931.172,
            "5938.783333333334"
          ],
          [
            1551543387.172,
            "5362.189436141446"
          ],
          [
            1551546843.172,
            "5840.691666666667"
          ],
          [
            1551550299.172,
            "6249.484364257588"
          ],
          [
            1551553755.172,
            "5367.790134207775"
          ],
          [
            1551557211.172,
            "5077.970508456215"
          ],
          [
            1551560667.172,
            "5863.291666666667"
          ],
          [
            1551564123.172,
            "5608.3375"
          ],
          [
            1551567579.172,
            "4994.033333333334"
          ],
          [
            1551571035.172,
            "4321.425"
          ],
          [
            1551574491.172,
            "4545.67272636364"
          ],
          [
            1551577947.172,
            "4845.540346271641"
          ],
          [
            1551581403.172,
            "4758.295833333334"
          ],
          [
            1551584859.172,
            "4427.995833333333"
          ],
          [
            1551588315.172,
            "4090.535613841378"
          ],
          [
            1551591771.172,
            "3827.046711137592"
          ],
          [
            1551595227.172,
            "4151.033333333334"
          ],
          [
            1551598683.172,
            "4141.483333333334"
          ],
          [
            1551602139.172,
            "5255.3125"
          ],
          [
            1551605595.172,
            "5603.320833333333"
          ],
          [
            1551609051.172,
            "5243.138856135761"
          ],
          [
            1551612507.172,
            "5008.645833333333"
          ],
          [
            1551615963.172,
            "5560.741666666667"
          ],
          [
            1551619419.172,
            "5331.166666666667"
          ],
          [
            1551622875.172,
            "6407.958333333333"
          ],
          [
            1551626331.172,
            "6129.576384895534"
          ],
          [
            1551629787.172,
            "6187.153654491971"
          ],
          [
            1551633243.172,
            "5989.604166666667"
          ],
          [
            1551636699.172,
            "5726.383333333333"
          ],
          [
            1551640155.172,
            "5470.391666666666"
          ],
          [
            1551643611.172,
            "5667.6125"
          ],
          [
            1551647067.172,
            "5855.615418911542"
          ],
          [
            1551650523.172,
            "5806.896637930675"
          ],
          [
            1551653979.172,
            "4737.064691127574"
          ],
          [
            1551657435.172,
            "4633.558333333333"
          ],
          [
            1551660891.172,
            "4225.195833333333"
          ],
          [
            1551664347.172,
            "5110.029166666666"
          ],
          [
            1551667803.172,
            "4603.169311926797"
          ],
          [
            1551671259.172,
            "4626.004200140005"
          ],
          [
            1551674715.172,
            "3860.6541666666667"
          ],
          [
            1551678171.172,
            "3823.0355453890156"
          ],
          [
            1551681627.172,
            "4011.3375"
          ],
          [
            1551685083.172,
            "4494.039409482776"
          ]
        ]
      }
    ]
  }
}

After focused code reading and head scratching it turns out the root cause is that persisted metrics weren't sorted during migration, this used to work fine but in Prometheus >= 2.5 it is a requirement. The upstream issue now has a MR to fix this problem: https://gitlab.com/gitlab-org/prometheus-storage-migrator/issues/5.

For good measure I'll reimage prometheus2003 and start again a migration to validate things are working as expected.

I've just repooled prometheus2003 and things seems to be working as expected!

The gap in metrics starts at March 14th ~14 UTC. The widest gap is for ops instance, which ends at March 16th at ~23 UTC (i.e. ~2.5 days) other instances have shorter gaps due to taking less time to migrate.

Change 490834 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't require Prometheus::Server when writing k8s token

https://gerrit.wikimedia.org/r/490834

Change 497746 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: run prometheus 2 on prometheus1003

https://gerrit.wikimedia.org/r/497746

Change 497746 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: run prometheus 2 on prometheus1003

https://gerrit.wikimedia.org/r/497746

Change 497761 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use yaml rules for prometheus v2 k8s-staging

https://gerrit.wikimedia.org/r/497761

Change 497761 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use yaml rules for prometheus v2 k8s-staging

https://gerrit.wikimedia.org/r/497761

Status update: all instances but ops on prometheus1003 have finished migrating. This time I changed the strategy to actually backfill new + migrated storage (which was the plan all along!) and the maximum gap observed is ~1.5h which is acceptable IMHO.

Next steps once prometheus1003 has finished migrating:

  1. Put prometheus1003 in rotation to serve traffic by itself, to prove it works as expected
  2. Reimage prometheus1004 with stretch and install Prometheus 2
  3. Copy data from prometheus1003 onto prometheus1004
  4. Put prometheus1004 in service

Change 499742 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: set v2 max block duration to 24h

https://gerrit.wikimedia.org/r/499742

Status update: codfw and eqiad are served by Prometheus v2 now (single host in each, prometheus2003 and prometheus1003 respectively). The Prometheus v1 hosts are depooled now and I'll reimage them early next week after no problems are reported. Backfill for those will be simpler because there's no migration involved: we'll be copying the already migrated storage from the existing v2 hosts.

As far as eqsin/esams/ulsfo are concerned I've thought about rsync'ing the data off hosts for migration, however there are a lot of metric files involved and rsync is going to take multiple hours before it finishes. During the rsync + migration time there will be no metric history, however new metrics will be collected as soon as Prometheus v2 is started.

A faster strategy is to bypass the filesystem and copy an LV snapshot off the host instead, which usually happens at 50+ MB/s. Unfortunately the VGs on the bastions don't have free extents, so to take snapshots we'll have to resize the LV and the filesystem, which entails Prometheus downtime.

To recap the two strategies:

rsync data and start Prometheus v2

Pros:

  • Safe, minimal Prometheus downtime

Cons:

  • Will take longer to transfer metrics off-host and begin migration on e.g. graphite2002 (exact numbers TBD) during which time metrics history won't be available

shrink existing FS and transfer a snapshot

Pros:

  • Faster data transfer time, migration can begin sooner

Cons:

  • There will be Prometheus downtime when the FS is shrinking, which can take some time (number unknown)
fgiunchedi renamed this task from Serve >= 50% of production Prometheus systems with Prometheus v2 to 100% of Prometheus traffic served by Prometheus v2.Apr 4 2019, 1:06 PM
fgiunchedi moved this task from Backlog to In progress on the observability board.Apr 15 2019, 2:50 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-17T08:47:06Z] <godog> reimage prometheus1004 - T187987

Mentioned in SAL (#wikimedia-operations) [2019-04-17T12:10:16Z] <godog> briefly stop all prometheus on prometheus1003 to finish metrics rsync - T187987

Change 504540 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: Prometheus v2 for prometheus1004

https://gerrit.wikimedia.org/r/504540

Change 504540 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: Prometheus v2 for prometheus1004

https://gerrit.wikimedia.org/r/504540

Change 499742 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: set v2 max block duration to 24h

https://gerrit.wikimedia.org/r/499742

Mentioned in SAL (#wikimedia-operations) [2019-04-17T13:47:51Z] <godog> reimage prometheus2004 - T187987

Change 504552 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: run Prometheus v2 on prometheus2004

https://gerrit.wikimedia.org/r/504552

Change 504552 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: run Prometheus v2 on prometheus2004

https://gerrit.wikimedia.org/r/504552

Both prometheus1004 and prometheus2004 are now in service with Prometheus v2! So far no issues, syncing the whole storage from their counterparts took ~2h each.

Change 490834 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: don't require Prometheus::Server when writing k8s token

https://gerrit.wikimedia.org/r/490834

Change 505712 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: prometheus v2 on bast5001

https://gerrit.wikimedia.org/r/505712

Change 505712 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: prometheus v2 on bast5001

https://gerrit.wikimedia.org/r/505712

Mentioned in SAL (#wikimedia-operations) [2019-04-23T09:23:15Z] <godog> upgrade prometheus to v2 on bast5001, previous metrics will not be available until migration and backfill are complete - T187987

Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:39:55Z] <godog> begin migration of bast4002 to prometheus v2 - T187987

Change 506950 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: bast4002 to Prometheus v2

https://gerrit.wikimedia.org/r/506950

Change 506950 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: bast4002 to Prometheus v2

https://gerrit.wikimedia.org/r/506950

Mentioned in SAL (#wikimedia-operations) [2019-04-30T08:22:36Z] <godog> bounce prometheus on bast4002 after backfill has finished - T187987

fgiunchedi updated the task description. (Show Details)Apr 30 2019, 3:31 PM
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a comment.EditedApr 30 2019, 3:49 PM

Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but only labmon1001 is used for queries via prometheus-labmon.eqiad.wmnet. There's very few prometheus data on those hosts so my plan is to run the migrator on the host itself, starting with labmon1002, let prometheus 2 run for a while then copy over the data to labmon1001.

In other words, the steps on labmon1002 will be:

  • stop puppet
  • stop prometheus
  • start migrating data
  • upgrade prometheus to v2
  • run puppet
  • once migration has finished, backfill migrated data

cc cloud-services-team as I'd like to perform this work on Thurs May 2nd on labmon1002, and upgrade labmon1001 early next week.

Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but only labmon1001 is used for queries via prometheus-labmon.eqiad.wmnet. There's very few prometheus data on those hosts so my plan is to run the migrator on the host itself, starting with labmon1002, let prometheus 2 run for a while then copy over the data to labmon1001.
In other words, the steps on labmon1002 will be:

  • stop puppet
  • stop prometheus
  • start migrating data
  • upgrade prometheus to v2
  • run puppet
  • once migration has finished, backfill migrated data

cc cloud-services-team as I'd like to perform this work on Thurs May 2nd on labmon1002, and upgrade labmon1001 early next week.

works for me! Ping me on IRC if you need anything.

Change 507751 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: labmon1002 to prometheus v2

https://gerrit.wikimedia.org/r/507751

Change 507751 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: labmon1002 to prometheus v2

https://gerrit.wikimedia.org/r/507751

labmon1002 has been migrated and seems to be working, I'll upgrade labmon1001 early next week.

Change 508280 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: labmon1001 to prometheus v2

https://gerrit.wikimedia.org/r/508280

Change 508280 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: labmon1001 to prometheus v2

https://gerrit.wikimedia.org/r/508280

fgiunchedi updated the task description. (Show Details)May 6 2019, 9:03 AM

Mentioned in SAL (#wikimedia-operations) [2019-05-06T09:03:20Z] <godog> upgrade labmon1001 to prometheus 2 - T187987

Change 508503 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: bast3002 to prometheus v2

https://gerrit.wikimedia.org/r/508503

Change 508503 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: bast3002 to prometheus v2

https://gerrit.wikimedia.org/r/508503

Mentioned in SAL (#wikimedia-operations) [2019-05-07T07:27:03Z] <godog> upgrade prometheus on bast3002 - T187987

Change 509052 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: remove v2 feature flag

https://gerrit.wikimedia.org/r/509052

Change 509052 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: remove v2 feature flag

https://gerrit.wikimedia.org/r/509052

fgiunchedi closed this task as Resolved.May 13 2019, 8:52 AM
fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

Completed! All production and wmcs Prometheus fleet migrated to Prometheus 2

Change 509822 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: remove v1 rules files

https://gerrit.wikimedia.org/r/509822

Change 509822 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: remove v1 rules files

https://gerrit.wikimedia.org/r/509822