100% of Prometheus traffic served by Prometheus v2
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Feb 22 2018, 10:16 AM

Details

Subject	Repo	Branch	Lines +/-
prometheus: remove v1 rules files	operations/puppet	production	+0 -485
prometheus: remove v2 feature flag	operations/puppet	production	+35 -139
hieradata: bast3002 to prometheus v2	operations/puppet	production	+1 -0
hieradata: labmon1001 to prometheus v2	operations/puppet	production	+1 -0
hieradata: labmon1002 to prometheus v2	operations/puppet	production	+1 -0
hieradata: bast4002 to Prometheus v2	operations/puppet	production	+1 -0
hieradata: prometheus v2 on bast5001	operations/puppet	production	+1 -0
prometheus: don't require Prometheus::Server when writing k8s token	operations/puppet	production	+0 -3
hieradata: run Prometheus v2 on prometheus2004	operations/puppet	production	+1 -0
prometheus: set v2 max block duration to 24h	operations/puppet	production	+1 -1
hieradata: Prometheus v2 for prometheus1004	operations/puppet	production	+1 -0
prometheus: use yaml rules for prometheus v2 k8s-staging	operations/puppet	production	+10 -3
hieradata: run prometheus 2 on prometheus1003	operations/puppet	production	+1 -0
prometheus: add rules_k8s.yml converted from rules_k8s.conf	operations/puppet	production	+12 -0
prometheus: use yaml rules files for prometheus v2	operations/puppet	production	+50 -15
hieradata: use Prometheus 2 on prometheus2003	operations/puppet	production	+1 -0
install_server: use stretch for prometheus hosts	operations/puppet	production	+0 -8
prometheus: add feature flag for v2 compat	operations/puppet	production	+92 -17
prometheus: set retention period for v2 compatibility	operations/puppet	production	+8 -8

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T220104 TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal)
Resolved	fgiunchedi	T187987 100% of Prometheus traffic served by Prometheus v2
Resolved	• ema	T212312 prometheus-based graph significantly slower than statsd equivalent
Resolved	• ema	T190992 prometheus: slow dashboards due to suboptimal query_range performance
Resolved	fgiunchedi	T214058 Convert prometheus-labs-targets to use nova API instead of wikitech's api.php
Resolved	fgiunchedi	T215272 Upgrade Prometheus to 2.7 in deployment-prep and tools

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

fgiunchedi closed subtask T214058: Convert prometheus-labs-targets to use nova API instead of wikitech's api.php as Resolved.Jan 21 2019, 2:07 PM

Change 486051 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP prometheus: add feature flag for v2 compat

https://gerrit.wikimedia.org/r/486051

gerritbot added a project: Patch-For-Review.Jan 23 2019, 10:43 AM

Change 486251 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: set retention period for v2 compatibility

https://gerrit.wikimedia.org/r/486251

Mentioned in SAL (#wikimedia-operations) [2019-01-24T14:50:41Z] <godog> roll restart prometheus after https://gerrit.wikimedia.org/r/c/operations/puppet/+/486251 - T187987

Change 486251 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: set retention period for v2 compatibility

https://gerrit.wikimedia.org/r/486251

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:48 PM

I ran a test conversion on graphite2001 using prometheus-storage-migrator and a snapshot of data taken from prometheus2003 and parallelism 10:

Instance	Retention	# metrics	v1 data size	v2 data size	conversion time
analytics	4032h	650	72MB	668MB	5m
k8s	4032h	241092	6.5GB	18GB	9.5h
global	10920h	115503	173GB	152G	18h
ops	2190h	1448041	254GB	384G	73h
services	4032h	478413	60GB	177G	67h

Change 486051 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add feature flag for v2 compat

https://gerrit.wikimedia.org/r/486051

As discussed on IRC: Let's upgrade to 2.7.1 next week as that fixes a security issue (CVE-2019-3826) in the internal UI (not exposed in production, but in https://beta-prometheus.wmflabs.org/). Change is already pending in Salsa: https://salsa.debian.org/go-team/packages/prometheus/commit/1cd743bc0012935842adb5941258c9ed8bff85fe

fgiunchedi closed subtask T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools as Resolved.Feb 11 2019, 10:54 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-11T14:16:49Z] <godog> depool and take a snapshot of prometheus data for all instances on prometheus2003 - T187987

fgiunchedi updated the task description. (Show Details)Feb 11 2019, 2:28 PM

fgiunchedi updated the task description. (Show Details)Feb 11 2019, 3:10 PM

jbond subscribed.Feb 11 2019, 5:35 PM

A "big rsync + snapshot prometheus + final rsync" yields about ~2h30m for the final rsync to run, with the bottleneck being a gazillion files on a spinning disk for the global prometheus instance. IOW ~3h (rsync + reimage) will be our gap between new and migrated data in Prometheus v2

Change 490325 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use stretch for prometheus hosts

https://gerrit.wikimedia.org/r/490325

Change 490325 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use stretch for prometheus hosts

https://gerrit.wikimedia.org/r/490325

Change 486059 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: use Prometheus 2 on prometheus2003

https://gerrit.wikimedia.org/r/486059

Change 486059 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: use Prometheus 2 on prometheus2003

https://gerrit.wikimedia.org/r/486059

Change 490375 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use rules_ops.yml for prometheus 2

https://gerrit.wikimedia.org/r/490375

Mentioned in SAL (#wikimedia-operations) [2019-02-13T18:06:58Z] <godog> reimage prometheus2003 - T187987

Change 490375 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use yaml rules files for prometheus v2

https://gerrit.wikimedia.org/r/490375

Change 490582 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add rules_k8s.yml converted from rules_k8s.conf

https://gerrit.wikimedia.org/r/490582

Change 490582 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add rules_k8s.yml converted from rules_k8s.conf

https://gerrit.wikimedia.org/r/490582

Status update: yesterday I've reimaged prometheus2003 and prometheus 2.7.1 has been running there, host is still depooled but collecting metrics similarly to its counterpart on prometheus2004 (ATM ~22.5k samples/s)

I've started migrating prometheus2003 data on graphite2001 although the rsync I ran yesterday wasn't complete so I've been rsync'ing the missing data from prometheus2004 and began the migration for analytics and services instances.

Change 490834 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't require Prometheus::Server when writing k8s token

https://gerrit.wikimedia.org/r/490834

fgiunchedi mentioned this in T216253: prometheus-pdns-rec-exporter invalid metric output.Feb 15 2019, 4:36 PM

Update: while converting global instance with full retention 10500h the migrator ran into a problem that would abort the migration. I've shortened the retention to 8500h hours (i.e. 354 days) and that has finished the migration successfully. I've reported the issue upstream as https://gitlab.com/gitlab-org/prometheus-storage-migrator/issues/4 in case that rings a bell to them.

Backfilling the global storage is what's blocking having prometheus2003 back in production at this point so I'm leaning towards merging the migrated data even with a 83 days shorter retention than original.

From the error reported in the upstream issue it seems that is data-dependent. Have you tried by any chance any other retention between 8500h and 10500h?
Can we enable any more debugging to get a better idea of which metric is throwing the error so that maybe we can just skip it instead?

In T187987#4984153, @Volans wrote:

From the error reported in the upstream issue it seems that is data-dependent. Have you tried by any chance any other retention between 8500h and 10500h?

Indeed, I have tried with 9000h for example and that failed too, one additional complication is that each try can take ~15-20h before failing :(

Can we enable any more debugging to get a better idea of which metric is throwing the error so that maybe we can just skip it instead?

There's no more debugging from the code as it is now AFAICS.

Since the 8500h migration worked I'm thinking of going with that for prometheus2003 so that host is done and can be put in production, then start converting prometheus2004 (or prometheus1003) and see how far we get.

Would that mean that the missing hours will be totally lost? In that case probably better to ask the users that were asking for longer retention to make sure we're not loosing any required data. (my 2 cents)

In T187987#4984247, @Volans wrote:

Would that mean that the missing hours will be totally lost? In that case probably better to ask the users that were asking for longer retention to make sure we're not loosing any required data. (my 2 cents)

Yes that's what that means, they'd be lost from prometheus2003 and possibly others too if we can't migrate more retention.

re: affected users, I'm looping in Performance-Team since IIRC they were involved in https://gerrit.wikimedia.org/r/c/operations/puppet/+/404434 originally to ask if they'd be ok with losing ~84d days at the end of the current retention period for global instance.

What does that mean, exactly? Losing 84 days worth of data that's already 15 months old?

In T187987#4984611, @Gilles wrote:

What does that mean, exactly? Losing 84 days worth of data that's already 15 months old?

That's correct yes, essentially trimming off the retention

Sounds fine to me!

Thanks @Gilles !

I have tried again a 9000h migration and that worked, I've now merged the migrated storage with existing storage and will be putting prometheus2003 back in production today.

In T187987#4990906, @fgiunchedi wrote:

Thanks @Gilles !

I have tried again a 9000h migration and that worked, I've now merged the migrated storage with existing storage and will be putting prometheus2003 back in production today.

And so I did, turns out however that vector queries (e.g. rate()) in Prometheus 2 don't like if the results have the same labelset but are different metrics, like in our case of migrated storage plus the already-running v2 storage upstream issue.

What I thought it was possible is: spin up a prometheus 2 collecting data while the v1 storage was migrating and then merge the two (migrated v1 + brand-new v2) to minimize the gap during migration, however as experienced above while this is possible it is not safe/reliable in the face of vector queries.

So, change of plans: migrate v1 storage to v2 but don't start a v2 instance while migration is in progress, start v2 only when migration is fully complete and new metrics can start flowing. The downside of this approach is that the gap will be as big as the migration time. So far this time has been in the order of 24h for our biggest instances (ops and global).

The other approach mentioned in the task description is to setup remote read from v1. I think however we should stick with the migrator and take a one-time hit of metric gap, instead of having to keep v1 around until the full retention has passed.

Thoughts / input welcome as always!

To test the new plan above I've started an rsync + migration of all instances of prometheus2003, starting from a snapshot of data from prometheus2004. As soon as the migration is done successfully then prometheus is started, so the gap is rsync time + migration time. For big instances hosted on spinning disks (global and services) the rsync time might be significant, in that case we can consider transferring the snapshot block device itself instead of looking at the filesystem.

Status update:

The migrations have finished for all instances on prometheus2003, however the same symptom remains. To explain in detail what's happening I'll give an example for the global instance and the cluster_input:rsyslog_input_submitted:rate5m metric before/after the migration in the screenshot below. The green line is pre-migration (i.e. data from Prometheus v1) and the red line is post-migration (i.e. when Prometheus v2 started again ingesting new data). The two metrics have the same labels yet are regarded by Prometheus as two separate metrics (see also the JSON below where the result list has two elements). In itself having such a result didn't use to be a problem before Prometheus 2.5 which shipped a fix for this issue making operations such as rate impossible on such results where two distinct metrics share the same labels.

I have been debugging why post-migration the new metrics are regarded as "new", and at the same time why testing the migration on analytics instance worked as expected (i.e. no duplicated metrics e.g. for prometheus_engine_queries) although without much success so far.

2019-03-12-121312_1256x703_scrot.png (703×1 px, 81 KB)

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "__name__": "cluster_input:rsyslog_input_submitted:rate5m",
          "cluster": "syslog",
          "input": "imtcp(6514)",
          "job": "federate-ops",
          "site": "codfw"
        },
        "values": [
          [
            1552317531.172,
            "6227.620833333333"
          ],
          [
            1552320987.172,
            "6622.216666666666"
          ],
          [
            1552324443.172,
            "5789.508333333333"
          ],
          [
            1552327899.172,
            "5787.7375"
          ],
          [
            1552331355.172,
            "5859.45"
          ],
          [
            1552334811.172,
            "5783.85"
          ],
          [
            1552338267.172,
            "5989.416666666667"
          ],
          [
            1552341723.172,
            "5358.641666666666"
          ],
          [
            1552345179.172,
            "5011.729166666667"
          ],
          [
            1552348635.172,
            "4530.6"
          ],
          [
            1552352091.172,
            "4072.0625"
          ],
          [
            1552355547.172,
            "3930.0833333333335"
          ],
          [
            1552359003.172,
            "5058.954166666666"
          ],
          [
            1552362459.172,
            "4158.95"
          ],
          [
            1552365915.172,
            "3677.170833333333"
          ],
          [
            1552369371.172,
            "3602.7875"
          ],
          [
            1552372827.172,
            "3982.9791666666665"
          ]
        ]
      },
      {
        "metric": {
          "__name__": "cluster_input:rsyslog_input_submitted:rate5m",
          "cluster": "syslog",
          "input": "imtcp(6514)",
          "job": "federate-ops",
          "site": "codfw"
        },
        "values": [
          [
            1551508827.172,
            "4084.982979237587"
          ],
          [
            1551512283.172,
            "4148.945833333333"
          ],
          [
            1551515739.172,
            "4971.770833333333"
          ],
          [
            1551519195.172,
            "5200.245833333333"
          ],
          [
            1551522651.172,
            "4739.388384301308"
          ],
          [
            1551526107.172,
            "4740.575609715445"
          ],
          [
            1551529563.172,
            "4838.3149456702895"
          ],
          [
            1551533019.172,
            "6006.662722090737"
          ],
          [
            1551536475.172,
            "6072.666291690102"
          ],
          [
            1551539931.172,
            "5938.783333333334"
          ],
          [
            1551543387.172,
            "5362.189436141446"
          ],
          [
            1551546843.172,
            "5840.691666666667"
          ],
          [
            1551550299.172,
            "6249.484364257588"
          ],
          [
            1551553755.172,
            "5367.790134207775"
          ],
          [
            1551557211.172,
            "5077.970508456215"
          ],
          [
            1551560667.172,
            "5863.291666666667"
          ],
          [
            1551564123.172,
            "5608.3375"
          ],
          [
            1551567579.172,
            "4994.033333333334"
          ],
          [
            1551571035.172,
            "4321.425"
          ],
          [
            1551574491.172,
            "4545.67272636364"
          ],
          [
            1551577947.172,
            "4845.540346271641"
          ],
          [
            1551581403.172,
            "4758.295833333334"
          ],
          [
            1551584859.172,
            "4427.995833333333"
          ],
          [
            1551588315.172,
            "4090.535613841378"
          ],
          [
            1551591771.172,
            "3827.046711137592"
          ],
          [
            1551595227.172,
            "4151.033333333334"
          ],
          [
            1551598683.172,
            "4141.483333333334"
          ],
          [
            1551602139.172,
            "5255.3125"
          ],
          [
            1551605595.172,
            "5603.320833333333"
          ],
          [
            1551609051.172,
            "5243.138856135761"
          ],
          [
            1551612507.172,
            "5008.645833333333"
          ],
          [
            1551615963.172,
            "5560.741666666667"
          ],
          [
            1551619419.172,
            "5331.166666666667"
          ],
          [
            1551622875.172,
            "6407.958333333333"
          ],
          [
            1551626331.172,
            "6129.576384895534"
          ],
          [
            1551629787.172,
            "6187.153654491971"
          ],
          [
            1551633243.172,
            "5989.604166666667"
          ],
          [
            1551636699.172,
            "5726.383333333333"
          ],
          [
            1551640155.172,
            "5470.391666666666"
          ],
          [
            1551643611.172,
            "5667.6125"
          ],
          [
            1551647067.172,
            "5855.615418911542"
          ],
          [
            1551650523.172,
            "5806.896637930675"
          ],
          [
            1551653979.172,
            "4737.064691127574"
          ],
          [
            1551657435.172,
            "4633.558333333333"
          ],
          [
            1551660891.172,
            "4225.195833333333"
          ],
          [
            1551664347.172,
            "5110.029166666666"
          ],
          [
            1551667803.172,
            "4603.169311926797"
          ],
          [
            1551671259.172,
            "4626.004200140005"
          ],
          [
            1551674715.172,
            "3860.6541666666667"
          ],
          [
            1551678171.172,
            "3823.0355453890156"
          ],
          [
            1551681627.172,
            "4011.3375"
          ],
          [
            1551685083.172,
            "4494.039409482776"
          ]
        ]
      }
    ]
  }
}

After focused code reading and head scratching it turns out the root cause is that persisted metrics weren't sorted during migration, this used to work fine but in Prometheus >= 2.5 it is a requirement. The upstream issue now has a MR to fix this problem: https://gitlab.com/gitlab-org/prometheus-storage-migrator/issues/5.

For good measure I'll reimage prometheus2003 and start again a migration to validate things are working as expected.

I've just repooled prometheus2003 and things seems to be working as expected!

The gap in metrics starts at March 14th ~14 UTC. The widest gap is for ops instance, which ends at March 16th at ~23 UTC (i.e. ~2.5 days) other instances have shorter gaps due to taking less time to migrate.

Change 490834 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't require Prometheus::Server when writing k8s token

https://gerrit.wikimedia.org/r/490834

Change 497746 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: run prometheus 2 on prometheus1003

https://gerrit.wikimedia.org/r/497746

Change 497746 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: run prometheus 2 on prometheus1003

https://gerrit.wikimedia.org/r/497746

Change 497761 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use yaml rules for prometheus v2 k8s-staging

https://gerrit.wikimedia.org/r/497761

Change 497761 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use yaml rules for prometheus v2 k8s-staging

https://gerrit.wikimedia.org/r/497761

Status update: all instances but ops on prometheus1003 have finished migrating. This time I changed the strategy to actually backfill new + migrated storage (which was the plan all along!) and the maximum gap observed is ~1.5h which is acceptable IMHO.

Next steps once prometheus1003 has finished migrating:

Put prometheus1003 in rotation to serve traffic by itself, to prove it works as expected
Reimage prometheus1004 with stretch and install Prometheus 2
Copy data from prometheus1003 onto prometheus1004
Put prometheus1004 in service

Change 499742 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: set v2 max block duration to 24h

https://gerrit.wikimedia.org/r/499742

Status update: codfw and eqiad are served by Prometheus v2 now (single host in each, prometheus2003 and prometheus1003 respectively). The Prometheus v1 hosts are depooled now and I'll reimage them early next week after no problems are reported. Backfill for those will be simpler because there's no migration involved: we'll be copying the already migrated storage from the existing v2 hosts.

As far as eqsin/esams/ulsfo are concerned I've thought about rsync'ing the data off hosts for migration, however there are a lot of metric files involved and rsync is going to take multiple hours before it finishes. During the rsync + migration time there will be no metric history, however new metrics will be collected as soon as Prometheus v2 is started.

A faster strategy is to bypass the filesystem and copy an LV snapshot off the host instead, which usually happens at 50+ MB/s. Unfortunately the VGs on the bastions don't have free extents, so to take snapshots we'll have to resize the LV and the filesystem, which entails Prometheus downtime.

To recap the two strategies:

rsync data and start Prometheus v2

Pros:

Safe, minimal Prometheus downtime

Cons:

Will take longer to transfer metrics off-host and begin migration on e.g. graphite2002 (exact numbers TBD) during which time metrics history won't be available

shrink existing FS and transfer a snapshot

Pros:

Faster data transfer time, migration can begin sooner

Cons:

There will be Prometheus downtime when the FS is shrinking, which can take some time (number unknown)

• ema closed subtask T190992: prometheus: slow dashboards due to suboptimal query_range performance as Resolved.Apr 1 2019, 4:00 PM

• ema closed subtask T212312: prometheus-based graph significantly slower than statsd equivalent as Resolved.

fgiunchedi renamed this task from Serve >= 50% of production Prometheus systems with Prometheus v2 to 100% of Prometheus traffic served by Prometheus v2.Apr 4 2019, 1:06 PM

fgiunchedi edited parent tasks, added: T220104: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal); removed: T213288: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal).

fgiunchedi moved this task from Inbox to In progress on the observability board.Apr 15 2019, 2:50 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-17T08:47:06Z] <godog> reimage prometheus1004 - T187987

Mentioned in SAL (#wikimedia-operations) [2019-04-17T12:10:16Z] <godog> briefly stop all prometheus on prometheus1003 to finish metrics rsync - T187987

Change 504540 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: Prometheus v2 for prometheus1004

https://gerrit.wikimedia.org/r/504540

Change 504540 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: Prometheus v2 for prometheus1004

https://gerrit.wikimedia.org/r/504540

Change 499742 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: set v2 max block duration to 24h

https://gerrit.wikimedia.org/r/499742

Mentioned in SAL (#wikimedia-operations) [2019-04-17T13:47:51Z] <godog> reimage prometheus2004 - T187987

Change 504552 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: run Prometheus v2 on prometheus2004

https://gerrit.wikimedia.org/r/504552

Change 504552 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: run Prometheus v2 on prometheus2004

https://gerrit.wikimedia.org/r/504552

Both prometheus1004 and prometheus2004 are now in service with Prometheus v2! So far no issues, syncing the whole storage from their counterparts took ~2h each.

Change 490834 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: don't require Prometheus::Server when writing k8s token

https://gerrit.wikimedia.org/r/490834

Change 505712 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: prometheus v2 on bast5001

https://gerrit.wikimedia.org/r/505712

Change 505712 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: prometheus v2 on bast5001

https://gerrit.wikimedia.org/r/505712

Mentioned in SAL (#wikimedia-operations) [2019-04-23T09:23:15Z] <godog> upgrade prometheus to v2 on bast5001, previous metrics will not be available until migration and backfill are complete - T187987

Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:39:55Z] <godog> begin migration of bast4002 to prometheus v2 - T187987

Change 506950 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: bast4002 to Prometheus v2

https://gerrit.wikimedia.org/r/506950

Change 506950 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: bast4002 to Prometheus v2

https://gerrit.wikimedia.org/r/506950

Mentioned in SAL (#wikimedia-operations) [2019-04-30T08:22:36Z] <godog> bounce prometheus on bast4002 after backfill has finished - T187987

fgiunchedi updated the task description. (Show Details)Apr 30 2019, 3:31 PM

fgiunchedi updated the task description. (Show Details)

Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but only labmon1001 is used for queries via prometheus-labmon.eqiad.wmnet. There's very few prometheus data on those hosts so my plan is to run the migrator on the host itself, starting with labmon1002, let prometheus 2 run for a while then copy over the data to labmon1001.

In other words, the steps on labmon1002 will be:

cc cloud-services-team as I'd like to perform this work on Thurs May 2nd on labmon1002, and upgrade labmon1001 early next week.

In T187987#5147514, @fgiunchedi wrote:

Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but only labmon1001 is used for queries via prometheus-labmon.eqiad.wmnet. There's very few prometheus data on those hosts so my plan is to run the migrator on the host itself, starting with labmon1002, let prometheus 2 run for a while then copy over the data to labmon1001.

In other words, the steps on labmon1002 will be:

stop puppet

stop prometheus

start migrating data

upgrade prometheus to v2

run puppet

once migration has finished, backfill migrated data

cc cloud-services-team as I'd like to perform this work on Thurs May 2nd on labmon1002, and upgrade labmon1001 early next week.

works for me! Ping me on IRC if you need anything.

Change 507751 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: labmon1002 to prometheus v2

https://gerrit.wikimedia.org/r/507751

Change 507751 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: labmon1002 to prometheus v2

https://gerrit.wikimedia.org/r/507751

labmon1002 has been migrated and seems to be working, I'll upgrade labmon1001 early next week.

Change 508280 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: labmon1001 to prometheus v2

https://gerrit.wikimedia.org/r/508280

Change 508280 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: labmon1001 to prometheus v2

https://gerrit.wikimedia.org/r/508280

fgiunchedi updated the task description. (Show Details)May 6 2019, 9:03 AM

Mentioned in SAL (#wikimedia-operations) [2019-05-06T09:03:20Z] <godog> upgrade labmon1001 to prometheus 2 - T187987

Change 508503 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: bast3002 to prometheus v2

https://gerrit.wikimedia.org/r/508503

Change 508503 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: bast3002 to prometheus v2

https://gerrit.wikimedia.org/r/508503

Mentioned in SAL (#wikimedia-operations) [2019-05-07T07:27:03Z] <godog> upgrade prometheus on bast3002 - T187987

Change 509052 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: remove v2 feature flag

https://gerrit.wikimedia.org/r/509052

Change 509052 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: remove v2 feature flag

https://gerrit.wikimedia.org/r/509052

Completed! All production and wmcs Prometheus fleet migrated to Prometheus 2

Change 509822 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: remove v1 rules files

https://gerrit.wikimedia.org/r/509822

Change 509822 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: remove v1 rules files

https://gerrit.wikimedia.org/r/509822

fgiunchedi mentioned this in Blog Post: The journey to Prometheus 2.Dec 16 2019, 1:49 PM

100% of Prometheus traffic served by Prometheus v2
Closed, ResolvedPublic
Actions

Description

Migration checklist for codfw/eqiad hosts

Migration / backfill procedure

Status

Details

Related Objects
Search...

Event Timeline

rsync data and start Prometheus v2

shrink existing FS and transfer a snapshot

	F28375595: 2019-03-12-121312_1256x703_scrot.png
	Mar 12 2019, 1:36 PM

100% of Prometheus traffic served by Prometheus v2Closed, ResolvedPublicActions

Description

Migration checklist for codfw/eqiad hosts

Migration / backfill procedure

Status

Details

Related ObjectsSearch...

Event Timeline

rsync data and start Prometheus v2

shrink existing FS and transfer a snapshot

100% of Prometheus traffic served by Prometheus v2
Closed, ResolvedPublic
Actions

Related Objects
Search...