Setting the stage
In July 2016 the SRE team at Wikimedia Foundation began investigating Prometheus as a modern metrics framework. The experience has been very positive and has provided benefits both for technical debt elimination—such as deprecating Ganglia and Diamond—and adding support for newer technologies like Kubernetes.
Prometheus development continued during Prometheus’ implementation, and Prometheus 2 was released in November 2017. The biggest and most important change was the redesigned storage engine, which brought impressive gains in resource usage and paved the way for remote long term storage.
About a year later, we were ready to migrate Prometheus 1 to Prometheus 2, with as little downtime and data loss as possible.
Migration to Prometheus 2 was necessary, and there were a couple of options.
One way was to set up Prometheus 2 to scrape new data and use the "remote read" feature to fetch missing metrics from Prometheus 1. That would require running both 1 and 2 in parallel for the full duration of retention period. We were prepared to do this, but we knew it would be resource intensive and should be kept as a last resort.
An alternative presented itself in the form of prometheus-storage-migrator to migrate in place Prometheus 1 storage to Prometheus 2. This method was going to be cheaper to run and more reliable. We ran a small scale smoke test of the storage-migrator, and it proved to be successful!
Our plan to migrate a Prometheus host looked like this:
- Stop Prometheus 1
- Upgrade to Prometheus 2, server binary and configuration
- Start Prometheus 2 with an empty storage
- Migrate Prometheus 1 storage to Prometheus 2
- Merge migrated storage with the newly started one
This sounds simple, right?
In theory, yes! But there were some challenges along the way:
- The trickiest steps are the ones dealing with persistent data on disk, namely the last two steps. We have a large amount of metrics to convert, and these are potentially long non-interruptible operations.
- Until migration was finished and storage had been merged, no past metrics were available for querying.
The storage architecture of Prometheus 2 introduced the concept of ''blocks'' and ''compactions'' of blocks; as time goes by, smaller blocks (time spans) get compacted into bigger blocks that encompass larger time spans.
During regular operation, it is a useful feature to have compactions running in the background for multiple reasons: storage space gets more compressed, queries are faster because fewer files need to be accessed, and so on.
However, compactions need to be disabled during migration. While prometheus-storage-migrator doesn't run compactions, for Prometheus 2, disabling compactions is achieved with --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h.
The resulting storage is then mergeable by copying the non-overlapping migrated storage blocks into Prometheus' 2 storage.
Finally, once the merged storage has been validated for correctness (e.g. are all metrics there? Are datapoints present for the migrated time span? etc), compactions should be reenabled in Prometheus 2.
Problems are to be expected when dealing with any significant amount of state, and this case was no different. While smoke testing storage-migrator on non-production datasets showed no obvious problems, migrating production metrics turned out to be less straightforward.
The first issue has been storage-migrator bailing with:
unable to add to v2 storage: unknown series: not found
After some code reading and help from upstream in issue 4, we were able to skip these metrics and carry on with the migration.
The second issue we ran into has been more difficult to track down and fix.
The migration would complete successfully, and the results appeared correct. However, rate() on time spans that included both old and new metrics would break with instant/range vector cannot contain two metrics with the same labelset.
The resulting storage was unusable. This was not great! After much frustration, code reading and fmt.Printf-based debugging, we tracked down the source of the problem.
Labels in Prometheus 1 storage were not always sorted, and storage-migrator would write such labels as-is, without sorting. Such results were tolerated by Prometheus until 2.5 but would cause errors in our Prometheus 2.7 instance.
The end result is a one-line patch to prometheus-storage-migrator to sort metric labels before writing to Prometheus v2 storage.
After a multi-month effort, we were able to successfully migrate the whole production fleet to Prometheus 2. The fleet consists of 9 hosts, including redundant pairs, more than 1TB of metrics and monitoring about 1500 hosts across 5 different locations.
The migration was performed with minimal downtime, minimal loss of redundancy, and the maximum time gap between new and migrated storage has been in the order of 2-3 hours. Huge thanks to Gitlab for their prometheus-storage-migrator! If you are interested in the gory details, see T187987 and related tasks.
Thank you @fgiunchedi for sharing this story. Congratulations for working through some difficult bugs to roll out this important upgrade.
You are very welcome! Happy to share. Also thanks to all folks that provided feedback on this post!