Page MenuHomePhabricator

prometheus: upgrade to >= 2.12
Closed, ResolvedPublic

Description

Follow-up for T222112 and https://wikitech.wikimedia.org/wiki/Incident_documentation/20190425-prometheus.

bbrazil says we might be helped by upgrading to >= 2.9.2, citing "various improvements to the [tsdb] postings logic" made between 2.7.2 and then.

Event Timeline

+1 to testing/PoC 2.9.2; we're using Debian Prometheus packages mostly verbatim, but adding back the k8s discovery + dependencies back as they are not shipped in Debian. For testing purposes we can even simply use upstream's binary tho.

Dzahn triaged this task as Medium priority.Apr 30 2019, 9:33 PM
fgiunchedi renamed this task from prometheus: upgrade to 2.9.2 to prometheus: upgrade to 2.11.Jul 8 2019, 1:09 PM
fgiunchedi updated the task description. (Show Details)
fgiunchedi renamed this task from prometheus: upgrade to 2.11 to prometheus: upgrade to 2.12.Aug 19 2019, 1:00 PM
fgiunchedi renamed this task from prometheus: upgrade to 2.12 to prometheus: upgrade to >= 2.12.Jul 6 2020, 12:05 PM

I took the change of Bullseye upcoming upgrade to build a Prometheus 2.24.1 + k8s package in the wmf/bullseye in the operations/debs/prometheus repo, the package works fine on Buster too so we can use it to upgrade Prometheus across the board.

The package works fine on Buster, though Stretch is trickier (prometheus codfw/eqiad are stretch) because of missing or old dependencies, namely:

  • libjs-popper.js isn't in stretch
  • libjs-jquery (>= 3.5.1~) is needed

Since the codfw/eqiad Prometheus hosts are going to be replaced with new HW in Q2, I'm going to force-install prometheus on the stretch hosts for now. It means the UI won't be necessarily functional in the meantime (i.e. when accessed via ssh tunnels), however nowadays the UI at https://thanos.wikimedia.org should be used instead for queries.

Since the codfw/eqiad Prometheus hosts are going to be replaced with new HW in Q2, I'm going to force-install prometheus on the stretch hosts for now. It means the UI won't be necessarily functional in the meantime (i.e. when accessed via ssh tunnels), however nowadays the UI at https://thanos.wikimedia.org should be used instead for queries.

I think we might some equivs dummy packages, though: If we force install packages this will be flagged in the Icinga checks for the dpkg state I think.

Since the codfw/eqiad Prometheus hosts are going to be replaced with new HW in Q2, I'm going to force-install prometheus on the stretch hosts for now. It means the UI won't be necessarily functional in the meantime (i.e. when accessed via ssh tunnels), however nowadays the UI at https://thanos.wikimedia.org should be used instead for queries.

I think we might some equivs dummy packages, though: If we force install packages this will be flagged in the Icinga checks for the dpkg state I think.

Good point, AFAICT the package isn't reported as broken and the icinga check is happy (force-installing to the same version on a stretch pontoon host since I already upgraded prometheus there for testing but you get the idea)

filippo@ms-fe-02:~$ sudo dpkg -i --force-depends prometheus_2.24.1+ds-1+wmf1_amd64.deb 
(Reading database ... 75243 files and directories currently installed.)
Preparing to unpack prometheus_2.24.1+ds-1+wmf1_amd64.deb ...
Unpacking prometheus (2.24.1+ds-1+wmf1) over (2.24.1+ds-1+wmf1) ...
dpkg: prometheus: dependency problems, but configuring anyway as you requested:
 prometheus depends on fonts-glyphicons-halflings; however:
  Package fonts-glyphicons-halflings is not installed.
 prometheus depends on libjs-bootstrap4; however:
  Package libjs-bootstrap4 is not installed.
 prometheus depends on libjs-jquery (>= 3.5.1~); however:
  Version of libjs-jquery on system is 3.1.1-2+deb9u2.
 prometheus depends on libjs-popper.js; however:
  Package libjs-popper.js is not installed.

Setting up prometheus (2.24.1+ds-1+wmf1) ...
Processing triggers for systemd (232-25+deb9u13) ...
Processing triggers for man-db (2.7.6.1-2) ...
filippo@ms-fe-02:~$ /usr/local/lib/nagios/plugins/check_dpkg
All packages OK

Mentioned in SAL (#wikimedia-operations) [2021-08-03T11:18:32Z] <godog> upgrade prometheus5001 to 2.24.1+ds-1+wmf1 - T222113

Mentioned in SAL (#wikimedia-operations) [2021-08-03T11:28:22Z] <godog> upgrade prometheus3001 to 2.24.1+ds-1+wmf1 - T222113

Mentioned in SAL (#wikimedia-operations) [2021-08-04T08:41:54Z] <godog> pool prometheus1003 (and depool prometheus1004 for testing 1003 only) - T222113

@jcrespo reported issues with minio scraping by prometheus, and indeed Prometheus' TLS certs validation changed due to a golang 1.15 change. The error:

Get "https://backup1004.eqiad.wmnet:9000/minio/v2/metrics/cluster": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

Since around 8-8:40 UTC, minio scrapping is failing on all backup* hosts, with:

Aug 04 08:08:57 backup1004 minio[2621]: http: TLS handshake error from 10.64.0.123:41564: remote error: tls: bad certificate

We had this issue for not using the fqdn on requests, but that was fixed here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/705694 and scraping worked nicely since then.

Using curl to scrape metrics seems to work correctly on both localhost (backup1004) and from prometheus1003.

FYI, I am (re)using, perhaps incorrectly, the automatically generated host puppet certs for this (minio)- in case someone else is doing it and gets bitten by the same error.

Mentioned in SAL (#wikimedia-operations) [2021-08-04T12:23:15Z] <godog> depool prometheus2004 for upgrade - T222113

Mentioned in SAL (#wikimedia-operations) [2021-08-04T14:17:22Z] <godog> depool prometheus2004 and pool prometheus2003 - T222113

Mentioned in SAL (#wikimedia-operations) [2021-08-04T14:28:46Z] <godog> upgrade prometheus on prometheus4001 - T222113

Mentioned in SAL (#wikimedia-operations) [2021-08-04T14:30:40Z] <godog> upgrade prometheus on cloudmetrics hosts - T222113

Change 710491 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Switch TLS certificates to PKI rather than puppet

https://gerrit.wikimedia.org/r/710491

Change 710491 merged by Jcrespo:

[operations/puppet@production] mediabackups: Switch TLS certificates to PKI rather than puppet

https://gerrit.wikimedia.org/r/710491

Change 710503 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Make /etc/minio/ssl writeable by owner

https://gerrit.wikimedia.org/r/710503

Change 710503 merged by Jcrespo:

[operations/puppet@production] mediabackups: Make /etc/minio/ssl writeable by owner

https://gerrit.wikimedia.org/r/710503

Mentioned in SAL (#wikimedia-operations) [2021-08-09T08:41:44Z] <godog> upgrade prometheus on prometheus1004 - T222113

Mentioned in SAL (#wikimedia-operations) [2021-08-09T08:46:20Z] <godog> upgrade prometheus on prometheus2004 - T222113

fgiunchedi claimed this task.

This is complete! 2.24.1+ds-1+wmf1 is running in production