Maniphest T205870

Fully migrate producers off statsd
Open, MediumPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Oct 1 2018, 2:25 PM

Details

Subject	Repo	Branch	Lines +/-
services: update the rec-api's Docker image	operations/deployment-charts	master	+2 -4
Upgrade to service-runner 3.1.0	mediawiki/services/recommendation-api	master	+3 K -8 K
services: update rec-api's staging Docker image	operations/deployment-charts	master	+1 -1
services: update Docker image and settings for Recommendation API	operations/deployment-charts	master	+2 -7
recommendation-api: update monitoring config	operations/deployment-charts	master	+29 -5
Move the only metric produced from Gauge to Histogram	mediawiki/services/recommendation-api	master	+4 -3
Use "set" instead of "endTiming" in makeMetric	mediawiki/services/recommendation-api	master	+1 -1
update service-runner to 2.8.0 and implement new metrics api	mediawiki/services/recommendation-api	master	+15 -11
scb: enable statsd_exporter and add matching rules	operations/puppet	production	+37 -0
citoid: switch to native prometheus metrics	operations/deployment-charts	master	+8 -83
Move remaining metrics to prometheus	mediawiki/services/citoid	master	+85 -22
Update to service-template-node 0.10.0.	mediawiki/services/citoid	master	+24 -16
Metrics: Wire up MetricsFactory into ServiceWiring and emit steps	mediawiki/core	master	+60 -0
Metrics: Wire up MetricsFactory into ServiceWiring	mediawiki/core	master	+22 -0
Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()	mediawiki/core	master	+12 -2
Metrics: Implement statsd-exporter compatible Metrics interface	mediawiki/core	master	+1 K -0
Metrics: Implement and enable statsd-exporter compatible Metrics interface	mediawiki/core	master	+996 -7
Switch to native prometheus latency histograms	mediawiki/services/mathoid	master	+17 -11
mathoid: Bump deployed version	operations/deployment-charts	master	+1 -1
update service-runner dependency to 2.8.0 and implement new metrics api	mediawiki/services/chromium-render	master	+85 -26
update service-runner to 2.8.0 and implement new metrics api	mediawiki/services/mobileapps	master	+27 -17
logstash: output webrequest 5xx metrics	operations/puppet	production	+8 -0
profile: disable statsd_exporter relay for ores	operations/puppet	production	+3 -1
Use new service-runner metrics for built in prometheus metrics	mediawiki/services/eventstreams	master	+46 -219
lvs, monitoring: prometheus expects params value as string[] type	operations/puppet	production	+16 -16
monitoring, profile, prometheus: bugfix, prometheus params values	operations/puppet	production	+26 -18
lvs, monitoring, prometheus: bugfix openapi exports	operations/puppet	production	+13 -9
lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes	operations/puppet	production	+211 -83
scb: add graphoid matching rules and deploy statsd exporter to scb cluster	operations/puppet	production	+59 -0
update service-runner to 2.8.0 and hyperswitch to 0.14.0	analytics/aqs	master	+2 -2
proton: enable statsd_exporter and add matching rules to profile::proton	operations/puppet	production	+79 -0
hiera: update ores to pass statsd through the statsd exporter	operations/puppet	production	+2 -0
swift: stop relaying to statsd/statsite	operations/puppet	production	+0 -0
hiera: update ores to pass statsd through statsd_exporter	operations/puppet	production	+2 -0
profile, prometheus, role: install swagger exporter on prometheus nodes	operations/puppet	production	+16 -1
logstash: stop relaying to central statsd	operations/puppet	production	+3 -1
hiera: disable statsd_exporter::relay_address on logstash nodes	operations/puppet	production	+4 -2
profile: use prometheus for logstash alerting	operations/puppet	production	+29 -29
prometheus: make statsd.relay-address toggle-able	operations/puppet	production	+7 -2
thumbor: stop relaying to statsd/statsite	operations/puppet	production	+6 -2
swift: remove statsite	operations/puppet	production	+3 -0
swift: stop relaying to statsd/statsite	operations/puppet	production	+2 -2
swift: port alerts to Prometheus	operations/puppet	production	+64 -64
grafana: use Prometheus swift metrics for dashboard	operations/puppet	production	+89 -94
prometheus: collect swift account/container stats globally	operations/puppet	production	+3 -0
hiera: fix statsd rules	operations/puppet	production	+9 -9
logstash: update statsd exporter mappings and use exporter	operations/puppet	production	+22 -18
add statsd_exporter config to mathoid	operations/deployment-charts	master	+20 -0
mediawiki: enable statsd_exporter and add matching rules to appserver	operations/puppet	production	+915 -0
varnish: enable statsd_exporter and add matching rules	operations/puppet	production	+164 -0
profile: enable statsd_exporter and add matching rules to logstash::collector	operations/puppet	production	+11 -0
profile: enable statsd_exporter and add matching rules to ores::worker	operations/puppet	production	+204 -0
ci: use statsite for localhost statsd aggregation	operations/puppet	production	+3 -2
statsite: move from role to profile	operations/puppet	production	+6 -6
hieradata: send periodic swift stats to localhost	operations/puppet	production	+3 -0
hieradata: switch all swift statsd traffic to statsd_exporter	operations/puppet	production	+3 -0
swift: add statsd mappings for periodic metrics	operations/puppet	production	+42 -0
swift: turn on statsd_exporter in eqiad	operations/puppet	production	+2 -0
swift: turn on statsd_exporter in codfw	operations/puppet	production	+2 -0
thumbor: relay statsd_exporter metrics to localhost	operations/puppet	production	+3 -1
swift: add statsd_port parameter	operations/puppet	production	+17 -10
hieradata: add statsd_exporter mappings for swift::storage	operations/puppet	production	+187 -2
hieradata: rename swift proxy statsd_exporter mapping	operations/puppet	production	+28 -28
swift: set statsd_exporter to relay to local statsd	operations/puppet	production	+6 -2
hieradata: add statsd_exporter mappings for swift-proxy	operations/puppet	production	+114 -2
swift: enable statsd_exporter	operations/puppet	production	+2 -0
prometheus: set defaults for statsd_exporter	operations/puppet	production	+14 -1
prometheus: add jobs for statsd_exporter	operations/puppet	production	+18 -1
thumbor: add missing statsd_exporter mappings	operations/puppet	production	+9 -0
thumbor: use statsd_exporter	operations/puppet	production	+3 -0
thumbor: add prometheus-statsd-exporter	operations/puppet	production	+96 -0
thumbor: set name for all statsd_exporter metrics	operations/puppet	production	+1 -0
thumbor: fix missing statsd_exporter mappings	operations/puppet	production	+8 -0
statsd_exporter: fix commandline flags	operations/puppet	production	+1 -1
New class: prometheus::statsd_exporter	operations/puppet	production	+76 -0
debian: add patch for inline udp usage	operations/debs/prometheus-statsd-exporter	master	+75 -0

Related Objects
Search...

Status	Assigned	Task
Open	None	T228380 Tech debt: sunsetting of Graphite
Open	None	T205870 Fully migrate producers off statsd
Resolved	colewhite	T233089 Export zuul metrics to Prometheus
Resolved	• ACraze	T233448 Review prometheus ORES rules for completeness
Declined	colewhite	T239833 StatsD Exporter drops relayed metrics
Resolved	colewhite	T240685 MediaWiki Prometheus support
Resolved	colewhite	T249164 RFC: Better interface for generating metrics in MediaWiki
Resolved	Krinkle	T292311 Create project tag for MediaWiki-libs-Metrics
Resolved	Krinkle	T292269 Decouple Profiler class from WebRequest and RequestContext
Resolved	Krinkle	T344748 MediaWiki Core - Review and merge StatsLib patch
Resolved	herron	T344751 Decide on default histogram buckets for MediaWiki timers
Open	None	T240995 AQS is not OpenAPI 3 compliant
Resolved	• Pchelolo	T241176 Review and release service-runner 2.8.0
Resolved	colewhite	T247820 Decide on `service-runner` aggregated prometheus metrics and use of `service` label
Resolved	Jgiannelos	T277857 Proton metrics broken
Open	None	T175087 Create a navtiming processor for Prometheus
Declined	None	T190936 navtiming.py: When processing metrics, include effectiveConnectionType as a factor
Resolved	Peter	T323124 Replace navtiming Platform tag ("site") with mw_skin
Resolved	Krinkle	T323129 Simulate client dispatch in a single scrape
Open	None	T321398 Move performance metrics from Graphite to Prometheus
Open	None	T325282 Update Grafana alerts to use metrics from Prometheus
Open	Peter	T325283 Update navtiming dashboards to use Prometheus metrics
Open	None	T325284 Update documentation to use Prometheus instead of Graphite
Open	None	T336764 Simplify navtiming multi-dc logic
Open	None	T293761 statsd and gunicorn metrics for superset

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

JMeybohm mentioned this in T277857: Proton metrics broken.Mar 19 2021, 10:28 AM

lmata moved this task from In progress to Epics In Progress on the observability board.Jun 14 2021, 3:43 PM

• Pchelolo closed subtask T241176: Review and release service-runner 2.8.0 as Resolved.Jun 17 2021, 10:55 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q1); removed observability.Jul 12 2021, 2:41 AM

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2021/2022-Q1) board.

Mvolz subscribed.Jul 13 2021, 10:40 AM

colewhite updated the task description. (Show Details)Jul 26 2021, 3:41 PM

lmata mentioned this in T288617: Metrics at the WMF are consolidated on the Prometheus stack and dashboards are managed as code.Aug 11 2021, 1:09 PM

akosiaris closed subtask T277857: Proton metrics broken as Resolved.Aug 23 2021, 2:53 PM

akosiaris updated the task description. (Show Details)Aug 30 2021, 3:31 PM

Change 693429 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/mathoid@master] Switch to native prometheus latency histograms

https://gerrit.wikimedia.org/r/693429

Change 693429 merged by jenkins-bot:

[mediawiki/services/mathoid@master] Switch to native prometheus latency histograms

https://gerrit.wikimedia.org/r/693429

• Mholloway unsubscribed.Sep 1 2021, 1:33 PM

Change 717115 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mathoid: Bump deployed version

https://gerrit.wikimedia.org/r/717115

Change 717115 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Bump deployed version

https://gerrit.wikimedia.org/r/717115

akosiaris updated the task description. (Show Details)Sep 3 2021, 9:17 AM

Change 721626 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 721627 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Add metrics configuration options

https://gerrit.wikimedia.org/r/721627

Change 721628 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

https://gerrit.wikimedia.org/r/721628

Change 721629 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

https://gerrit.wikimedia.org/r/721629

Change 721630 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: send MetricsFactory to emit step

https://gerrit.wikimedia.org/r/721630

Change 556420 merged by jenkins-bot:

[mediawiki/services/citoid@master] Update to service-template-node 0.10.0.

https://gerrit.wikimedia.org/r/556420

Change 724129 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] pass MetricsFactory instance to emitBufferedStatsdData in MWLBFactory

https://gerrit.wikimedia.org/r/724129

lmata edited projects, added SRE Observability (FY2021/2022-Q2); removed SRE Observability (FY2021/2022-Q1).Sep 28 2021, 1:46 PM

Change 721626 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 585032 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Implement and enable statsd-exporter compatible Metrics interface

Reason:

in favor of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721626/

https://gerrit.wikimedia.org/r/585032

lmata added a project: Goal.Sep 30 2021, 8:48 PM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2021/2022-Q2) board.

Change 721629 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721629

Change 721628 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721628

Change 721627 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring and emit steps

https://gerrit.wikimedia.org/r/721627

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.4; 2021-10-12).Oct 7 2021, 4:00 PM

fgiunchedi added a subtask: T175087: Create a navtiming processor for Prometheus.Oct 19 2021, 12:16 PM

• dpifke subscribed.Oct 19 2021, 3:29 PM

Ottomata mentioned this in T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.Dec 17 2021, 7:21 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q3); removed SRE Observability (FY2021/2022-Q2).Jan 12 2022, 9:43 PM

Change 767180 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Move remaining metrics to prometheus

https://gerrit.wikimedia.org/r/767180

Change 767180 merged by jenkins-bot:

[mediawiki/services/citoid@master] Move remaining metrics to prometheus

https://gerrit.wikimedia.org/r/767180

Change 776233 had a related patch set uploaded (by Mvolz; author: PipelineBot):

[operations/deployment-charts@master] citoid: switch to native prometheus metrics

https://gerrit.wikimedia.org/r/776233

Change 776233 merged by jenkins-bot:

[operations/deployment-charts@master] citoid: switch to native prometheus metrics

https://gerrit.wikimedia.org/r/776233

Mvolz updated the task description. (Show Details)Apr 7 2022, 11:22 AM

This is now deployed for citoid.

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid

In T205870#7837817, @Mvolz wrote:

This is now deployed for citoid.

This is great to see! Thanks for your help @Mvolz

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

ack, please feel free to contact us (SRE o11y) for assistance with the missing/broken metrics if needed

In T205870#7838013, @fgiunchedi wrote:

In T205870#7837817, @Mvolz wrote:

This is now deployed for citoid.

This is great to see! Thanks for your help @Mvolz

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

ack, please feel free to contact us (SRE o11y) for assistance with the missing/broken metrics if needed

I looked into a bit ago and didn't make any progress, and I'm not going to be able to look at it in the next two weeks either due to going away so if you'd like to have a look, be my guest! GC is broken for mathoid too, a bunch of zotero metrics also don't work (but not sure they ever did as it's not really tooled very well?)

In T205870#7838787, @Mvolz wrote:

I looked into a bit ago and didn't make any progress, and I'm not going to be able to look at it in the next two weeks either due to going away so if you'd like to have a look, be my guest! GC is broken for mathoid too, a bunch of zotero metrics also don't work (but not sure they ever did as it's not really tooled very well?)

GC metrics were removed in service-runner 2.9.0

In T205870#7837817, @Mvolz wrote:

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

I took a stab at fixing citoid quantiles by (method, endpoint, status) as well as total memory, top 5 pods memory, and traffic by http status. Please have a look to see if they're fixed in a way you would expect. If something seems amiss, please feel free to make any modification you deem appropriate.

lmata edited projects, added SRE Observability (FY2021/2022-Q4); removed SRE Observability (FY2021/2022-Q3).Apr 11 2022, 1:00 PM

colewhite edited projects, added SRE Observability (FY2022/2023-Q1); removed SRE Observability (FY2021/2022-Q4).Jul 1 2022, 10:59 PM

A quick update on "high frequency" statsd producers sampled over 10 minutes on graphite1004. The list is getting shorter and shorter and that's great to see!

546061489 MediaWiki
35414078 restbase
1333791 aqs
 832382 frontend
 182788 kartotherian
   7935 ve
   4908 wikibase
   3380 service_checker
   3185 restbase-dev
   3013 mw
   2061 tilerator
   1915 performance
   1620 Vector
   1554 growthExperiments
    297 tileratorui
    120 gunicorn
     59 Wikidata
     16 eventlogging
     14 cloudvps

fgiunchedi mentioned this in T320620: Port openapi/swagger checks/alerts to Prometheus.Oct 12 2022, 12:01 PM

lmata edited projects, added Observability-Metrics; removed SRE Observability (FY2022/2023-Q1).Nov 7 2022, 12:38 PM

lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.Jan 16 2023, 5:35 PM

Change 484586 abandoned by Cwhite:

[operations/puppet@production] scb: enable statsd_exporter and add matching rules

Reason:

https://gerrit.wikimedia.org/r/484586

Krinkle removed a project: Performance-Team (Radar).Aug 18 2023, 8:14 PM

Change 558184 merged by jenkins-bot:

[mediawiki/services/recommendation-api@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/558184

elukey mentioned this in rMSRA0ec1466aad8d: update service-runner to 2.8.0 and implement new metrics api.Nov 29 2023, 6:22 PM

Hi @colewhite! I worked with James to port the Recommendation-api to nodejs 18, and one of the patches that we merged is:

https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/558184

When we deploy the new code, I see this error (and no metrics reported):

{"name":"recommendation-api","hostname":"recommendation-api-production-5459988bb6-g4n7q","pid":17,"level":"ERROR","levelPath":"error/metrics","msg":"endTiming() unsupported for metric type Gauge","time":"2023-12-05T15:18:14.563Z","v":0}

We rolled back, but I am wondering if you have more info (I have limited knowledge about what service-runner does behind the scenes). If we solve this problem we'll be able to deploy anytime :)

Change 981645 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/recommendation-api@master] Use "set" instead of "endTiming" in makeMetric

https://gerrit.wikimedia.org/r/981645

Change 981645 merged by Elukey:

[mediawiki/services/recommendation-api@master] Use "set" instead of "endTiming" in makeMetric

https://gerrit.wikimedia.org/r/981645

elukey mentioned this in rMSRA0b555eb148a2: Use "set" instead of "endTiming" in makeMetric.Dec 11 2023, 9:35 AM

Change 982047 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/recommendation-api@master] Move the only metric produced from Gauge to Histogram

https://gerrit.wikimedia.org/r/982047

@colewhite hi again! I added some context to https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/982047, now I have a better idea about what's happening. Lemme know what's best and if I am missing something!

Jdforrester-WMF subscribed.Dec 11 2023, 1:55 PM

Change 982047 merged by jenkins-bot:

[mediawiki/services/recommendation-api@master] Move the only metric produced from Gauge to Histogram

https://gerrit.wikimedia.org/r/982047

elukey mentioned this in rMSRA642bd433d821: Move the only metric produced from Gauge to Histogram.Dec 11 2023, 2:01 PM

Change 983403 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] recommendation-api: update statsd configuration

https://gerrit.wikimedia.org/r/983403

Change 983403 merged by jenkins-bot:

[operations/deployment-charts@master] recommendation-api: update monitoring config

https://gerrit.wikimedia.org/r/983403

Change 983694 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update Docker image and settings for Recommendation API

https://gerrit.wikimedia.org/r/983694

Change 983694 merged by Elukey:

[operations/deployment-charts@master] services: update Docker image and settings for Recommendation API

https://gerrit.wikimedia.org/r/983694

Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick look it seems that we define the new metric as "seconds" but in reality the value that it is carries is in ms.

In T205870#9413501, @elukey wrote:

Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick look it seems that we define the new metric as "seconds" but in reality the value that it is carries is in ms.

This was resolved in service-runner 3.1.0. Will recommendation-api work with that version?

Change 984103 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/recommendation-api@master] Upgrade to service-runner 3.1.0

https://gerrit.wikimedia.org/r/984103

Change 984103 merged by jenkins-bot:

[mediawiki/services/recommendation-api@master] Upgrade to service-runner 3.1.0

https://gerrit.wikimedia.org/r/984103

Change 984131 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update rec-api's staging Docker image

https://gerrit.wikimedia.org/r/984131

Change 984131 merged by Elukey:

[operations/deployment-charts@master] services: update rec-api's staging Docker image

https://gerrit.wikimedia.org/r/984131

elukey mentioned this in rMSRA12719caf28ff: Upgrade to service-runner 3.1.0.Dec 19 2023, 10:20 AM

In T205870#9413962, @colewhite wrote:

In T205870#9413501, @elukey wrote:

Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick look it seems that we define the new metric as "seconds" but in reality the value that it is carries is in ms.

This was resolved in service-runner 3.1.0. Will recommendation-api work with that version?

Done! It is running in staging :)

DAlangi_WMF subscribed.Jan 24 2024, 3:05 PM

hashar closed subtask T233089: Export zuul metrics to Prometheus as Declined.Mar 29 2024, 2:59 PM

thcipriani reopened subtask T233089: Export zuul metrics to Prometheus as Open.Mar 29 2024, 5:59 PM

Change #1018717 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update the rec-api's Docker image

https://gerrit.wikimedia.org/r/1018717

colewhite closed subtask T240685: MediaWiki Prometheus support as Resolved.Apr 15 2024, 3:12 PM

fgiunchedi mentioned this in T228380: Tech debt: sunsetting of Graphite.May 20 2024, 1:31 PM

Change #1018717 merged by Elukey:

[operations/deployment-charts@master] services: update the rec-api's Docker image

https://gerrit.wikimedia.org/r/1018717

Mentioned in SAL (#wikimedia-operations) [2024-06-10T13:36:40Z] <elukey> move recommendation-api on wikikube to prometheus metrics (offboarded from statsd) - T205870

@colewhite o/ I finally deployed recommendation-api, and this time it looks good. I updated also its dashboard:

https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1

I see some differences with old/new metrics, but I believe they are due to a better granularity with Prometheus metrics.

This is the snapshot before/after the deployment: https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1&from=1718024904966&to=1718029628948

We should be good, if so we can tick-off rec-api :)

fgiunchedi updated the task description. (Show Details)Aug 14 2024, 8:42 AM

colewhite changed the status of subtask T233089: Export zuul metrics to Prometheus from Open to In Progress.Sep 6 2024, 9:56 PM

Aklapper edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Oct 10 2024, 12:35 PM

Aklapper removed subscribers: • dpifke, • Pchelolo.

colewhite updated the task description. (Show Details)Oct 11 2024, 6:30 PM

colewhite updated the task description. (Show Details)Nov 13 2024, 5:07 PM

@colewhite: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on October 11th.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

colewhite closed subtask T233089: Export zuul metrics to Prometheus as Resolved.Thu, Dec 19, 12:41 AM

colewhite reopened subtask T233089: Export zuul metrics to Prometheus as Open.Thu, Dec 19, 3:48 PM

colewhite closed subtask T233089: Export zuul metrics to Prometheus as Resolved.Thu, Dec 19, 8:21 PM

elukey mentioned this in T382408: Prometheus metrics for Kartotherian on k8s.Fri, Dec 20, 8:59 AM

Fully migrate producers off statsd
Open, MediumPublic
Actions

Description

statsv-produced metrics, see also T180105

navtiming-produced metrics, see also T175087

TODO

Use global aggregation / percentiles

Details

Related Objects
Search...

Event Timeline

Fully migrate producers off statsdOpen, MediumPublicActions

Description

statsv-produced metrics, see also T180105

navtiming-produced metrics, see also T175087

TODO

Use global aggregation / percentiles

Details

Related ObjectsSearch...

Event Timeline

Fully migrate producers off statsd
Open, MediumPublic
Actions

Related Objects
Search...