I've chatted with @JMeybohm about this, and since for example a seemingly-related fix (https://github.com/rsyslog/rsyslog/pull/5012/commits/e8ac82e09f930bf99421cc323c24a9dbf215f9da) is present in the Debian testing repos (plus potentially other fixes) I've backported rsyslog to bullseye as 8.2404.0-1~bpo11+1. It is actually a straight backport (no changes needed) although there's a flaky imfile test, which I haven't verified it is equally flaky on sid.

Fri, Apr 12, 12:11 PM · Patch-For-Review, Observability-Logging, serviceops

fgiunchedi added a comment to T362239: Reformat IRC alerts to be more useful.

I'm +1 on moving resolved / firing to the beginning and see what the feedback is.

Fri, Apr 12, 10:48 AM · Patch-For-Review, Observability-Alerting

fgiunchedi created T362387: Clean up logstash7 consumer groups for mediawiki.httpd.accesslog.

Fri, Apr 12, 8:16 AM · Observability-Logging

Thu, Apr 11

fgiunchedi added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

Following up from a chat yesterday:

Thu, Apr 11, 9:36 AM · Patch-For-Review, User-herron, Observability-Metrics

fgiunchedi committed rLPRI0794ac5576fd: add opensearch dashboards secrets.

add opensearch dashboards secrets

Thu, Apr 11, 7:18 AM

Wed, Apr 10

fgiunchedi added a project to T246998: Enable SSO for Kibana: SRE Observability (FY2023/2024-Q4).

Wed, Apr 10, 2:45 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE

fgiunchedi created T362230: Skip black color for wikibugs task updates.

Wed, Apr 10, 1:32 PM · Wikibugs

fgiunchedi updated the task description for T360414: Phase out cergen for Observability services.

Wed, Apr 10, 9:26 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Tue, Apr 9

fgiunchedi added a comment to T355963: Gather feedback from users of the 'unmanaged' debian-12.0-nopuppet image.

In T355963#9699633, @Andrew wrote:

<long pause> Thanks for the feedback!

Tue, Apr 9, 9:59 AM · Patch-For-Review, cloud-services-team, Cloud-VPS

fgiunchedi added a comment to T360414: Phase out cergen for Observability services.

Indeed, I think grafana_labs.certs.yaml as a whole can be ditched

Tue, Apr 9, 9:51 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Mon, Apr 8

fgiunchedi added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

The configuration hasn't changed, though we did upgrade to Bookworm and together with that came a new version of Alertmanager, thus it might be a regression

Mon, Apr 8, 8:26 AM · Data-Persistence, Observability-Alerting

fgiunchedi reopened T351927: Decide and tweak Thanos retention as "Open".

Yes we'll be trimming the retention more this week @MatthewVernon

Mon, Apr 8, 8:25 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

Fri, Apr 5

fgiunchedi added a comment to T361229: titan200[12] RAM/SSD upgrade coordination.

Thank you @Jhancock.wm @herron !

Fri, Apr 5, 9:28 AM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw

Thu, Apr 4

fgiunchedi added a comment to T361706: 2024-04-03 calico/typha down.

In T361706#9685435, @taavi wrote:

Does this need to be private?

Thu, Apr 4, 8:18 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

fgiunchedi changed the visibility for T361706: 2024-04-03 calico/typha down.

Thu, Apr 4, 8:17 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

fgiunchedi removed a project from T361706: 2024-04-03 calico/typha down: WMF-NDA.

Thu, Apr 4, 8:16 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

Wed, Apr 3

fgiunchedi created T361706: 2024-04-03 calico/typha down.

Wed, Apr 3, 1:50 PM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

fgiunchedi removed a project from T357747: Capacity planning/estimation for Thanos: SRE Observability (FY2023/2024-Q4).

Moving off Q4 board since we have hw in capex spreadsheet and it'll be coming next FY

Wed, Apr 3, 12:59 PM · SRE-swift-storage, Observability-Metrics

fgiunchedi closed T351927: Decide and tweak Thanos retention as Resolved.

Resolving this since capacity is under control now and we have more coming next FY as per T357747: Capacity planning/estimation for Thanos

Wed, Apr 3, 12:58 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

fgiunchedi edited projects for T349626: Migrate SRE repositories to GitLab - operations/alerts, added: Observability-Alerting; removed SRE Observability (FY2023/2024-Q4).

I'm moving this off this Q board since I believe it'll happen further down the line

Wed, Apr 3, 11:53 AM · Observability-Alerting, Patch-For-Review, GitLab (Project Migration), collaboration-services

fgiunchedi closed T354399: Prometheus @ k8s OOM loop as Resolved.

I haven't observed OOMs related to WAL when doing a Prometheus rolling restart today, I'm optimistically resolving the task, though to be reopened if things change

Wed, Apr 3, 10:00 AM · User-fgiunchedi, Observability-Metrics

fgiunchedi closed T360537: Bump prometheus instances allocated space as Resolved.

This is done, available space for Prometheus data has been expanded

Wed, Apr 3, 10:00 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

fgiunchedi closed T358647: hieradata for syslog/centralserver should use hash instead of array for env_variables as Resolved.

Done by @Fabfur, thank you!

Wed, Apr 3, 8:08 AM · SRE Observability (FY2023/2024-Q3), observability

fgiunchedi added a comment to T351698: Linting problems found for NovafullstackSustainedFailures.

Indeed, I don't see the alert at https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem

Wed, Apr 3, 8:04 AM · Cloud-VPS, cloud-services-team

Tue, Apr 2

fgiunchedi updated the task description for T350192: On-call batphone escalation configuration holidays FY2023-24.

Tue, Apr 2, 1:58 PM · SRE Observability (FY2023/2024-Q4)

fgiunchedi added a comment to T361566: Request creation of o11y VPS project to replace monitoring.

In T361566#9680089, @taavi wrote:

In T361566#9680047, @fgiunchedi wrote:

Just a note that we now have https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet#wmcloud.org_zone_delegations - depending on your use case could that replace the floating IP?

Thank you, to clarify what I need is *.o11y.wmcloud.org (HTTPS only) to be answered by an instance/backend. If the generic proxy can do also zone delegation then I'm all for it! Last I checked this wasn't possible, hence the floating IP, though things might have changed

That's indeed supported too, I clarified the docs.

Tue, Apr 2, 1:52 PM · Cloud-VPS (Project-requests)

fgiunchedi added a comment to T361566: Request creation of o11y VPS project to replace monitoring.

Thank you folks for the quick action on this! Appreciate it

Tue, Apr 2, 1:17 PM · Cloud-VPS (Project-requests)

fgiunchedi created T361566: Request creation of o11y VPS project to replace monitoring.

Tue, Apr 2, 9:22 AM · Cloud-VPS (Project-requests)

Fri, Mar 29

fgiunchedi closed T344954: Configure Jaeger to follow dot-delimited daily index date convention as Resolved.

This is done, we're using . as separator

Fri, Mar 29, 10:43 AM · Observability-Tracing

fgiunchedi added a project to T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver: User-fgiunchedi.

Fri, Mar 29, 10:32 AM · User-fgiunchedi, Patch-For-Review, Pontoon

fgiunchedi closed T351179: LVM vg0 close to getting full on prometheus eqiad as Resolved.

This is done, we have more space for prometheus

Fri, Mar 29, 10:31 AM · SRE Observability (FY2023/2024-Q4), Observability-Metrics

fgiunchedi closed T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 as Resolved.

I pushed forward with this to be in a stable/known state ASAP, i.e. alert1001 and alert2001 are both on puppet 7 now and catalogs compile successfully

Fri, Mar 29, 9:42 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE

fgiunchedi closed T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.

Fri, Mar 29, 9:40 AM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)

fgiunchedi added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

Set batphone for today until Monday COB

Fri, Mar 29, 8:54 AM · SRE Observability (FY2023/2024-Q4)

Thu, Mar 28

fgiunchedi reassigned T361229: titan200[12] RAM/SSD upgrade coordination from fgiunchedi to herron.

Thank you @RobH, I've coordinated with @herron and he'll be helping with this

Thu, Mar 28, 4:27 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw

fgiunchedi added a comment to T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7.

I've been working on debugging this too, here's my understanding:

naggen2 is used to generate icinga configuration for nagios_host and nagios_service exported resources, runs as a generator on puppet master/server
naggen2 reads /etc/puppet/puppetdb.conf to discover the puppetdb url
on puppetmaster naggen2 works because puppetdb url points on port 8443, which has ssl cert validation as optional
this is not the case on puppetserver, thus naggen2 can't query puppetdb

Thu, Mar 28, 9:20 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE

Wed, Mar 27

fgiunchedi raised the priority of T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag from Low to Medium.

Wed, Mar 27, 2:28 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging

fgiunchedi added a project to T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag: SRE Observability (FY2023/2024-Q4).

Wed, Mar 27, 2:28 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging

fgiunchedi updated subscribers of T360862: Degraded RAID on centrallog1002.

Also cc @VRiley-WMF if you could help with this? thank you!

Wed, Mar 27, 10:58 AM · SRE, ops-eqiad

Tue, Mar 26

fgiunchedi edited projects for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4)

fgiunchedi edited projects for T302373: Upgrade prometheus-statsd-exporter, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Observability-Metrics

fgiunchedi edited projects for T350694: Infrastructure Foundation Alerts to migrate, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Patch-For-Review, Infrastructure-Foundations, Observability-Alerting

fgiunchedi edited projects for T351710: ossl rsyslog errors post-migration, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability

fgiunchedi edited projects for T349626: Migrate SRE repositories to GitLab - operations/alerts, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · Observability-Alerting, Patch-For-Review, GitLab (Project Migration), collaboration-services

fgiunchedi edited projects for T343529: Prometheus doesn't reload or alert on expired client certificates, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Prod-Kubernetes, Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar

fgiunchedi edited projects for T321808: Port most/all Icinga checks to Prometheus/Alertmanager, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Observability-Alerting

fgiunchedi edited projects for T351179: LVM vg0 close to getting full on prometheus eqiad, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Observability-Metrics

fgiunchedi edited projects for T353457: Karma UI shows duplicate alerts, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), cloud-services-team, Observability-Alerting

fgiunchedi edited projects for T356788: thanos-query probedown due to OOM of both eqiad titan frontends, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability

fgiunchedi edited projects for T354255: Alert in need of triage: AlertLintProblem (instance localhost:9123), added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), sre-alert-triage

fgiunchedi edited projects for T357747: Capacity planning/estimation for Thanos, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE-swift-storage, Observability-Metrics

fgiunchedi edited projects for T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics

fgiunchedi closed T359198: Icinga BFD check failing as Resolved.

This is fixed, I've undone my symlink bandaid. I've also reported the issue at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1067768

Tue, Mar 26, 1:54 PM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, netops, SRE

fgiunchedi closed T359198: Icinga BFD check failing, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.

Tue, Mar 26, 1:52 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)

fgiunchedi moved T360537: Bump prometheus instances allocated space from Backlog to Doing on the User-fgiunchedi board.

Tue, Mar 26, 11:14 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

fgiunchedi moved T359633: Strategy for Envoy metrics and Prometheus from Backlog to Doing on the User-fgiunchedi board.

Tue, Mar 26, 11:13 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s

fgiunchedi moved T354399: Prometheus @ k8s OOM loop from Backlog to Doing on the User-fgiunchedi board.

Tue, Mar 26, 11:13 AM · User-fgiunchedi, Observability-Metrics

fgiunchedi added a project to T354399: Prometheus @ k8s OOM loop: User-fgiunchedi.

Tue, Mar 26, 11:13 AM · User-fgiunchedi, Observability-Metrics

fgiunchedi added a project to T359633: Strategy for Envoy metrics and Prometheus: User-fgiunchedi.

Tue, Mar 26, 11:13 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s

Mon, Mar 25

fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Good news and bad news, in the sense that I can't reproduce the OOM in prometheus k8s in codfw, I suspect my fix at https://gerrit.wikimedia.org/r/1013515 to fetch less Envoy metrics significantly reduced load and thus replaying the WAL doesn't pose a memory problem anymore.

Mon, Mar 25, 1:43 PM · User-fgiunchedi, Observability-Metrics

fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Thank you @andrea.denisse for the suggestions! Let's indeed discuss further what are the best options going forward

Mon, Mar 25, 1:04 PM · User-fgiunchedi, Observability-Metrics

fgiunchedi added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Promising results, samples/s in eqiad went from ~200k/s to ~110k/s after the change (and slightly increasing)

Mon, Mar 25, 9:53 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s

fgiunchedi updated subscribers of T360862: Degraded RAID on centrallog1002.

@Jclark-ctr it looks like one of the new SSDs from {T359452} isn't happy, I've located the drive so it should be blinking; could we replace it ASAP? please ping me on IRC when you can, thank you !

Mon, Mar 25, 8:20 AM · SRE, ops-eqiad

Fri, Mar 22

fgiunchedi added a comment to T360703: Replace or remove Debian Buster VMs in 'monitoring' cloud-vps project.

Thank you for the heads up; for context I'm working on T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver which will enable us to rebuild the whole o11y stack with Bullseye/Bookworm VMs

Fri, Mar 22, 8:52 AM · Cloud-VPS (Debian Buster Deprecation), cloud-services-team

Thu, Mar 21

fgiunchedi added a project to T360537: Bump prometheus instances allocated space: User-fgiunchedi.

Thu, Mar 21, 2:50 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Thank you @andrea.denisse for taking a look!

Thu, Mar 21, 11:42 AM · User-fgiunchedi, Observability-Metrics

Wed, Mar 20

fgiunchedi created T360537: Bump prometheus instances allocated space.

Wed, Mar 20, 3:51 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

fgiunchedi closed T359631: install (2) 1.92TB SSDs from decom into prometheus200[56] as Resolved.

This is done, thank you @Papaul

Wed, Mar 20, 3:49 PM · ops-codfw, SRE

fgiunchedi added a comment to T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.

In T359640#9642256, @fgiunchedi wrote:

We (o11y) have brainstormed this issue a little at the offsite, and one partial solution would be to get a prometheus dedicated mw instance, to at least contain the blast radius.

We'll have to brainstorm a little more, though even with moderately-sized histograms I can see statsd-exporter per-pod not being manageable when we're talking big histograms and hundreds of pods

Wed, Mar 20, 1:27 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics

fgiunchedi renamed T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag from apache2 cpu-stuck on logstash1032 causes kafka logging lag to apache2 cpu-stuck on logstash hosts causes kafka logging lag.

Wed, Mar 20, 11:14 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging

Tue, Mar 19

fgiunchedi updated subscribers of T359632: install (2) 1.92TB SSDs from decom into prometheus100[56].

@Jclark-ctr @VRiley-WMF please ping me on irc when you get on site tomorrow and we can coordinate, I'll be around, thank you!

Tue, Mar 19, 3:51 PM · ops-eqiad, SRE, procurement

fgiunchedi added a comment to T359631: install (2) 1.92TB SSDs from decom into prometheus200[56].

Thank you @Jhancock.wm ! how's tomorrow at 16 UTC for you? we'll be doing both hosts one at a time, and just to confirm: the drives are hot swap (?)

Tue, Mar 19, 3:48 PM · ops-codfw, SRE

fgiunchedi created T360444: Validate thanos/prometheus rules in puppet CI.

Tue, Mar 19, 3:19 PM · Patch-For-Review, Observability-Metrics

fgiunchedi added a comment to T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.

In T359640#9625172, @Krinkle wrote:

Can we disable host-level instance for MediaWiki's statsd exporter? (Or substitute with a constant?) I believe that would save 100x or 2 orders of magnitude. I can't imagine that ever being relevant for service/domain-specific stats from the MediaWiki application. I imagine of the hypothetical use cases that we don't yet have today, 99% would be covered by site="codfw", if we keep that.

Tue, Mar 19, 2:28 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics

fgiunchedi added a comment to T359497: StatsD Exporter: gracefully handle metric signature changes.

Good point re: statsd_exporter_events_conflict_total, looking at a mw-on-k8s world, I think linking the statsd-exporter lifecycle to mw seems the easiest? which also begs the question: maybe it does happen already during mw deployments as pods are cycled?

Tue, Mar 19, 1:57 PM · Observability-Metrics

fgiunchedi created T360433: Thumbor statsd-exporter metrics conflicts.

Tue, Mar 19, 1:57 PM · Thumbor

fgiunchedi added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

For awareness, see also https://phabricator.wikimedia.org/T359178#9640223 re: statsv in the context of varnishkafka deprecation/removal.

Tue, Mar 19, 1:35 PM · Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics

Mar 8 2024

fgiunchedi added a comment to T326322: Add per-output queue monitoring for Juniper network devices.

Yeah having some ballpark numbers will be a great help @cmooney, unless we're talking hundreds of thousands more metrics than we have now I think we're good to go, tens of thousands we can do without much effort/resources

Mar 8 2024, 4:26 PM · Patch-For-Review, SRE, Infrastructure-Foundations, netops

fgiunchedi created T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.

Mar 8 2024, 3:53 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics

fgiunchedi added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Ah yes indeed, thank you @JMeybohm !

Mar 8 2024, 2:29 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s

fgiunchedi created T359633: Strategy for Envoy metrics and Prometheus.

Mar 8 2024, 2:09 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s

fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Indeed the WAL grew quite fast (faster than I expected anyways) as the mw-on-k8s migration progressed (we're at ~50% now)

Mar 8 2024, 1:39 PM · User-fgiunchedi, Observability-Metrics

Mar 6 2024

fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup as Resolved.

Calling this done, albeit with an hack

Mar 6 2024, 2:55 PM · SRE, SRE Observability (FY2023/2024-Q3)

fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.

Mar 6 2024, 2:53 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)

fgiunchedi added a comment to T359292: ircecho doesn't attempt to open log files created after startup.

Logs from ircecho.service

Mar 6 2024, 1:24 PM · SRE, SRE Observability (FY2023/2024-Q3)

fgiunchedi added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder

Mar 6 2024, 1:21 PM · SRE Observability, sre-alert-triage

fgiunchedi created T359292: ircecho doesn't attempt to open log files created after startup.

Mar 6 2024, 9:23 AM · SRE, SRE Observability (FY2023/2024-Q3)

fgiunchedi closed T359153: statsv metrics are both prometheus ops and ext as Resolved.

All good! Thank you @colewhite for the merge

Mar 6 2024, 9:00 AM · Observability-Metrics

Mar 5 2024

fgiunchedi added a comment to T333615: Upgrade alert* hosts to Bookworm.

Something else that didn't work well: the current version of ircecho doesn't seem to attempt reopening the files it is supposed to look for in /var/log/icinga. I have "fixed" this by creating said .log files and then restarting ircecho, which then did properly open/tail the files

Mar 5 2024, 5:31 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)

fgiunchedi added a comment to T359198: Icinga BFD check failing.

I've bandaided the issue on alert2001, we'll need a more proper fix:

Mar 5 2024, 5:28 PM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, netops, SRE

fgiunchedi added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

Thank you for the detailed write up on this @Krinkle ! See below for my take:

Mar 5 2024, 12:02 PM · Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics

fgiunchedi created T359153: statsv metrics are both prometheus ops and ext.

Mar 5 2024, 11:18 AM · Observability-Metrics

fgiunchedi changed the status of T359068: Not enough space on titan2001 for thanos-compact from Open to Stalled.

Stalling until thanos-compact finishes its cycle, and we can assess how much space is used too

Mar 5 2024, 10:25 AM · User-fgiunchedi, Observability-Metrics

fgiunchedi renamed T359068: Not enough space on titan2001 for thanos-compact from Not enough space on titan hosts for thanos-compact to Not enough space on titan2001 for thanos-compact.

Mar 5 2024, 10:23 AM · User-fgiunchedi, Observability-Metrics

fgiunchedi added a comment to T359068: Not enough space on titan2001 for thanos-compact.

With the new 1.6TB disk in place we have ~2.2TB of raid0, which is great. This is fine for short/medium term, not long term because it means thanos-compact is able to complete a cycle only on titan2001 now. We'll get the other hosts in line in terms of space soon though (next FY or this FY is TBD)

Mar 5 2024, 9:46 AM · User-fgiunchedi, Observability-Metrics

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)
View All

Calendar

Today

Tomorrow

Thursday

User Details

Recent Activity
View All

Yesterday

Fri, Apr 12

Thu, Apr 11

Wed, Apr 10

Tue, Apr 9

Mon, Apr 8

Fri, Apr 5

Thu, Apr 4

Wed, Apr 3

Tue, Apr 2

Fri, Mar 29

Thu, Mar 28

Wed, Mar 27

Tue, Mar 26

Mon, Mar 25

Fri, Mar 22

Thu, Mar 21

Wed, Mar 20

Tue, Mar 19

Mar 8 2024

Mar 6 2024

Mar 5 2024

fgiunchedi (Filippo Giunchedi)/* No comment */

Projects (17)View All

Calendar

Today

Tomorrow

Thursday

User Details

Recent ActivityView All

Yesterday

Fri, Apr 12

Thu, Apr 11

Wed, Apr 10

Tue, Apr 9

Mon, Apr 8

Fri, Apr 5

Thu, Apr 4

Wed, Apr 3

Tue, Apr 2

Fri, Mar 29

Thu, Mar 28

Wed, Mar 27

Tue, Mar 26

Mon, Mar 25

Fri, Mar 22

Thu, Mar 21

Wed, Mar 20

Tue, Mar 19

Mar 8 2024

Mar 6 2024

Mar 5 2024

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)
View All

Recent Activity
View All