Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (499 w, 2 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Wed, Apr 24

fgiunchedi committed rLPRI29b82206b3fd: add logstash_oidc.
add logstash_oidc
Wed, Apr 24, 12:17 PM
fgiunchedi added a comment to T357616: Logs from containers sometimes not visible in logstash.

The bandaid is in place (restart rsyslog.service every 4 hours, the 4 is a magic number, it can be tweaked). Let's see how we go with this in place while a better solution is found

Wed, Apr 24, 12:06 PM · Patch-For-Review, Observability-Logging, serviceops
fgiunchedi closed T362719: Upgrade Jaeger to 1.56.0 (latest stable) as Resolved.

This is done! we're running jaeger collector and query 1.56

Wed, Apr 24, 8:12 AM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
fgiunchedi closed T362719: Upgrade Jaeger to 1.56.0 (latest stable), a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.
Wed, Apr 24, 8:11 AM · Epic, Observability-Tracing

Thu, Apr 18

fgiunchedi moved T362719: Upgrade Jaeger to 1.56.0 (latest stable) from Backlog to Doing on the User-fgiunchedi board.
Thu, Apr 18, 2:44 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
fgiunchedi added a project to T362719: Upgrade Jaeger to 1.56.0 (latest stable): User-fgiunchedi.
Thu, Apr 18, 12:19 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
fgiunchedi added a comment to T355963: Gather feedback from users of the 'unmanaged' debian-12.0-nopuppet image.

That's correct yeah, I don't think there are security implications. What I'm after is the possibility to upload/config a keypair once and reuse that across instances launch, which at the time I couldn't find a way to do. Perhaps the underlying APIs do support it: once a named keypair has been used once then it can be reused for subsequent launches? That'd be enough for me

Huh, I think I'm having the opposite problem: once I create a keypair and launch a VM with it that keypair is forever associated with my account and installed by default in all future VMs. That sounds like what you want, is that not what you're seeing?

Thu, Apr 18, 8:03 AM · cloud-services-team, Cloud-VPS

Tue, Apr 16

fgiunchedi closed T344953: Manage jaeger-* index lifecycle as Resolved.

This is done, I've gone for a simpler approach for now to just delete old jaeger indices via curator. If need arises we can do other tweaks like replica / shard count tweaks and so on

Tue, Apr 16, 2:14 PM · Observability-Tracing
fgiunchedi added a comment to T362239: Reformat IRC alerts to be more useful.

My take to have alert group (i.e. the alert name) at the beginning is the following:

  • alerts.w.o is "keyed" to alert groups, not individual alerts
  • the optional number of alerts that are firing refers to the alert group as a whole, e.g. I find it confusing to have an alert count next to the individual alert (FIRING [4x] <summary> (<alert group>))
  • the alert group name already gives a broad indication of what's wrong
Tue, Apr 16, 9:40 AM · Patch-For-Review, Observability-Alerting

Mon, Apr 15

herron awarded T246998: Enable SSO for Kibana a Party Time token.
Mon, Apr 15, 5:06 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE
fgiunchedi closed T246998: Enable SSO for Kibana as Resolved.

I'm optimistically resolving this since logstash.w.o (nowadays opensearch dashboards) is working as expected

Mon, Apr 15, 3:12 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE
fgiunchedi added a comment to T362376: The prune_old_srv_syslog_directories.service can't delete non-empty directories on centrallog instances.

I've fixed the issue with the following on centrallog hosts:

Mon, Apr 15, 9:57 AM · SRE Observability (FY2023/2024-Q4), Observability-Logging
fgiunchedi closed T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag as Resolved.

I'm optimistically resolving this since we no longer use mod auth ldap

Mon, Apr 15, 8:55 AM · SRE Observability (FY2023/2024-Q4), Observability-Logging
fgiunchedi closed T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag, a subtask of T246998: Enable SSO for Kibana, as Resolved.
Mon, Apr 15, 8:55 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE

Fri, Apr 12

fgiunchedi added a comment to T357616: Logs from containers sometimes not visible in logstash.

I've chatted with @JMeybohm about this, and since for example a seemingly-related fix (https://github.com/rsyslog/rsyslog/pull/5012/commits/e8ac82e09f930bf99421cc323c24a9dbf215f9da) is present in the Debian testing repos (plus potentially other fixes) I've backported rsyslog to bullseye as 8.2404.0-1~bpo11+1. It is actually a straight backport (no changes needed) although there's a flaky imfile test, which I haven't verified it is equally flaky on sid.

Fri, Apr 12, 12:11 PM · Patch-For-Review, Observability-Logging, serviceops
fgiunchedi added a comment to T362239: Reformat IRC alerts to be more useful.

I'm +1 on moving resolved / firing to the beginning and see what the feedback is.

Fri, Apr 12, 10:48 AM · Patch-For-Review, Observability-Alerting
fgiunchedi created T362387: Clean up logstash7 consumer groups for mediawiki.httpd.accesslog.
Fri, Apr 12, 8:16 AM · Observability-Logging

Thu, Apr 11

fgiunchedi added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

Following up from a chat yesterday:

Thu, Apr 11, 9:36 AM · Patch-For-Review, User-herron, Observability-Metrics
fgiunchedi committed rLPRI0794ac5576fd: add opensearch dashboards secrets.
add opensearch dashboards secrets
Thu, Apr 11, 7:18 AM

Wed, Apr 10

fgiunchedi added a project to T246998: Enable SSO for Kibana: SRE Observability (FY2023/2024-Q4).
Wed, Apr 10, 2:45 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE
fgiunchedi created T362230: Skip black color for wikibugs task updates.
Wed, Apr 10, 1:32 PM · Wikibugs
fgiunchedi updated the task description for T360414: Phase out cergen for Observability services.
Wed, Apr 10, 9:26 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Tue, Apr 9

fgiunchedi added a comment to T355963: Gather feedback from users of the 'unmanaged' debian-12.0-nopuppet image.

<long pause> Thanks for the feedback!

Tue, Apr 9, 9:59 AM · cloud-services-team, Cloud-VPS
fgiunchedi added a comment to T360414: Phase out cergen for Observability services.

Indeed, I think grafana_labs.certs.yaml as a whole can be ditched

Tue, Apr 9, 9:51 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Mon, Apr 8

fgiunchedi added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

The configuration hasn't changed, though we did upgrade to Bookworm and together with that came a new version of Alertmanager, thus it might be a regression

Mon, Apr 8, 8:26 AM · Data-Persistence, Observability-Alerting
fgiunchedi reopened T351927: Decide and tweak Thanos retention as "Open".

Yes we'll be trimming the retention more this week @MatthewVernon

Mon, Apr 8, 8:25 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

Fri, Apr 5

fgiunchedi added a comment to T361229: titan200[12] RAM/SSD upgrade coordination.

Thank you @Jhancock.wm @herron !

Fri, Apr 5, 9:28 AM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw

Thu, Apr 4

fgiunchedi added a comment to T361706: 2024-04-03 calico/typha down.

Does this need to be private?

Thu, Apr 4, 8:18 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
fgiunchedi changed the visibility for T361706: 2024-04-03 calico/typha down.
Thu, Apr 4, 8:17 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
fgiunchedi removed a project from T361706: 2024-04-03 calico/typha down: WMF-NDA.
Thu, Apr 4, 8:16 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

Wed, Apr 3

fgiunchedi created T361706: 2024-04-03 calico/typha down.
Wed, Apr 3, 1:50 PM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
fgiunchedi removed a project from T357747: Capacity planning/estimation for Thanos: SRE Observability (FY2023/2024-Q4).

Moving off Q4 board since we have hw in capex spreadsheet and it'll be coming next FY

Wed, Apr 3, 12:59 PM · SRE-swift-storage, Observability-Metrics
fgiunchedi closed T351927: Decide and tweak Thanos retention as Resolved.

Resolving this since capacity is under control now and we have more coming next FY as per T357747: Capacity planning/estimation for Thanos

Wed, Apr 3, 12:58 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi edited projects for T349626: Migrate SRE repositories to GitLab - operations/alerts, added: Observability-Alerting; removed SRE Observability (FY2023/2024-Q4).

I'm moving this off this Q board since I believe it'll happen further down the line

Wed, Apr 3, 11:53 AM · Observability-Alerting, Patch-For-Review, GitLab (Project Migration), collaboration-services
fgiunchedi closed T354399: Prometheus @ k8s OOM loop as Resolved.

I haven't observed OOMs related to WAL when doing a Prometheus rolling restart today, I'm optimistically resolving the task, though to be reopened if things change

Wed, Apr 3, 10:00 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi closed T360537: Bump prometheus instances allocated space as Resolved.

This is done, available space for Prometheus data has been expanded

Wed, Apr 3, 10:00 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi closed T358647: hieradata for syslog/centralserver should use hash instead of array for env_variables as Resolved.

Done by @Fabfur, thank you!

Wed, Apr 3, 8:08 AM · SRE Observability (FY2023/2024-Q3), observability
fgiunchedi added a comment to T351698: Linting problems found for NovafullstackSustainedFailures.

Indeed, I don't see the alert at https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem

Wed, Apr 3, 8:04 AM · Cloud-VPS, cloud-services-team

Tue, Apr 2

fgiunchedi updated the task description for T350192: On-call batphone escalation configuration holidays FY2023-24.
Tue, Apr 2, 1:58 PM · SRE Observability (FY2023/2024-Q4)
fgiunchedi added a comment to T361566: Request creation of o11y VPS project to replace monitoring.

Just a note that we now have https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet#wmcloud.org_zone_delegations - depending on your use case could that replace the floating IP?

Thank you, to clarify what I need is *.o11y.wmcloud.org (HTTPS only) to be answered by an instance/backend. If the generic proxy can do also zone delegation then I'm all for it! Last I checked this wasn't possible, hence the floating IP, though things might have changed

That's indeed supported too, I clarified the docs.

Tue, Apr 2, 1:52 PM · Cloud-VPS (Project-requests)
fgiunchedi added a comment to T361566: Request creation of o11y VPS project to replace monitoring.

Thank you folks for the quick action on this! Appreciate it

Tue, Apr 2, 1:17 PM · Cloud-VPS (Project-requests)
fgiunchedi created T361566: Request creation of o11y VPS project to replace monitoring.
Tue, Apr 2, 9:22 AM · Cloud-VPS (Project-requests)

Mar 29 2024

fgiunchedi closed T344954: Configure Jaeger to follow dot-delimited daily index date convention as Resolved.

This is done, we're using . as separator

Mar 29 2024, 10:43 AM · Observability-Tracing
fgiunchedi added a project to T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver: User-fgiunchedi.
Mar 29 2024, 10:32 AM · User-fgiunchedi, Patch-For-Review, Pontoon
fgiunchedi closed T351179: LVM vg0 close to getting full on prometheus eqiad as Resolved.

This is done, we have more space for prometheus

Mar 29 2024, 10:31 AM · SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi closed T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 as Resolved.

I pushed forward with this to be in a stable/known state ASAP, i.e. alert1001 and alert2001 are both on puppet 7 now and catalogs compile successfully

Mar 29 2024, 9:42 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE
fgiunchedi closed T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.
Mar 29 2024, 9:40 AM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

Set batphone for today until Monday COB

Mar 29 2024, 8:54 AM · SRE Observability (FY2023/2024-Q4)

Mar 28 2024

fgiunchedi reassigned T361229: titan200[12] RAM/SSD upgrade coordination from fgiunchedi to herron.

Thank you @RobH, I've coordinated with @herron and he'll be helping with this

Mar 28 2024, 4:27 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw
fgiunchedi added a comment to T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7.

I've been working on debugging this too, here's my understanding:

  • naggen2 is used to generate icinga configuration for nagios_host and nagios_service exported resources, runs as a generator on puppet master/server
  • naggen2 reads /etc/puppet/puppetdb.conf to discover the puppetdb url
  • on puppetmaster naggen2 works because puppetdb url points on port 8443, which has ssl cert validation as optional
  • this is not the case on puppetserver, thus naggen2 can't query puppetdb
Mar 28 2024, 9:20 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE

Mar 27 2024

fgiunchedi raised the priority of T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag from Low to Medium.
Mar 27 2024, 2:28 PM · SRE Observability (FY2023/2024-Q4), Observability-Logging
fgiunchedi added a project to T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag: SRE Observability (FY2023/2024-Q4).
Mar 27 2024, 2:28 PM · SRE Observability (FY2023/2024-Q4), Observability-Logging
fgiunchedi updated subscribers of T360862: Degraded RAID on centrallog1002.

Also cc @VRiley-WMF if you could help with this? thank you!

Mar 27 2024, 10:58 AM · SRE, ops-eqiad

Mar 26 2024

fgiunchedi edited projects for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4)
fgiunchedi edited projects for T302373: Upgrade prometheus-statsd-exporter, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Observability-Metrics
fgiunchedi edited projects for T350694: Infrastructure Foundation Alerts to migrate, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), Patch-For-Review, Infrastructure-Foundations, Observability-Alerting
fgiunchedi edited projects for T351710: ossl rsyslog errors post-migration, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability
fgiunchedi edited projects for T349626: Migrate SRE repositories to GitLab - operations/alerts, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · Observability-Alerting, Patch-For-Review, GitLab (Project Migration), collaboration-services
fgiunchedi edited projects for T343529: Prometheus doesn't reload or alert on expired client certificates, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), Prod-Kubernetes, Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar
fgiunchedi edited projects for T321808: Port most/all Icinga checks to Prometheus/Alertmanager, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), Observability-Alerting
fgiunchedi edited projects for T351179: LVM vg0 close to getting full on prometheus eqiad, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi edited projects for T353457: Karma UI shows duplicate alerts, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), cloud-services-team, Observability-Alerting
fgiunchedi edited projects for T356788: thanos-query probedown due to OOM of both eqiad titan frontends, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability
fgiunchedi edited projects for T354255: Alert in need of triage: AlertLintProblem (instance localhost:9123), added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), sre-alert-triage
fgiunchedi edited projects for T357747: Capacity planning/estimation for Thanos, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE-swift-storage, Observability-Metrics
fgiunchedi edited projects for T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Mar 26 2024, 2:57 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi closed T359198: Icinga BFD check failing as Resolved.

This is fixed, I've undone my symlink bandaid. I've also reported the issue at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1067768

Mar 26 2024, 1:54 PM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, netops, SRE
fgiunchedi closed T359198: Icinga BFD check failing, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.
Mar 26 2024, 1:52 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi moved T360537: Bump prometheus instances allocated space from Backlog to Doing on the User-fgiunchedi board.
Mar 26 2024, 11:14 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi moved T359633: Strategy for Envoy metrics and Prometheus from Backlog to Doing on the User-fgiunchedi board.
Mar 26 2024, 11:13 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi moved T354399: Prometheus @ k8s OOM loop from Backlog to Doing on the User-fgiunchedi board.
Mar 26 2024, 11:13 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a project to T354399: Prometheus @ k8s OOM loop: User-fgiunchedi.
Mar 26 2024, 11:13 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a project to T359633: Strategy for Envoy metrics and Prometheus: User-fgiunchedi.
Mar 26 2024, 11:13 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s

Mar 25 2024

fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Good news and bad news, in the sense that I can't reproduce the OOM in prometheus k8s in codfw, I suspect my fix at https://gerrit.wikimedia.org/r/1013515 to fetch less Envoy metrics significantly reduced load and thus replaying the WAL doesn't pose a memory problem anymore.

Mar 25 2024, 1:43 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Thank you @andrea.denisse for the suggestions! Let's indeed discuss further what are the best options going forward

Mar 25 2024, 1:04 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Promising results, samples/s in eqiad went from ~200k/s to ~110k/s after the change (and slightly increasing)

Mar 25 2024, 9:53 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi updated subscribers of T360862: Degraded RAID on centrallog1002.

@Jclark-ctr it looks like one of the new SSDs from {T359452} isn't happy, I've located the drive so it should be blinking; could we replace it ASAP? please ping me on IRC when you can, thank you !

Mar 25 2024, 8:20 AM · SRE, ops-eqiad

Mar 22 2024

fgiunchedi added a comment to T360703: Replace or remove Debian Buster VMs in 'monitoring' cloud-vps project.

Thank you for the heads up; for context I'm working on T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver which will enable us to rebuild the whole o11y stack with Bullseye/Bookworm VMs

Mar 22 2024, 8:52 AM · Cloud-VPS (Debian Buster Deprecation), cloud-services-team

Mar 21 2024

fgiunchedi added a project to T360537: Bump prometheus instances allocated space: User-fgiunchedi.
Mar 21 2024, 2:50 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Thank you @andrea.denisse for taking a look!

Mar 21 2024, 11:42 AM · User-fgiunchedi, Observability-Metrics

Mar 20 2024

fgiunchedi created T360537: Bump prometheus instances allocated space.
Mar 20 2024, 3:51 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi closed T359631: install (2) 1.92TB SSDs from decom into prometheus200[56] as Resolved.

This is done, thank you @Papaul

Mar 20 2024, 3:49 PM · ops-codfw, SRE
fgiunchedi added a comment to T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.

We (o11y) have brainstormed this issue a little at the offsite, and one partial solution would be to get a prometheus dedicated mw instance, to at least contain the blast radius.

We'll have to brainstorm a little more, though even with moderately-sized histograms I can see statsd-exporter per-pod not being manageable when we're talking big histograms and hundreds of pods

Mar 20 2024, 1:27 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi renamed T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag from apache2 cpu-stuck on logstash1032 causes kafka logging lag to apache2 cpu-stuck on logstash hosts causes kafka logging lag.
Mar 20 2024, 11:14 AM · SRE Observability (FY2023/2024-Q4), Observability-Logging

Mar 19 2024

fgiunchedi updated subscribers of T359632: install (2) 1.92TB SSDs from decom into prometheus100[56].

@Jclark-ctr @VRiley-WMF please ping me on irc when you get on site tomorrow and we can coordinate, I'll be around, thank you!

Mar 19 2024, 3:51 PM · ops-eqiad, SRE, procurement
fgiunchedi added a comment to T359631: install (2) 1.92TB SSDs from decom into prometheus200[56].

Thank you @Jhancock.wm ! how's tomorrow at 16 UTC for you? we'll be doing both hosts one at a time, and just to confirm: the drives are hot swap (?)

Mar 19 2024, 3:48 PM · ops-codfw, SRE
fgiunchedi created T360444: Validate thanos/prometheus rules in puppet CI.
Mar 19 2024, 3:19 PM · Patch-For-Review, Observability-Metrics
fgiunchedi added a comment to T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.

Can we disable host-level instance for MediaWiki's statsd exporter? (Or substitute with a constant?) I believe that would save 100x or 2 orders of magnitude. I can't imagine that ever being relevant for service/domain-specific stats from the MediaWiki application. I imagine of the hypothetical use cases that we don't yet have today, 99% would be covered by site="codfw", if we keep that.

Mar 19 2024, 2:28 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi added a comment to T359497: StatsD Exporter: gracefully handle metric signature changes.

Good point re: statsd_exporter_events_conflict_total, looking at a mw-on-k8s world, I think linking the statsd-exporter lifecycle to mw seems the easiest? which also begs the question: maybe it does happen already during mw deployments as pods are cycled?

Mar 19 2024, 1:57 PM · Observability-Metrics
fgiunchedi created T360433: Thumbor statsd-exporter metrics conflicts.
Mar 19 2024, 1:57 PM · Thumbor
fgiunchedi added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

For awareness, see also https://phabricator.wikimedia.org/T359178#9640223 re: statsv in the context of varnishkafka deprecation/removal.

Mar 19 2024, 1:35 PM · Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics

Mar 8 2024

fgiunchedi added a comment to T326322: Add per-output queue monitoring for Juniper network devices.

Yeah having some ballpark numbers will be a great help @cmooney, unless we're talking hundreds of thousands more metrics than we have now I think we're good to go, tens of thousands we can do without much effort/resources

Mar 8 2024, 4:26 PM · Patch-For-Review, SRE, Infrastructure-Foundations, netops
fgiunchedi created T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.
Mar 8 2024, 3:53 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Ah yes indeed, thank you @JMeybohm !

Mar 8 2024, 2:29 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi created T359633: Strategy for Envoy metrics and Prometheus.
Mar 8 2024, 2:09 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Indeed the WAL grew quite fast (faster than I expected anyways) as the mw-on-k8s migration progressed (we're at ~50% now)

Mar 8 2024, 1:39 PM · User-fgiunchedi, Observability-Metrics

Mar 6 2024

fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup as Resolved.

Calling this done, albeit with an hack

Mar 6 2024, 2:55 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.
Mar 6 2024, 2:53 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359292: ircecho doesn't attempt to open log files created after startup.

Logs from ircecho.service

Mar 6 2024, 1:24 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder

Mar 6 2024, 1:21 PM · SRE Observability, sre-alert-triage