Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (497 w, 4 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Yesterday

herron awarded T246998: Enable SSO for Kibana a Party Time token.
Mon, Apr 15, 5:06 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE
fgiunchedi closed T246998: Enable SSO for Kibana as Resolved.

I'm optimistically resolving this since logstash.w.o (nowadays opensearch dashboards) is working as expected

Mon, Apr 15, 3:12 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE
fgiunchedi added a comment to T362376: The prune_old_srv_syslog_directories.service can't delete non-empty directories on centrallog instances.

I've fixed the issue with the following on centrallog hosts:

Mon, Apr 15, 9:57 AM · SRE Observability (FY2023/2024-Q4), Observability-Logging, Patch-For-Review
fgiunchedi closed T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag as Resolved.

I'm optimistically resolving this since we no longer use mod auth ldap

Mon, Apr 15, 8:55 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging
fgiunchedi closed T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag, a subtask of T246998: Enable SSO for Kibana, as Resolved.
Mon, Apr 15, 8:55 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE

Fri, Apr 12

fgiunchedi added a comment to T357616: Logs from containers sometimes not visible in logstash.

I've chatted with @JMeybohm about this, and since for example a seemingly-related fix (https://github.com/rsyslog/rsyslog/pull/5012/commits/e8ac82e09f930bf99421cc323c24a9dbf215f9da) is present in the Debian testing repos (plus potentially other fixes) I've backported rsyslog to bullseye as 8.2404.0-1~bpo11+1. It is actually a straight backport (no changes needed) although there's a flaky imfile test, which I haven't verified it is equally flaky on sid.

Fri, Apr 12, 12:11 PM · Patch-For-Review, Observability-Logging, serviceops
fgiunchedi added a comment to T362239: Reformat IRC alerts to be more useful.

I'm +1 on moving resolved / firing to the beginning and see what the feedback is.

Fri, Apr 12, 10:48 AM · Patch-For-Review, Observability-Alerting
fgiunchedi created T362387: Clean up logstash7 consumer groups for mediawiki.httpd.accesslog.
Fri, Apr 12, 8:16 AM · Observability-Logging

Thu, Apr 11

fgiunchedi added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

Following up from a chat yesterday:

Thu, Apr 11, 9:36 AM · Patch-For-Review, User-herron, Observability-Metrics
fgiunchedi committed rLPRI0794ac5576fd: add opensearch dashboards secrets.
add opensearch dashboards secrets
Thu, Apr 11, 7:18 AM

Wed, Apr 10

fgiunchedi added a project to T246998: Enable SSO for Kibana: SRE Observability (FY2023/2024-Q4).
Wed, Apr 10, 2:45 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE
fgiunchedi created T362230: Skip black color for wikibugs task updates.
Wed, Apr 10, 1:32 PM · Wikibugs
fgiunchedi updated the task description for T360414: Phase out cergen for Observability services.
Wed, Apr 10, 9:26 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Tue, Apr 9

fgiunchedi added a comment to T355963: Gather feedback from users of the 'unmanaged' debian-12.0-nopuppet image.

<long pause> Thanks for the feedback!

Tue, Apr 9, 9:59 AM · Patch-For-Review, cloud-services-team, Cloud-VPS
fgiunchedi added a comment to T360414: Phase out cergen for Observability services.

Indeed, I think grafana_labs.certs.yaml as a whole can be ditched

Tue, Apr 9, 9:51 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Mon, Apr 8

fgiunchedi added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

The configuration hasn't changed, though we did upgrade to Bookworm and together with that came a new version of Alertmanager, thus it might be a regression

Mon, Apr 8, 8:26 AM · Data-Persistence, Observability-Alerting
fgiunchedi reopened T351927: Decide and tweak Thanos retention as "Open".

Yes we'll be trimming the retention more this week @MatthewVernon

Mon, Apr 8, 8:25 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

Fri, Apr 5

fgiunchedi added a comment to T361229: titan200[12] RAM/SSD upgrade coordination.

Thank you @Jhancock.wm @herron !

Fri, Apr 5, 9:28 AM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw

Thu, Apr 4

fgiunchedi added a comment to T361706: 2024-04-03 calico/typha down.

Does this need to be private?

Thu, Apr 4, 8:18 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
fgiunchedi changed the visibility for T361706: 2024-04-03 calico/typha down.
Thu, Apr 4, 8:17 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
fgiunchedi removed a project from T361706: 2024-04-03 calico/typha down: WMF-NDA.
Thu, Apr 4, 8:16 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

Wed, Apr 3

fgiunchedi created T361706: 2024-04-03 calico/typha down.
Wed, Apr 3, 1:50 PM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
fgiunchedi removed a project from T357747: Capacity planning/estimation for Thanos: SRE Observability (FY2023/2024-Q4).

Moving off Q4 board since we have hw in capex spreadsheet and it'll be coming next FY

Wed, Apr 3, 12:59 PM · SRE-swift-storage, Observability-Metrics
fgiunchedi closed T351927: Decide and tweak Thanos retention as Resolved.

Resolving this since capacity is under control now and we have more coming next FY as per T357747: Capacity planning/estimation for Thanos

Wed, Apr 3, 12:58 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi edited projects for T349626: Migrate SRE repositories to GitLab - operations/alerts, added: Observability-Alerting; removed SRE Observability (FY2023/2024-Q4).

I'm moving this off this Q board since I believe it'll happen further down the line

Wed, Apr 3, 11:53 AM · Observability-Alerting, Patch-For-Review, GitLab (Project Migration), collaboration-services
fgiunchedi closed T354399: Prometheus @ k8s OOM loop as Resolved.

I haven't observed OOMs related to WAL when doing a Prometheus rolling restart today, I'm optimistically resolving the task, though to be reopened if things change

Wed, Apr 3, 10:00 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi closed T360537: Bump prometheus instances allocated space as Resolved.

This is done, available space for Prometheus data has been expanded

Wed, Apr 3, 10:00 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi closed T358647: hieradata for syslog/centralserver should use hash instead of array for env_variables as Resolved.

Done by @Fabfur, thank you!

Wed, Apr 3, 8:08 AM · SRE Observability (FY2023/2024-Q3), observability
fgiunchedi added a comment to T351698: Linting problems found for NovafullstackSustainedFailures.

Indeed, I don't see the alert at https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem

Wed, Apr 3, 8:04 AM · Cloud-VPS, cloud-services-team

Tue, Apr 2

fgiunchedi updated the task description for T350192: On-call batphone escalation configuration holidays FY2023-24.
Tue, Apr 2, 1:58 PM · SRE Observability (FY2023/2024-Q4)
fgiunchedi added a comment to T361566: Request creation of o11y VPS project to replace monitoring.

Just a note that we now have https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet#wmcloud.org_zone_delegations - depending on your use case could that replace the floating IP?

Thank you, to clarify what I need is *.o11y.wmcloud.org (HTTPS only) to be answered by an instance/backend. If the generic proxy can do also zone delegation then I'm all for it! Last I checked this wasn't possible, hence the floating IP, though things might have changed

That's indeed supported too, I clarified the docs.

Tue, Apr 2, 1:52 PM · Cloud-VPS (Project-requests)
fgiunchedi added a comment to T361566: Request creation of o11y VPS project to replace monitoring.

Thank you folks for the quick action on this! Appreciate it

Tue, Apr 2, 1:17 PM · Cloud-VPS (Project-requests)
fgiunchedi created T361566: Request creation of o11y VPS project to replace monitoring.
Tue, Apr 2, 9:22 AM · Cloud-VPS (Project-requests)

Fri, Mar 29

fgiunchedi closed T344954: Configure Jaeger to follow dot-delimited daily index date convention as Resolved.

This is done, we're using . as separator

Fri, Mar 29, 10:43 AM · Observability-Tracing
fgiunchedi added a project to T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver: User-fgiunchedi.
Fri, Mar 29, 10:32 AM · User-fgiunchedi, Patch-For-Review, Pontoon
fgiunchedi closed T351179: LVM vg0 close to getting full on prometheus eqiad as Resolved.

This is done, we have more space for prometheus

Fri, Mar 29, 10:31 AM · SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi closed T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 as Resolved.

I pushed forward with this to be in a stable/known state ASAP, i.e. alert1001 and alert2001 are both on puppet 7 now and catalogs compile successfully

Fri, Mar 29, 9:42 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE
fgiunchedi closed T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.
Fri, Mar 29, 9:40 AM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

Set batphone for today until Monday COB

Fri, Mar 29, 8:54 AM · SRE Observability (FY2023/2024-Q4)

Thu, Mar 28

fgiunchedi reassigned T361229: titan200[12] RAM/SSD upgrade coordination from fgiunchedi to herron.

Thank you @RobH, I've coordinated with @herron and he'll be helping with this

Thu, Mar 28, 4:27 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw
fgiunchedi added a comment to T358506: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7.

I've been working on debugging this too, here's my understanding:

  • naggen2 is used to generate icinga configuration for nagios_host and nagios_service exported resources, runs as a generator on puppet master/server
  • naggen2 reads /etc/puppet/puppetdb.conf to discover the puppetdb url
  • on puppetmaster naggen2 works because puppetdb url points on port 8443, which has ssl cert validation as optional
  • this is not the case on puppetserver, thus naggen2 can't query puppetdb
Thu, Mar 28, 9:20 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability (FY2023/2024-Q3), SRE

Wed, Mar 27

fgiunchedi raised the priority of T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag from Low to Medium.
Wed, Mar 27, 2:28 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging
fgiunchedi added a project to T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag: SRE Observability (FY2023/2024-Q4).
Wed, Mar 27, 2:28 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging
fgiunchedi updated subscribers of T360862: Degraded RAID on centrallog1002.

Also cc @VRiley-WMF if you could help with this? thank you!

Wed, Mar 27, 10:58 AM · SRE, ops-eqiad

Tue, Mar 26

fgiunchedi edited projects for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4)
fgiunchedi edited projects for T302373: Upgrade prometheus-statsd-exporter, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Observability-Metrics
fgiunchedi edited projects for T350694: Infrastructure Foundation Alerts to migrate, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Patch-For-Review, Infrastructure-Foundations, Observability-Alerting
fgiunchedi edited projects for T351710: ossl rsyslog errors post-migration, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability
fgiunchedi edited projects for T349626: Migrate SRE repositories to GitLab - operations/alerts, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · Observability-Alerting, Patch-For-Review, GitLab (Project Migration), collaboration-services
fgiunchedi edited projects for T343529: Prometheus doesn't reload or alert on expired client certificates, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Prod-Kubernetes, Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar
fgiunchedi edited projects for T321808: Port most/all Icinga checks to Prometheus/Alertmanager, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Observability-Alerting
fgiunchedi edited projects for T351179: LVM vg0 close to getting full on prometheus eqiad, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi edited projects for T353457: Karma UI shows duplicate alerts, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), cloud-services-team, Observability-Alerting
fgiunchedi edited projects for T356788: thanos-query probedown due to OOM of both eqiad titan frontends, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability
fgiunchedi edited projects for T354255: Alert in need of triage: AlertLintProblem (instance localhost:9123), added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), sre-alert-triage
fgiunchedi edited projects for T357747: Capacity planning/estimation for Thanos, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE-swift-storage, Observability-Metrics
fgiunchedi edited projects for T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops, added: SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).
Tue, Mar 26, 2:57 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi closed T359198: Icinga BFD check failing as Resolved.

This is fixed, I've undone my symlink bandaid. I've also reported the issue at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1067768

Tue, Mar 26, 1:54 PM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, netops, SRE
fgiunchedi closed T359198: Icinga BFD check failing, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.
Tue, Mar 26, 1:52 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi moved T360537: Bump prometheus instances allocated space from Backlog to Doing on the User-fgiunchedi board.
Tue, Mar 26, 11:14 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi moved T359633: Strategy for Envoy metrics and Prometheus from Backlog to Doing on the User-fgiunchedi board.
Tue, Mar 26, 11:13 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi moved T354399: Prometheus @ k8s OOM loop from Backlog to Doing on the User-fgiunchedi board.
Tue, Mar 26, 11:13 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a project to T354399: Prometheus @ k8s OOM loop: User-fgiunchedi.
Tue, Mar 26, 11:13 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a project to T359633: Strategy for Envoy metrics and Prometheus: User-fgiunchedi.
Tue, Mar 26, 11:13 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s

Mon, Mar 25

fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Good news and bad news, in the sense that I can't reproduce the OOM in prometheus k8s in codfw, I suspect my fix at https://gerrit.wikimedia.org/r/1013515 to fetch less Envoy metrics significantly reduced load and thus replaying the WAL doesn't pose a memory problem anymore.

Mon, Mar 25, 1:43 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Thank you @andrea.denisse for the suggestions! Let's indeed discuss further what are the best options going forward

Mon, Mar 25, 1:04 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Promising results, samples/s in eqiad went from ~200k/s to ~110k/s after the change (and slightly increasing)

Mon, Mar 25, 9:53 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi updated subscribers of T360862: Degraded RAID on centrallog1002.

@Jclark-ctr it looks like one of the new SSDs from {T359452} isn't happy, I've located the drive so it should be blinking; could we replace it ASAP? please ping me on IRC when you can, thank you !

Mon, Mar 25, 8:20 AM · SRE, ops-eqiad

Fri, Mar 22

fgiunchedi added a comment to T360703: Replace or remove Debian Buster VMs in 'monitoring' cloud-vps project.

Thank you for the heads up; for context I'm working on T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver which will enable us to rebuild the whole o11y stack with Bullseye/Bookworm VMs

Fri, Mar 22, 8:52 AM · Cloud-VPS (Debian Buster Deprecation), cloud-services-team

Thu, Mar 21

fgiunchedi added a project to T360537: Bump prometheus instances allocated space: User-fgiunchedi.
Thu, Mar 21, 2:50 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Thank you @andrea.denisse for taking a look!

Thu, Mar 21, 11:42 AM · User-fgiunchedi, Observability-Metrics

Wed, Mar 20

fgiunchedi created T360537: Bump prometheus instances allocated space.
Wed, Mar 20, 3:51 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
fgiunchedi closed T359631: install (2) 1.92TB SSDs from decom into prometheus200[56] as Resolved.

This is done, thank you @Papaul

Wed, Mar 20, 3:49 PM · ops-codfw, SRE
fgiunchedi added a comment to T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.

We (o11y) have brainstormed this issue a little at the offsite, and one partial solution would be to get a prometheus dedicated mw instance, to at least contain the blast radius.

We'll have to brainstorm a little more, though even with moderately-sized histograms I can see statsd-exporter per-pod not being manageable when we're talking big histograms and hundreds of pods

Wed, Mar 20, 1:27 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi renamed T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag from apache2 cpu-stuck on logstash1032 causes kafka logging lag to apache2 cpu-stuck on logstash hosts causes kafka logging lag.
Wed, Mar 20, 11:14 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging

Tue, Mar 19

fgiunchedi updated subscribers of T359632: install (2) 1.92TB SSDs from decom into prometheus100[56].

@Jclark-ctr @VRiley-WMF please ping me on irc when you get on site tomorrow and we can coordinate, I'll be around, thank you!

Tue, Mar 19, 3:51 PM · ops-eqiad, SRE, procurement
fgiunchedi added a comment to T359631: install (2) 1.92TB SSDs from decom into prometheus200[56].

Thank you @Jhancock.wm ! how's tomorrow at 16 UTC for you? we'll be doing both hosts one at a time, and just to confirm: the drives are hot swap (?)

Tue, Mar 19, 3:48 PM · ops-codfw, SRE
fgiunchedi created T360444: Validate thanos/prometheus rules in puppet CI.
Tue, Mar 19, 3:19 PM · Patch-For-Review, Observability-Metrics
fgiunchedi added a comment to T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.

Can we disable host-level instance for MediaWiki's statsd exporter? (Or substitute with a constant?) I believe that would save 100x or 2 orders of magnitude. I can't imagine that ever being relevant for service/domain-specific stats from the MediaWiki application. I imagine of the hypothetical use cases that we don't yet have today, 99% would be covered by site="codfw", if we keep that.

Tue, Mar 19, 2:28 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi added a comment to T359497: StatsD Exporter: gracefully handle metric signature changes.

Good point re: statsd_exporter_events_conflict_total, looking at a mw-on-k8s world, I think linking the statsd-exporter lifecycle to mw seems the easiest? which also begs the question: maybe it does happen already during mw deployments as pods are cycled?

Tue, Mar 19, 1:57 PM · Observability-Metrics
fgiunchedi created T360433: Thumbor statsd-exporter metrics conflicts.
Tue, Mar 19, 1:57 PM · Thumbor
fgiunchedi added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

For awareness, see also https://phabricator.wikimedia.org/T359178#9640223 re: statsv in the context of varnishkafka deprecation/removal.

Tue, Mar 19, 1:35 PM · Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics

Mar 8 2024

fgiunchedi added a comment to T326322: Add per-output queue monitoring for Juniper network devices.

Yeah having some ballpark numbers will be a great help @cmooney, unless we're talking hundreds of thousands more metrics than we have now I think we're good to go, tens of thousands we can do without much effort/resources

Mar 8 2024, 4:26 PM · Patch-For-Review, SRE, Infrastructure-Foundations, netops
fgiunchedi created T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.
Mar 8 2024, 3:53 PM · SRE Observability (FY2023/2024-Q4), MediaWiki-Platform-Team (Radar), Observability-Metrics
fgiunchedi added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Ah yes indeed, thank you @JMeybohm !

Mar 8 2024, 2:29 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi created T359633: Strategy for Envoy metrics and Prometheus.
Mar 8 2024, 2:09 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Indeed the WAL grew quite fast (faster than I expected anyways) as the mw-on-k8s migration progressed (we're at ~50% now)

Mar 8 2024, 1:39 PM · User-fgiunchedi, Observability-Metrics

Mar 6 2024

fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup as Resolved.

Calling this done, albeit with an hack

Mar 6 2024, 2:55 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.
Mar 6 2024, 2:53 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359292: ircecho doesn't attempt to open log files created after startup.

Logs from ircecho.service

Mar 6 2024, 1:24 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder

Mar 6 2024, 1:21 PM · SRE Observability, sre-alert-triage
fgiunchedi created T359292: ircecho doesn't attempt to open log files created after startup.
Mar 6 2024, 9:23 AM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi closed T359153: statsv metrics are both prometheus ops and ext as Resolved.

All good! Thank you @colewhite for the merge

Mar 6 2024, 9:00 AM · Observability-Metrics

Mar 5 2024

fgiunchedi added a comment to T333615: Upgrade alert* hosts to Bookworm.

Something else that didn't work well: the current version of ircecho doesn't seem to attempt reopening the files it is supposed to look for in /var/log/icinga. I have "fixed" this by creating said .log files and then restarting ircecho, which then did properly open/tail the files

Mar 5 2024, 5:31 PM · Patch-For-Review, SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359198: Icinga BFD check failing.

I've bandaided the issue on alert2001, we'll need a more proper fix:

Mar 5 2024, 5:28 PM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, netops, SRE
fgiunchedi added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

Thank you for the detailed write up on this @Krinkle ! See below for my take:

Mar 5 2024, 12:02 PM · Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics
fgiunchedi created T359153: statsv metrics are both prometheus ops and ext.
Mar 5 2024, 11:18 AM · Observability-Metrics
fgiunchedi changed the status of T359068: Not enough space on titan2001 for thanos-compact from Open to Stalled.

Stalling until thanos-compact finishes its cycle, and we can assess how much space is used too

Mar 5 2024, 10:25 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi renamed T359068: Not enough space on titan2001 for thanos-compact from Not enough space on titan hosts for thanos-compact to Not enough space on titan2001 for thanos-compact.
Mar 5 2024, 10:23 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T359068: Not enough space on titan2001 for thanos-compact.

With the new 1.6TB disk in place we have ~2.2TB of raid0, which is great. This is fine for short/medium term, not long term because it means thanos-compact is able to complete a cycle only on titan2001 now. We'll get the other hosts in line in terms of space soon though (next FY or this FY is TBD)

Mar 5 2024, 9:46 AM · User-fgiunchedi, Observability-Metrics