Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (493 w, 3 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Fri, Mar 8

fgiunchedi added a comment to T326322: Add per-output queue monitoring for Juniper network devices.

Yeah having some ballpark numbers will be a great help @cmooney, unless we're talking hundreds of thousands more metrics than we have now I think we're good to go, tens of thousands we can do without much effort/resources

Fri, Mar 8, 4:26 PM · Patch-For-Review, SRE, Infrastructure-Foundations, netops
fgiunchedi created T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops.
Fri, Mar 8, 3:53 PM · MediaWiki-Platform-Team (Radar), SRE Observability (FY2023/2024-Q3), Observability-Metrics
fgiunchedi added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Ah yes indeed, thank you @JMeybohm !

Fri, Mar 8, 2:29 PM · Observability-Metrics, MW-on-K8s
fgiunchedi created T359633: Strategy for Envoy metrics and Prometheus.
Fri, Mar 8, 2:09 PM · Observability-Metrics, MW-on-K8s
fgiunchedi added a comment to T354399: Prometheus @ k8s OOM loop.

Indeed the WAL grew quite fast (faster than I expected anyways) as the mw-on-k8s migration progressed (we're at ~50% now)

Fri, Mar 8, 1:39 PM · Observability-Metrics

Wed, Mar 6

fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup as Resolved.

Calling this done, albeit with an hack

Wed, Mar 6, 2:55 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi closed T359292: ircecho doesn't attempt to open log files created after startup, a subtask of T333615: Upgrade alert* hosts to Bookworm, as Resolved.
Wed, Mar 6, 2:53 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359292: ircecho doesn't attempt to open log files created after startup.

Logs from ircecho.service

Wed, Mar 6, 1:24 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder

Wed, Mar 6, 1:21 PM · SRE Observability, sre-alert-triage
fgiunchedi created T359292: ircecho doesn't attempt to open log files created after startup.
Wed, Mar 6, 9:23 AM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi closed T359153: statsv metrics are both prometheus ops and ext as Resolved.

All good! Thank you @colewhite for the merge

Wed, Mar 6, 9:00 AM · Observability-Metrics

Tue, Mar 5

fgiunchedi added a comment to T333615: Upgrade alert* hosts to Bookworm.

Something else that didn't work well: the current version of ircecho doesn't seem to attempt reopening the files it is supposed to look for in /var/log/icinga. I have "fixed" this by creating said .log files and then restarting ircecho, which then did properly open/tail the files

Tue, Mar 5, 5:31 PM · SRE, SRE Observability (FY2023/2024-Q3)
fgiunchedi added a comment to T359198: Icinga BFD check failing.

I've bandaided the issue on alert2001, we'll need a more proper fix:

Tue, Mar 5, 5:28 PM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, netops, SRE
fgiunchedi added a comment to T355837: Add Prometheus support to statsd.js via mw.track().

Thank you for the detailed write up on this @Krinkle ! See below for my take:

Tue, Mar 5, 12:02 PM · Grafana, MediaWiki-Platform-Team (Radar), MediaWiki-extensions-WikimediaEvents, Observability-Metrics
fgiunchedi created T359153: statsv metrics are both prometheus ops and ext.
Tue, Mar 5, 11:18 AM · Observability-Metrics
fgiunchedi changed the status of T359068: Not enough space on titan2001 for thanos-compact from Open to Stalled.

Stalling until thanos-compact finishes its cycle, and we can assess how much space is used too

Tue, Mar 5, 10:25 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi renamed T359068: Not enough space on titan2001 for thanos-compact from Not enough space on titan hosts for thanos-compact to Not enough space on titan2001 for thanos-compact.
Tue, Mar 5, 10:23 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T359068: Not enough space on titan2001 for thanos-compact.

With the new 1.6TB disk in place we have ~2.2TB of raid0, which is great. This is fine for short/medium term, not long term because it means thanos-compact is able to complete a cycle only on titan2001 now. We'll get the other hosts in line in terms of space soon though (next FY or this FY is TBD)

Tue, Mar 5, 9:46 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi closed T359070: Spare SSDs for titan2001 ? as Resolved.

Brilliant, thank you very much @Jhancock.wm !

Tue, Mar 5, 8:47 AM · SRE, ops-codfw
fgiunchedi closed T359070: Spare SSDs for titan2001 ?, a subtask of T359068: Not enough space on titan2001 for thanos-compact, as Resolved.
Tue, Mar 5, 8:46 AM · User-fgiunchedi, Observability-Metrics

Mon, Mar 4

fgiunchedi added a comment to T359070: Spare SSDs for titan2001 ?.

Thank you @Jhancock.wm ! I'd like to go for the 1x 1.6TB SSD please to be added to the existing SSDs in titan2001

Mon, Mar 4, 4:25 PM · SRE, ops-codfw
fgiunchedi added a subtask for T359068: Not enough space on titan2001 for thanos-compact: T359070: Spare SSDs for titan2001 ?.
Mon, Mar 4, 4:01 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a parent task for T359070: Spare SSDs for titan2001 ?: T359068: Not enough space on titan2001 for thanos-compact.
Mon, Mar 4, 4:01 PM · SRE, ops-codfw
fgiunchedi created T359070: Spare SSDs for titan2001 ?.
Mon, Mar 4, 4:01 PM · SRE, ops-codfw
fgiunchedi created T359068: Not enough space on titan2001 for thanos-compact.
Mon, Mar 4, 3:57 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi closed T263720: Notification spam from "last puppet run" upon re-enabling puppet as Resolved.

Optimistically resolving since we've moved to prometheus-based alerts for puppet failures, which do aggregate and should DTRT in this case too

Mon, Mar 4, 1:50 PM · Puppet-Infrastructure, Observability-Alerting, SRE, Puppet
fgiunchedi closed T264016: Host page did not auto-resolve in VO as Resolved.

I'm tentatively resolving this since I believe we didn't see new occurrences

Mon, Mar 4, 1:41 PM · User-fgiunchedi, Observability-Alerting

Fri, Mar 1

fgiunchedi created T358870: kafka-logging broker certs about to expire.
Fri, Mar 1, 2:32 PM · Observability-Logging
fgiunchedi closed T355758: Create "corto" Phabricator bot account for Corto as Resolved.

bot is indeed working -- thanks again @brennen

Fri, Mar 1, 1:49 PM · Release-Engineering-Team (Now this 🫠), User-brennen, Phabricator-Bot-Requests, Incident Tooling
fgiunchedi closed T355758: Create "corto" Phabricator bot account for Corto, a subtask of T356790: introducing corto internal incident response workflow automation, as Resolved.
Fri, Mar 1, 1:48 PM · Incident Tooling, SRE-OnFire
fgiunchedi created T358849: Enable Jaeger SPM functionality.
Fri, Mar 1, 11:05 AM · Observability-Tracing
fgiunchedi added a comment to T358051: Many errors from Grafana with alertname=DatasourceError team=qte 2024-02-20.

Hey @Peter, I checked the apache logs on grafana1002 and couldn't find anything relevant on the 22nd; however we (o11y) recommend turning off datasourceerror notifications for alerts, see also the full rationale and instructions at https://wikitech.wikimedia.org/wiki/Grafana#DatasourceError_notification_spam (not sure if you came across this already tho)

Fri, Mar 1, 10:18 AM · Quality-and-Test-Engineering-Team
fgiunchedi added a comment to T358648: SystemdUnitFailed alert aggregation issues.

Thank you for the report, in general I agree we should be aggregating on the unit name itself and that would make the alert more clear; to achieve this we can change the grouping logic when routing alerts, I'll take a stab at it next week

Fri, Mar 1, 9:35 AM · SRE, Observability-Alerting
fgiunchedi added a comment to T358838: prometheus-icinga-am.service Fails to Start on alert2001.

The issue rang a bell, and indeed we've fixed the issue in https://gerrit.wikimedia.org/r/c/operations/puppet/+/981407 although on the standby host the override file with the fix is never deployed because icinga-am is set to not run (and rightfully so).

Fri, Mar 1, 9:32 AM · SRE, SRE Observability (FY2023/2024-Q3)

Thu, Feb 29

fgiunchedi added a comment to T355795: Fix "requests triggering circuit breakers" Elastic alert.

For reference, the full list of search-related graphite alerts:

Thu, Feb 29, 3:10 PM · Data-Platform-SRE (2024.03.04 - 2024.03.24), Observability-Alerting

Wed, Feb 28

fgiunchedi added a comment to T358626: Review of log level settings for prometheus-blackbox-exporter and thanos-query.

Thank you @andrea.denisse for filing the task! I'm thinking of reverting the thanos debug logging in T356788: thanos-query probedown due to OOM of both eqiad titan frontends since we have a better idea of problematic queries now.

Wed, Feb 28, 2:07 PM · SRE Observability
fgiunchedi added a comment to T358626: Review of log level settings for prometheus-blackbox-exporter and thanos-query.

Thank you @andrea.denisse for filing the task! I'm thinking of reverting the thanos debug logging in T356788: thanos-query probedown due to OOM of both eqiad titan frontends since we have a better idea of problematic queries now. blackbox-exporter logs though will need to stay at debug level since they are used for debugging alerts themselves (e.g. ProbeDown has a link to the blackbox-exporter logs in logstash)

Wed, Feb 28, 2:02 PM · SRE Observability
fgiunchedi created P58067 maniphest.search vs maniphest.query.
Wed, Feb 28, 1:37 PM

Fri, Feb 23

fgiunchedi added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

This happened again today, recovery was better in the sense that titan hosts themselves remained available, the OOM kicked in and things recovered without intervention. The page still went out as probes failed though.

Fri, Feb 23, 10:58 AM · SRE Observability (FY2023/2024-Q3), Sustainability (Incident Followup), SRE, observability
fgiunchedi closed T358317: Set maxmessagesize for rsyslog-receiver as Resolved.

And we're done

Fri, Feb 23, 10:52 AM · Observability-Logging
fgiunchedi created T358317: Set maxmessagesize for rsyslog-receiver.
Fri, Feb 23, 9:55 AM · Observability-Logging
fgiunchedi claimed T355758: Create "corto" Phabricator bot account for Corto.

Fantastic, thank you @brennen ! I'll take the task and resolve once the bot is confirmed working

Fri, Feb 23, 8:15 AM · Release-Engineering-Team (Now this 🫠), User-brennen, Phabricator-Bot-Requests, Incident Tooling
fgiunchedi closed T357893: PuppetZeroResources alert spams IRC on puppetserver failures as Resolved.

I'm more confident now in resolving the task, since WidespreadPuppetFailure has been fixed now and will alert on >= 3% failure rate in a given site (either agent failed, or no resources)

Fri, Feb 23, 8:14 AM · Infrastructure-Foundations, Observability-Metrics

Thu, Feb 22

fgiunchedi merged T358086: Slow load times for trace.w.o into T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .
Thu, Feb 22, 1:40 PM · Observability-Tracing
fgiunchedi merged task T358086: Slow load times for trace.w.o into T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .
Thu, Feb 22, 1:40 PM · Observability-Tracing
fgiunchedi reopened T357893: PuppetZeroResources alert spams IRC on puppetserver failures as "Open".

Turns out I was too hasty here, WidespreadPuppetFailure should have fired and it didn't, reopening this and I'll investigate here

Thu, Feb 22, 10:17 AM · Infrastructure-Foundations, Observability-Metrics
fgiunchedi added a comment to T357747: Capacity planning/estimation for Thanos.

I think the proposed table should look like this?

# weeksGBsresolution
60~620000s
60~500005m
280~430001h

I.e. 60W (as per text, a bit over a year), not 50W as you currently have? My back-of-an-envelope calculation has the GBs figures about right, though, so I don't think it changes the thrust of your argument.

Thu, Feb 22, 9:49 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics
fgiunchedi updated the task description for T357747: Capacity planning/estimation for Thanos.
Thu, Feb 22, 9:47 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics
fgiunchedi closed T357893: PuppetZeroResources alert spams IRC on puppetserver failures as Resolved.

I'm optimistically calling this resolved as there won't be critical notification spam going forward

Thu, Feb 22, 9:41 AM · Infrastructure-Foundations, Observability-Metrics

Wed, Feb 21

fgiunchedi created T358086: Slow load times for trace.w.o.
Wed, Feb 21, 10:02 AM · Observability-Tracing
fgiunchedi closed T320555: cas-sso idp for jaeger-ui on k8s, a subtask of T320554: Deploy and run Jaeger on the aux cluster, as Resolved.
Wed, Feb 21, 9:16 AM · Observability-Tracing
fgiunchedi closed T320555: cas-sso idp for jaeger-ui on k8s as Resolved.

Calling this done since https://trace.wikimedia.org now is a thing, thank you all involved @akosiaris @CDanis @taavi !

Wed, Feb 21, 9:16 AM · User-fgiunchedi, Observability-Tracing
fgiunchedi updated subscribers of T357747: Capacity planning/estimation for Thanos.

cc @MatthewVernon and SRE-swift-storage for your input re: capacity planning and hardware needs for thanos-be, let me know what you think!

Wed, Feb 21, 8:54 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics
fgiunchedi updated the task description for T357747: Capacity planning/estimation for Thanos.
Wed, Feb 21, 8:53 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics
fgiunchedi updated the task description for T357747: Capacity planning/estimation for Thanos.
Wed, Feb 21, 8:51 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics
fgiunchedi updated the task description for T357747: Capacity planning/estimation for Thanos.
Wed, Feb 21, 8:47 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics
fgiunchedi updated the task description for T357747: Capacity planning/estimation for Thanos.
Wed, Feb 21, 8:43 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics

Tue, Feb 20

fgiunchedi added a comment to T355758: Create "corto" Phabricator bot account for Corto.

cheers @thcipriani ! @brennen could you help us with this request? thank you!

Tue, Feb 20, 4:07 PM · Release-Engineering-Team (Now this 🫠), User-brennen, Phabricator-Bot-Requests, Incident Tooling
fgiunchedi added a comment to T320555: cas-sso idp for jaeger-ui on k8s.

Thank you @taavi! Definitely that was a problem, which I've verified it is fixed now thanks to @CDanis' patch:

Tue, Feb 20, 1:23 PM · User-fgiunchedi, Observability-Tracing

Mon, Feb 19

fgiunchedi renamed T268233: CAS-based services (?) lose the session after an hour from Thanos and Grafana lose the session after an hour to CAS-based services (?) lose the session after an hour.
Mon, Feb 19, 2:22 PM · Infrastructure-Foundations, CAS-SSO, SRE
fgiunchedi moved T351710: ossl rsyslog errors post-migration from Doing to Up next on the User-fgiunchedi board.
Mon, Feb 19, 1:02 PM · SRE Observability (FY2023/2024-Q3), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability
fgiunchedi created T357893: PuppetZeroResources alert spams IRC on puppetserver failures.
Mon, Feb 19, 11:15 AM · Infrastructure-Foundations, Observability-Metrics

Feb 16 2024

fgiunchedi added a comment to T320555: cas-sso idp for jaeger-ui on k8s.

I am digging into ingressgateway logs and found the following upon issuing the curl above:

Feb 16 2024, 10:37 AM · User-fgiunchedi, Observability-Tracing
fgiunchedi created T357747: Capacity planning/estimation for Thanos.
Feb 16 2024, 7:54 AM · SRE-swift-storage, SRE Observability (FY2023/2024-Q3), Observability-Metrics

Feb 15 2024

fgiunchedi added a comment to T320555: cas-sso idp for jaeger-ui on k8s.

update: I've been poking at ingress/istio after the change above without any luck, current symptom is what looks like a timeout:

Feb 15 2024, 1:58 PM · User-fgiunchedi, Observability-Tracing
fgiunchedi added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

Now Thanos services run in their own slice, which should help with enforcing resource limits.

Feb 15 2024, 1:07 PM · SRE Observability (FY2023/2024-Q3), Sustainability (Incident Followup), SRE, observability

Feb 14 2024

fgiunchedi added a comment to T356994: Alertmanager IRC notifications feedback and improvements.

After a little thought I think at the very least we should do the following:

Feb 14 2024, 12:54 PM · Observability-Alerting
fgiunchedi created T357525: Upgrade Grafana to 9.5.
Feb 14 2024, 12:48 PM · Observability-Metrics
fgiunchedi added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

Happened again albeit on titan1001 only, where query-frontend and store both using cpu and memory, and the host becoming unresponsive

Feb 14 2024, 10:50 AM · SRE Observability (FY2023/2024-Q3), Sustainability (Incident Followup), SRE, observability

Feb 13 2024

fgiunchedi added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

Thank you for reaching out; I generally agree with the rationale, and I'm ok to try a larger repeat_interval for SystemdUnitFailed. I'll send a patch to implement that for any SystemdUnitFailed alert regardless of team, though we can tune it as needed.

Feb 13 2024, 3:32 PM · Data-Persistence, Observability-Alerting
lmata awarded T357384: Simplify Grafana failovers a Like token.
Feb 13 2024, 1:38 PM · SRE Observability (FY2023/2024-Q4), Observability-Metrics
fgiunchedi updated the task description for T332764: Port base host checks from Icinga to Alertmanager.
Feb 13 2024, 1:20 PM · Patch-For-Review, Observability-Alerting
fgiunchedi closed T321810: Move systemd unit status checks to Alertmanager, a subtask of T321808: Port most/all Icinga checks to Prometheus/Alertmanager, as Invalid.
Feb 13 2024, 1:18 PM · SRE Observability (FY2023/2024-Q3), Observability-Alerting
fgiunchedi closed T321810: Move systemd unit status checks to Alertmanager as Invalid.

We've implemented this in other tasks

Feb 13 2024, 1:18 PM · Observability-Alerting
fgiunchedi renamed T357373: Inbound interface errors from Inbound interface errors - asw-c-codfw to Inbound interface errors.
Feb 13 2024, 1:09 PM · SRE, ops-codfw
fgiunchedi merged task T357400: PowerSupplyFailure into T357377: PowerSupplyFailure - mw2389.
Feb 13 2024, 1:09 PM · SRE, ops-codfw
fgiunchedi updated subscribers of T357377: PowerSupplyFailure - mw2389.

@Peachey88 thank you for your help on this, however please don't retitle @phaultfinder tasks as the title is used as a search key and a new task will be created (T357400)

Feb 13 2024, 1:09 PM · SRE, ops-codfw
fgiunchedi merged T357400: PowerSupplyFailure into T357377: PowerSupplyFailure - mw2389.
Feb 13 2024, 1:08 PM · SRE, ops-codfw
fgiunchedi created T357384: Simplify Grafana failovers.
Feb 13 2024, 9:23 AM · SRE Observability (FY2023/2024-Q4), Observability-Metrics

Feb 12 2024

fgiunchedi updated the task description for T352665: Upgrade Grafana hosts to Bookworm.
Feb 12 2024, 1:56 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q3)
ayounsi awarded T356994: Alertmanager IRC notifications feedback and improvements a Love token.
Feb 12 2024, 10:26 AM · Observability-Alerting
fgiunchedi updated the task description for T332764: Port base host checks from Icinga to Alertmanager.
Feb 12 2024, 10:14 AM · Patch-For-Review, Observability-Alerting

Feb 9 2024

fgiunchedi closed T357028: Introduce a way to retry checks for SystemdUnitFailed before alerting as Resolved.

Since we're back to Icinga semantics in terms of waiting before alerting I'm resolving the task!

Feb 9 2024, 2:51 PM · Release-Engineering-Team (Radar), SRE Observability, serviceops
fgiunchedi added a comment to T357028: Introduce a way to retry checks for SystemdUnitFailed before alerting.

The patch above does essentially that, i.e. match SystemdUnitFailed semantics to what we were expecting for Icinga (3 minute leeway)

Feb 9 2024, 1:48 PM · Release-Engineering-Team (Radar), SRE Observability, serviceops
fgiunchedi updated the task description for T357099: Remove check_procs-based Icinga alerts.
Feb 9 2024, 9:47 AM · Observability-Alerting
fgiunchedi created T357099: Remove check_procs-based Icinga alerts.
Feb 9 2024, 9:44 AM · Observability-Alerting
fgiunchedi closed T332709: Multi-team Prometheus/Alertmanager alerts as Resolved.

Resolving this since we have multi-team alerts, both host-based and a pattern to get even finer-grained ownership in case of e.g. systemd units living on the same hosts and owned by different teams

Feb 9 2024, 9:39 AM · Observability-Alerting
fgiunchedi closed T332709: Multi-team Prometheus/Alertmanager alerts, a subtask of T321808: Port most/all Icinga checks to Prometheus/Alertmanager, as Resolved.
Feb 9 2024, 9:39 AM · SRE Observability (FY2023/2024-Q3), Observability-Alerting
fgiunchedi closed T320931: Progress indicator for Icinga -> Alertmanager migration as Resolved.

I'm boldly resolving this: the progress indicator is at this tab https://docs.google.com/spreadsheets/d/19nxCXldb804TJCXGy4Z2BHG_1wRksRnKcPC6sXfjQuM/edit#gid=701141702 namely the "TODO" pivot table that lists all icinga checks we have yet to migrate

Feb 9 2024, 9:37 AM · Observability-Alerting
fgiunchedi closed T320931: Progress indicator for Icinga -> Alertmanager migration, a subtask of T321808: Port most/all Icinga checks to Prometheus/Alertmanager, as Resolved.
Feb 9 2024, 9:37 AM · SRE Observability (FY2023/2024-Q3), Observability-Alerting
fgiunchedi added a comment to T348508: Curator failed to delete indices in codfw.

I've silenced the alert related to this for 60d

Feb 9 2024, 9:08 AM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, Observability-Logging
fgiunchedi added a comment to T356140: blackbox alerts: add instance names to tickets created by alerting.

Yes this is possible, albeit a bit clunky at the moment. The way we do it for e.g. dcops tasks is group the alerts on instance and then change the title for the webhook (url-encoded) to create tasks with instance in the title. This also means you'll be getting a ticket per-host, let me know if you'd like assistance with this

Feb 9 2024, 7:46 AM · Observability-Alerting, collaboration-services
fgiunchedi added a comment to T357026: grafana-loki.service Failed on grafana2001.

Thank you for taking a look! I believe this was caused by the fact that loki started on grafana2001 before the sync and wrote its own WAL, all good

Feb 9 2024, 7:36 AM · SRE Observability

Feb 8 2024

fgiunchedi added a comment to T344202: Create VictorOps config for new Data Platform SRE team.

Could someone from the observability team rename the analytics routing key in VictorOps to data-platform please?

Feb 8 2024, 1:47 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03), observability, Observability-Alerting
fgiunchedi closed T337831: Remove specific nrpe::monitor_systemd_unit_state, a subtask of T321808: Port most/all Icinga checks to Prometheus/Alertmanager, as Resolved.
Feb 8 2024, 1:44 PM · SRE Observability (FY2023/2024-Q3), Observability-Alerting
fgiunchedi closed T337831: Remove specific nrpe::monitor_systemd_unit_state as Resolved.

This is completed!

Feb 8 2024, 1:44 PM · Observability-Alerting
fgiunchedi created T356994: Alertmanager IRC notifications feedback and improvements.
Feb 8 2024, 1:34 PM · Observability-Alerting
fgiunchedi added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

Current avenues I'm exploring:

  • Tighten the memory limits, thanos-query memory utilization jumps up very fast and I suspect what happens is that in certain cases there isn't enough memory left for the host to still being usable. Which is worse scenario of course than thanos-query being restarted, and takes longer to recover
  • Add debug logging to thanos-query as @CDanis pointed out
Feb 8 2024, 12:50 PM · SRE Observability (FY2023/2024-Q3), Sustainability (Incident Followup), SRE, observability
fgiunchedi added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

Current avenues I'm exploring:

  • Tighten the memory limits, thanos-query memory utilization jumps up very fast and I suspect what happens is that in certain cases there isn't enough memory left for the host to still being usable. Which is worse scenario of course than thanos-query being restarted, and takes longer to recover
  • Add debug logging to thanos-query as @CDanis pointed out
Feb 8 2024, 9:18 AM · SRE Observability (FY2023/2024-Q3), Sustainability (Incident Followup), SRE, observability

Feb 7 2024

fgiunchedi closed T356787: The label named state on node_systemd_service_restart_total metrics was changed to name as Declined.

Considering the points above (crashloop detection being a new feature, buster being on the way out) I'm declining the task, though feel free to reopen as you see fit!

Feb 7 2024, 2:50 PM · User-fgiunchedi, Observability-Alerting