Page MenuHomePhabricator

tappof (Tiziano Fogli)
User

Projects (12)

Today

  • No visible events.

Tomorrow

  • No visible events.

Monday

  • No visible events.

User Details

User Since
Jul 23 2024, 9:16 AM (71 w, 4 d)
Availability
Available
IRC Nick
tappof
LDAP User
Tiziano Fogli
MediaWiki User
Tiziano Fogli [ Global Accounts ]

Recent Activity

Yesterday

tappof closed T410745: Strengthen regex for suffix matching in Prometheus::Blackbox::Check::(Http|Icmp|Tcp) generated rules, a subtask of T400074: ProbeDown - wdqs1015, as Resolved.
Fri, Dec 5, 4:49 PM · SRE Observability, collaboration-services
tappof closed T410745: Strengthen regex for suffix matching in Prometheus::Blackbox::Check::(Http|Icmp|Tcp) generated rules as Resolved.
Fri, Dec 5, 4:49 PM · SRE Observability (FY2025/2026-Q2)

Thu, Dec 4

tappof added a comment to T410835: ErrorBudgetBurn.

The gap is related to the revert of the patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184566
and is temporary, since the data is present in the TSDB blocks but is not being served by the Thanos Querier.

Thu, Dec 4, 2:16 PM · Test Kitchen (Experiment Platform Sprint 16)
tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

The trend has changed after the revert.

image.png (1×1 px, 53 KB)

Thu, Dec 4, 10:56 AM · SRE Observability (FY2025/2026-Q2)
tappof created P86396 (An Untitled Masterwork).
Thu, Dec 4, 9:23 AM
tappof added a comment to T410835: ErrorBudgetBurn.

Just adding a note about the start and end dates of the gap.

image.png (1×2 px, 124 KB)

Thu, Dec 4, 6:37 AM · Test Kitchen (Experiment Platform Sprint 16)

Tue, Dec 2

tappof reopened T349521: Prometheus/Pyrra: establish backfill process for recording rules, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Open.
Tue, Dec 2, 2:19 PM · Patch-For-Review, User-herron, Observability-Metrics
tappof reopened T349521: Prometheus/Pyrra: establish backfill process for recording rules as "Open".

Due to the issues described in T410152: Disk space saturation (/srv) on Titan hosts, reverting the patch https://gerrit.wikimedia.org/r/1184566
was necessary.

Tue, Dec 2, 2:19 PM · SRE-SLO, Patch-For-Review, User-herron, Observability-Metrics
tappof added a subtask for T349521: Prometheus/Pyrra: establish backfill process for recording rules: T410152: Disk space saturation (/srv) on Titan hosts.
Tue, Dec 2, 10:20 AM · SRE-SLO, Patch-For-Review, User-herron, Observability-Metrics
tappof added a parent task for T410152: Disk space saturation (/srv) on Titan hosts: T349521: Prometheus/Pyrra: establish backfill process for recording rules.
Tue, Dec 2, 10:20 AM · SRE Observability (FY2025/2026-Q2)

Mon, Dec 1

tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

Another side effect:

image.png (1×951 px, 116 KB)

Mon, Dec 1, 5:33 PM · SRE Observability (FY2025/2026-Q2)

Fri, Nov 28

tappof edited P86115 (An Untitled Masterwork).
Fri, Nov 28, 4:38 PM
tappof edited P86115 (An Untitled Masterwork).
Fri, Nov 28, 4:37 PM
tappof edited P86115 (An Untitled Masterwork).
Fri, Nov 28, 4:21 PM
tappof created P86115 (An Untitled Masterwork).
Fri, Nov 28, 4:21 PM
tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

To avoid a revert on Friday and to be in the driver’s seat during the weekend, 100 GB were added to the VGs on titan1001, titan1002, and titan2002.

image.png (328×1 px, 53 KB)

Fri, Nov 28, 4:10 PM · SRE Observability (FY2025/2026-Q2)
tappof updated the task description for T411273: Thanos (store|query-frontend) memcached cache in bad status.
Fri, Nov 28, 2:55 PM · SRE-SLO, SRE Observability (FY2025/2026-Q2), Observability-Metrics
tappof added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

Thanos ruler points to query-frontend as its thanos querier:

/usr/bin/thanos rule ... --query http://localhost:16902
...
/usr/bin/thanos query-frontend ... --http-address 0.0.0.0:16902 ...
Fri, Nov 28, 2:51 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
tappof created T411273: Thanos (store|query-frontend) memcached cache in bad status.
Fri, Nov 28, 2:46 PM · SRE-SLO, SRE Observability (FY2025/2026-Q2), Observability-Metrics

Thu, Nov 27

tappof closed T411167: Yubikey-SSH-FIDO for Tiziano Fogli (tappof) as Resolved.
Thu, Nov 27, 4:15 PM · SRE, SRE-Access-Requests
tappof created T411167: Yubikey-SSH-FIDO for Tiziano Fogli (tappof).
Thu, Nov 27, 10:18 AM · SRE, SRE-Access-Requests

Wed, Nov 26

tappof added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

Just updated the dashboard: https://grafana.wikimedia.org/goto/PzmXbiWvg?orgId=1

Wed, Nov 26, 10:18 AM · SRE-SLO

Fri, Nov 21

tappof created T410745: Strengthen regex for suffix matching in Prometheus::Blackbox::Check::(Http|Icmp|Tcp) generated rules.
Fri, Nov 21, 2:47 PM · SRE Observability (FY2025/2026-Q2)
tappof added a comment to T400074: ProbeDown - wdqs1015.

Found the issue: the rules configured in modules/profile/manifests/microsites/monitoring.pp:67 are generating a regex that also matches the ones generated by modules/profile/manifests/query_service/monitor/ldf.pp.

image.png (646×1 px, 116 KB)

Fri, Nov 21, 2:32 PM · SRE Observability, collaboration-services

Thu, Nov 20

tappof created P85421 (An Untitled Masterwork).
Thu, Nov 20, 4:43 PM
tappof added a subtask for T400074: ProbeDown - wdqs1015: T305223: Clean up stale Prometheus target and rules files.
Thu, Nov 20, 4:21 PM · SRE Observability, collaboration-services
tappof added a parent task for T305223: Clean up stale Prometheus target and rules files: T400074: ProbeDown - wdqs1015.
Thu, Nov 20, 4:21 PM · Patch-For-Review, Observability-Metrics, SRE Observability (FY2025/2026-Q1)
tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

I’d suggest reverting the patch, as the compactor is currently unable to do its job. This could lead to a thrashing situation that would be harder to recover from than the one we’re experiencing now. Once we’ve confirmed we’re no longer in troubled waters, we can investigate why backfilling metrics was difficult without such a cutoff and eventually evaluate alternatives.
I’m not entirely sure this is the root cause, but I prefer to give it a try before changing other configurations that could impact performance.

Thu, Nov 20, 3:12 PM · SRE Observability (FY2025/2026-Q2)
tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

It is now affecting the compactor as well.

Thu, Nov 20, 12:04 PM · SRE Observability (FY2025/2026-Q2)

Tue, Nov 18

tappof closed T410365: SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus100[56]:9100 as Resolved.

Yes, thank you.

Tue, Nov 18, 9:11 AM · Observability-Metrics, DBA
tappof closed T410365: SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus100[56]:9100, a subtask of T409557: Productionize new clouddb* hosts (clouddb1022-1033), as Resolved.
Tue, Nov 18, 9:11 AM · Data-Services, cloud-services-team, DBA
tappof created T410365: SystemdUnitFailed: generate-mysqld-exporter-config.service on prometheus100[56]:9100.
Tue, Nov 18, 9:04 AM · Observability-Metrics, DBA
tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

Sure, I think we can explore any route that will fix our scenario. That said, we’re quite happy with the current Thanos/Prometheus performance, so I’d like to better understand the real needs behind having such a short cutoff window of just one day.

Tue, Nov 18, 7:48 AM · SRE Observability (FY2025/2026-Q2)

Mon, Nov 17

tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.
  1. We changed the --max-time parameter of Thanos Store from -15d to -1d.
  2. This effectively caused a 5x increase in the amount of data transferred from the object store.
  3. One compactor cycle takes roughly 2 weeks.
  4. Just considering point 1, we are potentially increasing the amount of data that can reside under /srv/thanos-store.
  5. Every day, the compactor creates new fresh blocks. Blocks are considered for downsampling only when they are older than 2 days ("All raw resolution metrics that are older than 40 hours are downsampled at a 5m resolution").
  6. Over time, with a cutoff of -1d, Thanos Store will constantly cache the new blocks created by the compactor (compacted and/or downsampled).
  7. In the short term, however, the blocks already present in the store have not yet been processed as deletable by the compactor, so they are effectively still valid. The store stops using them until they are no longer valid (i.e., removable), after which it starts requesting new blocks — but this time much more frequently, since previously a block remained valid for 2 weeks (with --max-time -15d and a compaction cycle duration of ~14d, blocks were probabilistically replaced almost simultaneously). Today, a block remains valid for about one day because it is then compacted or downsampled, or both (and therefore effectively becomes a new block with its data merged with other blocks).
Mon, Nov 17, 4:49 PM · SRE Observability (FY2025/2026-Q2)
tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

/srv was also moved to the VG on titan2002.

Mon, Nov 17, 1:58 PM · SRE Observability (FY2025/2026-Q2)
tappof added a subtask for T361229: titan200[12] RAM/SSD upgrade coordination: T410152: Disk space saturation (/srv) on Titan hosts.
Mon, Nov 17, 1:23 PM · DC-Ops, SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw
tappof added a parent task for T410152: Disk space saturation (/srv) on Titan hosts: T361229: titan200[12] RAM/SSD upgrade coordination.
Mon, Nov 17, 1:22 PM · SRE Observability (FY2025/2026-Q2)
tappof added a comment to T410152: Disk space saturation (/srv) on Titan hosts.

/srv was also moved to the VG on titan1002.

Mon, Nov 17, 9:11 AM · SRE Observability (FY2025/2026-Q2)

Fri, Nov 14

tappof created T410152: Disk space saturation (/srv) on Titan hosts.
Fri, Nov 14, 4:21 PM · SRE Observability (FY2025/2026-Q2)

Thu, Nov 13

tappof added a subtask for T281812: Audit/Assess meta monitoring strategy: Unknown Object (Task).
Thu, Nov 13, 2:44 PM · Observability-Alerting

Wed, Nov 12

tappof added a subtask for T395441: Port all Icinga checks to Prometheus/Alertmanager: preparation: T367370: Shift frack alerting to use prometheus-alertmanager instead of icinga.
Wed, Nov 12, 2:19 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics
tappof added a parent task for T367370: Shift frack alerting to use prometheus-alertmanager instead of icinga: T395441: Port all Icinga checks to Prometheus/Alertmanager: preparation.
Wed, Nov 12, 2:19 PM · Patch-For-Review, Observability-Alerting, Fundraising-Backlog, fundraising-tech-ops
tappof added a comment to T398869: Create Pyrra SLOs for xLab.

I was running some tests related to the spike we saw here: https://w.wiki/_mzMp .

Wed, Nov 12, 10:49 AM · SRE-SLO, Test Kitchen (Experiment Platform Sprint 14), OKR-Work

Nov 5 2025

tappof added a comment to T409076: Public cloud account request for moving meta monitoring off of wikitech-static.

Please hold off on working on this task until further notice.
We may have found a way to handle the Icinga meta-monitoring reliably using the same approach as the Prometheus/Thanos meta-monitor.

Nov 5 2025, 1:17 PM · Infrastructure-Foundations

Nov 3 2025

tappof added a parent task for T409076: Public cloud account request for moving meta monitoring off of wikitech-static: Unknown Object (Task).
Nov 3 2025, 3:42 PM · Infrastructure-Foundations
tappof created T409076: Public cloud account request for moving meta monitoring off of wikitech-static.
Nov 3 2025, 2:33 PM · Infrastructure-Foundations
tappof added a comment to T398869: Create Pyrra SLOs for xLab.

It seems that some of the eventgate pods were restarted between 16:00 and 17:00 (Just a quick check by looking at the metrics — I didn’t dig into the logs or anything else).

Nov 3 2025, 12:59 PM · SRE-SLO, Test Kitchen (Experiment Platform Sprint 14), OKR-Work

Oct 31 2025

tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 31 2025, 2:56 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 31 2025, 2:42 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 31 2025, 2:41 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 31 2025, 2:25 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 31 2025, 2:22 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 31 2025, 2:21 PM · Patch-For-Review, Observability-Alerting
tappof closed T407484: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) as Resolved.
Oct 31 2025, 11:45 AM · SRE Observability (FY2025/2026-Q2), sre-alert-triage

Oct 30 2025

tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 30 2025, 3:08 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 30 2025, 2:35 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 30 2025, 10:40 AM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 30 2025, 10:23 AM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 30 2025, 10:20 AM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 30 2025, 10:16 AM · Patch-For-Review, Observability-Alerting

Oct 28 2025

tappof added a comment to T408378: Nokia OSPF alerts not working.

I saw the alerts on the ALERTS metric: https://w.wiki/FqSi .
I think there was a silence rule in place, so you didn't get any notifications.

Oct 28 2025, 2:30 PM · Observability-Alerting, netops, Infrastructure-Foundations, SRE
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 28 2025, 11:28 AM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 28 2025, 11:25 AM · Patch-For-Review, Observability-Alerting
tappof changed the subtype of T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager from "Task" to "Goal".
Oct 28 2025, 11:16 AM · Patch-For-Review, Observability-Alerting
tappof assigned T374839: Port postgresql replication check to Prometheus/Alertmanager to herron.
Oct 28 2025, 11:14 AM · PostgreSQL, Observability-Alerting
tappof assigned T370157: Port lists monitoring alerts to Alertmanager to herron.
Oct 28 2025, 11:12 AM · Observability-Alerting
tappof added a subtask for T315866: Migrate mysql icinga alerts to alert manager: T369045: Migrate mysql icinga alerts to alert manager - haproxy exporter.
Oct 28 2025, 11:11 AM · Patch-For-Review, DBA
tappof removed a subtask for T321808: Port all Icinga checks to Prometheus/Alertmanager: T369045: Migrate mysql icinga alerts to alert manager - haproxy exporter.
Oct 28 2025, 11:11 AM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof edited parent tasks for T369045: Migrate mysql icinga alerts to alert manager - haproxy exporter, added: T315866: Migrate mysql icinga alerts to alert manager; removed: T321808: Port all Icinga checks to Prometheus/Alertmanager.
Oct 28 2025, 11:11 AM · DBA
tappof reassigned T357099: Remove check_procs-based Icinga alerts from tappof to herron.
Oct 28 2025, 11:10 AM · Patch-For-Review, Observability-Alerting

Oct 25 2025

tappof added a member for SRE-SLO: tappof.
Oct 25 2025, 9:27 PM

Oct 21 2025

tappof claimed T375166: Port PDU checks to Prometheus/Alertmanager.
Oct 21 2025, 2:14 PM · Observability-Alerting
tappof claimed T370530: Clean up "git repo needs merge" checks.
Oct 21 2025, 2:14 PM · Puppet, MW-on-K8s, Observability-Alerting
tappof reassigned T367149: Add "file age" node textfile exporter capability from fgiunchedi to herron.
Oct 21 2025, 2:11 PM · Observability-Metrics, Observability-Alerting
tappof claimed T357099: Remove check_procs-based Icinga alerts.
Oct 21 2025, 12:55 PM · Patch-For-Review, Observability-Alerting
tappof removed a subtask for T321808: Port all Icinga checks to Prometheus/Alertmanager: T330989: Probe mr devices from Prometheus.
Oct 21 2025, 12:54 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof added a subtask for T395441: Port all Icinga checks to Prometheus/Alertmanager: preparation: T330989: Probe mr devices from Prometheus.
Oct 21 2025, 12:54 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics
tappof edited parent tasks for T330989: Probe mr devices from Prometheus, added: T395441: Port all Icinga checks to Prometheus/Alertmanager: preparation; removed: T321808: Port all Icinga checks to Prometheus/Alertmanager.
Oct 21 2025, 12:54 PM · Observability-Alerting
tappof claimed T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.
Oct 21 2025, 12:49 PM · Patch-For-Review, SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof claimed T321808: Port all Icinga checks to Prometheus/Alertmanager.
Oct 21 2025, 12:49 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof claimed T395441: Port all Icinga checks to Prometheus/Alertmanager: preparation.
Oct 21 2025, 12:48 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics
tappof claimed T388138: PDUs in Active status missing IP address information in NetBox.
Oct 21 2025, 12:48 PM · DC-Ops, Observability-Metrics
tappof claimed T395446: Evaluate which solution we could adopt as a drop-in replacement for NRPE (and start prototyping).
Oct 21 2025, 12:41 PM · SRE Observability (FY2025/2026-Q1), Observability-Metrics
tappof claimed T350360: Evaluate "drop in" replacement for nrpe scripts.
Oct 21 2025, 12:41 PM · Patch-For-Review, Observability-Alerting
tappof claimed T370526: Remove load_average check for ms-be/thanos-be.
Oct 21 2025, 12:38 PM · SRE-swift-storage, Observability-Alerting
tappof updated the task description for T309012: Migrate zookeeper prometheus checks from Icinga to Alertmanager.
Oct 21 2025, 12:27 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Observability-Alerting
tappof closed T309012: Migrate zookeeper prometheus checks from Icinga to Alertmanager, a subtask of T288622: All Prometheus based alerts move from Icinga to alert manager exclusively, as Resolved.
Oct 21 2025, 12:27 PM · Patch-For-Review, SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof closed T309012: Migrate zookeeper prometheus checks from Icinga to Alertmanager, a subtask of T346438: [Epic] Review alerting strategy for Data Platform SRE, as Resolved.
Oct 21 2025, 12:26 PM · Epic, Data-Platform-SRE, observability
tappof closed T309012: Migrate zookeeper prometheus checks from Icinga to Alertmanager as Resolved.
Oct 21 2025, 12:26 PM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Observability-Alerting

Oct 20 2025

tappof added a comment to T407484: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100).

I found that the certificates used by Prometheus to authenticate against Kubernetes are being renewed every hour. I believe the root cause lies in modules/profile/manifests/prometheus/k8s.pp:22, where renew_seconds is set to 365d and 23h.

Oct 20 2025, 4:00 PM · SRE Observability (FY2025/2026-Q2), sre-alert-triage
tappof claimed T407484: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100).
Oct 20 2025, 10:42 AM · SRE Observability (FY2025/2026-Q2), sre-alert-triage

Oct 17 2025

tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 17 2025, 4:26 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 17 2025, 4:13 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 17 2025, 3:54 PM · Patch-For-Review, Observability-Alerting
tappof updated the task description for T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager.
Oct 17 2025, 3:50 PM · Patch-For-Review, Observability-Alerting

Oct 15 2025

tappof changed the subtype of T407138: Port hadoop checks to prometheus/alertmanager from "Task" to "Goal".
Oct 15 2025, 1:26 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof changed the subtype of T407137: Port haproxy checks to prometheus/alertmanager from "Task" to "Goal".
Oct 15 2025, 1:26 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof changed the subtype of T407130: Remove check_systemd_unit_status-based Icinga alerts from "Task" to "Goal".
Oct 15 2025, 1:26 PM · Observability-Alerting
tappof changed the subtype of T407120: O11y alerts to migrate from "Task" to "Goal".
Oct 15 2025, 1:26 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting
tappof changed the subtype of T407331: Port mediawiki::php checks to Prometheus/Alertmanager from "Task" to "Goal".
Oct 15 2025, 1:25 PM · SRE Observability (FY2025/2026-Q1), Observability-Alerting