Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (18)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (469 w, 2 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Tue, Sep 26

fgiunchedi updated the task description for T314118: Reduce IRC flood/spam during incidents.
Tue, Sep 26, 9:47 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q1), serviceops-radar, User-fgiunchedi, SRE
fgiunchedi added a comment to T342463: Write new partman recipe for cloudelastic.

Sounds great, please reach out and/or send reviews if sth is amiss with the standard recipes

Tue, Sep 26, 7:26 AM · Data-Platform-SRE
fgiunchedi updated subscribers of T347312: Expose ethtool metrics to Prometheus.

Thank you @cmooney for the extended explanation, very much appreciated! Adding @wiki_willy for visibility as dcops is the recipient of InterfaceErrors.

Tue, Sep 26, 7:23 AM · Observability-Alerting, observability

Mon, Sep 25

fgiunchedi moved T345712: Make sure we have observability for otel-coll and jaeger from Backlog to Doing on the User-fgiunchedi board.
Mon, Sep 25, 2:52 PM · User-fgiunchedi, Observability-Tracing
fgiunchedi created T347289: Port icinga::monitor::checks to pingthing.
Mon, Sep 25, 12:32 PM · Observability-Metrics
fgiunchedi closed T347167: Temporary prometheus alert evaluation failures on host role change as Resolved.

I've bumped the for threshold, resolving for now and will reopen if the issue is back

Mon, Sep 25, 10:11 AM · Observability-Metrics, Observability-Alerting
fgiunchedi closed T346950: Prometheus rule evaluation failure as Resolved.

No reoccurence, resolving

Mon, Sep 25, 10:09 AM · observability
fgiunchedi added a project to T345712: Make sure we have observability for otel-coll and jaeger: User-fgiunchedi.
Mon, Sep 25, 9:25 AM · User-fgiunchedi, Observability-Tracing
fgiunchedi updated subscribers of T342463: Write new partman recipe for cloudelastic.

Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the standard raid0 recipes (i.e. raid1 for / and raid0 for /srv) would work in this case?

Mon, Sep 25, 9:08 AM · Data-Platform-SRE

Fri, Sep 22

fgiunchedi added a comment to T347148: Determine how to monitor services in cloud-private / cloudlb.

ns-recursor VIP uses cloud VPS addressing, correct?

No, it's a service hosted by the cloudservices bare-metal nodes. These were previously using public IP addressing in the WMF production realm for all services, and thus reachable from everywhere.

Now the cloudservices nodes are connected to 10.x addressing (same as any other wmf private host basically), as well as having a leg in a new cloud-only network using 172.20.x.x. Their 10.x IPs can be polled by prometheus no problem. They also announce some public IPs in BGP, for services that need to be available from the internet, which are currently reachable from private WMF space (unless we put ACLs in to block it). They use other, 172.20.x private IPs to host services that are internal to cloud and don't need to be exposed to the outside.

Fri, Sep 22, 2:31 PM · observability, cloud-services-team, Cloud-VPS
fgiunchedi added a comment to T346950: Prometheus rule evaluation failure.

Change deployed, we'll be standing by and see if thanos still laments evaluation failures. Note that prometheus itself has experienced some, although that is completely different and tracked in T347167: Temporary prometheus alert evaluation failures on host role change

Fri, Sep 22, 2:16 PM · observability
fgiunchedi created T347167: Temporary prometheus alert evaluation failures on host role change.
Fri, Sep 22, 2:13 PM · Observability-Metrics, Observability-Alerting
fgiunchedi updated the task description for T345712: Make sure we have observability for otel-coll and jaeger.
Fri, Sep 22, 1:15 PM · User-fgiunchedi, Observability-Tracing
fgiunchedi added a comment to T347148: Determine how to monitor services in cloud-private / cloudlb.

Thank you for raising this @taavi !

Fri, Sep 22, 12:32 PM · observability, cloud-services-team, Cloud-VPS
fgiunchedi added a comment to T346950: Prometheus rule evaluation failure.

It isn't thanos-rule itself reporting the error message, but thanos-store that rule talks to; in other words the error message is nested

Fri, Sep 22, 10:07 AM · observability
fgiunchedi updated the task description for T345712: Make sure we have observability for otel-coll and jaeger.
Fri, Sep 22, 8:40 AM · User-fgiunchedi, Observability-Tracing

Thu, Sep 21

fgiunchedi updated the task description for T345712: Make sure we have observability for otel-coll and jaeger.
Thu, Sep 21, 8:53 AM · User-fgiunchedi, Observability-Tracing

Wed, Sep 20

fgiunchedi updated the task description for T346893: Investigate swagger-exporter failures.
Wed, Sep 20, 12:33 PM · Observability-Alerting, serviceops
fgiunchedi created T346893: Investigate swagger-exporter failures.
Wed, Sep 20, 12:19 PM · Observability-Alerting, serviceops
fgiunchedi closed T346129: Puppet doesn't self-recover on build-envoy-config failure as Resolved.

I'll optimistically call this specific issue resolved, the nail in the coffin will be file-based xds for envoy

Wed, Sep 20, 10:02 AM · serviceops, envoy
fgiunchedi added a comment to T346129: Puppet doesn't self-recover on build-envoy-config failure.

I've looked into the puppet logs from the first puppet run on cumin1001:/var/log/spicerack/sre/hosts/reimage/202309130825_filippo_2981305_titan1001.out and the initial failure is because systemd::syslog can't find the envoy user when setting directory ownership:

Wed, Sep 20, 9:37 AM · serviceops, envoy
fgiunchedi closed T346871: Test benthos webrequest_live with only one host as Resolved.

Test was successful in the sense that we can keep processing the webrequest firehose from codfw (centrallog2002). Even though by taking a performance hit in terms of messages processed, of course also network bandwidth takes a significant hit, jumping ~ +25MB/s. I believe that in the unlikely event of one centrallog host being unavailable for multiple hours we can still get a reasonable webrequest sampled stream. If even that doesn't work for some reason, it is easy to spin up benthos on different hosts since it is all stateless and trivially horizontally scalable

Wed, Sep 20, 9:07 AM · Observability-Metrics
fgiunchedi updated subscribers of T346871: Test benthos webrequest_live with only one host.

For reference, the dashboards me and @elukey are looking at:

Wed, Sep 20, 8:41 AM · Observability-Metrics
fgiunchedi created T346871: Test benthos webrequest_live with only one host.
Wed, Sep 20, 8:34 AM · Observability-Metrics

Tue, Sep 19

fgiunchedi created P52536 (An Untitled Masterwork).
Tue, Sep 19, 4:53 PM
fgiunchedi closed T346606: cr*-eqsin long poll times from librenms as Resolved.

Opened T346759: Investigate and deploy 'max-repeaters = 20' to all librenms devices for followups, this is done

Tue, Sep 19, 1:29 PM · SRE, Infrastructure-Foundations, netops, Observability-Metrics
fgiunchedi created T346759: Investigate and deploy 'max-repeaters = 20' to all librenms devices.
Tue, Sep 19, 1:29 PM · SRE, Infrastructure-Foundations, Observability-Metrics, netops
fgiunchedi closed T346371: Delete MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from Graphite as Resolved.

Thank you for reaching out @Urbanecm_WMF and letting us know about metric cleanup! This is done

Tue, Sep 19, 8:42 AM · Observability-Metrics, SRE, Growth-Team, Graphite
fgiunchedi placed T288622: All Prometheus based alerts move from Icinga to alert manager exclusively up for grabs.
Tue, Sep 19, 8:31 AM · SRE Observability (FY2023/2024-Q1)
fgiunchedi placed T278514: Wishlist for AlertManager alerts from Grafana up for grabs.
Tue, Sep 19, 8:31 AM · Observability-Alerting, User-fgiunchedi
fgiunchedi placed T267019: Alert design guidelines for teams are produced up for grabs.
Tue, Sep 19, 8:30 AM · Observability-Alerting

Mon, Sep 18

fgiunchedi added a project to T346606: cr*-eqsin long poll times from librenms: netops.

+ netops for visibility since this can impact network devices

Mon, Sep 18, 1:47 PM · SRE, Infrastructure-Foundations, netops, Observability-Metrics
fgiunchedi added a comment to T346606: cr*-eqsin long poll times from librenms.

Setting max-repeaters to 20 definitely had an impact on bgp peers poll time:

Mon, Sep 18, 1:10 PM · SRE, Infrastructure-Foundations, netops, Observability-Metrics
fgiunchedi added a comment to T346606: cr*-eqsin long poll times from librenms.

I tried the setting above on https://librenms.wikimedia.org/device/device=159/tab=edit/section=snmp/ though the web UI reloaded and the text field was empty, suggesting to me that the setting "didn't take"

Mon, Sep 18, 10:19 AM · SRE, Infrastructure-Foundations, netops, Observability-Metrics
fgiunchedi awarded T346319: Some device mempool graphs can't be rendered in librenms a Like token.
Mon, Sep 18, 10:13 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi closed T346140: Better validation of "ip" for benthos webrequest_live as Resolved.

This is done, webrequest_live is more robust against partial / unindexable requests

Mon, Sep 18, 10:10 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a subtask for T299640: RIPE Atlas exporter improvements: T346616: atlas-exporter fails to parse response from api.
Mon, Sep 18, 10:04 AM · observability, Infrastructure-Foundations
fgiunchedi added a parent task for T346616: atlas-exporter fails to parse response from api: T299640: RIPE Atlas exporter improvements.
Mon, Sep 18, 10:04 AM · Infrastructure-Foundations, Observability-Metrics
fgiunchedi created T346616: atlas-exporter fails to parse response from api.
Mon, Sep 18, 10:04 AM · Infrastructure-Foundations, Observability-Metrics
fgiunchedi closed T309979: Upgrade Prometheus VMs in PoPs to Bullseye as Resolved.

That's correct yes, all done

Mon, Sep 18, 9:40 AM · SRE Observability (FY2023/2024-Q1), Observability-Metrics
fgiunchedi closed T309979: Upgrade Prometheus VMs in PoPs to Bullseye, a subtask of T324725: Observability Bookworm/Bullseye upgrades, as Resolved.
Mon, Sep 18, 9:40 AM · SRE Observability (FY2023/2024-Q1)
fgiunchedi created T346606: cr*-eqsin long poll times from librenms.
Mon, Sep 18, 9:11 AM · SRE, Infrastructure-Foundations, netops, Observability-Metrics
fgiunchedi closed T346335: prometheus-snmp-exporter config assembly doesn't self recover on errors as Resolved.

This is done, we're using prometheus-assemble-config for snmp-exporter too now

Mon, Sep 18, 8:47 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi closed T346335: prometheus-snmp-exporter config assembly doesn't self recover on errors, a subtask of T344136: Upgrade LibreNMS to 23.7.0 or higher, as Resolved.
Mon, Sep 18, 8:47 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi moved T346335: prometheus-snmp-exporter config assembly doesn't self recover on errors from Backlog to Doing on the User-fgiunchedi board.
Mon, Sep 18, 8:29 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi added a project to T346335: prometheus-snmp-exporter config assembly doesn't self recover on errors: User-fgiunchedi.
Mon, Sep 18, 8:29 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, SRE Observability (FY2023/2024-Q1)

Fri, Sep 15

Milimetric awarded T343320: Request Access to Superset querying presto_analytics_hive datasets a Like token.
Fri, Sep 15, 8:04 PM · Product-Analytics, SRE-Access-Requests, SRE, CommRel-Specialists-Support (Jul-Sep-2023)
fgiunchedi added a comment to T346317: Alert "access port speed less 100mbit" and librenms upgrade.

Sweet, thank you @ayounsi !

Fri, Sep 15, 1:58 PM · SRE, Infrastructure-Foundations, netops, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi updated the task description for T344136: Upgrade LibreNMS to 23.7.0 or higher.
Fri, Sep 15, 6:03 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi placed T346335: prometheus-snmp-exporter config assembly doesn't self recover on errors up for grabs.
Fri, Sep 15, 6:03 AM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi closed T346318: Fix librenms/alertmanager integration, a subtask of T344136: Upgrade LibreNMS to 23.7.0 or higher, as Resolved.
Fri, Sep 15, 6:02 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi closed T346318: Fix librenms/alertmanager integration as Resolved.

This is done

Fri, Sep 15, 6:02 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)

Thu, Sep 14

lmata awarded T346318: Fix librenms/alertmanager integration a Like token.
Thu, Sep 14, 9:26 PM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi created T346335: prometheus-snmp-exporter config assembly doesn't self recover on errors.
Thu, Sep 14, 2:10 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi added a comment to T346318: Fix librenms/alertmanager integration.

The integration works again, what I did is:

Thu, Sep 14, 1:44 PM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi created T346319: Some device mempool graphs can't be rendered in librenms.
Thu, Sep 14, 10:37 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi updated the task description for T346318: Fix librenms/alertmanager integration.
Thu, Sep 14, 10:34 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi created T346318: Fix librenms/alertmanager integration.
Thu, Sep 14, 10:33 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi created T346317: Alert "access port speed less 100mbit" and librenms upgrade.
Thu, Sep 14, 10:23 AM · SRE, Infrastructure-Foundations, netops, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi closed T346275: Degraded RAID on netmon1003 as Invalid.

Nothing to do, host was reimaged:

Thu, Sep 14, 9:57 AM · Infrastructure-Foundations, SRE, ops-eqiad
fgiunchedi moved T346143: Remove thanos components from thanos-fe role and reimage thanos-fe hosts from Backlog to Doing on the User-fgiunchedi board.
Thu, Sep 14, 9:53 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi updated the task description for T341488: Split Thanos components from thanos-fe hosts into titan hosts.
Thu, Sep 14, 9:52 AM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi added a comment to T346140: Better validation of "ip" for benthos webrequest_live.

@elukey and I looked into this, turns out that these records have dt set to - and therefore not indexed in druid, current plan is to drop those at the benthos level before sampling

Thu, Sep 14, 9:29 AM · User-fgiunchedi, Observability-Metrics
fgiunchedi moved T345637: Remove nodeport from otel-collector deployment from Doing to Backlog on the User-fgiunchedi board.
Thu, Sep 14, 8:44 AM · User-fgiunchedi, Observability-Tracing
fgiunchedi moved T320563: our various Envoys are configured to report traces to local OpenTelemetry Collector from Doing to Backlog on the User-fgiunchedi board.
Thu, Sep 14, 8:44 AM · User-fgiunchedi, Observability-Tracing
fgiunchedi moved T346140: Better validation of "ip" for benthos webrequest_live from Backlog to Doing on the User-fgiunchedi board.
Thu, Sep 14, 8:44 AM · User-fgiunchedi, Observability-Metrics

Wed, Sep 13

lmata awarded T341488: Split Thanos components from thanos-fe hosts into titan hosts a Like token.
Wed, Sep 13, 2:48 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi added a project to T346140: Better validation of "ip" for benthos webrequest_live: User-fgiunchedi.
Wed, Sep 13, 2:31 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi updated the task description for T341488: Split Thanos components from thanos-fe hosts into titan hosts.
Wed, Sep 13, 12:19 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi closed T341999: Create 'titan' role and put new hosts in service as Resolved.

Hosts reimaged with raid0, resolving

Wed, Sep 13, 12:18 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi closed T341999: Create 'titan' role and put new hosts in service, a subtask of T341488: Split Thanos components from thanos-fe hosts into titan hosts, as Resolved.
Wed, Sep 13, 12:17 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi added a comment to T345190: Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@.

@Hokwelum you can find alertmanager onboard documentation for new teams at https://wikitech.wikimedia.org/wiki/Alertmanager . Please add me or other members of observability to the gerrit code reviews and we'll review/merge them for you! Please reach out too if you need further assistance and/or have questions

Wed, Sep 13, 10:03 AM · Patch-For-Review, MediaWiki-Engineering-Group-onboarding, MediaWiki-Platform-Team, MediaWiki-ResourceLoader
fgiunchedi renamed T346129: Puppet doesn't self-recover on build-envoy-config failure from Puppet doesn't self-recover with a zero-byte /etc/envoy/envoy.yaml to Puppet doesn't self-recover on build-envoy-config failure.
Wed, Sep 13, 9:06 AM · serviceops, envoy
fgiunchedi added a comment to T346129: Puppet doesn't self-recover on build-envoy-config failure.

There's also a related problem, which is more a puppet one, that if the build-envoy-config exec fails (like it happens on the first puppet run) it is never retried unless one of admin-config.yaml or runtime.yaml changes (and the exec is called again)

Wed, Sep 13, 9:00 AM · serviceops, envoy
fgiunchedi added a comment to T346129: Puppet doesn't self-recover on build-envoy-config failure.

This is preventing zero-touch reimage of hosts running envoy AFAICS.

Wed, Sep 13, 8:53 AM · serviceops, envoy
fgiunchedi added a comment to T346129: Puppet doesn't self-recover on build-envoy-config failure.

First puppet run does indeed create /etc/envoy/envoy.yaml if isn't present, trying to fix its permissions

Wed, Sep 13, 8:44 AM · serviceops, envoy
fgiunchedi renamed T346129: Puppet doesn't self-recover on build-envoy-config failure from Puppet doesn't self-recover when build-envoy-config leaves behind a zero-byte envoy.yaml to Puppet doesn't self-recover with a zero-byte /etc/envoy/envoy.yaml.
Wed, Sep 13, 8:39 AM · serviceops, envoy
fgiunchedi created P52492 (An Untitled Masterwork).
Wed, Sep 13, 7:58 AM

Tue, Sep 12

fgiunchedi reopened T341999: Create 'titan' role and put new hosts in service as "Open".

I was a little too hasty here, I forgot we need raid0 on these hosts to be able to store blocks to be compacted, will need to reimage the hosts

Tue, Sep 12, 1:42 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi reopened T341999: Create 'titan' role and put new hosts in service, a subtask of T341488: Split Thanos components from thanos-fe hosts into titan hosts, as Open.
Tue, Sep 12, 1:42 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi created T346143: Remove thanos components from thanos-fe role and reimage thanos-fe hosts.
Tue, Sep 12, 1:26 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi closed T341999: Create 'titan' role and put new hosts in service as Resolved.

New hosts are in service, resolving

Tue, Sep 12, 1:25 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi closed T341999: Create 'titan' role and put new hosts in service, a subtask of T341488: Split Thanos components from thanos-fe hosts into titan hosts, as Resolved.
Tue, Sep 12, 1:24 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi added a comment to T346140: Better validation of "ip" for benthos webrequest_live.

I noticed this because it seems that during said incident with many invalid ip benthos has sent out more messages than I'd have expected:

Tue, Sep 12, 1:17 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi created T346140: Better validation of "ip" for benthos webrequest_live.
Tue, Sep 12, 12:40 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T341488: Split Thanos components from thanos-fe hosts into titan hosts.

re: the last point, namely cleaning up thanos components off thanos-fe (therefore leaving only swift) I initially thought of going the state => absent route, though that seems more trouble than removing the thanos profiles from thanos::frontend role and roll-reimage the thanos-fe hosts. What do you think @MatthewVernon ?

Tue, Sep 12, 12:24 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi updated the task description for T341488: Split Thanos components from thanos-fe hosts into titan hosts.
Tue, Sep 12, 12:12 PM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi created T346129: Puppet doesn't self-recover on build-envoy-config failure.
Tue, Sep 12, 8:41 AM · serviceops, envoy
fgiunchedi updated the task description for T341488: Split Thanos components from thanos-fe hosts into titan hosts.
Tue, Sep 12, 8:12 AM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi moved T341999: Create 'titan' role and put new hosts in service from Up next to Doing on the User-fgiunchedi board.
Tue, Sep 12, 8:07 AM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi moved T341488: Split Thanos components from thanos-fe hosts into titan hosts from Up next to Doing on the User-fgiunchedi board.
Tue, Sep 12, 8:07 AM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
fgiunchedi added a comment to T344136: Upgrade LibreNMS to 23.7.0 or higher.

@andrea.denisse. is there a task for this blocking issue? As more and more people are going to upgrade to bookworm thanks for finding those bugs.

Tue, Sep 12, 7:47 AM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)

Mon, Sep 11

fgiunchedi created T346071: Add bookworm support/build to scap.
Mon, Sep 11, 4:46 PM · Release-Engineering-Team, Scap
fgiunchedi updated the task description for T344136: Upgrade LibreNMS to 23.7.0 or higher.
Mon, Sep 11, 3:54 PM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi updated the task description for T344136: Upgrade LibreNMS to 23.7.0 or higher.
Mon, Sep 11, 3:37 PM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi updated the task description for T344136: Upgrade LibreNMS to 23.7.0 or higher.
Mon, Sep 11, 2:24 PM · Observability-Metrics, SRE Observability (FY2023/2024-Q1)
fgiunchedi updated the task description for T346016: pg replication lag UNKNOWN for puppetdb2003.
Mon, Sep 11, 8:22 AM · Puppet
fgiunchedi created T346016: pg replication lag UNKNOWN for puppetdb2003.
Mon, Sep 11, 8:22 AM · Puppet
fgiunchedi added a comment to T345909: "Ensure hosts are not performing a change on every puppet run" alert is failing.

Looks like the alert is working as expected: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DPuppetConstantChange

Mon, Sep 11, 8:06 AM · Infrastructure-Foundations, Puppet-Infrastructure, Puppet
fgiunchedi added a comment to T320563: our various Envoys are configured to report traces to local OpenTelemetry Collector.

mesh tracing for citoid also enabled in staging now!

Mon, Sep 11, 8:05 AM · User-fgiunchedi, Observability-Tracing