Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (15)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (345 w, 1 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Yesterday

fgiunchedi created T282863: Upgrade Grafana to 8.
Fri, May 14, 12:08 PM · Performance-Team (Radar), observability
fgiunchedi added a comment to T280801: Cloud VPS pre-release Debian Bullseye images.

Thank you for following up @Andrew, I'm wondering if we could locally hack sth to unblock that specific bit and see what else needs fixing?

Fri, May 14, 9:09 AM · cloud-services-team (Kanban), Cloud-VPS
fgiunchedi updated subscribers of T282839: Degraded RAID on ms-be1053.

sdd is indeed busted and host is under warranty, please replace @Cmjohnson / @Jclark-ctr , thank you!

Fri, May 14, 9:06 AM · SRE, ops-eqiad

Wed, May 12

fgiunchedi removed a project from T213933: PoC alert/notification functionality with Elastic Stack: User-fgiunchedi.
Wed, May 12, 3:30 PM · observability, Patch-For-Review, Restricted Project, Security-Team, Wikimedia-Logstash
fgiunchedi removed a project from T182759: Add Prometheus exporter to Jenkins instances: User-fgiunchedi.
Wed, May 12, 3:30 PM · Release-Engineering-Team (Seen), observability, Continuous-Integration-Infrastructure, Goal, SRE
fgiunchedi removed a project from T78135: Provide a pxe-bootable rescue image: User-fgiunchedi.
Wed, May 12, 3:29 PM · SRE

Tue, May 11

fgiunchedi updated the task description for T264291: Swift users and their usage.
Tue, May 11, 1:50 PM · SRE-swift-storage
fgiunchedi closed T281039: Splunk On-Call doing something odd with routing some wmcs alerts as Resolved.

I can't find an option to instruct icinga to stop sending ACKs notifications on a per-contact basis unfortunately. Since the issue seems benign I'll resolve, feel free to reopen though!

Tue, May 11, 8:12 AM · cloud-services-team (Kanban), observability

Mon, May 10

fgiunchedi moved T281812: Audit/Assess external monitoring strategy from Inbox to In progress on the observability board.
Mon, May 10, 3:24 PM · User-fgiunchedi, observability
fgiunchedi closed T282434: Degraded RAID on ms-be1038 as Resolved.

RAID firmware upgraded and host rebooted 2x, we're back

Mon, May 10, 12:31 PM · SRE, ops-eqiad
fgiunchedi added a comment to T282434: Degraded RAID on ms-be1038.

Message at boot up

Mon, May 10, 12:16 PM · SRE, ops-eqiad
fgiunchedi added a comment to T282434: Degraded RAID on ms-be1038.

Looks like the host is busted, I'll try a reboot

Mon, May 10, 12:12 PM · SRE, ops-eqiad

Fri, May 7

fgiunchedi moved T281095: Move paging for librenms from icinga to AM from Up next to Doing on the User-fgiunchedi board.
Fri, May 7, 7:30 AM · Patch-For-Review, SRE, User-fgiunchedi, netops, observability

Thu, May 6

fgiunchedi removed a project from T227080: Deprecate all non-Kafka logstash inputs: User-fgiunchedi.
Thu, May 6, 12:30 PM · Patch-For-Review, observability, Wikimedia-Logstash, SRE
fgiunchedi removed a project from T240667: Ingestion errors for production logs on ELK7: User-fgiunchedi.
Thu, May 6, 12:29 PM · observability, SRE, Wikimedia-Logstash
fgiunchedi removed a project from T235891: Ingest production logs with ELK7: User-fgiunchedi.
Thu, May 6, 12:29 PM · observability, SRE, Wikimedia-Logstash
fgiunchedi moved T281812: Audit/Assess external monitoring strategy from Backlog to Doing on the User-fgiunchedi board.
Thu, May 6, 8:48 AM · User-fgiunchedi, observability

Tue, May 4

fgiunchedi updated subscribers of T281812: Audit/Assess external monitoring strategy.
Tue, May 4, 10:22 AM · User-fgiunchedi, observability
fgiunchedi created T281812: Audit/Assess external monitoring strategy.
Tue, May 4, 10:22 AM · User-fgiunchedi, observability
fgiunchedi created T281810: Request increased quota for monitoring Cloud VPS project.
Tue, May 4, 10:06 AM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
fgiunchedi added a comment to T281699: swift-ring: Add support for Cinder based Cloud VPS VMs.

For a bit of context: to keep a good "emulation" of production there needs to be a block device (LVM, or other like a loop device or similar) for puppet (and the scripts) to mkfs/mount/etc.

Tue, May 4, 9:28 AM · Beta-Cluster-Infrastructure, SRE-swift-storage
fgiunchedi added a comment to T280805: Error in apifeatureusage curator "forcemerge" step.

While we're on the topic (ah!) of apifeatureusage, with mediawiki logs on kafka we don't strictly need logstash anymore to ingest kafka -> cirrussearch if the feature stays based on mw logs (as opposed to event platform).

Tue, May 4, 9:18 AM · Platform Engineering, ApiFeatureUsage, Discovery-Search (Current work), observability

Mon, May 3

fgiunchedi added a comment to T265435: codfw: Testing Out Sample PDUs.

I did some work on this last week, there's temporary patches on netmon1002 to get things going at least minimally and collect voltage/current/power/etc from the PDU's branches. I ran into troubles with conditional discovery and asked upstream about it: https://community.librenms.org/t/skipping-values-based-on-oids-in-another-table-with-yaml-discovery/15689

Mon, May 3, 12:54 PM · User-fgiunchedi, observability, ops-codfw, DC-Ops, SRE
fgiunchedi moved T281454: Onboard teams with Prometheus-based alerts to AM from Backlog to Doing on the User-fgiunchedi board.
Mon, May 3, 12:15 PM · User-fgiunchedi, observability
fgiunchedi added a comment to T281507: KaiOS app client-side errors dashboard stopped working.

from the sent payload I'm guessing the messages should end up in (eqiad|codfw).mediawiki.client.error topic

FYI, that code sets meta.stream to 'kaios_app.error', from which the Kafka topic names are created, e.g. (eqiad|codfw).kaios_app.error

Mon, May 3, 9:58 AM · observability, Wikimedia-Logstash, Inuka-Team

Fri, Apr 30

fgiunchedi added a comment to T265435: codfw: Testing Out Sample PDUs.

Thank you @Papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry4) we'd need to add support for inbound current and environmental monitors. I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?

Also a few questions that popped up while reading the MIB:

  • Are we going to use multiple PDUs chained together?
  • For input current the mib has "line" and "branch" concepts, I'm not super familiar with these and would be great to clarify what's what in the daisy-chained case too
  • Temperature is in degrees F (not a big deal, need to check if librenms can convert for us under the hood)
  • @Papaul I see the sensor connected, there is one reading for temp and one for humidity; and two other readings for temp/humidity with bogus values. Does the sensor have two probes physically or just one ?

@fgiunchedi i changed the temperature to Celsius, the sensor have only 1 probe but the PDU can take up to 2 sensors

Fri, Apr 30, 9:17 AM · User-fgiunchedi, observability, ops-codfw, DC-Ops, SRE
fgiunchedi added a comment to T281507: KaiOS app client-side errors dashboard stopped working.

I had a brief look into this to check the logstash pipeline health. I can't find events in the dashboard for the last 90d, although from the sent payload I'm guessing the messages should end up in (eqiad|codfw).mediawiki.client.error topic in the kafka "logging" cluster (?).

Fri, Apr 30, 7:32 AM · observability, Wikimedia-Logstash, Inuka-Team

Thu, Apr 29

fgiunchedi created T281454: Onboard teams with Prometheus-based alerts to AM.
Thu, Apr 29, 9:43 AM · User-fgiunchedi, observability
fgiunchedi moved T265435: codfw: Testing Out Sample PDUs from Backlog to Doing on the User-fgiunchedi board.
Thu, Apr 29, 9:29 AM · User-fgiunchedi, observability, ops-codfw, DC-Ops, SRE
fgiunchedi moved T281358: Move Performance Icinga alerts to AlertManager from Backlog to Doing on the User-fgiunchedi board.
Thu, Apr 29, 9:29 AM · Patch-For-Review, Performance-Team, observability, User-fgiunchedi
fgiunchedi moved T269272: Sign-in links from Grafana dashboards don't work when not signed into SSO from Doing to Radar on the User-fgiunchedi board.
Thu, Apr 29, 9:29 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), CAS-SSO, observability, SRE
fgiunchedi moved T281359: Onboard teams with Grafana alerts to AM from Backlog to Doing on the User-fgiunchedi board.
Thu, Apr 29, 9:29 AM · User-fgiunchedi, observability
fgiunchedi added a project to T265435: codfw: Testing Out Sample PDUs: User-fgiunchedi.
Thu, Apr 29, 7:48 AM · User-fgiunchedi, observability, ops-codfw, DC-Ops, SRE
fgiunchedi added a comment to T265435: codfw: Testing Out Sample PDUs.

Hi @fgiunchedi - Chatsworth has been pretty flexible with the amount of time we have for testing it, so I think we should be ok keeping it for a longer duration. Just let us approximately how long we would need to keep it for, and we can pass the info along to them. Or if this PDU doesn't seem that great from a monitoring and software compatibility aspect, let us know as well. We want to make sure the PDU makes sense for everyone across the org...and we can always pass on this and try out the next manufacturer if it makes more sense. Thanks for all the help, it's definitely appreciated. ~Willy

Thu, Apr 29, 6:43 AM · User-fgiunchedi, observability, ops-codfw, DC-Ops, SRE

Wed, Apr 28

fgiunchedi updated the task description for T281359: Onboard teams with Grafana alerts to AM.
Wed, Apr 28, 2:37 PM · User-fgiunchedi, observability
fgiunchedi committed rOALEf5afb17cc2c1: Return pathlib objects from file-listing functions (authored by fgiunchedi).
Return pathlib objects from file-listing functions
Wed, Apr 28, 1:58 PM
fgiunchedi updated the task description for T281359: Onboard teams with Grafana alerts to AM.
Wed, Apr 28, 12:47 PM · User-fgiunchedi, observability
fgiunchedi added projects to T281359: Onboard teams with Grafana alerts to AM: observability, User-fgiunchedi.
Wed, Apr 28, 12:43 PM · User-fgiunchedi, observability
fgiunchedi created T281359: Onboard teams with Grafana alerts to AM.
Wed, Apr 28, 12:42 PM · User-fgiunchedi, observability
fgiunchedi created T281358: Move Performance Icinga alerts to AlertManager.
Wed, Apr 28, 12:34 PM · Patch-For-Review, Performance-Team, observability, User-fgiunchedi
fgiunchedi added a comment to T265435: codfw: Testing Out Sample PDUs.

Thank you @Papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry4) we'd need to add support for inbound current and environmental monitors. I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?

Wed, Apr 28, 10:21 AM · User-fgiunchedi, observability, ops-codfw, DC-Ops, SRE
fgiunchedi updated the task description for T281135: codfw: Relocate servers in 10G racks .
Wed, Apr 28, 8:27 AM · serviceops, DBA, SRE, ops-codfw

Tue, Apr 27

fgiunchedi added a comment to T281267: various weekly and daily dumps run from systemd timers are broken.

@fgiunchedi there is a requirement to forward a subset of icinga alerts to a different set of users. either sending to an email address or something fancier like a push notifications.

As a starting point it would be could to forward "Check+systemd+state" alerts relating to snapshot servers to a ops-dumps@wikimedia.org, before i start digging into puppet i thought i would ping you as i think this may be something that's better handled in alertmanager?

Tue, Apr 27, 3:56 PM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
fgiunchedi moved T207292: Review prometheus_nodes params from Backlog to Up next on the User-fgiunchedi board.
Tue, Apr 27, 1:47 PM · User-fgiunchedi, observability, SRE
fgiunchedi moved T225140: Icinga alerts that should open tasks instead of alerting from Backlog to Up next on the User-fgiunchedi board.
Tue, Apr 27, 1:47 PM · User-fgiunchedi, observability
fgiunchedi moved T278514: Wishlist for AlertManager alerts from Grafana from Backlog to Up next on the User-fgiunchedi board.
Tue, Apr 27, 1:47 PM · User-fgiunchedi, Performance-Team (Radar), observability
fgiunchedi moved T273716: Improve Alertmanager/LibreNMS notifications from Backlog to Up next on the User-fgiunchedi board.
Tue, Apr 27, 1:47 PM · User-fgiunchedi, Patch-For-Review, observability
fgiunchedi moved T281095: Move paging for librenms from icinga to AM from Backlog to Up next on the User-fgiunchedi board.
Tue, Apr 27, 1:47 PM · Patch-For-Review, SRE, User-fgiunchedi, netops, observability
fgiunchedi moved T269272: Sign-in links from Grafana dashboards don't work when not signed into SSO from Up next to Doing on the User-fgiunchedi board.
Tue, Apr 27, 1:47 PM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), CAS-SSO, observability, SRE
fgiunchedi moved T227080: Deprecate all non-Kafka logstash inputs from Up next to Backlog on the User-fgiunchedi board.
Tue, Apr 27, 1:46 PM · Patch-For-Review, observability, Wikimedia-Logstash, SRE
fgiunchedi moved T235891: Ingest production logs with ELK7 from Up next to Backlog on the User-fgiunchedi board.
Tue, Apr 27, 1:46 PM · observability, SRE, Wikimedia-Logstash
fgiunchedi moved T240667: Ingestion errors for production logs on ELK7 from Up next to Backlog on the User-fgiunchedi board.
Tue, Apr 27, 1:46 PM · observability, SRE, Wikimedia-Logstash
fgiunchedi moved T272836: Decom ms-be[1019-1026] from Doing to Radar on the User-fgiunchedi board.
Tue, Apr 27, 1:46 PM · SRE, ops-eqiad, User-fgiunchedi, SRE-swift-storage
fgiunchedi removed a subtask for T261633: Put ms-be2057 (Dell R740xd2) in service: T264998: Some object-replicator log lines not making it to centrallog.
Tue, Apr 27, 1:24 PM · Patch-For-Review, User-fgiunchedi, SRE, SRE-swift-storage
fgiunchedi removed a parent task for T264998: Some object-replicator log lines not making it to centrallog: T261633: Put ms-be2057 (Dell R740xd2) in service.
Tue, Apr 27, 1:24 PM · SRE, SRE-swift-storage
fgiunchedi closed T266016: Refresh and expand Swift hardware capacity as Resolved.

This is complete

Tue, Apr 27, 1:24 PM · User-fgiunchedi, SRE-swift-storage
fgiunchedi updated the task description for T266016: Refresh and expand Swift hardware capacity.
Tue, Apr 27, 1:24 PM · User-fgiunchedi, SRE-swift-storage
fgiunchedi closed T280961: Degraded RAID on ms-be1019, a subtask of T272836: Decom ms-be[1019-1026], as Declined.
Tue, Apr 27, 1:23 PM · SRE, ops-eqiad, User-fgiunchedi, SRE-swift-storage
fgiunchedi closed T280961: Degraded RAID on ms-be1019 as Declined.

Hosts is decom

Tue, Apr 27, 1:23 PM · SRE, ops-eqiad
fgiunchedi added a project to T272836: Decom ms-be[1019-1026]: ops-eqiad.

@Cmjohnson or @Jclark-ctr all yours, hosts ready for decom

Tue, Apr 27, 1:23 PM · SRE, ops-eqiad, User-fgiunchedi, SRE-swift-storage
fgiunchedi updated the task description for T272836: Decom ms-be[1019-1026].
Tue, Apr 27, 12:48 PM · SRE, ops-eqiad, User-fgiunchedi, SRE-swift-storage
fgiunchedi renamed T272836: Decom ms-be[1019-1026] from Decom ms-be[1019-1026] from swift to Decom ms-be[1019-1026].
Tue, Apr 27, 12:47 PM · SRE, ops-eqiad, User-fgiunchedi, SRE-swift-storage
fgiunchedi updated the task description for T266016: Refresh and expand Swift hardware capacity.
Tue, Apr 27, 10:23 AM · User-fgiunchedi, SRE-swift-storage
fgiunchedi added a comment to T281055: mr1 port utilization alerts shouldn't mention hash page in their IRC logs.

Moving to AM sounds good to me. But if needed, in the interim we could change the magic string we use in check_librenms to something else instead of hash page, which I chose for simplicity but has maybe just caused more confusion.

All we'd have to do is to change the --escalation-pattern flag value and also change the names of the alert rules in LibreNMS.

Tue, Apr 27, 7:52 AM · SRE, netops
fgiunchedi added a comment to T265435: codfw: Testing Out Sample PDUs.

Thank you @Papaul, could you forward the attached mib? I'll take a look, though I think a call will be best

Tue, Apr 27, 7:33 AM · User-fgiunchedi, observability, ops-codfw, DC-Ops, SRE

Mon, Apr 26

fgiunchedi updated the task description for T281135: codfw: Relocate servers in 10G racks .
Mon, Apr 26, 2:57 PM · serviceops, DBA, SRE, ops-codfw
fgiunchedi created T281128: generate-mysqld-exporter-config fails on prometheus eqiad.
Mon, Apr 26, 1:04 PM · DBA
fgiunchedi created T281107: ms-be1062 fell off the network, causing swift timeouts.
Mon, Apr 26, 9:49 AM · Wikimedia-Incident, SRE, SRE-swift-storage
fgiunchedi added a comment to T281039: Splunk On-Call doing something odd with routing some wmcs alerts.

AFAICT all of these "proto incidents" are ACKs issued by icinga (not SOC ACKs) and as such don't page folks in SOC. I think the proper action here might be to instruct icinga to stop sending ACKs to SOC, or leave things as-is since there weren't mis-pages ?

Mon, Apr 26, 9:14 AM · cloud-services-team (Kanban), observability
fgiunchedi added a comment to T281055: mr1 port utilization alerts shouldn't mention hash page in their IRC logs.

@CDanis set it up, there is a Icinga check that pulls the LibreNMS api and should page where # page is present. But should not page for management routers.
@fgiunchedi Maybe that's something now doable directly through Alert Manager instead? (and we can stop using the # page tag?)

Mon, Apr 26, 9:01 AM · SRE, netops
fgiunchedi added a project to T281095: Move paging for librenms from icinga to AM: User-fgiunchedi.
Mon, Apr 26, 9:01 AM · Patch-For-Review, SRE, User-fgiunchedi, netops, observability
fgiunchedi created T281095: Move paging for librenms from icinga to AM.
Mon, Apr 26, 9:01 AM · Patch-For-Review, SRE, User-fgiunchedi, netops, observability
fgiunchedi added a comment to T281048: mwlog1001 is running out of free space on /srv/mw-log.

FWIW +1 on lowering debug level, AFAIK mwlog1001 is indeed quite close to being replaced by mwlog1002 in T224565: Migrate mwlog/udp2log servers to Buster

Mon, Apr 26, 8:43 AM · Performance-Team, MediaWiki-Revision-backend, MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), observability, SRE
fgiunchedi added a comment to T281055: mr1 port utilization alerts shouldn't mention hash page in their IRC logs.

I agree, we should be restricting #page to alerts that page folks, not sure of an alternative tag though (or remove the tag altogether for now) cc @ayounsi

Mon, Apr 26, 8:37 AM · SRE, netops
fgiunchedi closed T280257: Thanos compaction stopped due to local filesystem space shortage as Resolved.

All thanos-fe hosts reimaged, resolving

Mon, Apr 26, 8:27 AM · User-fgiunchedi, observability
fgiunchedi added a comment to T281019: Please Upload large files to Commons.

From my tests the culprit seems to be webproxy hosts closing the transfer after ~4MB, though using urldownloader works as expected, which proxy were you using for the tests @Urbanecm ?

Mon, Apr 26, 7:47 AM · SRE, Wikimedia-Site-requests, Internet-Archive

Fri, Apr 23

fgiunchedi added a comment to T268233: thanos u/i gives errors if left idle for a few hours.

FWIW this is still happening (namely when GET'ing a query with an sso session in need for refresh, the thanos UI shows Error executing query: OK, fully refreshing the page works). The UI worked fine for me during a working day, but stopped working until refresh the next day. What's the current refresh time for an SSO session before the refresh in the background kicks in?

Fri, Apr 23, 12:55 PM · CAS-SSO, observability, SRE
fgiunchedi added a comment to T268233: thanos u/i gives errors if left idle for a few hours.

FWIW this is still happening (namely when GET'ing a query with an sso session in need for refresh, the thanos UI shows Error executing query: OK, fully refreshing the page works). The UI worked fine for me during a working day, but stopped working until refresh the next day. What's the current refresh time for an SSO session before the refresh in the background kicks in?

Fri, Apr 23, 10:26 AM · CAS-SSO, observability, SRE
fgiunchedi moved T280257: Thanos compaction stopped due to local filesystem space shortage from Inbox to In progress on the observability board.
Fri, Apr 23, 10:20 AM · User-fgiunchedi, observability
fgiunchedi added a comment to T269272: Sign-in links from Grafana dashboards don't work when not signed into SSO.

As a data point, after the forcelogin change (thanks!) I haven't experienced faulty logins/redirects when moving from grafana.w.o to grafana-rw.w.o

Fri, Apr 23, 10:09 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), CAS-SSO, observability, SRE
fgiunchedi added a subtask for T272836: Decom ms-be[1019-1026]: T280961: Degraded RAID on ms-be1019.
Fri, Apr 23, 8:19 AM · SRE, ops-eqiad, User-fgiunchedi, SRE-swift-storage
fgiunchedi added a parent task for T280961: Degraded RAID on ms-be1019: T272836: Decom ms-be[1019-1026].
Fri, Apr 23, 8:19 AM · SRE, ops-eqiad
fgiunchedi closed T277163: Prometheus PoPs disk space utilization as Resolved.

Back to 90-ish percent max fs utilization

Fri, Apr 23, 8:18 AM · User-fgiunchedi, observability
fgiunchedi added a comment to T280961: Degraded RAID on ms-be1019.

Host will be ready for decom next week and filesystems are mostly empty already, no need to replace disks. Leaving the task open until decom

Fri, Apr 23, 8:14 AM · SRE, ops-eqiad

Wed, Apr 21

fgiunchedi added a comment to T280773: Swift account to store ML models.

SGTM, in practical terms the work to do involves adding the account to hieradata/common/profile/thanos/swift.yaml to puppet.git and the private bits to "public private" and the real private.git

Thanks! Quick question - what the .admin setting implies? Being able to do anything on the cluster or something less powerful? (Just trying to figure out what to create)

Wed, Apr 21, 4:07 PM · SRE-swift-storage, Lift-Wing, Machine-Learning-Team (Active Tasks)
fgiunchedi created T280801: Cloud VPS pre-release Debian Bullseye images.
Wed, Apr 21, 3:01 PM · cloud-services-team (Kanban), Cloud-VPS
fgiunchedi added a comment to T280773: Swift account to store ML models.

SGTM, in practical terms the work to do involves adding the account to hieradata/common/profile/thanos/swift.yaml to puppet.git and the private bits to "public private" and the real private.git

Wed, Apr 21, 2:13 PM · SRE-swift-storage, Lift-Wing, Machine-Learning-Team (Active Tasks)
fgiunchedi added a comment to T276697: Implement central logging for mailman3.

Bizarre PCC is a NOOP indeed. The patch LGTM, but I see mailman3 didn't log anything to journald on lists1002 since this morning?

Wed, Apr 21, 2:11 PM · Patch-For-Review, observability, SRE, Wikimedia-Mailing-lists
fgiunchedi added a project to T273716: Improve Alertmanager/LibreNMS notifications: User-fgiunchedi.
Wed, Apr 21, 2:04 PM · User-fgiunchedi, Patch-For-Review, observability
fgiunchedi moved T277163: Prometheus PoPs disk space utilization from Backlog to In progress on the observability board.
Wed, Apr 21, 2:03 PM · User-fgiunchedi, observability
fgiunchedi added a project to T277163: Prometheus PoPs disk space utilization: User-fgiunchedi.
Wed, Apr 21, 2:03 PM · User-fgiunchedi, observability
fgiunchedi added a comment to T278514: Wishlist for AlertManager alerts from Grafana.

Thank you for the feedback! Replies below

Wed, Apr 21, 12:47 PM · User-fgiunchedi, Performance-Team (Radar), observability
fgiunchedi created T280782: thanos-fe2001 machine check exception and crash/stall.
Wed, Apr 21, 12:20 PM · SRE, ops-codfw
fgiunchedi closed T267650: LibreNMS supports more than one Alertmanager address, a subtask of T267018: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations, as Resolved.
Wed, Apr 21, 9:32 AM · SRE, netops, User-fgiunchedi, observability
fgiunchedi closed T267650: LibreNMS supports more than one Alertmanager address as Resolved.

Upgraded librenms today in T266987 and added alertmanager-codfw.w.o to the AM transports.

Wed, Apr 21, 9:32 AM · Upstream, User-fgiunchedi, observability
fgiunchedi added a comment to T273064: Setup Analytics team in VO/splunk oncall.

For the specific problem I think you could also use a case switch (I think preferably using hiera variable like Andrew suggested in the review, similar to is_critical). HTH!

Wed, Apr 21, 8:24 AM · Patch-For-Review, Analytics-Kanban, Analytics-Clusters, User-fgiunchedi, observability
fgiunchedi added a comment to T276697: Implement central logging for mailman3.

For daemons that are logging to syslog/journald the tl;dr to get the logs in logstash is to add the "program name" to modules/profile/files/rsyslog/lookup_table_output.json with value kafka local (or only kafka if you are not interested in local logs). For daemons logging to local files, tl;dr similar setup plus the "input file" part of rsyslog (i.e. rsyslog::input::file). Hope that helps! Happy to review patches of course and/or provide more guidance

Wed, Apr 21, 7:58 AM · Patch-For-Review, observability, SRE, Wikimedia-Mailing-lists
fgiunchedi closed T141038: implement paging for non-ops teams as Resolved.

We have implemented paging for non-ops teams in VO/splunk oncall, within icinga and alertmanager has that capability as well. I'm boldly resolving the task, but feel free to reopen!

Wed, Apr 21, 7:52 AM · observability, Icinga, SRE

Tue, Apr 20

fgiunchedi updated the task description for T279457: Multiple host down alerts from rack C2.
Tue, Apr 20, 1:57 PM · netops, SRE, ops-codfw
fgiunchedi removed a project from T267002: Rename swift.discovery.wmnet to ms-fe.discovery.wmnet: User-fgiunchedi.
Tue, Apr 20, 12:13 PM · SRE-swift-storage