Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (18)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (300 w, 3 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Yesterday

fgiunchedi closed T257214: Degraded RAID on ms-be2025 as Invalid.

Host came back clean, I've updated the hw raid firmware while I was at it

Mon, Jul 6, 2:39 PM · Operations, ops-codfw
fgiunchedi closed T148614: Icinga check for Tor as Declined.

Tor has been retired in T243288: Retire the Tor relay

Mon, Jul 6, 2:15 PM · Patch-For-Review, Icinga, Tor, observability, Operations
fgiunchedi moved T126989: MediaWiki logging & encryption from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:14 PM · MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review, observability, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
fgiunchedi moved T171122: librenms: consider using Distributed Poller with multiple netmon servers from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:14 PM · observability, Operations
fgiunchedi closed T179078: mpt raid controller not detected as fact on maps-test2* as Declined.

The old hosts have been eventually decom'd!

Mon, Jul 6, 2:14 PM · Patch-For-Review, Operations, observability
fgiunchedi moved T173806: Icinga: evaluate stalking options for some checks from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:12 PM · observability
fgiunchedi moved T160060: Icinga check for sysctl settings from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:12 PM · User-herron, Patch-For-Review, observability, Icinga, Operations
fgiunchedi moved T84845: improve cron spam visibility from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:12 PM · observability, Operations
fgiunchedi moved T179395: Cluster puppet variable and ganglia decommission from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:12 PM · Patch-For-Review, observability, Operations
fgiunchedi moved T163996: Icinga check for ipv6 host reachability from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:12 PM · Operations, observability
fgiunchedi closed T199479: Add alerts for Logstash rates in production as Resolved.

We have icinga alerts for mediawiki errors rates nowadays, based on Prometheus metrics (via logstash -> statsd -> prometheus)

Mon, Jul 6, 2:11 PM · Sustainability (Incident Prevention), Core Platform Team Legacy (Watching / External), Operations, observability
fgiunchedi moved T82937: CLI script for manual paging from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:10 PM · User-CDanis, Operations, observability, Icinga
fgiunchedi renamed T82937: CLI script for manual paging from re-create script for manual paging to CLI script for manual paging.
Mon, Jul 6, 2:09 PM · User-CDanis, Operations, observability, Icinga
fgiunchedi moved T203485: Revisit Grafana/Icinga notification strategy from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:08 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations
fgiunchedi closed T206131: add monitoring to alert on hosts without RAID as Declined.

With the standard partman recipes being implemented essentially everywhere it also means we get (software) raid "by default". I'm going to boldly resolve the task but please reopen if needed!

Mon, Jul 6, 2:06 PM · observability, Operations
fgiunchedi moved T206939: "Workers" data from prometheus for mw app servers alternates strangely from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:04 PM · Performance-Team (Radar), observability, Operations
fgiunchedi renamed T191400: Add haproxy-exporter to dbproxy hosts from Build package, puppetize and setup prometheus haproxy exporter to Add haproxy-exporter to dbproxy hosts.
Mon, Jul 6, 2:03 PM · DBA, observability
fgiunchedi removed a project from T207938: Make actual number of servers available (like in Grafana board): Graphite.
Mon, Jul 6, 2:01 PM · observability
fgiunchedi moved T207938: Make actual number of servers available (like in Grafana board) from Inbox to Backlog on the observability board.
Mon, Jul 6, 2:01 PM · observability
fgiunchedi moved T208875: Update prometheus-node-exporter NTP metrics from Inbox to Backlog on the observability board.
Mon, Jul 6, 1:59 PM · Operations, observability
fgiunchedi moved T211459: rancid causes puppet to flap on netmon1002 from Inbox to Backlog on the observability board.
Mon, Jul 6, 1:58 PM · observability
fgiunchedi moved T202061: Implement an accurate and easy to understand status page for all wikis from Inbox to Backlog on the observability board.
Mon, Jul 6, 1:52 PM · observability, Operations
fgiunchedi added a comment to T229542: Export LibreNMS data to Prometheus.

Push Gateway implementation at T249311: Deploy Prometheus Push Gateway

Mon, Jul 6, 1:50 PM · observability
fgiunchedi closed T209709: Feature: enable prometheus-nginx-exporter for nginx metrics as Declined.

Boldly declining as we're still using nginx but it is on its way out (frontend caches already off nginx, internal usage should be replaced with envoy)

Mon, Jul 6, 12:15 PM · observability
fgiunchedi removed a project from T210993: Deprecate Diamond collectors in Cloud VPS: User-fgiunchedi.
Mon, Jul 6, 12:14 PM · cloud-services-team (Kanban), observability, Operations
fgiunchedi moved T164238: move icinga contacts file to public repo from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:14 PM · observability, Icinga, Operations
fgiunchedi moved T211982: Find links to grafana.wikimedia.org and change them to use the new URL format from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:13 PM · Operations, observability, User-CDanis
fgiunchedi moved T187434: Include apache_exporter in puppet module httpd (was: apache) from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:13 PM · observability, User-fgiunchedi, Operations
fgiunchedi moved T215848: icinga really needs to check puppet run success of passive icinga hosts from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:12 PM · observability, Icinga, Operations
fgiunchedi moved T216611: Icinga check for ircecho should check for actual activity from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:12 PM · IRCecho, observability, Icinga, Operations
fgiunchedi moved T219902: Stop using public (cached) endpoints for checks on graphite from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:12 PM · observability, Operations
fgiunchedi moved T221784: Puppet failing without Icinga alert in case of dependency cycle from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:07 PM · Puppet, Icinga, observability, Operations
fgiunchedi renamed T222113: prometheus: upgrade to >= 2.12 from prometheus: upgrade to 2.12 to prometheus: upgrade to >= 2.12.
Mon, Jul 6, 12:05 PM · Sustainability (Incident Prevention), observability, Operations
fgiunchedi moved T222102: prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) from Inbox to Backlog on the observability board.
Mon, Jul 6, 12:05 PM · Sustainability (Incident Prevention), observability, Operations
fgiunchedi closed T224691: labmon / prometheus - query error - monitoring artifacts - Icinga UNKNOWN as Invalid.

Looks like this is no longer an issue, I checked cloudmetrics* (ex labmon) alerts and no UNKNOWNs

Mon, Jul 6, 12:01 PM · observability, Cloud-Services
fgiunchedi closed T117821: Make a udp2log output plugin for Logstash as Declined.

Boldly resolving this task, with the logging pipeline in production we can either tap into the kafka log stream pre-logstash or inject messages back into kafka post-logstash after processing

Mon, Jul 6, 11:57 AM · observability, Wikimedia-Logstash, MediaWiki-Debug-Logger
fgiunchedi moved T222113: prometheus: upgrade to >= 2.12 from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:54 AM · Sustainability (Incident Prevention), observability, Operations
fgiunchedi moved T225140: Icinga alerts that should open tasks instead of alerting from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:53 AM · observability
fgiunchedi moved T193766: Ship host syslogs to ELK from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:51 AM · observability, Wikimedia-Logstash, User-herron, Patch-For-Review, Operations
fgiunchedi moved T197173: Ship MX logs to ELK from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:51 AM · observability, User-herron, Wikimedia-Logstash, Operations
fgiunchedi moved T199785: Some logstash syslog entries fail to parse from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:50 AM · observability, Wikimedia-Logstash
fgiunchedi moved T213933: PoC alert/notification functionality with Elastic Stack from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:50 AM · observability, User-fgiunchedi, Patch-For-Review, Restricted Project, Security-Team, Wikimedia-Logstash
fgiunchedi moved T215497: Move iegreview from udp2log to syslog from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:50 AM · observability, Wikimedia-Logstash, Operations
fgiunchedi moved T215499: Move wikimania-scholarships from udp2log to syslog from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:50 AM · observability, Wikimedia-Logstash, Operations
fgiunchedi closed T218691: Remove elasticsearch icinga checks from logstash collectors as Resolved.

AFAICT we've been running all elasticsearch checks in all clusters and we're OK with it, boldly resolving!

Mon, Jul 6, 11:49 AM · observability, Operations, Discovery-Search, Icinga, Elasticsearch, Wikimedia-Logstash
fgiunchedi closed T97297: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine as Invalid.

We have the logging pipeline in production now, in other words applications send logs to either local syslog unix socket / journald or localhost udp

Mon, Jul 6, 11:46 AM · observability, Operations, Wikimedia-Logstash
fgiunchedi moved T226703: rsyslog output modules (fwd, kafka) failures should not affect each other from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:44 AM · User-fgiunchedi, observability
fgiunchedi renamed T226703: rsyslog output modules (fwd, kafka) failures should not affect each other from webrequest 5xx data in logstash stopped at ~1:10 2019/06/26 and catched up at ~6:30 2019/06/27 to rsyslog output modules (fwd, kafka) failures should not affect each other.
Mon, Jul 6, 11:44 AM · User-fgiunchedi, observability
fgiunchedi closed T234134: Graphite function sortByTotal() undefined in graphite-labs as Invalid.

Graphite version on wmcs has catched up and sortByTotal is available (tested on grafana-labs' explore function)

Mon, Jul 6, 11:41 AM · Performance-Team (Radar), observability, Beta-Cluster-Infrastructure, Graphite
fgiunchedi moved T236379: Include #page on host alerts that page SRE from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:38 AM · Icinga, observability
fgiunchedi moved T237604: Record per-server power usage from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:37 AM · observability
fgiunchedi added a comment to T237706: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes.

AFAIK this hasn't recurred, but we might have not had Phatality deployments since then @mmodell ?

Mon, Jul 6, 11:36 AM · Phatality, Operations, observability
fgiunchedi moved T238006: Icinga alert for hosts with no Puppet roles from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:36 AM · Operations, observability, Puppet
fgiunchedi moved T238794: dropped packets to kafkamon 9000/tcp from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:35 AM · Operations, observability
fgiunchedi moved T240560: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:35 AM · Operations, serviceops, observability
fgiunchedi moved T240571: Consider alerting if journald drops logs for certain units from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:34 AM · observability
fgiunchedi moved T249311: Deploy Prometheus Push Gateway from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:33 AM · observability
fgiunchedi moved T249607: Kibana naming convention from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:33 AM · observability
fgiunchedi moved T251155: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:33 AM · netops, observability, Operations
fgiunchedi moved T149643: Review Icinga alarms with disabled notifications from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:33 AM · observability, Operations
fgiunchedi moved T251156: add traceroute measurements to RIPE Atlas prometheus data from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:33 AM · netops, Operations, observability
fgiunchedi moved T251184: Add Grafana worldmap panel from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:33 AM · observability
fgiunchedi moved T163692: Have puppet create Prometheus LVs from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:32 AM · observability, User-fgiunchedi, Prometheus-metrics-monitoring
fgiunchedi moved T214819: Add license statement to Grafana dashboards from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:32 AM · observability, Graphite, WMF-Legal, Software-Licensing, Operations
fgiunchedi moved T177197: Export Prometheus-compatible JVM metrics from JVMs in production from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:32 AM · observability, Goal, Operations
fgiunchedi closed T224399: exim paniclog on $HOST has non-zero size, a subtask of T132324: Tracking and Reducing cron-spam to root@ , as Resolved.
Mon, Jul 6, 11:31 AM · Patch-For-Review, Operations
fgiunchedi closed T224399: exim paniclog on $HOST has non-zero size as Resolved.

Resolving in favor of T257016: Fix paniclog alert to only sent mails once

Mon, Jul 6, 11:31 AM · observability, Operations
fgiunchedi moved T252773: Move kafkamon hosts to Debian Buster from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:30 AM · Analytics-Clusters, Analytics-Radar, observability, Operations
fgiunchedi moved T257016: Fix paniclog alert to only sent mails once from Inbox to Backlog on the observability board.
Mon, Jul 6, 11:27 AM · User-MoritzMuehlenhoff, observability, Operations
fgiunchedi added a comment to T151009: Provide authenticated access to Thanos native web interface.

Taking over this issue to provide access to Thanos instead, which provides a unified query interface.

Mon, Jul 6, 11:26 AM · observability, Patch-For-Review, User-fgiunchedi, Operations, Prometheus-metrics-monitoring
fgiunchedi renamed T151009: Provide authenticated access to Thanos native web interface from Provide authenticated access to Prometheus native web interface to Provide authenticated access to Thanos native web interface.
Mon, Jul 6, 11:25 AM · observability, Patch-For-Review, User-fgiunchedi, Operations, Prometheus-metrics-monitoring
fgiunchedi closed T223483: Logstash stops processing messages if a single output becomes blocked as Resolved.

Boldly resolving, it is indeed the case that a blocked logstash output exerts backpressure on the whole pipeline. Pending items are T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable and T255243: Increase logging pipeline ingestion capacity

Mon, Jul 6, 11:21 AM · observability, Operations, Wikimedia-Logstash
fgiunchedi moved T217340: Change logstash plugin deployment to use deb packaging and deployment from Up next to In progress on the observability board.
Mon, Jul 6, 11:18 AM · Patch-For-Review, observability, Operations, Discovery-Search
fgiunchedi closed T256953: ps1-c3-codfw icinga checks UNKNOWN as Resolved.

This is complete, was indeed related to PDU upgrades

Mon, Jul 6, 10:36 AM · Operations, ops-codfw, observability

Fri, Jul 3

fgiunchedi created P11729 (An Untitled Masterwork).
Fri, Jul 3, 1:42 PM
fgiunchedi created T257024: Buster elasticsearch-curator version not compatible with ELK7.
Fri, Jul 3, 8:51 AM · Operations, Wikimedia-Logstash

Thu, Jul 2

fgiunchedi awarded T256966: dbstore1005 s8 mariadb instance crashed a Party Time token.
Thu, Jul 2, 4:44 PM · Upstream, User-Kormat, Analytics, DBA
herron awarded T256443: move 4 new logstash VMs into production a Like token.
Thu, Jul 2, 3:18 PM · User-fgiunchedi, observability, Operations, Wikimedia-Logstash
fgiunchedi added a comment to T255072: (Due By: 2020-07-25) rack/setup/install alert1001.

@fgiunchedi icinga1001 is in rack C8, that is now a 10G rack. Do you still want this server there or can we move to another rack that is 1G only and eventually migrate icinga1001 to the same rack?

Thu, Jul 2, 3:14 PM · Operations, ops-eqiad, DC-Ops
fgiunchedi moved T213902: Implement sensitive logstash access control from Up next to Backlog on the observability board.
Thu, Jul 2, 2:02 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi moved T207860: Collect client network errors, deprecation, intervention and crash reports from Inbox to Backlog on the observability board.
Thu, Jul 2, 1:58 PM · observability, Traffic, Operations
fgiunchedi moved T256418: Evaluate alternative to Logstash StatsD outputs from Inbox to Up next on the observability board.
Thu, Jul 2, 1:58 PM · Wikimedia-Logstash, observability
fgiunchedi moved T256954: Port Prometheus dashboards to Thanos from Inbox to In progress on the observability board.
Thu, Jul 2, 1:57 PM · User-fgiunchedi, observability, Operations
fgiunchedi closed T215904: Better understanding of Logstash performance as Resolved.

Resolving if favor of T255243: Increase logging pipeline ingestion capacity

Thu, Jul 2, 1:55 PM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi moved T248884: Documentation of client side error logging capabilities on mediawiki from Backlog to Radar on the User-fgiunchedi board.
Thu, Jul 2, 1:53 PM · Analytics-Radar, Product-Infrastructure-Team-Backlog (Kanban), Documentation, Performance-Team (Radar), Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data
fgiunchedi moved T256954: Port Prometheus dashboards to Thanos from Backlog to Doing on the User-fgiunchedi board.
Thu, Jul 2, 1:53 PM · User-fgiunchedi, observability, Operations
fgiunchedi added a project to T256954: Port Prometheus dashboards to Thanos: User-fgiunchedi.
Thu, Jul 2, 1:52 PM · User-fgiunchedi, observability, Operations
fgiunchedi added a comment to T256418: Evaluate alternative to Logstash StatsD outputs.

Option 1 seems attractive to me because it is the proverbial nail in the coffin for the issue of logstash-derived metrics being unreliable in the face of kafka consumer lag. OTOH it is unclear to me how much of an effort it'd be to get there (?)

Thu, Jul 2, 11:52 AM · Wikimedia-Logstash, observability
fgiunchedi created T256954: Port Prometheus dashboards to Thanos.
Thu, Jul 2, 10:32 AM · User-fgiunchedi, observability, Operations
fgiunchedi created T256953: ps1-c3-codfw icinga checks UNKNOWN.
Thu, Jul 2, 10:08 AM · Operations, ops-codfw, observability
fgiunchedi closed T256443: move 4 new logstash VMs into production, a subtask of T255243: Increase logging pipeline ingestion capacity, as Resolved.
Thu, Jul 2, 10:03 AM · Patch-For-Review, User-fgiunchedi, Operations, observability, Wikimedia-Logstash
fgiunchedi closed T256443: move 4 new logstash VMs into production as Resolved.

This is done, thanks @herron for putting the new VMs in service

Thu, Jul 2, 10:03 AM · User-fgiunchedi, observability, Operations, Wikimedia-Logstash
fgiunchedi added a comment to T253555: Remove ganglia leftovers from ops/puppet.

@fgiunchedi: the puppetmaster module still has some ganglia-related things such as prometheus-ganglia-gen. Is that still needed?

Thu, Jul 2, 10:01 AM · Patch-For-Review, Analytics, Traffic, Operations
fgiunchedi added a comment to T205856: Retire udp2log: onboard its producers and consumers to the logging pipeline.

We are on the Kafka pipeline for MW logs that were sent to logstash over the network, udp2log is still in place due to the high volume of logs but yes eventually we'd like to deprecate udp2log too and move everything to Kafka.

Thu, Jul 2, 10:00 AM · Analytics-Radar, Performance-Team (Radar), observability, Wikimedia-Logstash, Operations

Fri, Jun 26

fgiunchedi moved T256443: move 4 new logstash VMs into production from Backlog to Doing on the User-fgiunchedi board.
Fri, Jun 26, 2:54 PM · User-fgiunchedi, observability, Operations, Wikimedia-Logstash
fgiunchedi added a comment to T255243: Increase logging pipeline ingestion capacity.

There are 4 new ganeti VMs now, 2 in eqiad and 2 in codfw, in row D each. They are ready to be taken into production in T256443.

Fri, Jun 26, 8:50 AM · Patch-For-Review, User-fgiunchedi, Operations, observability, Wikimedia-Logstash

Thu, Jun 25

fgiunchedi moved T255568: Envoy should listen on ipv6 and ipv4 from Backlog to Radar on the User-fgiunchedi board.
Thu, Jun 25, 2:11 PM · User-fgiunchedi, observability, serviceops

Wed, Jun 24

fgiunchedi created P11651 (An Untitled Masterwork).
Wed, Jun 24, 3:05 PM
fgiunchedi added a comment to T252401: Improve VO integration with Icinga.

I've switched both SRE and WMCS contacts to the VO-specific contacts and verified they work as expected. The previous failure was due to not using address1 in place of email.

Wed, Jun 24, 2:12 PM · observability

Tue, Jun 23

fgiunchedi moved T256139: VM requests for additional Logstash capacity from Backlog to Doing on the User-fgiunchedi board.
Tue, Jun 23, 4:05 PM · serviceops, User-fgiunchedi, Operations, observability, Wikimedia-Logstash