fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (18)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (211 w, 4 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Today

fgiunchedi closed T207713: Degraded RAID on ms-be2017 as Invalid.

Looks like a case of the controller freaking out. I've updated its firmware now to 6.60, after a reboot the raid is clean

Tue, Oct 23, 8:59 AM · Operations, ops-codfw

Yesterday

fgiunchedi updated the task description for T206454: Setup Kafka cluster, producers and consumers for logging pipeline.
Mon, Oct 22, 10:52 AM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi updated the task description for T206633: Setup rsyslog to be able to produce logs to Kafka.
Mon, Oct 22, 10:52 AM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations

Fri, Oct 19

fgiunchedi moved T196484: rack/setup/install graphite1004 from Up next to Doing on the User-fgiunchedi board.
Fri, Oct 19, 3:13 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi added a comment to T196484: rack/setup/install graphite1004.

Tried 6MB per thread now: we're ingesting about 30MB/s of udp traffic, with 4 statsd-proxy threads each should be able to buffer its share of bandwidth (7.5MB/s) for ~1s

Fri, Oct 19, 9:41 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi added a comment to T196484: rack/setup/install graphite1004.

Setting a 2MB socket receive buffer has helped getting errors down to ~0, unfortunately statsd-proxy nor statsite support setting SO_RCVBUF socket option via configuration, so I did this to temporarily set the buffer to 2MB and then back to its default:

Fri, Oct 19, 8:38 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi closed T207399: Degraded RAID on ms-be1021 as Invalid.

In this case the controller freaked out, after a reboot the raids are clean:

Fri, Oct 19, 7:31 AM · ops-eqiad, Operations

Thu, Oct 18

fgiunchedi added a comment to T101141: udp rcvbuferrors and inerrors on graphite1001.

We've been observing periodic elevated (>500/s) udp inerrors / buffer errors on graphite1001 since yesterday _after_ having switched statsd traffic to graphite1004 in T196484. The only statsd client still sending traffic to graphite1001 is ores in this case and errors are still elevated even with modest traffic (compared to the firehose of all udp statsd traffic)

Thu, Oct 18, 12:59 PM · monitoring, MW-1.27-release-notes, MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), Operations, Graphite
fgiunchedi added a comment to T88997: Improve graphite failover.

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Definitely, that sounds perfect :] Thank you to have noticed my edit. The statsd host is mentioned in the hieradata for role::ci::master:

hieradata/role/common/ci/master.yaml
23 profile::zuul::server::conf:
24     # ferm defaults to ACCEPT on loopback:
25     gearman_server: 127.0.0.1
26     config_git_branch: master
27     gearman_server_start: true
28     # FIXME use a lookup?
29     statsd_host: statsd.eqiad.wmnet   # <--------- [ EASY CHANGE ] ------------
30     url_pattern: 'https://integration.wikimedia.org/ci/job/{job.name}/{build.number}/console'
31     status_url: 'https://integration.wikimedia.org/zuul/'

So probably we just need to add the statsite profile to the role modules/role/manifests/ci/master.pp, restart Zuul and call it done?

Thu, Oct 18, 12:49 PM · Performance-Team (Radar), Patch-For-Review, Zuul, Operations, Graphite
fgiunchedi added a comment to T207296: Rationalize default logrotate "rotated" file extensions.

I'm +1 on dateext going forward, likely not worth going back and change all existing logrotate configs

Thu, Oct 18, 8:37 AM · Wikimedia-Logstash, Operations
fgiunchedi added a comment to T88997: Improve graphite failover.

I thought statsd.eqiad.wmnet pointed to a service IP that would be moved from host to host but DNS shows it is a CNAME to the graphite hosts.

Thu, Oct 18, 8:19 AM · Performance-Team (Radar), Patch-For-Review, Zuul, Operations, Graphite
fgiunchedi updated the task description for T88997: Improve graphite failover.
Thu, Oct 18, 7:59 AM · Performance-Team (Radar), Patch-For-Review, Zuul, Operations, Graphite

Wed, Oct 17

fgiunchedi added a project to T207292: Review prometheus_nodes params: User-fgiunchedi.
Wed, Oct 17, 4:32 PM · User-fgiunchedi, monitoring, Operations
fgiunchedi added a comment to T206338: Allow directing users to PHP7 based on a cookie.

Agreed on the behaviors we want, on the behaviors that are desirable (i.e. what to do on engine down or unresponsive) I think we should stick to what the (absence of) the cookie instructs apache to do. Rationale being that detecting engine down/unresponsive might paper over problems with the engine itself. The other side effect is that the php7 choice would be in two places, mediawiki declaratively and apache "at runtime" depending on the state of the engines at the time of request which IMO will make debugging harder.

Wed, Oct 17, 7:32 AM · Core Platform Team Backlog (Watching / External), Core Platform Team (PHP7 (TEC4)), Patch-For-Review, Operations

Mon, Oct 15

fgiunchedi moved T178690: Better organization for SRE grafana dashboards from In progress to Up next on the monitoring board.
Mon, Oct 15, 3:06 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi created T207040: Graphite1001 disk usage at 96%.
Mon, Oct 15, 2:57 PM · Operations, monitoring
fgiunchedi moved T200209: Decom graphite2001 from Backlog to Up next on the monitoring board.
Mon, Oct 15, 2:38 PM · ops-codfw, Operations, monitoring
fgiunchedi moved T200210: Decom graphite2002 from Backlog to Up next on the monitoring board.
Mon, Oct 15, 2:38 PM · monitoring, Operations, ops-codfw
fgiunchedi moved T205852: Onboard at least 10 new non-sensitive log producers to the logging pipeline from In Dev/Progress to Up next on the Wikimedia-Logstash board.
Mon, Oct 15, 2:34 PM · Wikimedia-Logstash, Operations
fgiunchedi moved T205855: Investigate approaches to ingest sensitive log producers from In Dev/Progress to Up next on the Wikimedia-Logstash board.
Mon, Oct 15, 2:34 PM · Wikimedia-Logstash, Operations
fgiunchedi closed T184655: logstash group1 dashboard incorrectly shows testwikidatawiki as Resolved.

Checked now, indeed now testwikidatawiki is in group0 not group1, resolving.

Mon, Oct 15, 2:33 PM · Operations, Wikimedia-Logstash
fgiunchedi closed T138345: Systemd unit did not restart logstash process that died for Elasticsearch connection failures as Invalid.

I don't think we've seen reoccurence of this, resolving as invalid, also generic systemd unit monitoring should help catching cases like this.

Mon, Oct 15, 2:30 PM · Wikimedia-Logstash
fgiunchedi moved T200706: rack/setup/install centrallog1001.eqiad.wmnet from Backlog to Up next on the Wikimedia-Logstash board.
Mon, Oct 15, 2:29 PM · User-herron, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi merged T127977: Logstash DC fail-over / per-DC setup into T205850: Procure and provision Logging pipeline hardware in multiple datacenters.
Mon, Oct 15, 2:28 PM · Wikimedia-Logstash, Operations
fgiunchedi merged task T127977: Logstash DC fail-over / per-DC setup into T205850: Procure and provision Logging pipeline hardware in multiple datacenters.
Mon, Oct 15, 2:28 PM · Wikimedia-Logstash, codfw-rollout
fgiunchedi closed T141783: Add monitoring for detecting when logstash services are down as Invalid.

I don't think we've seen reoccurrence of this, also logstash now has monitoring for udp packet loss which I'm assuming would also show up if logstash services are down.

Mon, Oct 15, 2:27 PM · Operations, Wikimedia-Logstash
fgiunchedi moved T203169: Logstash hardware expansion from Backlog to Externally blocked on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Operations
fgiunchedi moved T206633: Setup rsyslog to be able to produce logs to Kafka from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi moved T205856: Deprecate >= 50% of udp2log producers from In Dev/Progress to Up next on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Wikimedia-Logstash, Operations
fgiunchedi moved T206454: Setup Kafka cluster, producers and consumers for logging pipeline from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi moved T205856: Deprecate >= 50% of udp2log producers from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Wikimedia-Logstash, Operations
fgiunchedi moved T205855: Investigate approaches to ingest sensitive log producers from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Wikimedia-Logstash, Operations
fgiunchedi moved T205852: Onboard at least 10 new non-sensitive log producers to the logging pipeline from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Wikimedia-Logstash, Operations
fgiunchedi moved T205851: Migrate >=90% of existing Logstash traffic to the logging pipeline from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Wikimedia-Logstash, Operations
fgiunchedi moved T205850: Procure and provision Logging pipeline hardware in multiple datacenters from Backlog to Externally blocked on the Wikimedia-Logstash board.
Mon, Oct 15, 2:25 PM · Wikimedia-Logstash, Operations
fgiunchedi moved T205849: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Oct 15, 2:24 PM · User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi closed T160644: Eventstreams graphite disk usage as Resolved.

We're doing good space wise now:

Mon, Oct 15, 2:22 PM · Patch-For-Review, monitoring, Operations, Analytics
fgiunchedi closed T160644: Eventstreams graphite disk usage, a subtask of T1075: Audit groups of metrics in Graphite that allocate a lot of disk space, as Resolved.
Mon, Oct 15, 2:22 PM · monitoring, User-fgiunchedi, Operations, Graphite
fgiunchedi closed T173698: Backfill librenms data in graphite with historical RRDs, a subtask of T171167: Evaluate LibreNMS' Graphite backend, as Declined.
Mon, Oct 15, 2:20 PM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
fgiunchedi closed T173698: Backfill librenms data in graphite with historical RRDs as Declined.

We're one year of librenms data in Graphite already, I'm declining this since we'll eventually reach librenms retention anyways (2yrs IIRC)

Mon, Oct 15, 2:20 PM · User-fgiunchedi, netops, monitoring, Operations
fgiunchedi moved T191400: Build package, puppetize and setup prometheus haproxy exporter from In progress to Backlog on the monitoring board.
Mon, Oct 15, 2:17 PM · monitoring
fgiunchedi moved T196484: rack/setup/install graphite1004 from In progress to Up next on the monitoring board.
Mon, Oct 15, 2:16 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi added a project to T200706: rack/setup/install centrallog1001.eqiad.wmnet: Wikimedia-Logstash.

Not really "logstash" but using Wikimedia-Logstash for logging-related tasks

Mon, Oct 15, 12:19 PM · User-herron, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T183454: Deprovision Diamond collectors no longer in use.

I ran @Krinkle script to audit grafana dashboards at https://gist.github.com/Krinkle/b5ceff5156c1f4cf3568e373cc135bad to gauge where we're still querying the servers hierarchy, full results at P7680.

There's some false positives I think (e.g. Prometheus dashboards show up too) but that should give a reasonable idea of where we can remove Diamond and what adjustments to dashboards are needed.

I'll also run an audit on graphite to see effectively what dashboards have been requesting servers. (IOW which of the above dashboards have been visited)

Mon, Oct 15, 10:45 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi updated subscribers of T183454: Deprovision Diamond collectors no longer in use.

I ran @Krinkle script to audit grafana dashboards at https://gist.github.com/Krinkle/b5ceff5156c1f4cf3568e373cc135bad to gauge where we're still querying the servers hierarchy, full results at P7680.

Mon, Oct 15, 10:16 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi added a project to T206963: Perform a statsd and Graphite switch: monitoring.

The most similar task is likely T88997: Improve graphite failover and related. As far as graphite goes sending carbon line-oriented traffic is already active-active in the sense that traffic can be sent to any graphite frontend in codfw/eqiad and it'll be mirrored to the other datacenter.

Mon, Oct 15, 9:45 AM · Performance-Team (Radar), monitoring, Patch-For-Review, Operations, Availability
fgiunchedi added a comment to T206939: "Workers" data from prometheus for mw app servers alternates strangely.

The prometheus.svc endpoint in eqiad and codfw is backed by two independent Prometheus servers scraping the same targets. What I suspect has happened is that one of the two servers "catched" workers in state closing or logging while the other didn't. This also suggests to me the exporter doesn't report all metrics it knows about all the time, which leads me to believe that mod_status believes that way (i.e. when no workers are in state closing they are not reported at all).

Mon, Oct 15, 9:38 AM · Performance-Team (Radar), monitoring, Operations
fgiunchedi added a comment to T206114: Create an Icinga check to alert on packet dropped.

I did a quick audit in eqiad (for starters) to preview how we'd be affected by the alert, in this way:

Mon, Oct 15, 8:38 AM · Discovery-Search (Current work), Patch-For-Review, monitoring, Operations
fgiunchedi added a comment to T202782: upgrade icinga server to stretch and replace einsteinium.

I was looking at T206704: Enable access from icinga1001 to mgmt interfaces and likely einsteinium/tegmen addresses will be found in other places on router configuration too (including pfw like @Volans pointed out) that will need updating

Mon, Oct 15, 8:13 AM · Patch-For-Review, monitoring, Operations

Wed, Oct 10

fgiunchedi triaged T206633: Setup rsyslog to be able to produce logs to Kafka as Normal priority.
Wed, Oct 10, 2:52 PM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T205522: ircecho / icinga-wm crashlooping.

With the latest patch in to log exceptions I think we're good to resolve this?

Wed, Oct 10, 1:16 PM · User-fgiunchedi, Patch-For-Review, IRCecho, Operations
fgiunchedi moved T205526: Register and identify icinga-wm from Doing to Radar on the User-fgiunchedi board.
Wed, Oct 10, 1:16 PM · User-fgiunchedi, Patch-For-Review, Operations
fgiunchedi closed T145867: Test making thumbor statsd metrics available from Prometheus as Resolved.

I'm resolving this since this work is happening as part of T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus

Wed, Oct 10, 1:12 PM · Thumbor, Prometheus-metrics-monitoring

Tue, Oct 9

fgiunchedi updated the task description for T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus.
Tue, Oct 9, 1:27 PM · Patch-For-Review, monitoring, Operations

Mon, Oct 8

fgiunchedi closed T205873: Investigate Kafka main cluster usage for logging pipeline as Resolved.

Looks like we have a way forward! Resolving in favor of T206454: Setup Kafka cluster, producers and consumers for logging pipeline to track the actual Kafka setup work.

Mon, Oct 8, 10:07 AM · Wikimedia-Logstash, Operations
fgiunchedi closed T205873: Investigate Kafka main cluster usage for logging pipeline, a subtask of T205849: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal), as Resolved.
Mon, Oct 8, 10:07 AM · User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi triaged T206454: Setup Kafka cluster, producers and consumers for logging pipeline as Normal priority.
Mon, Oct 8, 10:06 AM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations

Fri, Oct 5

fgiunchedi added a comment to T179050: setup bast4002/WMF7218.

Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheus_host. Once TTL expires bast4001 should no longer receive queries, this can be verified by looking at /var/log/apache2/other_vhosts_access.log.

Fri, Oct 5, 12:33 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo
fgiunchedi placed T170817: Upgrade Thumbor servers to Stretch up for grabs.

@fgiunchedi - Are you still working on this or should it be unassigned? It looks like this is blocking T36947 (i.e. upgrading librsvg to fix SVG rendering problems), which I need fixed for T201207.

Fri, Oct 5, 8:53 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a comment to T179050: setup bast4002/WMF7218.

Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheus_host. Once TTL expires bast4001 should no longer receive queries, this can be verified by looking at /var/log/apache2/other_vhosts_access.log.

Fri, Oct 5, 8:01 AM · Patch-For-Review, Traffic, Operations, ops-ulsfo

Thu, Oct 4

mmodell awarded T204383: Update Debian Package for Scap to 3.8.7-1 a Orange Medal token.
Thu, Oct 4, 4:26 PM · Packaging, Release, Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Scap
fgiunchedi updated the task description for T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus.
Thu, Oct 4, 2:16 PM · Patch-For-Review, monitoring, Operations
fgiunchedi updated the task description for T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus.
Thu, Oct 4, 1:04 PM · Patch-For-Review, monitoring, Operations
fgiunchedi updated the task description for T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus.
Thu, Oct 4, 12:49 PM · Patch-For-Review, monitoring, Operations
fgiunchedi closed T204383: Update Debian Package for Scap to 3.8.7-1 as Resolved.

All done! 3.8.7-1 is live

Thu, Oct 4, 10:44 AM · Packaging, Release, Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Scap
fgiunchedi closed T204383: Update Debian Package for Scap to 3.8.7-1, a subtask of T191921: mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5), as Resolved.
Thu, Oct 4, 10:44 AM · Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Scap
fgiunchedi closed T204383: Update Debian Package for Scap to 3.8.7-1, a subtask of T121597: Implement MediaWiki pre-promote checks, as Resolved.
Thu, Oct 4, 10:44 AM · Patch-For-Review, Wikimedia-Incident, Scap (Scap3-MediaWiki-MVP), scap2
fgiunchedi added a comment to T203177: cloudvps: metrics and analytics .

The Prometheus instance running on labmon1001 is scraping data from cloudcontrol1003 but I can't find the series in https://grafana-labs.wikimedia.org.

If I connect to Prometheus directly on port 9900, the series are there but I've failed to find a suitable data source in grafana-labs that would allow me to see that.

Also, I'm confused about grafana-labs vs grafana vs tools-prometheus. It seems we'd want to store OpenStack metrics in the prod prometheus since they're one level below our tools stack.

Thu, Oct 4, 7:59 AM · Patch-For-Review, Cloud-Services

Wed, Oct 3

fgiunchedi added a comment to T185134: Prometheus 2 breaking change.

Good question @cwdent. We haven't tackled the problem in production yet, though IIRC Prometheus suggests setting up a v2 instance with remote reading from the existing v1 instance. This way data that's not present in v2 will be read from v1, when enough time has passed (e.g. the Prometheus retention period) we can decom the v1 instance.

Wed, Oct 3, 3:21 PM · Fundraising-Backlog, fundraising-tech-ops
fgiunchedi updated the task description for T183454: Deprovision Diamond collectors no longer in use.
Wed, Oct 3, 1:59 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi removed a parent task for T177196: Port non-deprecated Diamond collectors to Prometheus: T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal).
Wed, Oct 3, 1:23 PM · monitoring, cloud-services-team (Kanban), User-fgiunchedi, Goal, Operations
fgiunchedi removed a subtask for T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal): T177196: Port non-deprecated Diamond collectors to Prometheus.
Wed, Oct 3, 1:23 PM · User-fgiunchedi, monitoring, Operations
fgiunchedi moved T196484: rack/setup/install graphite1004 from Doing to Up next on the User-fgiunchedi board.
Wed, Oct 3, 1:16 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi moved T205849: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) from Backlog to Doing on the User-fgiunchedi board.
Wed, Oct 3, 1:16 PM · User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi moved T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) from Backlog to Doing on the User-fgiunchedi board.
Wed, Oct 3, 1:16 PM · User-fgiunchedi, monitoring, Operations
fgiunchedi added a project to T205849: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal): User-fgiunchedi.
Wed, Oct 3, 1:16 PM · User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi added a project to T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal): User-fgiunchedi.
Wed, Oct 3, 1:15 PM · User-fgiunchedi, monitoring, Operations
fgiunchedi moved T205522: ircecho / icinga-wm crashlooping from Backlog to Radar on the User-fgiunchedi board.
Wed, Oct 3, 1:15 PM · User-fgiunchedi, Patch-For-Review, IRCecho, Operations
fgiunchedi moved T205526: Register and identify icinga-wm from Backlog to Doing on the User-fgiunchedi board.
Wed, Oct 3, 1:15 PM · User-fgiunchedi, Patch-For-Review, Operations
fgiunchedi added a project to T205522: ircecho / icinga-wm crashlooping: User-fgiunchedi.
Wed, Oct 3, 12:35 PM · User-fgiunchedi, Patch-For-Review, IRCecho, Operations
fgiunchedi added a project to T205526: Register and identify icinga-wm: User-fgiunchedi.
Wed, Oct 3, 12:34 PM · User-fgiunchedi, Patch-For-Review, Operations

Tue, Oct 2

fgiunchedi created T205974: logrotate cronspam on ms-be1040.
Tue, Oct 2, 1:12 PM · Patch-For-Review, Operations
fgiunchedi added a comment to T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus.

Also to take into consideration that services moving to k8s have statsd_exporter listening on localhost, for those there's no deployment needed, only writing the statsd -> prometheus mapping rules for statsd_exporter to use.

Tue, Oct 2, 9:19 AM · Patch-For-Review, monitoring, Operations
fgiunchedi added a comment to T191400: Build package, puppetize and setup prometheus haproxy exporter.

The packaging part has been done already as part of T204266: Backport prometheus haproxy exporter for Jessie what's left in this case I believe is the puppetization to add haproxy-exporter to dbproxy hosts and the related job in Prometheus.

Tue, Oct 2, 9:17 AM · monitoring
fgiunchedi closed T200960: Logstash packet loss as Resolved.

A couple of days ago a sudden spike of syslog udp input caused again packet loss. IOW we have mitigated the common cases but sudden udp surges will still causes loss.

Next option is to have something much faster like rsyslog listen for syslog messages, write to a file and instruct logstash to tail that file instead.

Tue, Oct 2, 9:11 AM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added projects to T205040: Show SVGs in page language if available: Thumbor, media-storage.

Adding Thumbor too since I'm sure it'll be affected as well. re: swift space concerns I don't think it'll be a problem unless the rasterized SVGs take up a lot of space, which I don't think it is the case. Thanks for the heads up!

Tue, Oct 2, 9:04 AM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, Community-Tech-Sprint, media-storage, Thumbor, Traffic, MediaWiki-Parser, Operations, Community-Tech
fgiunchedi closed T205863: Logstash in beta doesn't have any logs as Resolved.

I bounced logstash on deployment-logstash2 and looks like logs are flowing again, logstash-plain.log wasn't being written to before the restart which is a little worrying in itself and makes it non-obvious to understand what's wrong.

Tue, Oct 2, 9:00 AM · Beta-Cluster-Infrastructure, Wikimedia-Logstash
fgiunchedi moved T177196: Port non-deprecated Diamond collectors to Prometheus from Backlog to In progress on the monitoring board.
Tue, Oct 2, 8:48 AM · monitoring, cloud-services-team (Kanban), User-fgiunchedi, Goal, Operations
fgiunchedi added a project to T177196: Port non-deprecated Diamond collectors to Prometheus: monitoring.
Tue, Oct 2, 8:48 AM · monitoring, cloud-services-team (Kanban), User-fgiunchedi, Goal, Operations
fgiunchedi moved T183454: Deprovision Diamond collectors no longer in use from Backlog to In progress on the monitoring board.
Tue, Oct 2, 8:48 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi moved T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) from Backlog to In progress on the monitoring board.
Tue, Oct 2, 8:47 AM · User-fgiunchedi, monitoring, Operations
fgiunchedi moved T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus from Backlog to In progress on the monitoring board.
Tue, Oct 2, 8:47 AM · Patch-For-Review, monitoring, Operations
fgiunchedi moved T178690: Better organization for SRE grafana dashboards from Up next to In progress on the monitoring board.
Tue, Oct 2, 8:36 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi added a comment to T205873: Investigate Kafka main cluster usage for logging pipeline.

At yesterday's monitoring/logging meeting we've discussed this and concluded that for good hygiene and decoupling it makes sense to spin up a new Kafka cluster for logging purposes. What's left to decide on which hardware we're going to run Kafka on, which in turn boils down to a budget question, see also T203169: Logstash hardware expansion

Tue, Oct 2, 8:27 AM · Wikimedia-Logstash, Operations
fgiunchedi added a comment to T182759: Add Prometheus exporter to Jenkins instances.

As of this morning both Jenkins master have the Prometheus plugin installed and enabled. The plugin will allows them to be used as Prometheus targets (at {jenkins-url}/prometheus) for collecting all sorts of build, node, and Jenkins master related metrics.

However, the plugin seems to have issues when "Fetch the test results of builds" is checked in the plugin configuration. DO NOT ENABLE THIS CONFIGURATION. @thcipriani and I observed high memory usage and request timeouts when this option was selected; we eventually tried killing the request thread and even then it continued to process for over 15 minutes.

We may have to go without individual metrics for tests and test suites for now, but the plugin as it's current configured provides a good starting point for Prometheus based metrics collection.

Tue, Oct 2, 8:25 AM · Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure, User-Elukey, User-fgiunchedi, Goal, Operations

Mon, Oct 1

fgiunchedi updated the task description for T205851: Migrate >=90% of existing Logstash traffic to the logging pipeline.
Mon, Oct 1, 2:39 PM · Wikimedia-Logstash, Operations
fgiunchedi created T205873: Investigate Kafka main cluster usage for logging pipeline.
Mon, Oct 1, 2:37 PM · Wikimedia-Logstash, Operations
fgiunchedi created T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus.
Mon, Oct 1, 2:25 PM · Patch-For-Review, monitoring, Operations
fgiunchedi added a parent task for T178690: Better organization for SRE grafana dashboards: T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal).
Mon, Oct 1, 1:42 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi added subtasks for T205862: Expand modern metrics infrastructure coverage (2018-19 Q2 goal): T178690: Better organization for SRE grafana dashboards, T183454: Deprovision Diamond collectors no longer in use, T177196: Port non-deprecated Diamond collectors to Prometheus.
Mon, Oct 1, 1:42 PM · User-fgiunchedi, monitoring, Operations