fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (19)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (228 w, 2 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Fri, Feb 15

fgiunchedi created T216253: prometheus-pdns-rec-exporter invalid metric output.
Fri, Feb 15, 4:36 PM · monitoring, User-fgiunchedi
fgiunchedi added a comment to T126989: MediaWiki logging & encryption.

Status update: mw logs that were going to logstash in plaintext now are being sent via localhost -> rsyslog -> kafka -> logstash and the network paths are encrypted.

udp2log for mediawiki is still present for debug / high volume logs, however it'll be deprecated in T205856: Retire udp2log: onboard its producers and consumers to the logging pipeline, with end state being that all mw logs are encrypted during transport on the network.

Fri, Feb 15, 4:17 PM · monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
fgiunchedi updated subscribers of T205856: Retire udp2log: onboard its producers and consumers to the logging pipeline.

This is the outline of the plan to move mediawiki logging off udp2log and logging pipeline's kafka (cc @bd808 @aaron @Ottomata)

Fri, Feb 15, 4:09 PM · Wikimedia-Logstash, Operations
fgiunchedi removed a project from T170817: Upgrade Thumbor servers to Stretch: User-fgiunchedi.
Fri, Feb 15, 11:24 AM · serviceops, User-jijiki, Patch-For-Review, Performance-Team (Radar), Operations, Thumbor
fgiunchedi moved T213189: Enable rsyslog mmjsonparse module fleetwide from Backlog to Up next on the User-fgiunchedi board.
Fri, Feb 15, 11:24 AM · User-fgiunchedi, Wikimedia-Logstash
fgiunchedi added a comment to T215998: ms-be1033 down and not powering up.

Thanks @Cmjohnson ! Did HP provide an ETA for shipment/resolution ?

Fri, Feb 15, 11:23 AM · Operations, ops-eqiad
fgiunchedi updated subscribers of T216235: cleanup reprepro configuration for elasticsearch-curator.
Fri, Feb 15, 10:52 AM · User-fgiunchedi, Discovery-Search, Elasticsearch, Operations
fgiunchedi added a project to T216235: cleanup reprepro configuration for elasticsearch-curator: User-fgiunchedi.
Fri, Feb 15, 10:52 AM · User-fgiunchedi, Discovery-Search, Elasticsearch, Operations
fgiunchedi added a comment to T215611: MediaWiki errors overloading logstash.

Apologies for the late reply -- and thanks to all who have helped in fixing the issue!

Fri, Feb 15, 10:03 AM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Patch-For-Review, Core Platform Team Kanban (Blocked Externally), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring
fgiunchedi added a comment to T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch.

On stretch by default we're installing elasticsearch-curator from stretch which is at version 4.2, instead the package needs to be at version >= 5 and thus come from stretch-wikimedia.

Fri, Feb 15, 8:06 AM · Patch-For-Review, User-herron, Operations, Wikimedia-Logstash

Thu, Feb 14

fgiunchedi added a comment to T199911: Systemd session creation fails under I/O load.

@fgiunchedi with the bandaid the problem doesn't show up anymore, do we want to keep this open for tracking or it's ok to resolve it?

Thu, Feb 14, 5:36 PM · Operations, Operations-Software-Development
fgiunchedi added a comment to T187987: Serve >= 50% of production Prometheus systems with Prometheus v2.

Status update: yesterday I've reimaged prometheus2003 and prometheus 2.7.1 has been running there, host is still depooled but collecting metrics similarly to its counterpart on prometheus2004 (ATM ~22.5k samples/s)

Thu, Feb 14, 1:23 PM · Patch-For-Review, monitoring, Operations

Wed, Feb 13

fgiunchedi triaged T216039: jenkins / zuul backing up due to jenkins slaves down as High priority.
Wed, Feb 13, 2:29 PM · Operations, Continuous-Integration-Infrastructure
fgiunchedi created T216039: jenkins / zuul backing up due to jenkins slaves down.
Wed, Feb 13, 2:20 PM · Operations, Continuous-Integration-Infrastructure
fgiunchedi created T216011: kafka1012 power supply alerts.
Wed, Feb 13, 11:14 AM · ops-eqiad, Operations
fgiunchedi added a comment to T213708: Upgrade production prometheus-node-exporter to >= 0.16.

I've just noticed, based on a diffscan email, that the new version of prometheus-node-exporter ALSO binds to :::9100 on ipv6 and listens to all ipv6 clients, while the old node exporter version would only bind to a specific interface on ipv4 and listen on that interface.

This means that on publicly-exposed hosts, we expose node-exporter to the world over ipv6 if there is no firewall rule. Some servers have no firewall by design, so node-exporter is actually reachable from the open internet.

The solution is probably to just not listen over ipv6.

Wed, Feb 13, 11:04 AM · Patch-For-Review, Goal, monitoring, Operations
fgiunchedi added a comment to T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

@fgiunchedi thoughts on this? looks like we are talking about 10-100 G files, not quite Terabytes

Wed, Feb 13, 9:27 AM · Research, Operations, Discovery, Analytics
fgiunchedi added a comment to T214529: EDAC events not being reported by node-exporter?.

Thanks @fgiunchedi, that's a good thought! However I couldn't find anything in the SEL for a selection of servers that are currently reporting / have recently reported memory issues:

Wed, Feb 13, 9:23 AM · Patch-For-Review, Operations, monitoring
fgiunchedi triaged T215998: ms-be1033 down and not powering up as High priority.
Wed, Feb 13, 8:01 AM · Operations, ops-eqiad
fgiunchedi created T215998: ms-be1033 down and not powering up.
Wed, Feb 13, 8:01 AM · Operations, ops-eqiad

Tue, Feb 12

fgiunchedi added a comment to T187987: Serve >= 50% of production Prometheus systems with Prometheus v2.

A "big rsync + snapshot prometheus + final rsync" yields about ~2h30m for the final rsync to run, with the bottleneck being a gazillion files on a spinning disk for the global prometheus instance. IOW ~3h (rsync + reimage) will be our gap between new and migrated data in Prometheus v2

Tue, Feb 12, 5:18 PM · Patch-For-Review, monitoring, Operations
fgiunchedi created T215904: Better understanding of Logstash performance.
Tue, Feb 12, 2:28 PM · Wikimedia-Logstash
fgiunchedi created T215900: Introduce logging pipeline input ratelimit.
Tue, Feb 12, 2:15 PM · Wikimedia-Logstash
fgiunchedi closed T215744: Grafana shows zero EventLogging events for around 44 hours around January 15 as Resolved.

It is indeed, thanks @Volans ! Tentatively resolving

Tue, Feb 12, 9:40 AM · monitoring, Analytics

Mon, Feb 11

fgiunchedi added a comment to T213708: Upgrade production prometheus-node-exporter to >= 0.16.

Uploading -3 internally and changing puppet to install that version sounds good to me!

Mon, Feb 11, 4:05 PM · Patch-For-Review, Goal, monitoring, Operations
fgiunchedi edited projects for T215744: Grafana shows zero EventLogging events for around 44 hours around January 15, added: monitoring; removed Graphite.

I can confirm what @elukey was seeing / saying, namely that the data seems missing only from prometheus instance (hitting d and then r in grafana reloads the dashboard). This is of course suboptimal and will be resolved once we have in place sth like Thanos which is able to merge responses from multiple Prometheus hosts. See also T213918: [stretch] Investigate distributed and long term storage solutions for Prometheus. On the specific issue at hand I checked SAL and there was no maintenance on prometheus eqiad at the time so definitely something happened to only one of the hosts, leaving the task open for now.

Mon, Feb 11, 4:01 PM · monitoring, Analytics
fgiunchedi added a comment to T214529: EDAC events not being reported by node-exporter?.

Thanks for the deep investigation, truly fascinating! WRT what @BBlack was saying that the host was displaying errors on its LCD, I'm wondering if alerting on SEL errors (i.e. T197084) would yield a better signal to noise ratio instead. I haven't delved deep into what that would mean however or even if practical, my two cents!

Mon, Feb 11, 3:51 PM · Patch-For-Review, Operations, monitoring
fgiunchedi added a comment to T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

@fgiunchedi ping on this...or can you ping someone who might know more about using Swift for this kind of thing? Should we consider this option or stick with rsync pull?

Mon, Feb 11, 3:45 PM · Research, Operations, Discovery, Analytics
fgiunchedi reassigned T209921: ms-be2047 spontaneous reboots from fgiunchedi to Papaul.

Host is in service at full weight, assigning to @Papaul for return of previous hardware

Mon, Feb 11, 3:33 PM · Patch-For-Review, User-fgiunchedi, Operations, ops-codfw
fgiunchedi updated the task description for T187987: Serve >= 50% of production Prometheus systems with Prometheus v2.
Mon, Feb 11, 3:10 PM · Patch-For-Review, monitoring, Operations
fgiunchedi updated the task description for T187987: Serve >= 50% of production Prometheus systems with Prometheus v2.
Mon, Feb 11, 2:28 PM · Patch-For-Review, monitoring, Operations
fgiunchedi created T215779: Tools prometheus can't talk to kubelet running on tools-worker.
Mon, Feb 11, 10:56 AM · cloud-services-team (Kanban), Tools
fgiunchedi closed T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools as Resolved.

Tools and deployment-prep are running Prometheus 2.7.1 rebuilt from unstable with k8s support and their storage has been migrated from Prometheus v1.

Mon, Feb 11, 10:54 AM · User-fgiunchedi, Patch-For-Review, cloud-services-team, monitoring, Operations
fgiunchedi closed T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools, a subtask of T187987: Serve >= 50% of production Prometheus systems with Prometheus v2, as Resolved.
Mon, Feb 11, 10:54 AM · Patch-For-Review, monitoring, Operations
fgiunchedi renamed T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools from Upgrade Prometheus to 2.6 in deployment-prep and tools to Upgrade Prometheus to 2.7 in deployment-prep and tools.
Mon, Feb 11, 10:49 AM · User-fgiunchedi, Patch-For-Review, cloud-services-team, monitoring, Operations
fgiunchedi added a comment to T214608: rack/setup/install logstash101[012].eqiad.wmnet.

Thanks @Cmjohnson ! Please treat this as priority this week since we're running short on disk space on existing logstash eqiad hosts.

Mon, Feb 11, 10:17 AM · Patch-For-Review, Operations

Fri, Feb 8

fgiunchedi closed T198041: graphite2001 crashed as Declined.

Host is going to be decom -- declining

Fri, Feb 8, 11:40 AM · Operations, ops-codfw, monitoring
fgiunchedi closed T198041: graphite2001 crashed, a subtask of T196483: rack/setup/install graphite2003, as Declined.
Fri, Feb 8, 11:39 AM · monitoring, Patch-For-Review, ops-codfw, Operations
fgiunchedi moved T210137: Handle unknown stats in rsyslog_exporter from Backlog to Up next on the monitoring board.
Fri, Feb 8, 11:38 AM · monitoring
fgiunchedi moved T213708: Upgrade production prometheus-node-exporter to >= 0.16 from Backlog to In progress on the monitoring board.
Fri, Feb 8, 11:38 AM · Patch-For-Review, Goal, monitoring, Operations
fgiunchedi moved T214594: node-exporter collector.diskstats.ignored-devices underescaped from Backlog to Up next on the monitoring board.
Fri, Feb 8, 11:36 AM · Patch-For-Review, monitoring
fgiunchedi moved T214529: EDAC events not being reported by node-exporter? from Backlog to In progress on the monitoring board.
Fri, Feb 8, 11:36 AM · Patch-For-Review, Operations, monitoring
fgiunchedi moved T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools from Backlog to In progress on the monitoring board.
Fri, Feb 8, 11:36 AM · User-fgiunchedi, Patch-For-Review, cloud-services-team, monitoring, Operations

Thu, Feb 7

fgiunchedi added a comment to T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools.

Conversion of tools-prometheus-02 worked as expected, I've stopped v1, moved v1 metrics out of the way and installed v2 package, then started v2. Once v2 is running and starts collecting metrics into an empty storage I've ran prometheus-storage-migrator on v1 data. That took about ~5h, and once complete the migrated data can be merged back into the fresh v2 storage. Overall there has been about 3h worth of missing data, from 6am to 9am (when I've upgraded to v2)

Thu, Feb 7, 5:55 PM · User-fgiunchedi, Patch-For-Review, cloud-services-team, monitoring, Operations
fgiunchedi added a comment to T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.

I've ran an audit on producers that sent logs through the least three inputs used over the last 24h (sorted by increasing volume by input)

Thu, Feb 7, 3:53 PM · Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi closed T173721: Track down the source of periodic increases in requests to swift eqiad as Resolved.

Turns out the spikes are varnish upload backends periodic restarts, thus expected.

Thu, Feb 7, 1:44 PM · media-storage, User-fgiunchedi, Operations
fgiunchedi updated the task description for T177197: Export Prometheus-compatible JVM metrics from JVMs in production.
Thu, Feb 7, 1:43 PM · User-Elukey, User-fgiunchedi, Goal, Operations
fgiunchedi closed T192610: prometheus on bast3002 misbehaving as Invalid.

We haven't seen this reoccurring afaik, also we're upgrading to Prometheus 2.6, tentatively resolving.

Thu, Feb 7, 1:41 PM · User-fgiunchedi, Operations, monitoring
fgiunchedi updated subscribers of T77110: Push messages to logstash from JS.

cc @Jhernandez as we talked about this at all hands, possibly a duplicate/old task though

Thu, Feb 7, 1:38 PM · MediaWiki-General-or-Unknown, JavaScript, Wikimedia-Logstash
fgiunchedi closed T126582: Log input from cassandra caused logstash process to crash repeatedly as Invalid.

AFAIK we haven't seen this bug reoccurring, tentatively resolving.

Thu, Feb 7, 1:36 PM · Wikimedia-Logstash
fgiunchedi added a comment to T126989: MediaWiki logging & encryption.

Status update: mw logs that were going to logstash in plaintext now are being sent via localhost -> rsyslog -> kafka -> logstash and the network paths are encrypted.

Thu, Feb 7, 1:35 PM · monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
fgiunchedi closed T213078: Replicas set to two on logstash indices regardless of index age as Resolved.

Thanks to the patch from @herron this is fixed now:

Thu, Feb 7, 1:25 PM · Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a project to T213189: Enable rsyslog mmjsonparse module fleetwide: User-fgiunchedi.
Thu, Feb 7, 1:21 PM · User-fgiunchedi, Wikimedia-Logstash
fgiunchedi committed rMSCAe6063b0758e1: run php syntax in parallel with php -l (authored by fgiunchedi).
run php syntax in parallel with php -l
Thu, Feb 7, 12:10 PM
fgiunchedi committed rMSCAabbed8a32850: run php syntax in parallel with php -l (authored by fgiunchedi).
run php syntax in parallel with php -l
Thu, Feb 7, 12:10 PM
fgiunchedi committed rMSCAa9af8cd4e92a: run php syntax in parallel with php -l (authored by fgiunchedi).
run php syntax in parallel with php -l
Thu, Feb 7, 12:10 PM
fgiunchedi committed rMSCAf260aa336af3: run php syntax in parallel with php -l (authored by fgiunchedi).
run php syntax in parallel with php -l
Thu, Feb 7, 12:10 PM
fgiunchedi removed a subtask for T213157: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6): T86969: Log directly to logstash via syslog input.
Thu, Feb 7, 11:26 AM · User-fgiunchedi, User-herron, Operations, Wikimedia-Logstash
fgiunchedi removed a parent task for T86969: Log directly to logstash via syslog input: T213157: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6).
Thu, Feb 7, 11:26 AM · Scap
fgiunchedi triaged T215499: Move wikimania-scholarships from udp2log to syslog as Normal priority.
Thu, Feb 7, 11:18 AM · Wikimedia-Logstash, Operations
fgiunchedi triaged T215497: Move iegreview from udp2log to syslog as Normal priority.
Thu, Feb 7, 11:14 AM · Wikimedia-Logstash, Operations
fgiunchedi added a comment to T86969: Log directly to logstash via syslog input.

Update on udp2log deprecation: we've deployed the new logging infrastructure now, IOW the recommended way is to use the system's syslog daemon (unix socket on /dev/log that is) and opt-in programs to have their syslog ingested into logstash via kafka. For json logging the syslog payload should be prefixed with @cee: to signal structured/json logging. I'm happy to provide further guidance too!

Thu, Feb 7, 11:10 AM · Scap
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Thu, Feb 7, 10:53 AM · Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi added a comment to T209110: Logging for the session storage service.

Thanks for kickstarting the discussion @Eevans @Pchelolo !

Thu, Feb 7, 10:15 AM · Patch-For-Review, User-Clarakosi, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
fgiunchedi awarded T191182: Stop using Differential for code review a Like token.
Thu, Feb 7, 9:48 AM · Release-Engineering-Team (Backlog), Phabricator, Gerrit
fgiunchedi merged T215451: Logging elasticsearch had shards unassigned and icinga didn't mention it into T212850: Create Icinga check for failed shard allocation.
Thu, Feb 7, 9:42 AM · Patch-For-Review, Discovery-Search (Current work), Operations, Elasticsearch
fgiunchedi merged task T215451: Logging elasticsearch had shards unassigned and icinga didn't mention it into T212850: Create Icinga check for failed shard allocation.
Thu, Feb 7, 9:42 AM · Wikimedia-Logstash
fgiunchedi added a comment to T211661: Automatically clean up unused thumbnails in Swift.

And indeed I share the concerns already mentioned, namely making sure we're able to have a bound on writes (deletes and/or metadata updates) as we're expiring thumbnails.

Hi Filippo! :) Thanks for humoring my kibbitzing.

You could update metadata on 0.01% of requests, but cap that to a fixed maximum number of metadata updates per minute. And if your expiry times are relative (using X-Delete-After), the limit on metadata updates also gives you an upper bound on the rate of deletes. (But it also means that any time you tweak the rate of metadata updates, there is a delay of $TTL seconds before you see the effect on deletions.) I figure you'd start with a very conservative cap and then bump it up over time.

Thu, Feb 7, 9:37 AM · Patch-For-Review, Traffic, media-storage, Operations, Performance-Team
fgiunchedi awarded T212850: Create Icinga check for failed shard allocation a Like token.
Thu, Feb 7, 9:23 AM · Patch-For-Review, Discovery-Search (Current work), Operations, Elasticsearch

Wed, Feb 6

fgiunchedi created T215451: Logging elasticsearch had shards unassigned and icinga didn't mention it.
Wed, Feb 6, 7:06 PM · Wikimedia-Logstash
fgiunchedi awarded T215413: Image Classification Working Group a Yellow Medal token.
Wed, Feb 6, 4:49 PM · Reading-Admin, SDC General, Wikidata, Multimedia, Discovery-Search, Analytics, Research
fgiunchedi moved T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools from Backlog to Doing on the User-fgiunchedi board.
Wed, Feb 6, 11:20 AM · User-fgiunchedi, Patch-For-Review, cloud-services-team, monitoring, Operations
fgiunchedi added a project to T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools: User-fgiunchedi.
Wed, Feb 6, 11:19 AM · User-fgiunchedi, Patch-For-Review, cloud-services-team, monitoring, Operations
fgiunchedi added a comment to T211661: Automatically clean up unused thumbnails in Swift.

IIRC object expiration was considered years ago (i.e. https://wikitech.wikimedia.org/wiki/Swift/ObjectExpiration) and at the time considered buggy for production use, though after 4+ years I'm sure the situation has changed so definitely worth reconsidering as an option too.

Wed, Feb 6, 10:41 AM · Patch-For-Review, Traffic, media-storage, Operations, Performance-Team
fgiunchedi added a comment to T116011: ferm: Log dropped packets.

I have created a simple module for configuereing ulogd2 avalible in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/486513/. The questions is what do we want to log and how do we want to log it. With the default configuration the module will produce the following:

In the first step we should log dropped packets, that's useful to e.g. debug problems or when we finetune a new service. But in some cases it's certainly also useful to be able to switch to the complete log, e.g. when we work on making an existing ferm rule more restrictive. Maybe we can make the mode configurable via Hiera?

On the matter of where to log, I'd prefer Logstash, but @herron or @fgiunchedi are in a better position to comment how to best integrate this.

Wed, Feb 6, 10:05 AM · Patch-For-Review, Operations

Tue, Feb 5

fgiunchedi added a comment to T213078: Replicas set to two on logstash indices regardless of index age.

This is causing logstash disks to get fuller than expected, part of the root cause seems to be that the previous curator action (forcemerge) doesn't complete and thus the "reduce replicas" action doesn't run:

Tue, Feb 5, 2:54 PM · Patch-For-Review, Wikimedia-Logstash
fgiunchedi moved T213918: [stretch] Investigate distributed and long term storage solutions for Prometheus from Backlog to Up next on the User-fgiunchedi board.
Tue, Feb 5, 2:44 PM · User-fgiunchedi, Goal, monitoring, Operations
fgiunchedi closed T215172: Remove graphite data for nodepool as Resolved.

Patch merged, thanks @hashar !

Tue, Feb 5, 2:29 PM · Patch-For-Review, serviceops, Continuous-Integration-Infrastructure (shipyard)
fgiunchedi closed T215172: Remove graphite data for nodepool, a subtask of T209361: Phase out Nodepool from production, as Resolved.
Tue, Feb 5, 2:29 PM · Patch-For-Review, Nodepool, Continuous-Integration-Infrastructure (shipyard), Release-Engineering-Team (Kanban)
fgiunchedi added a comment to T215172: Remove graphite data for nodepool.

Indeed the directory/metrics aren't there anymore:

Tue, Feb 5, 2:27 PM · Patch-For-Review, serviceops, Continuous-Integration-Infrastructure (shipyard)
fgiunchedi triaged T215272: Upgrade Prometheus to 2.7 in deployment-prep and tools as Normal priority.
Tue, Feb 5, 1:58 PM · User-fgiunchedi, Patch-For-Review, cloud-services-team, monitoring, Operations
fgiunchedi added a comment to T187987: Serve >= 50% of production Prometheus systems with Prometheus v2.

I ran a test conversion on graphite2001 using prometheus-storage-migrator and a snapshot of data taken from prometheus2003 and parallelism 10:

Tue, Feb 5, 12:23 PM · Patch-For-Review, monitoring, Operations
fgiunchedi assigned T214838: ms-be1034 crash to Cmjohnson.

Similarly to T214778: Degraded RAID on ms-be1020 this host will need a firmware/bios/etc upgrade, assigning to @Cmjohnson. Let me know when ok to take the host offline.

Tue, Feb 5, 10:37 AM · Operations, media-storage
fgiunchedi reassigned T214778: Degraded RAID on ms-be1020 from fgiunchedi to Cmjohnson.

This is a HP server, while the f/w can probably be updated remotely it would be best if I did the update on-site with the service pack and can update everything else at the same time.

Tue, Feb 5, 10:36 AM · ops-eqiad, Operations
fgiunchedi moved T209921: ms-be2047 spontaneous reboots from Radar to Doing on the User-fgiunchedi board.
Tue, Feb 5, 10:26 AM · Patch-For-Review, User-fgiunchedi, Operations, ops-codfw
fgiunchedi added a comment to T209921: ms-be2047 spontaneous reboots.

@fgiunchedi I replaced the problematic server with the new one Dell shipped to me. The OS is installed and puppet first run done. I will proceed to the disk wipe on the old server on Wednesday before shipping it back to Dell. Let me know if you have any questions.

Thanks.

Tue, Feb 5, 10:25 AM · Patch-For-Review, User-fgiunchedi, Operations, ops-codfw
fgiunchedi closed T177196: Port non-deprecated Diamond collectors to Prometheus as Resolved.
Tue, Feb 5, 8:43 AM · monitoring, cloud-services-team (Kanban), User-fgiunchedi, Goal, Operations
fgiunchedi closed T177196: Port non-deprecated Diamond collectors to Prometheus, a subtask of T177195: Reduce technical debt in metrics monitoring, as Resolved.
Tue, Feb 5, 8:43 AM · User-CDanis, User-fgiunchedi, Technical-Debt, Goal, Operations
fgiunchedi added a project to T187434: Include apache_exporter in puppet module apache: monitoring.
Tue, Feb 5, 8:43 AM · monitoring, User-fgiunchedi, Goal, Operations
fgiunchedi closed T214558: Puppet failure on deployment-prometheus01.deployment-prep.eqiad.wmflabs as Resolved.

I've ran puppet on the vm to make it successful and then disabled puppet again, resolving

Tue, Feb 5, 8:41 AM · monitoring, Puppet, Beta-Cluster-Infrastructure

Sun, Jan 27

fgiunchedi claimed T214778: Degraded RAID on ms-be1020.

Host is back after a powercycle, looks like the raid controller freaked out. Leaving this open to upgrade the controller firmware.

Sun, Jan 27, 5:03 PM · ops-eqiad, Operations
fgiunchedi closed T214796: ms-be1034 icinga alers as Resolved.

root cause looks like it was additional load from ms-be1020 being extremely slow (but reachable) and traffic ramping up in eqiad, resolving this in favor of T214778: Degraded RAID on ms-be1020

Sun, Jan 27, 5:02 PM · Operations

Fri, Jan 25

fgiunchedi added a comment to T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus.

Status update: 4 services (swift / ores / thumbor / logstash) have their metrics collected by Prometheus by virtue of using statsd_exporter out of 40, so 10% ATM. mathoid on k8s is on its way to have Prometheus metrics too.

Fri, Jan 25, 11:04 AM · Performance-Team (Radar), Patch-For-Review, monitoring, Operations
fgiunchedi added a comment to T205851: Migrate >=90% of existing Logstash traffic to the logging pipeline.

Expanding from the graph above with this expression: sum by (plugin_id) (rate(logstash_node_plugin_events_out_total{plugin_id=~"input/.*"}[5m])) / scalar(sum(rate(logstash_node_plugin_events_out_total{plugin_id=~"input/.*"}[5m]))) over the last week we've averaged 87% of logs onto the new logging pipeline (86% mediawiki + 1% from onboarded apps)

Fri, Jan 25, 9:57 AM · Patch-For-Review, Wikimedia-Logstash, Operations

Thu, Jan 24

fgiunchedi created T214594: node-exporter collector.diskstats.ignored-devices underescaped.
Thu, Jan 24, 4:02 PM · Patch-For-Review, monitoring
fgiunchedi updated subscribers of T214558: Puppet failure on deployment-prometheus01.deployment-prep.eqiad.wmflabs.

Indeed, this failures are part of testing for T213708: Upgrade production prometheus-node-exporter to >= 0.16, is there a way we can temporarily mute the emails or sth like that?

Thu, Jan 24, 10:07 AM · monitoring, Puppet, Beta-Cluster-Infrastructure
fgiunchedi moved T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring from Doing to Up next on the User-fgiunchedi board.
Thu, Jan 24, 10:06 AM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, monitoring, Operations

Wed, Jan 23

fgiunchedi added a comment to T208566: puppet.git rake fails with ruby 2.5.

In case it is useful, on a buster system I'm using this to run rake locally: PUPPET_GEM_VERSION=4.10.12 bundle exec rake test

Wed, Jan 23, 1:39 PM · Continuous-Integration-Config, Operations, Puppet

Tue, Jan 22

fgiunchedi renamed T214309: kafka / logstash / elasticsearch lag monitoring and alerting from logstash / elasticsearch indexing lag to kafka / logstash / elasticsearch lag monitoring and alerting.
Tue, Jan 22, 4:52 PM · Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T214309: kafka / logstash / elasticsearch lag monitoring and alerting.

Messages now are spread amongst broker as expected after https://gerrit.wikimedia.org/r/485833 and I believe the immediate issue (i.e. lag) has been resolved, however I'm leaving this task open for the general issue of monitoring / alerting both kafka consumer lag and ingestion/indexing lag

Tue, Jan 22, 4:51 PM · Patch-For-Review, Wikimedia-Logstash