Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (19)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (263 w, 4 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Today

fgiunchedi closed T233638: rack/setup/install ms-be205[1-6].codfw.wmnet as Resolved.

This is completed, hosts are fully in service now.

Tue, Oct 22, 4:40 PM · User-fgiunchedi, Operations, SRE-swift-storage, ops-codfw
fgiunchedi added a comment to T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array.

I tested that on mwdebug1001, triggered the case that should log stuff, and did not see anything being logged in Kibana. I think it still doesn't work.

Tue, Oct 22, 10:50 AM · MW-1.35-notes (1.35.0-wmf.4; 2019-10-29), Patch-For-Review, Wikimedia-production-error, Release-Engineering-Team, Performance-Team (Radar), Deployments, Wikimedia-Logstash, VisualEditor
fgiunchedi added a comment to T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array.

FTR: as of indices starting on Oct 22nd the limit is now 2048 fields instead of 1000 previously

Tue, Oct 22, 10:34 AM · MW-1.35-notes (1.35.0-wmf.4; 2019-10-29), Patch-For-Review, Wikimedia-production-error, Release-Engineering-Team, Performance-Team (Radar), Deployments, Wikimedia-Logstash, VisualEditor
fgiunchedi created T236130: Elevated 502s observed in ulsfo.
Tue, Oct 22, 9:16 AM · Operations, Traffic

Yesterday

fgiunchedi created T236075: Evaluate, suggest and choose an alert escalation solution.
Mon, Oct 21, 3:02 PM · User-fgiunchedi, observability
fgiunchedi closed T231110: bring swiftrepl back to life, a subtask of T231086: Picture from Commons not found from Singapore, as Resolved.
Mon, Oct 21, 2:49 PM · User-fgiunchedi, Structured-Data-Backlog, Structured Data Engineering, Multimedia, MW-1.34-notes (1.34.0-wmf.21; 2019-09-03), Patch-For-Review, Commons, MediaWiki-File-management, SRE-swift-storage, Traffic, Operations
fgiunchedi closed T231110: bring swiftrepl back to life as Resolved.

This is effectively done (i.e. swiftrepl is back), following up in T162123: Running swiftrepl is not puppetized

Mon, Oct 21, 2:49 PM · User-fgiunchedi, Commons, MediaWiki-File-management, SRE-swift-storage, Operations
fgiunchedi added a comment to T162123: Running swiftrepl is not puppetized.

swiftrepl is now running puppetized on both codfw and eqiad and running as a timer once a week per site.

Mon, Oct 21, 1:54 PM · User-fgiunchedi, Operations, SRE-swift-storage

Fri, Oct 18

fgiunchedi created T235891: Ingest production logs with ELK7.
Fri, Oct 18, 3:57 PM · Patch-For-Review, Operations, Wikimedia-Logstash
fgiunchedi moved T227668: Per-backend ATS Prometheus metrics from Doing to Backlog on the User-fgiunchedi board.
Fri, Oct 18, 1:40 PM · User-fgiunchedi, observability, Traffic, Operations
fgiunchedi moved T228380: Tech debt: sunsetting of Graphite (part 1) (Q1 goal FY19-20) from Doing to Up next on the User-fgiunchedi board.
Fri, Oct 18, 1:40 PM · User-fgiunchedi, Goal, observability
fgiunchedi moved T215904: Better understanding of Logstash performance from Up next to Doing on the User-fgiunchedi board.
Fri, Oct 18, 1:40 PM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi closed T234232: Hosts in puppet with $cluster missing from wikimedia_clusters as Resolved.

Excellent @aborrero ! All looking good, boldly resolving.

Fri, Oct 18, 10:58 AM · Operations, observability
fgiunchedi added a comment to T235804: Capture metrics for postfix queue depth both overall and for specific queues and domains.

FWIW yes cron'd scripts to export gauges would work (I haven't seen the script though). We tend to limit metrics exported by node-exporter to system-level things, although not a strict requirement. Another way to achieve a similar result (what SRE does in production) is to deploy mtail to parse postfix logs and get counters out of its logs, HTH!

Fri, Oct 18, 10:18 AM · Fundraising-Backlog, Epic, observability, fundraising-tech-ops

Thu, Oct 17

fgiunchedi moved T224564: Reimage wezen to Stretch or Buster (and rename to centrallog2001) from Backlog to Up next on the User-fgiunchedi board.
Thu, Oct 17, 3:10 PM · User-fgiunchedi, observability, Operations
fgiunchedi claimed T224564: Reimage wezen to Stretch or Buster (and rename to centrallog2001).
Thu, Oct 17, 10:16 AM · User-fgiunchedi, observability, Operations
fgiunchedi added a project to T224564: Reimage wezen to Stretch or Buster (and rename to centrallog2001): User-fgiunchedi.
Thu, Oct 17, 10:10 AM · User-fgiunchedi, observability, Operations
fgiunchedi renamed T224564: Reimage wezen to Stretch or Buster (and rename to centrallog2001) from Reimage wezen to Stretch (and rename to centrallog2001) to Reimage wezen to Stretch or Buster (and rename to centrallog2001).
Thu, Oct 17, 10:10 AM · User-fgiunchedi, observability, Operations

Fri, Oct 11

fgiunchedi placed T213933: PoC alert/notification functionality with Elastic Stack up for grabs.

Hi @sbassett, apologies for the delayed reply! I'm not sure if deployment-prep access is all-or-nothing for services or shell access. In the sense that access to https://logstash-beta.wmflabs.org is one shared user/password and credentials are stored in a file in one of the deployment-prep hosts.

Fri, Oct 11, 5:52 PM · observability, User-fgiunchedi, Patch-For-Review, Restricted Project, Security-Team, Wikimedia-Logstash

Thu, Oct 10

fgiunchedi moved T222366: Test swift object server deployment with one disk per tcp port from Backlog to Doing on the User-fgiunchedi board.
Thu, Oct 10, 8:26 PM · User-fgiunchedi, SRE-swift-storage
fgiunchedi added a comment to T234567: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus.

+1 on at least a week's worth of data

Thu, Oct 10, 8:16 PM · Traffic, observability, Operations
fgiunchedi added a comment to T234900: Setup bacula backup monitoring.

Thanks for reaching out @jcrespo, happy to help brainstorming on monitoring and which metrics make sense for this use case. Can do either on task or hangout for higher bandwidth

Thu, Oct 10, 8:15 PM · Patch-For-Review, Availability, observability, Goal, Operations

Wed, Oct 9

fgiunchedi moved T215904: Better understanding of Logstash performance from Backlog to Up next on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi moved T234698: ms-be1020 - firmware upgrade: (was: host went down) from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · ops-eqiad, User-fgiunchedi, SRE-swift-storage, Operations
fgiunchedi moved T207292: Review prometheus_nodes params from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · User-fgiunchedi, observability, Operations
fgiunchedi moved T86552: Monitor and alarm on SMART attributes from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · Patch-For-Review, User-fgiunchedi, Operations, observability
fgiunchedi moved T187708: Monitor prometheus exporters "up" status from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · User-fgiunchedi, observability
fgiunchedi moved T151009: Provide authenticated access to Prometheus native web interface from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · observability, Patch-For-Review, User-fgiunchedi, Operations, Prometheus-metrics-monitoring
fgiunchedi moved T171482: Programmatic generation of grafana dashboards from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · Patch-For-Review, Graphite, User-fgiunchedi, observability, Operations
fgiunchedi moved T178690: Better organization for SRE grafana dashboards from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · User-CDanis, Patch-For-Review, User-fgiunchedi, observability, Operations
fgiunchedi moved T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet from Backlog to Doing on the User-fgiunchedi board.
Wed, Oct 9, 11:31 PM · User-fgiunchedi, Operations
fgiunchedi added a project to T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet: User-fgiunchedi.
Wed, Oct 9, 11:30 PM · User-fgiunchedi, Operations
fgiunchedi moved T233638: rack/setup/install ms-be205[1-6].codfw.wmnet from Radar to Doing on the User-fgiunchedi board.
Wed, Oct 9, 11:30 PM · User-fgiunchedi, Operations, SRE-swift-storage, ops-codfw
fgiunchedi moved T233638: rack/setup/install ms-be205[1-6].codfw.wmnet from Blocked to Radar on the User-fgiunchedi board.
Wed, Oct 9, 11:30 PM · User-fgiunchedi, Operations, SRE-swift-storage, ops-codfw
fgiunchedi moved T234698: ms-be1020 - firmware upgrade: (was: host went down) from Backlog to Up next on the User-fgiunchedi board.
Wed, Oct 9, 11:30 PM · ops-eqiad, User-fgiunchedi, SRE-swift-storage, Operations

Tue, Oct 8

fgiunchedi added a project to T234698: ms-be1020 - firmware upgrade: (was: host went down): User-fgiunchedi.

Indeed we'd need to upgrade its firmware as per T141756: audit / test / upgrade hp smartarray P840 firmware, holding off once we have new swift hw in place in eqiad to not "jinx it" if we possibly can

Tue, Oct 8, 6:59 PM · ops-eqiad, User-fgiunchedi, SRE-swift-storage, Operations

Mon, Oct 7

fgiunchedi added a comment to T231870: Add new Graphite instance in Grafana.

I like the idea of effectively proxying per datasource (Grafana upstream issue) as opposed to HTTP_PROXY + NO_PROXY in Grafana's environment.

Mon, Oct 7, 10:49 PM · Patch-For-Review, observability, Performance-Team
fgiunchedi updated subscribers of T234565: Standardize the logging format.

Thanks @colewhite for starting this! I'm cc'ing @Eevans as I know he's interested in a standardized logging schema too and we've chatted about it in the past as well.

Mon, Oct 7, 10:22 PM · Wikimedia-Logstash, observability, Operations
fgiunchedi added a comment to T187709: Cumin feature idea: Prometheus backend.

[agreed on the rest]

Mon, Oct 7, 10:21 PM · SRE-tools
fgiunchedi updated subscribers of T224585: Migrate labmon* to Stretch (or Buster, better yet!).

@bd808 I'm echoing what @MoritzMuehlenhoff said (thanks!) and going with Buster seems worthwhile to me. Specifically Grafana 6 is a safe upgrade AFAIK (cc @CDanis) and ditto for graphite. @Phamhi I'd be happy to help reviewing patches for Buster support!

Mon, Oct 7, 9:48 PM · cloud-services-team (Kanban), Operations
fgiunchedi added a comment to T209110: Logging for the session storage service.

@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.

Great!

We unfortunately had to roll that back. An unrelated service on kubernetes was logging so badly that it caused a big lag in the pipeline

Are the logs searchable now?

Yes, for a definition of "searchable". A look at it would be https://logstash.wikimedia.org/goto/90ab9dc9f656b65af8328361ecf1dc0a. As you can tell, that JSON in JSON parsing is really needed.

If so, how do we need to search given this JSON-in-JSON format?

https://gerrit.wikimedia.org/r/539978 should resolve this

Mon, Oct 7, 9:28 PM · CPT Initiatives (Session Management Service (CDP2)), Patch-For-Review, User-Clarakosi, User-Eevans
fgiunchedi moved T233638: rack/setup/install ms-be205[1-6].codfw.wmnet from Backlog to Blocked on the User-fgiunchedi board.
Mon, Oct 7, 9:14 PM · User-fgiunchedi, Operations, SRE-swift-storage, ops-codfw
fgiunchedi added a project to T233638: rack/setup/install ms-be205[1-6].codfw.wmnet: User-fgiunchedi.
Mon, Oct 7, 9:14 PM · User-fgiunchedi, Operations, SRE-swift-storage, ops-codfw
fgiunchedi awarded T234567: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus a Love token.
Mon, Oct 7, 8:24 PM · Traffic, observability, Operations
fgiunchedi added a comment to T224033: Fix operations/puppet.git "rebase hell".

I'm +1 on turning on rebase if necessary and see how things play out, if they don't for some reason it is an easy revert

Mon, Oct 7, 6:04 PM · Release-Engineering-Team (Development services), Gerrit, Release-Engineering-Team-TODO, Continuous-Integration-Config, Operations
fgiunchedi awarded T224033: Fix operations/puppet.git "rebase hell" a Mountain of Wealth token.
Mon, Oct 7, 6:03 PM · Release-Engineering-Team (Development services), Gerrit, Release-Engineering-Team-TODO, Continuous-Integration-Config, Operations

Wed, Oct 2

fgiunchedi added a comment to T227867: mw1239 memory errors .

self-healing??
<+icinga-wm> RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1

Wed, Oct 2, 11:29 PM · ops-eqiad, DC-Ops, Operations, serviceops
fgiunchedi added a comment to T187709: Cumin feature idea: Prometheus backend.

@TheAnarcat thanks indeed for taking the time to look into this!

Wed, Oct 2, 9:46 PM · SRE-tools
fgiunchedi added a comment to T234358: wmf-auto-reimage-host on HP gen10 WARNING: unable to verify that BIOS boot parameters are back to normal, got:.

Didn't realize this was normal and thought it was hp gen10-specific! Since it happens on other hosts too I wouldn't spend too much time on it, I'm ok to even resolve/decline the task

Wed, Oct 2, 9:18 PM · SRE-tools
fgiunchedi added a project to T215904: Better understanding of Logstash performance: User-fgiunchedi.
Wed, Oct 2, 9:08 PM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi created T234459: Allow variables without hiera calls as lookup() default parameters.
Wed, Oct 2, 5:44 PM · Puppet
fgiunchedi moved T199406: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts from Up next to Backlog on the User-fgiunchedi board.
Wed, Oct 2, 5:10 PM · Patch-For-Review, User-fgiunchedi, Operations
fgiunchedi changed the status of T199406: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts from Open to Stalled.

Setting as stalled for now, the immediate issue has been bandaided

Wed, Oct 2, 5:10 PM · Patch-For-Review, User-fgiunchedi, Operations
fgiunchedi raised the priority of T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet from Normal to High.

Please note that putting these systems in production is becoming urgent, is there a status update and/or ETA?

Wed, Oct 2, 3:43 PM · User-fgiunchedi, Operations

Tue, Oct 1

fgiunchedi renamed T234232: Hosts in puppet with $cluster missing from wikimedia_clusters from acme-chief hosts not in Prometheus to Hosts in puppet with $cluster missing from wikimedia_clusters.
Tue, Oct 1, 6:57 PM · Operations, observability
fgiunchedi added a comment to T162123: Running swiftrepl is not puppetized.

Change 539535 merged by Filippo Giunchedi:
[operations/puppet@production] swift: open per-port object server ports
https://gerrit.wikimedia.org/r/539535

Tue, Oct 1, 6:33 PM · User-fgiunchedi, Operations, SRE-swift-storage
fgiunchedi created T234358: wmf-auto-reimage-host on HP gen10 WARNING: unable to verify that BIOS boot parameters are back to normal, got:.
Tue, Oct 1, 6:16 PM · SRE-tools

Mon, Sep 30

fgiunchedi added a comment to T233638: rack/setup/install ms-be205[1-6].codfw.wmnet.

@sorry I fogot to mentioned that on the task, It was the faster and easier way to rack those servers. We can still move 1 host from D to C but it will take a while for me to get in into C2 if you find with that.
If you want to move 1 host from D to C please change the IP address of ms-be2055 from 10.192.48.X/22 to 10 192.32.X/22 and power it down

Mon, Sep 30, 10:16 PM · User-fgiunchedi, Operations, SRE-swift-storage, ops-codfw
fgiunchedi added a comment to T215904: Better understanding of Logstash performance.

I think we should bump Kafka partitions for a subset of topics to say 32 or 64, this way we'll guarantee that each logstash hosts gets multiple partitions and in case of overload we can add more logstash hosts and those will also receive multiple partitions.

Mon, Sep 30, 6:15 PM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi closed T221202: kafka-logging __consumer_offsets topic traffic increased as Resolved.

Resolving again as this seems to have gone away on Sept 13th, cause still unclear to me though

Mon, Sep 30, 6:03 PM · observability, Wikimedia-Logstash
fgiunchedi added a comment to T215904: Better understanding of Logstash performance.

It has been observed that during times of high Kafka traffic (i.e. when a backlog develops because Logstash can't keep up) the CPU on logstash hosts isn't maxed out, with only one thread typically using close to one core and the rest being mostly idle. That suggests to me one limiting factor at the moment might be Kafka consumer parallelism, specifically we have three partitions per topic by default and three logstash ingester hosts per site, so under normal circumstances there's one partition per host being consumed.

Mon, Sep 30, 5:58 PM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi updated subscribers of T234223: Single row view in Logstash hidden by Phatality.

cc @mmodell

Mon, Sep 30, 3:50 PM · Release-Engineering-Team, Wikimedia-Logstash
fgiunchedi closed T230012: Add (a subset of) graphite logs to logstash, a subtask of T63779: Add system logs to logstash (tracking), as Declined.
Mon, Sep 30, 3:50 PM · observability, Tracking-Neverending, Wikimedia-Logstash
fgiunchedi closed T230012: Add (a subset of) graphite logs to logstash as Declined.

Graphite is on its way out eventually, declining

Mon, Sep 30, 3:50 PM · observability, Tracking-Neverending, Wikimedia-Logstash
fgiunchedi closed T154732: Exception in thread "Ruby-0-Thread-18: /opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.20/lib/stud/buffer.rb:92" java.lang.UnsupportedOperationException as Declined.

I don't recall seeing this error anytime recently, boldly declining and we shall reopen

Mon, Sep 30, 3:45 PM · observability, Wikimedia-Logstash
fgiunchedi added a comment to T70820: Automatically log fatals/exceptions into Phabricator with stack traces.

@mmodell it seems to me with Phatality deployed to production we can resolve this task ?

Mon, Sep 30, 3:42 PM · observability, Wikimedia-Logstash
fgiunchedi created T234232: Hosts in puppet with $cluster missing from wikimedia_clusters.
Mon, Sep 30, 3:35 PM · Operations, observability
fgiunchedi moved T233828: Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash from Backlog to In progress on the observability board.
Mon, Sep 30, 3:19 PM · Patch-For-Review, Wikimedia-Logstash, serviceops, Operations, observability
fgiunchedi added a comment to T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet.

@fgiunchedi 3 backend systems to replace ms-be101[6-8]

Mon, Sep 30, 1:30 PM · User-fgiunchedi, Operations
fgiunchedi reassigned T233638: rack/setup/install ms-be205[1-6].codfw.wmnet from fgiunchedi to Papaul.

@fgiunchedi all yours

Mon, Sep 30, 1:28 PM · User-fgiunchedi, Operations, SRE-swift-storage, ops-codfw

Fri, Sep 27

fgiunchedi closed T228379: Improve our alerting capabilities (Q1 goal FY19-20) as Resolved.

Completed! See T228878 for subtask status

Fri, Sep 27, 3:18 PM · User-herron, User-fgiunchedi, Goal, observability
fgiunchedi updated the task description for T228379: Improve our alerting capabilities (Q1 goal FY19-20).
Fri, Sep 27, 3:18 PM · User-herron, User-fgiunchedi, Goal, observability
fgiunchedi closed T228878: Reduce Icinga alert noise, a subtask of T228379: Improve our alerting capabilities (Q1 goal FY19-20), as Resolved.
Fri, Sep 27, 3:17 PM · User-herron, User-fgiunchedi, Goal, observability
fgiunchedi closed T228878: Reduce Icinga alert noise as Resolved.

Resolving as this is complete, the ipsec alerts subtask is still open pending a firing of legacy/spammy alerts to compare to the new ones but otherwise done. systemd alerts have been stalled pending better aggregation/grouping capabilities.

Fri, Sep 27, 3:17 PM · User-fgiunchedi, Goal, observability
fgiunchedi closed T214838: ms-be1034 crash as Declined.

Will be done as part of T141756: audit / test / upgrade hp smartarray P840 firmware, resolving

Fri, Sep 27, 1:08 PM · Operations, SRE-swift-storage
fgiunchedi closed T216325: Thumbnails missing for uploaded file on shnwiki, says "Unauthorized" error as Resolved.

Thumbnails work for me on that wiki now, resolving

Fri, Sep 27, 1:05 PM · Thumbor, SRE-swift-storage, MediaWiki-Uploading, Multimedia
fgiunchedi closed T156143: High CPU usage from swift-proxy on frontend machines as Declined.

Hasn't reoccurred through multiple depool cycles, in the meantime swift has been upgraded too, declining

Fri, Sep 27, 12:55 PM · Operations, SRE-swift-storage
fgiunchedi closed T138496: bring swift eqiad to one zone per row as Resolved.

Row balancing has occurred naturally as we've cycled through hardware

Fri, Sep 27, 12:52 PM · SRE-swift-storage, Operations
fgiunchedi added a comment to T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet.

I see this task is for 6x hosts and parent T228461 is for 9x, wanted to make sure that's expected/wanted ?

Fri, Sep 27, 7:54 AM · User-fgiunchedi, Operations
fgiunchedi added a comment to T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet.

@fgiunchedi I see you said for raid Partitioning/Raid: "use existing ms-be setup" Unfortunately my memory is not that great anymore can you please remind what the existing raid setup is please,

Fri, Sep 27, 7:49 AM · User-fgiunchedi, Operations
fgiunchedi updated the task description for T233289: Decommission ms-be1027.
Fri, Sep 27, 7:48 AM · decommission, Operations, ops-eqiad
fgiunchedi assigned T233289: Decommission ms-be1027 to Cmjohnson.

@Cmjohnson host is ready for decom! thanks

Fri, Sep 27, 7:45 AM · decommission, Operations, ops-eqiad
fgiunchedi renamed T233289: Decommission ms-be1027 from Unable to power on ms-be1027 to Decommission ms-be1027.
Fri, Sep 27, 7:44 AM · decommission, Operations, ops-eqiad

Thu, Sep 26

fgiunchedi added a comment to T225601: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org>.

@fgiunchedi I understand what you are saying, but I'm very new and not skilled with installations yet. Do you know how to exclude that certain package from the installation process?

Thu, Sep 26, 3:33 PM · Parsoid, Operations, MediaWiki-Releasing
fgiunchedi closed T150486: Deploy federation for Prometheus as Resolved.

This has happened in the meantime!

Thu, Sep 26, 2:31 PM · Patch-For-Review, Prometheus-metrics-monitoring, Operations
fgiunchedi updated the task description for T150486: Deploy federation for Prometheus.
Thu, Sep 26, 2:31 PM · Patch-For-Review, Prometheus-metrics-monitoring, Operations
fgiunchedi created T233956: Deploy Thanos (long-term storage) stateless components: sidecar and query.
Thu, Sep 26, 2:11 PM · User-fgiunchedi, Goal, observability
fgiunchedi added a comment to T130329: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts.

Still ongoing from time to time (e.g. in september)

Thu, Sep 26, 1:16 PM · Discovery-Search (Current work), Patch-For-Review, Operations, Discovery, Elasticsearch
fgiunchedi added a comment to T233134: logstash-beta.wmflabs.org does not receive any mediawiki events.

My two cents: since deployment-logstash03 has been setup in T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster as the 02 stretch replacement, my suggestion would be to work to bring up 03 and ditch 02 in the process

Thu, Sep 26, 1:10 PM · Release-Engineering-Team-TODO, observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure
fgiunchedi added a comment to T233739: MediaWiki log spam during row D blip / rack D2 unavailable.

Another type of spam that we have observed, php-generated but comes through by means of syslog + apache: https://logstash.wikimedia.org/goto/c5afb82f07a0249524d66b74c82e55c4

Thu, Sep 26, 9:10 AM · MediaWiki-General, Wikimedia-Logstash, Wikimedia-Incident

Wed, Sep 25

fgiunchedi added a comment to T226986: Client side error logging production launch.

@fgiunchedi: Sorry for the belated ping but has the

Write minimal client to send errors without attempting normalization for MVP

step got an owner? I see from the minutes of the meeting on 04/09/2019 that we discussed it but mightn't have concluded anything:

  • Filippo: owner for component given it seems there will be more work on it (deduplication/aggregation/rate limiting)
Wed, Sep 25, 3:43 PM · Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic, Analytics
fgiunchedi updated subscribers of T232820: Security Concept Review For client side error logging js client.

Thanks @sbassett ! Will do, I'm cc'ing @Tgr too

Wed, Sep 25, 2:47 PM · Security-Team-Reviews
fgiunchedi added a comment to T231066: Host decommission improvements.

I tested the cookbook on ms-be1027 in T233289, the host is powered down and not coming back (faulty hw) and the cookbook stopped when trying to get to the host, whereas IMHO it should have continued (and/or prompt) with the remaining steps

Wed, Sep 25, 2:39 PM · Operations, DC-Ops, SRE-tools
fgiunchedi added a comment to T233289: Decommission ms-be1027.

Indeed the decom script failed on this host that's powered down already, the full trace is

Wed, Sep 25, 2:36 PM · decommission, Operations, ops-eqiad
fgiunchedi reassigned T221068: decom ms-be201[345] from RobH to Papaul.

This is ready for you to take over @Papaul, thanks!

Wed, Sep 25, 2:19 PM · decommission, ops-codfw, SRE-swift-storage, User-fgiunchedi, Operations
fgiunchedi updated the task description for T221068: decom ms-be201[345].
Wed, Sep 25, 2:18 PM · decommission, ops-codfw, SRE-swift-storage, User-fgiunchedi, Operations
fgiunchedi updated the task description for T221068: decom ms-be201[345].
Wed, Sep 25, 1:54 PM · decommission, ops-codfw, SRE-swift-storage, User-fgiunchedi, Operations
fgiunchedi added a comment to T230752: Deploy phatality into kibana.

Should be all deployed now, ready for another round of plugin-install

Wed, Sep 25, 1:48 PM · Release-Engineering-Team-TODO (201909), User-MModell, observability, Wikimedia-Logstash, Phabricator
phuedx awarded T226986: Client side error logging production launch a Mountain of Wealth token.
Wed, Sep 25, 9:22 AM · Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic, Analytics

Tue, Sep 24

fgiunchedi created T233739: MediaWiki log spam during row D blip / rack D2 unavailable.
Tue, Sep 24, 4:40 PM · MediaWiki-General, Wikimedia-Logstash, Wikimedia-Incident