Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (20)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (280 w, 2 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Fri, Feb 14

fgiunchedi created T245280: logstash_formatter_key_conflict in mediawiki logs.
Fri, Feb 14, 4:25 PM · MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), Wikimedia-production-error, MediaWiki-extensions-LoginNotify, Community-Tech, MediaWiki-Authentication-and-authorization, MediaWiki-Debug-Logger
fgiunchedi awarded T245242: Allow !log in #wikimedia-sre a Like token.
Fri, Feb 14, 3:48 PM · Stashbot
fgiunchedi created T245242: Allow !log in #wikimedia-sre.
Fri, Feb 14, 10:11 AM · Stashbot
fgiunchedi committed rLPRIc33a588c2712: hieradata: add dummy performance_arclamp key (authored by fgiunchedi).
hieradata: add dummy performance_arclamp key
Fri, Feb 14, 9:52 AM

Thu, Feb 13

fgiunchedi added a comment to T244776: Swift container for performance flame graphs (ArcLamp).

Looking at yesterday's (2020-02-11) output, it was about 8 GB of (uncompressed) logs and 14 MB of SVGs, and about 800 files total. We can control the sampling interval to regulate how big these get, so let's assume it's relatively constant. I'll have to check if there's a reason we don't compress the logs; I feel like we should, which would dramatically reduce this. (I just now tried gzip -1 on one set of logs, and they went from 4 GB to 479 MB.)

Thu, Feb 13, 9:48 AM · Performance-Team, Patch-For-Review, Arc-Lamp, SRE-swift-storage

Tue, Feb 11

fgiunchedi added a comment to T244776: Swift container for performance flame graphs (ArcLamp).

Great to see this work ! re: authentication and permissions it is indeed like @aaron outlined, we'd be creating a user and that can create containers and upload files at will.

Tue, Feb 11, 4:54 PM · Performance-Team, Patch-For-Review, Arc-Lamp, SRE-swift-storage
fgiunchedi added a comment to T242250: rack/setup/install ps[12]-60[34]-eqsin.

@RobH please let me know once the PDUs should be snmp-accessible, they'll need to be added to puppet/monitoring

Tue, Feb 11, 9:48 AM · Operations, ops-eqsin
fgiunchedi added a comment to T244761: Script to point SRE local machine traffic to another LB.

+1 to /etc/hosts, I've done similar in the past and has worked as expected. As a side note the script could even take the form of a puppet manifest we can then puppet apply locally.

Tue, Feb 11, 9:41 AM · Operations
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Tue, Feb 11, 9:22 AM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi closed T242585: Move cassandra logging to logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Tue, Feb 11, 9:21 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi closed T242585: Move cassandra logging to logging pipeline as Resolved.

This is complete! All cassandra production clusters now log through the logging pipeline.

Tue, Feb 11, 9:21 AM · Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Operations

Mon, Feb 10

fgiunchedi moved T240667: Ingestion errors for production logs on ELK7 from Backlog to Doing on the User-fgiunchedi board.
Mon, Feb 10, 10:25 AM · User-fgiunchedi, Operations, Wikimedia-Logstash

Fri, Feb 7

fgiunchedi added a comment to T244357: Provision grafana VM in codfw.

added vm-requests tag and pasted vm-request form. please add the missing data above.

Fri, Feb 7, 9:32 AM · serviceops, vm-requests, observability, Operations
fgiunchedi updated the task description for T244357: Provision grafana VM in codfw.
Fri, Feb 7, 9:31 AM · serviceops, vm-requests, observability, Operations

Thu, Feb 6

fgiunchedi added a comment to T225125: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline.

Status update: out of the box json logging support has been introduced in elasticsearch 7 (https://github.com/elastic/elasticsearch/issues/8786). Whereas for previous versions we'd need to bring in jackson-databind, which comes with its own set of challenges (e.g. https://github.com/elastic/elasticsearch/issues/22103). Thus I'm of the opinion that waiting for the elasticsearch 7 upgrade on cirrus/relforge/cloudelastic will be easier.

Thu, Feb 6, 2:23 PM · Patch-For-Review, Discovery-Search (Current work), observability, Elasticsearch, Operations, Wikimedia-Logstash

Wed, Feb 5

fgiunchedi renamed T244208: Upgrade Grafana to 6.6 from Upgrade Grafana to 6.4 to Upgrade Grafana to 6.6.
Wed, Feb 5, 5:02 PM · Operations, observability
fgiunchedi created T244357: Provision grafana VM in codfw.
Wed, Feb 5, 2:02 PM · serviceops, vm-requests, observability, Operations

Tue, Feb 4

fgiunchedi created T244208: Upgrade Grafana to 6.6.
Tue, Feb 4, 9:23 AM · Operations, observability
fgiunchedi moved T242585: Move cassandra logging to logging pipeline from Backlog to Doing on the User-fgiunchedi board.
Tue, Feb 4, 9:18 AM · Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T227108: Port varnishlog consumers to log to syslog / logging infra.

Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found:

  1. journald < buster has a maximum line length of 2k, thus long lines get broken into multiple lines, in turn breaking json parsing.
Tue, Feb 4, 9:01 AM · Patch-For-Review, Traffic, observability, Wikimedia-Logstash, User-fgiunchedi, Operations

Mon, Feb 3

fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Mon, Feb 3, 3:34 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi added a comment to T227108: Port varnishlog consumers to log to syslog / logging infra.

Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found:

Mon, Feb 3, 10:53 AM · Patch-For-Review, Traffic, observability, Wikimedia-Logstash, User-fgiunchedi, Operations

Mon, Jan 27

fgiunchedi closed T242511: Degraded RAID on ms-be1039 as Resolved.

@godog I replaced the disk, please see what you need to do to add it back to the raid. Thanks!

Mon, Jan 27, 11:13 PM · SRE-swift-storage, ops-eqiad, Operations

Tue, Jan 21

fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Tue, Jan 21, 11:40 AM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi closed T242609: Move thumbor to the logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Tue, Jan 21, 11:39 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi closed T242609: Move thumbor to the logging pipeline as Resolved.

Thumbor now is logging to localhost:11514, from there logs are shipped to the logging pipeline, resolving!

Tue, Jan 21, 11:39 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations

Mon, Jan 20

fgiunchedi added a comment to T242826: Increase retention time of graphite stats for CI.

No idea @greg right off the bat, although I see the panel Gate and Submit "Resident Time" on top left is working as expected so I take it is possible in some fashion

Mon, Jan 20, 9:34 AM · User-greg, Release-Engineering-Team (CI & Testing services), Graphite, Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3))

Jan 17 2020

fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Jan 17 2020, 2:00 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Jan 17 2020, 1:54 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Jan 17 2020, 1:46 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Jan 17 2020, 1:40 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Jan 17 2020, 1:33 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi created T243065: Provision plaintext syslog collectors in esams/ulsfo/eqsin.
Jan 17 2020, 11:03 AM · netops, observability, Operations
fgiunchedi created T243057: Move Prometheus off eqsin/ulsfo/esams bastions.
Jan 17 2020, 10:07 AM · Operations, observability
fgiunchedi added a comment to T125408: Regularly & Automatically backup WMDE metrics stored in graphite.

@fgiunchedi Any idea if there is any sort of regular / scheduled backups of the disks for graphite nodes?
If not I'll try to put something in place for the few metrics we would like to not loose for now :)

Jan 17 2020, 9:26 AM · Graphite, User-Addshore, Operations, WMDE-Analytics-Engineering
fgiunchedi closed T241790: (No Need By Date Provided) rack/setup/install restbase202[123] as Resolved.

All done, service is being implemented in T243000

Jan 17 2020, 9:17 AM · Core Platform Team Workboards (Clinic Duty Team), ops-codfw, Operations
fgiunchedi updated the task description for T241790: (No Need By Date Provided) rack/setup/install restbase202[123].
Jan 17 2020, 9:16 AM · Core Platform Team Workboards (Clinic Duty Team), ops-codfw, Operations
fgiunchedi reassigned T241790: (No Need By Date Provided) rack/setup/install restbase202[123] from fgiunchedi to Eevans.
Jan 17 2020, 9:15 AM · Core Platform Team Workboards (Clinic Duty Team), ops-codfw, Operations

Jan 15 2020

fgiunchedi reassigned T240882: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet from fgiunchedi to herron.
Jan 15 2020, 2:21 PM · Patch-For-Review, Operations, ops-codfw, Wikimedia-Logstash
fgiunchedi added a comment to T240882: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet.

Thank you @Papaul and @Dzahn !

Jan 15 2020, 1:51 PM · Patch-For-Review, Operations, ops-codfw, Wikimedia-Logstash
fgiunchedi added a comment to T242826: Increase retention time of graphite stats for CI.

In other panels I see data going back to Aug 2017 for core's gate-and-submit, e.g. https://grafana.wikimedia.org/d/000000108/releng-kpis?orgId=1&from=1496046004105&to=1579080951474&fullscreen&edit&panelId=2 maybe it has to do with the panel's query ?

Jan 15 2020, 9:38 AM · User-greg, Release-Engineering-Team (CI & Testing services), Graphite, Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3))

Jan 14 2020

fgiunchedi added a comment to T235891: Ingest production logs with ELK7.

Something else I realized today: with ELK7 we dropped our custom logstash template in favor of logstash's default, although we'll need to bump the default number of fields per index (1000 -> 4096) at least

Jan 14 2020, 5:04 PM · User-fgiunchedi, Patch-For-Review, Operations, Wikimedia-Logstash
jijiki awarded T242609: Move thumbor to the logging pipeline a Love token.
Jan 14 2020, 10:33 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T232820: Security Concept Review For client side error logging js client.

AFAICT the code is basically ready to be merged, when does security review need to happen ?

@fgiunchedi - a few points:

  1. Given that the patch set is a config file change and some fairy minimal JavaScript for error-tracking, this probably wouldn't warrant a full security readiness review. Those are typically performed for new (or substantial rewrites) of MediaWiki extensions, services, etc. I know our documentation isn't very clear on various review thresholds - hopefully we can improve that at some point.
  2. I'd imagine the Performance-Team might be interested in having a look at this if they haven't already.
  3. Prior to any patch review, we'd probably require that all of the relevant CI checks are passing.
  4. For the patch review itself, @Reedy or I would most likely focus on ensuring that the regexWebkit and regexGecko patterns are sufficiently hardened and if any potential attack vectors against EventGate are being introduced.
Jan 14 2020, 10:22 AM · Security Concept Review, Security-Team

Jan 13 2020

fgiunchedi added a subtask for T227080: Deprecate all non-Kafka logstash inputs: T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline.
Jan 13 2020, 2:30 PM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a parent task for T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline: T227080: Deprecate all non-Kafka logstash inputs.
Jan 13 2020, 2:30 PM · observability, Operations, Wikimedia-Logstash
fgiunchedi created T242609: Move thumbor to the logging pipeline.
Jan 13 2020, 2:18 PM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T212946: Stream Thumbor logs to logstash.

Reviving this as part of this Q's OKRs to move services off logstash non-kafka inputs, I'll followup with patches to move to the localhost-udp compatibility endpoint! (See T242609)

Jan 13 2020, 2:14 PM · observability, Wikimedia-Logstash, User-jijiki, serviceops, Operations, Thumbor
fgiunchedi added a comment to T242579: Setup netconsole on upload@esams hosts.

In case it is helpful: we can reuse the centrallog hosts in codfw/eqiad. For site-local netconsole instead we'll need to setup local syslog collectors anyways (on ganeti VMs) for network devices syslog, so we could piggyback on those.

Jan 13 2020, 12:03 PM · Traffic, Operations
fgiunchedi created T242585: Move cassandra logging to logging pipeline.
Jan 13 2020, 10:53 AM · Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T232820: Security Concept Review For client side error logging js client.

@fgiunchedi - any update or progress on this? If not, we might just want to decline this task for now until there's actually some code ready for review. Thanks.

Yes, progress in the form of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/553376 which is the client-side javascript part. The code is being reviewed and worked on although I don't think it is yet at the security-review stage. Having said that, feel free to close this task and we'll reopen/followup when needed.

Jan 13 2020, 10:44 AM · Security Concept Review, Security-Team
fgiunchedi moved T178839: New upstream jvm-tools from Backlog to Radar on the User-fgiunchedi board.
Jan 13 2020, 10:08 AM · Core Platform Team, User-Eevans, User-fgiunchedi, Operations
fgiunchedi moved T227080: Deprecate all non-Kafka logstash inputs from Backlog to Doing on the User-fgiunchedi board.
Jan 13 2020, 10:08 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi moved T227108: Port varnishlog consumers to log to syslog / logging infra from Backlog to Doing on the User-fgiunchedi board.
Jan 13 2020, 10:08 AM · Patch-For-Review, Traffic, observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi closed T242471: Degraded RAID on ms-be1035 as Resolved.

Looks like the controller freaked out (T141756), firmware upgraded and rebooted.

Jan 13 2020, 9:27 AM · SRE-swift-storage, ops-eqiad, Operations
fgiunchedi assigned T242511: Degraded RAID on ms-be1039 to Jclark-ctr.

@Jclark-ctr @Cmjohnson host is under warranty for another month according to netbox, please order a replacement for the failed 4TB disk (led is blinking), thanks!

Jan 13 2020, 9:09 AM · SRE-swift-storage, ops-eqiad, Operations

Jan 9 2020

fgiunchedi closed T241534: Degraded RAID on ms-be2035 as Resolved.

Thanks @Papaul ! Upon reboot the host booted into pxe, I am assuming because the first disk was present but was unbootable and didn't fallback onto booting from the second disk. Anyways all good after a reimage, resolving.

Jan 9 2020, 12:55 PM · SRE-swift-storage, Operations, ops-codfw

Jan 7 2020

kostajh awarded Blog Post: The journey to Prometheus 2 a Yellow Medal token.
Jan 7 2020, 12:23 PM
fgiunchedi added a comment to Blog Post: The journey to Prometheus 2.
In J184#2651, @mmodell wrote:

Thank you @fgiunchedi for sharing this story. Congratulations for working through some difficult bugs to roll out this important upgrade.

Jan 7 2020, 10:51 AM
fgiunchedi added a comment to T240520: Produce dumps of commons thumbnail URLs.

How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?

This seems to get urls from swift at about 20k/sec, for the 1.3B commons thumbs that works out to about 18 hours. I didn't check enwiki, assuming commons would be an order of magnitude more than the others, but could look into it. If we want things to take less time that could be parallelized over the list of containers to dump (255), probably we could do 4 at a time or some such.

The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?

Commonswiki was the primary goal, as above about it is around 18 hours. Compressed the output is around 7GB.

How does swift hold up under this? Can we run one process doing commons and another/others doing the rest, or is that going to be a noticeable load on the servers? I'm going to add @fgiunchedi for comments on this.

Jan 7 2020, 10:50 AM · Patch-For-Review, Dumps-Generation, Internet-Archive, Datasets-Archiving

Jan 4 2020

mmodell awarded Blog Post: The journey to Prometheus 2 a Love token.
Jan 4 2020, 8:06 PM

Jan 2 2020

fgiunchedi changed the header image for post Blog Post: The journey to Prometheus 2.
Jan 2 2020, 12:53 PM
fgiunchedi changed the header image for post Blog Post: The journey to Prometheus 2.
Jan 2 2020, 12:52 PM
fgiunchedi changed the header image for post Blog Post: The journey to Prometheus 2.
Jan 2 2020, 12:51 PM
fgiunchedi updated the post content for Blog Post: The journey to Prometheus 2.
Jan 2 2020, 12:46 PM
fgiunchedi updated the post content for Blog Post: The journey to Prometheus 2.
Jan 2 2020, 12:41 PM
fgiunchedi removed a project from T234698: ms-be1020 - firmware upgrade: (was: host went down): User-fgiunchedi.
Jan 2 2020, 11:18 AM · ops-eqiad, SRE-swift-storage, Operations
fgiunchedi closed T239805: ms-fe2007 NIC failure as Resolved.

Host is back in service!

Jan 2 2020, 11:10 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi added a comment to T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory.

I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was reporting multiple GBs of memory free at the time of the crashes:

Jan 2 2020, 10:58 AM · observability, Operations, Traffic
fgiunchedi added a project to T239805: ms-fe2007 NIC failure: User-fgiunchedi.
Jan 2 2020, 10:38 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi lowered the priority of T226373: Swift object servers become briefly unresponsive on a regular basis from High to Medium.

What is the right followup after a month? "I don't know" is an ok answer, I just want to clarify the status of the ticket- e.g. does ATS people need to be involved? Or just "being handled, just needs time"?

Jan 2 2020, 10:35 AM · Performance-Team (Radar), User-jijiki, serviceops, Patch-For-Review, SRE-swift-storage, Operations
fgiunchedi merged T241714: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Jan 2 2020, 10:09 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi merged task T241714: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Jan 2 2020, 10:09 AM · Operations, ops-codfw
fgiunchedi added a comment to T241534: Degraded RAID on ms-be2035.

@Papaul host is in warranty and looks like an SSD failed, could we get that replaced (led is blinking), thanks!

Jan 2 2020, 10:08 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi merged task T241535: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Jan 2 2020, 10:07 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi merged T241535: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Jan 2 2020, 10:07 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

Configured both ports to use PXE when booting, now the host is running the reimage correctly:

Jan 2 2020, 9:46 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:

Link Status                                           <Disconnected>         *
Jan 2 2020, 9:36 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:

Jan 2 2020, 9:32 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi added a comment to T211661: Automatically clean up unused thumbnails in Swift.

Analytics now publishes media access stats, might be useful to drive some/all thumbnail cleanup: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests

Jan 2 2020, 9:09 AM · User-jijiki, Patch-For-Review, Traffic, SRE-swift-storage, Performance-Team, Operations

Dec 20 2019

fgiunchedi awarded T226986: Client side error logging production launch a Love token.
Dec 20 2019, 8:55 AM · Performance-Team (Radar), Desktop Improvements, Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic, Analytics

Dec 19 2019

ema awarded Blog Post: The journey to Prometheus 2 a Burninate token.
Dec 19 2019, 3:26 PM
fgiunchedi published Blog Post: The journey to Prometheus 2.
Dec 19 2019, 2:18 PM

Dec 18 2019

fgiunchedi added a comment to T240685: MediaWiki Prometheus support.

My two cents re: prometheus_client_php current adapters, apcu and redis, since AIUI none are optimal/desirable. There might be another way inspired by what the ruby client does (see also this PromCon 2019 talk and video). Very broadly, the idea IIRC is for each process to mmap a separate file with its own metrics, then at collection time metrics are read from the files and merged.

Dec 18 2019, 9:55 AM · Operations, MediaWiki-General, observability

Dec 17 2019

fgiunchedi added a comment to T240789: Return traffic to eqiad WMCS triggering FNM.

FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how far that gets us.

Dec 17 2019, 9:48 AM · Patch-For-Review, cloud-services-team (Kanban), Operations, netops

Dec 16 2019

fgiunchedi added a comment to T232820: Security Concept Review For client side error logging js client.

@fgiunchedi - any update or progress on this? If not, we might just want to decline this task for now until there's actually some code ready for review. Thanks.

Dec 16 2019, 3:17 PM · Security Concept Review, Security-Team
fgiunchedi updated the task description for T236075: Evaluate, suggest and choose an alert escalation solution.
Dec 16 2019, 3:08 PM · User-fgiunchedi, observability
fgiunchedi moved T231086: Picture from Commons not found from Singapore from Doing to Radar on the User-fgiunchedi board.
Dec 16 2019, 3:07 PM · Performance-Team (Radar), User-fgiunchedi, Structured-Data-Backlog, Structured Data Engineering, Multimedia, MW-1.34-notes (1.34.0-wmf.21; 2019-09-03), Patch-For-Review, MediaWiki-File-management, Commons, SRE-swift-storage, Traffic, Operations
fgiunchedi moved T156955: Standardizing our partman recipes from Backlog to Doing on the User-fgiunchedi board.
Dec 16 2019, 3:07 PM · Patch-For-Review, User-fgiunchedi, Operations
fgiunchedi edited projects for T156955: Standardizing our partman recipes, added: User-fgiunchedi; removed Patch-For-Review.
Dec 16 2019, 3:04 PM · Patch-For-Review, User-fgiunchedi, Operations
fgiunchedi closed T240798: Degraded RAID on ms-be2016 as Resolved.

hw raid firmware upgraded, resolving

Dec 16 2019, 9:25 AM · Operations, ops-codfw

Dec 13 2019

fgiunchedi updated the description for Observing the observable.
Dec 13 2019, 3:24 PM
fgiunchedi changed the profile image for blog Observing the observable.
Dec 13 2019, 3:23 PM
fgiunchedi changed the profile image for blog Observing the observable.
Dec 13 2019, 3:23 PM
fgiunchedi created Observing the observable.
Dec 13 2019, 3:11 PM
fgiunchedi updated the task description for T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 2:14 PM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 2:11 PM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 2:09 PM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi renamed T239090: Restbase logging indexing conflict on 'res' and 'body' logging fields from Restbase logging indexing conflict to Restbase logging indexing conflict on 'res' and 'body' logging fields.
Dec 13 2019, 2:07 PM · Core Platform Team Workboards (Clinic Duty Team), User-fgiunchedi, Wikimedia-Logstash, RESTBase
fgiunchedi added a comment to T239090: Restbase logging indexing conflict on 'res' and 'body' logging fields.

After getting a little more perspective in T240667 it seems that indeed res and body are sometimes sent as strings and sometimes as nested objects.

Dec 13 2019, 2:05 PM · Core Platform Team Workboards (Clinic Duty Team), User-fgiunchedi, Wikimedia-Logstash, RESTBase