fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (18)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (201 w, 6 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Tue, Aug 14

fgiunchedi renamed T201971: Shorten logstash retention temporarily from Shorten logstash retention to Shorten logstash retention temporarily.
Tue, Aug 14, 11:38 PM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Goal, Operations
fgiunchedi added a comment to T201971: Shorten logstash retention temporarily.

We can't delete inside indices easily, no. Dropping old indices is cheap compared to actually looking inside and delete only specific data. I'll clarify in the task description that this is a temporary bandaid though until we get more logstash hardware.

Tue, Aug 14, 11:36 PM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Goal, Operations
fgiunchedi triaged T201971: Shorten logstash retention temporarily as Normal priority.
Tue, Aug 14, 11:04 PM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Goal, Operations
fgiunchedi added a comment to T200362: Send logstash service metrics to prometheus.

Change 451018 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: add jmx_exporter

https://gerrit.wikimedia.org/r/451018

Tue, Aug 14, 7:51 PM · Patch-For-Review, User-herron, monitoring, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T155705: confctl: log to SAL even if the selection doesn't match any host.

Also there's nothing logged on stdout on non-existent host and conftool exits 0. Ditto for a non-existant service

Tue, Aug 14, 7:41 PM · Operations, Operations-Software-Development
fgiunchedi added a comment to T201858: Investigate having thumbor push thumbnails to both data centers, and not just eqiad.

Not having pre-generated thumbnails at switchover time will have a significant impact in the sense that thumbor in codfw can get overloaded with thumbnails missing from both varnish and swift. Additionally it'd mean we'd have to keep relying on swiftrepl as more than a safety net as originally was designed for, hope that helps!

Tue, Aug 14, 5:02 PM · Thumbor, Performance-Team
fgiunchedi created T201938: Request for swift ms-be refresh.
Tue, Aug 14, 4:41 PM · hardware-requests, Operations
fgiunchedi created T201937: Request for swift ms-be expansion.
Tue, Aug 14, 4:35 PM · hardware-requests, Operations

Mon, Aug 13

fgiunchedi closed T201863: RESTBase dev environment (Cassandra) SSL certificates expired as Resolved.

Nice, thanks!

Mon, Aug 13, 11:37 PM · Services (watching), Operations
fgiunchedi added a comment to T201863: RESTBase dev environment (Cassandra) SSL certificates expired.

I've renewed the certs on restbase-dev* and ran puppet. Next up is cassandra roll restart to pick up the certs.

Mon, Aug 13, 10:21 PM · Services (watching), Operations
fgiunchedi added a comment to T201694: Move servers off asw2-a-eqiad.

re: ms-be1040 it can be moved back to the old switch any time

Mon, Aug 13, 5:33 PM · Patch-For-Review, Operations, netops

Wed, Aug 8

fgiunchedi added a comment to T200960: Logstash packet loss.

Took thread dumps from 1008 and 1007: https://phabricator.wikimedia.org/P7437 and https://phabricator.wikimedia.org/P7438

Wed, Aug 8, 3:49 PM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T200960: Logstash packet loss.

I've added java threads and heap bytes to the dashboard, looks like there's a thread leak on 2 out of 3 hosts (unclear though if that's involved in packet loss)

Wed, Aug 8, 3:47 PM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T201473: prometheus-pdns-exporter for Jessie?.

For both questions the answer is "yes", it is something we wrote and should be trivial to backport to jessie/stretch. I believe a simple rebuild for stretch/jessie will do that.

Wed, Aug 8, 1:18 PM · Patch-For-Review, monitoring, cloud-services-team
fgiunchedi renamed T200960: Logstash packet loss from Logstash has ~90% packet loss since June 29 to Logstash packet loss.
Wed, Aug 8, 11:02 AM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi reopened T200960: Logstash packet loss as "Open".

Reopening since I just saw a brief 40 packets/s loss on logstash1008

Wed, Aug 8, 11:02 AM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T201344: rack/setup/install icinga1001.wikimedia.org.

[looking at the bikeshed] to me monitor is a bit too generic, also this is likely to be a single-use box (i.e. only icinga) so icinga1001 would work better IMO

Wed, Aug 8, 9:22 AM · ops-eqiad, monitoring, Operations
fgiunchedi closed T200960: Logstash packet loss as Resolved.
Wed, Aug 8, 9:17 AM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi triaged T200960: Logstash packet loss as Normal priority.

I am not seeing packet loss anymore after moving to persisted queues, I'm resolving this though feel free to reopen. There is still the issue of slow pipelines of course, which hopefully we'll get more insights with logstash metrics exported into prometheus.

Wed, Aug 8, 9:16 AM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi reassigned T199967: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group from RStallman-legalteam to Lea_WMDE.

@Lea_WMDE what's your LDAP username to be added to wmde group?

Wed, Aug 8, 9:15 AM · Patch-For-Review, Operations, LDAP-Access-Requests
fgiunchedi added a comment to T201468: LDAP access for HWalls and GVarnum.

I see hwalls is already in wmf ldap group:

Wed, Aug 8, 8:53 AM · Analytics, LDAP-Access-Requests
fgiunchedi created T201490: ldap tool 500s on group wmf.
Wed, Aug 8, 8:43 AM · Toolforge, Tools
fgiunchedi added a comment to T201196: analytics-privatedata-users access for Dario Rossi (username drossi).

To clarify, the procedure to request a developer account is here: https://www.mediawiki.org/wiki/Developer_account

Wed, Aug 8, 7:54 AM · SRE-Access-Requests, Operations
fgiunchedi added a comment to T201196: analytics-privatedata-users access for Dario Rossi (username drossi).

Hi @Rossi.dario.g,
this request in particular is to create your user on the WMF cluster. Wikitech users creation is self-service, please create an account there too and let us know your username. The wikitech and wikipedia accounts AFAIK are not related though.

Wed, Aug 8, 7:51 AM · SRE-Access-Requests, Operations
fgiunchedi triaged T201470: Add contint-roots to releases{1,2}001 as Normal priority.
Wed, Aug 8, 7:44 AM · Release-Engineering-Team (Watching / External), SRE-Access-Requests, Operations
fgiunchedi added a comment to T201470: Add contint-roots to releases{1,2}001.

Since this request is expanding root scope to other boxes I believe it'll need to be put up at the next SRE meeting on Monday

Wed, Aug 8, 7:44 AM · Release-Engineering-Team (Watching / External), SRE-Access-Requests, Operations
fgiunchedi moved T199198: Some swift filesystems reporting negative disk usage from Doing to Blocked on the User-fgiunchedi board.
Wed, Aug 8, 7:41 AM · User-fgiunchedi, media-storage, Operations
fgiunchedi moved T86552: Monitor and alarm on SMART attributes from Doing to Up next on the User-fgiunchedi board.
Wed, Aug 8, 7:41 AM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
fgiunchedi closed T201454: update ssh keys for amire80 - August 2018 as Resolved.

Public key swapped

Wed, Aug 8, 7:16 AM · SRE-Access-Requests, Patch-For-Review, Operations

Tue, Aug 7

fgiunchedi added a comment to T201133: db1069 (x1 master) memory errors.

Indeed it can happen since the alert is errors over four days, if no new errors come in the alert will recover

Tue, Aug 7, 2:07 PM · ops-eqiad, Operations, DBA
fgiunchedi added a comment to T200960: Logstash packet loss.

I've enabled disk persisted queues in logstash, early to tell now but it looks like that "fixed" (papered over) the issue so slow pipelines and outputs don't affect inputs anymore. I'll remove the daily restart later today if things look in order.

Tue, Aug 7, 1:28 PM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T201411: Fatal error on Special:Log caused by date input (TimestampException error).

This indeed persists and triggers "mediawiki exceptions" alerts due to high volume of attempts, eg.

Tue, Aug 7, 1:27 PM · Security, MediaWiki-Logging, Wikimedia-log-errors
fgiunchedi triaged T201355: bast1002 - hardware (memory) issue as Normal priority.
Tue, Aug 7, 9:52 AM · Patch-For-Review, Operations, ops-eqiad
fgiunchedi triaged T201350: Access to dumps servers as Normal priority.
Tue, Aug 7, 9:51 AM · Patch-For-Review, Data-Services, Operations, SRE-Access-Requests
fgiunchedi reopened T163286: Tegmen: process spawn loop + failed icinga + failing puppet as "Open".

Reopening, looks like tegmen is suffering lots of nsca processes again :(

Tue, Aug 7, 9:45 AM · Patch-For-Review, monitoring, Operations

Mon, Aug 6

fgiunchedi added a comment to T201187: Thumbnails don't seem to be being created/saved for id_internalwikimedia.

Can you try again?

Mon, Aug 6, 2:32 PM · Patch-For-Review, Operations, Thumbor
fgiunchedi added a project to T196484: rack/setup/install graphite1004: User-fgiunchedi.

Thanks @RobH ! Yeah role spare makes sense in this case.

Mon, Aug 6, 2:13 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi added a comment to T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0"..

Ack, thanks for the report @revi ! I'll defer to some people more intimate with uploadstash

Mon, Aug 6, 2:12 PM · media-storage, MediaWiki-Uploading, Multimedia, User-revi
fgiunchedi added a comment to T201187: Thumbnails don't seem to be being created/saved for id_internalwikimedia.

No script no, just a review like the above. I'll deploy that later today.

Mon, Aug 6, 1:56 PM · Patch-For-Review, Operations, Thumbor
fgiunchedi merged task T188835: Logstash should log varnish XID into T176065: Include Varnish XID in Logstash.
Mon, Aug 6, 1:16 PM · Thumbor, Wikimedia-Logstash, MediaWiki-Debug-Logger
fgiunchedi merged T188835: Logstash should log varnish XID into T176065: Include Varnish XID in Logstash.
Mon, Aug 6, 1:16 PM · MediaWiki-Debug-Logger, Wikimedia-Logstash
fgiunchedi closed T166608: Some mw-ocg-service logs fail to index and are being dropped as Declined.

OCG isn't in service anymore

Mon, Aug 6, 1:15 PM · Wikimedia-Logstash, OCG-General
fgiunchedi moved T63785: Improve HA for logstash cluster from Backlog to Up next on the Wikimedia-Logstash board.
Mon, Aug 6, 1:13 PM · Wikimedia-Logstash
fgiunchedi moved T97297: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine from Backlog to Up next on the Wikimedia-Logstash board.
Mon, Aug 6, 1:13 PM · Operations, Wikimedia-Logstash
fgiunchedi moved T127977: Logstash DC fail-over / per-DC setup from Backlog to Up next on the Wikimedia-Logstash board.
Mon, Aug 6, 1:13 PM · Wikimedia-Logstash, codfw-rollout
fgiunchedi moved T169586: Send celery logs and events to logstash from Backlog to Service Integration on the Wikimedia-Logstash board.
Mon, Aug 6, 1:12 PM · TestMe, Patch-For-Review, Wikimedia-Logstash, ORES, Scoring-platform-team
fgiunchedi added a comment to T117821: Make a udp2log output plugin for Logstash.

Nowadays we're deprecating udp2log, though the issue of sending logs from logstash out elsewhere still stands. cc Parsing-Team for their opinion on what sort of log export they would like to see

Mon, Aug 6, 1:09 PM · Wikimedia-Logstash, MediaWiki-Debug-Logger
fgiunchedi moved T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable from Backlog to Up next on the Wikimedia-Logstash board.
Mon, Aug 6, 1:07 PM · Discovery-Search, Services (watching), Operations, Elasticsearch, Wikimedia-Logstash, Discovery
fgiunchedi moved T178442: ssl terminators on elasticsearch servers (nginx) don't send their logs to logstash from Backlog to Service Integration on the Wikimedia-Logstash board.
Mon, Aug 6, 1:06 PM · Discovery-Search, Wikimedia-Logstash
fgiunchedi moved T181630: Send celery and wsgi service logs to logstash from Backlog to Service Integration on the Wikimedia-Logstash board.
Mon, Aug 6, 1:06 PM · Wikimedia-Logstash, monitoring, Wikimedia-Incident, Operations, Scoring-platform-team
fgiunchedi moved T200362: Send logstash service metrics to prometheus from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Aug 6, 1:06 PM · Patch-For-Review, User-herron, monitoring, Wikimedia-Logstash, Operations
fgiunchedi moved T193766: Ship host syslogs to ELK from Backlog to Service Integration on the Wikimedia-Logstash board.
Mon, Aug 6, 1:05 PM · Wikimedia-Logstash, User-herron, Patch-For-Review, Operations
fgiunchedi moved T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch from In Dev/Progress to Up next on the Wikimedia-Logstash board.
Mon, Aug 6, 1:05 PM · MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), Analytics, Patch-For-Review, MW-1.29-release-notes, EventBus, Wikimedia-Logstash
fgiunchedi moved T200960: Logstash packet loss from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
Mon, Aug 6, 1:05 PM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T200960: Logstash packet loss.

Daily restart is in place now and a packet loss alert too. Unfortunately packet loss shows up some hours after a restart too, with not only the syslog receive buffer filling up but also the gelf port (12201 udp). In addition to that logstash laments errors while receiving gelf, so possibly related:

Mon, Aug 6, 12:39 PM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a project to T200706: rack/setup/install centrallog1001.eqiad.wmnet: User-fgiunchedi.

Thanks @RobH !

Mon, Aug 6, 12:36 PM · User-fgiunchedi, Operations
fgiunchedi added a comment to T201187: Thumbnails don't seem to be being created/saved for id_internalwikimedia.

I believe that's because thumbor has to know about private containers, I've proposed https://gerrit.wikimedia.org/r/c/operations/puppet/+/450539 and we should update the new wiki creation checklist to include this step too.

Mon, Aug 6, 11:03 AM · Patch-For-Review, Operations, Thumbor

Fri, Aug 3

fgiunchedi added a comment to T200362: Send logstash service metrics to prometheus.

I took a look at both metrics and it seems https://github.com/BonnierNews/logstash_exporter metrics are more Prometheus-idiomatic (e.g. metric naming, usage of tags) so I think we should go for that.

There's some metrics missing though that I think it'd be nice (and not a blocker) to have:

Fri, Aug 3, 3:38 PM · Patch-For-Review, User-herron, monitoring, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T200362: Send logstash service metrics to prometheus.

We'll need to add jmx_exporter to Logstash too, to get JVM stats like most other JVMs on the fleet.

Fri, Aug 3, 2:21 PM · Patch-For-Review, User-herron, monitoring, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T200960: Logstash packet loss.

Slightly related, I asked syslog logstash input upstream to add settings for receive buffers: https://github.com/logstash-plugins/logstash-input-syslog/issues/50

Fri, Aug 3, 9:44 AM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T200960: Logstash packet loss.

UDP loss has been minimized now, though even with the current settings I've seen the receive buffer spike to ~1.5MB before getting drained. Short term what we could do is spool syslog traffic to disk (via logstash itself and persistent queues or rsyslog) instead of relying on being fast enough to drain the receive buffer.

Fri, Aug 3, 8:56 AM · Operations, Patch-For-Review, Wikimedia-Logstash

Thu, Aug 2

fgiunchedi changed the status of T199198: Some swift filesystems reporting negative disk usage from Open to Stalled.

Stalling this, might happen again and upstream likely will have mitigations in linux 4.19

Thu, Aug 2, 4:17 PM · User-fgiunchedi, media-storage, Operations
fgiunchedi added a comment to T200960: Logstash packet loss.

I poked at logstash a little more and packet loss has gone away after these changes (not yet in puppet):

  1. pipeline.workers: 1 was explicit in the logstash configuration, I commented it to get #workers == #CPUs working on the pipeline
  2. the multiline filter isn't thread safe so even with the setting above I had to remove multiline filter for now (used only by hhvm-fatal)
  3. increased the default receive buffer to 4MB on all logstash hosts, that's enough headroom for logstash to catch up
Thu, Aug 2, 1:11 PM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi created T200984: Stop introducing new code expanded from erb templates.
Thu, Aug 2, 8:58 AM · Operations, Puppet
fgiunchedi added a comment to T200960: Logstash packet loss.

A ton of these messages (from tcpdump), likely spamming syslog

Thu, Aug 2, 7:54 AM · Operations, Patch-For-Review, Wikimedia-Logstash
fgiunchedi added a comment to T200960: Logstash packet loss.

After the restart on logstash1007 syslog shows taking about 90% time:

Thu, Aug 2, 7:44 AM · Operations, Patch-For-Review, Wikimedia-Logstash

Wed, Aug 1

fgiunchedi added a comment to T200866: Timeout and HTTP 500 error on 5000px thumbnail of large SVG image.

Indeed Thumbor rate-limits failures to render a certain original after some tries, https://wikitech.wikimedia.org/wiki/Thumbor#Throttling
My guess would be that 5000px exceeds thumbor's memory limit for rendering

Wed, Aug 1, 8:39 AM · Thumbor
fgiunchedi added a comment to T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0"..

Odd UploadStash would fail, can you try again? Does the 503 report any further error?

Wed, Aug 1, 8:38 AM · media-storage, MediaWiki-Uploading, Multimedia, User-revi

Tue, Jul 31

fgiunchedi added a comment to T200362: Send logstash service metrics to prometheus.

I took a look at both metrics and it seems https://github.com/BonnierNews/logstash_exporter metrics are more Prometheus-idiomatic (e.g. metric naming, usage of tags) so I think we should go for that.

Tue, Jul 31, 8:36 AM · Patch-For-Review, User-herron, monitoring, Wikimedia-Logstash, Operations

Mon, Jul 30

fgiunchedi changed the status of T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests from Stalled to Open.

Sorry for the delay! I've merged the patches so haproxy is now running alongside nginx on thumbor instances.
Things still missing off top of my head:

Mon, Jul 30, 1:44 PM · Performance-Team, Thumbor
fgiunchedi created T200678: wtp2011 memory correctable errors.
Mon, Jul 30, 10:40 AM · Operations, ops-codfw
fgiunchedi added a comment to T199198: Some swift filesystems reporting negative disk usage.

I've inquired upstream, one of the suggested approaches is to run with page poisoning. I'll do that on one host in codfw, also this issue will be likely checked for and fixed in linux 4.19.

Mon, Jul 30, 10:12 AM · User-fgiunchedi, media-storage, Operations
fgiunchedi awarded T200673: varnish-http-requests false positives when a DC is depooled a Like token.
Mon, Jul 30, 10:05 AM · Traffic, Operations
fgiunchedi added a comment to T199198: Some swift filesystems reporting negative disk usage.

Indeed still another case of the 32nd bit flipping, interestingly on a codfw host where we haven't been seeing this yet:

Mon, Jul 30, 8:32 AM · User-fgiunchedi, media-storage, Operations

Fri, Jul 27

fgiunchedi moved T198753: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) from Backlog to Doing on the User-fgiunchedi board.
Fri, Jul 27, 9:21 AM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Goal, Operations
fgiunchedi closed T200317: grafana fails to load dashboards from disk as Resolved.

I checked grafana 5.2 and it correctly skips invalid dashboards from disk, mentioning which dashboards are failing to load, resolving.

Fri, Jul 27, 9:18 AM · Patch-For-Review, monitoring, Operations
fgiunchedi added a comment to T199968: Investigate odd aggregation of MediaWiki.edit.failures.conflict.sum metric in graphite.

I believe the discrepancy comes from the fact that the whisper file for that metric uses average as aggregation method, not sum. Likely because it is a file created a long time ago, before we fixed the issue in modules/role/manifests/graphite/base.pp to have .sum metric files aggregate with sum.

Got it, I figured that might be the case but had no way of checking.
Is there a way to alter this for future aggregations of this method?

Fri, Jul 27, 8:11 AM · Performance-Team, User-Addshore, WMDE-Analytics-Engineering, Graphite

Thu, Jul 26

fgiunchedi closed T134458: status.wikimedia.org should use some Wikimedia favicon if possible, a subtask of T199816: Sunset Watchmouse's status.wikimedia.org, as Invalid.
Thu, Jul 26, 10:43 AM · User-fgiunchedi, monitoring, Patch-For-Review, Operations
fgiunchedi closed T134458: status.wikimedia.org should use some Wikimedia favicon if possible as Invalid.

Parent task resolved!

Thu, Jul 26, 10:43 AM · Operations
fgiunchedi closed T195530: status.wikimedia.org showing all lights green during major outage, a subtask of T195520: Multiple projects reporting Cannot access the database: No working replica DB server, as Invalid.
Thu, Jul 26, 10:43 AM · User-Addshore, Wikidata-Campsite, MW-1.32-release-notes (WMF-deploy-2018-05-29 (1.32.0-wmf.6)), Wikidata-Ministry-Of-Magic, Wikimedia-Incident, Wikidata, Patch-For-Review, Wikimedia-General-or-Unknown, Wikimedia-log-errors, Operations
fgiunchedi closed T195530: status.wikimedia.org showing all lights green during major outage, a subtask of T199816: Sunset Watchmouse's status.wikimedia.org, as Invalid.
Thu, Jul 26, 10:43 AM · User-fgiunchedi, monitoring, Patch-For-Review, Operations
fgiunchedi closed T195530: status.wikimedia.org showing all lights green during major outage as Invalid.

Parent task resolved!

Thu, Jul 26, 10:43 AM · monitoring, Operations
fgiunchedi closed T189763: status.wikimedia.org should have an alternative privacy policy, a subtask of T199816: Sunset Watchmouse's status.wikimedia.org, as Invalid.
Thu, Jul 26, 10:43 AM · User-fgiunchedi, monitoring, Patch-For-Review, Operations
fgiunchedi closed T189763: status.wikimedia.org should have an alternative privacy policy as Invalid.

Parent task resolved!

Thu, Jul 26, 10:43 AM · monitoring, Operations, Privacy, Security
fgiunchedi closed T115945: status.wikimedia.org should not load Google Analytics as Invalid.

Parent task resolved!

Thu, Jul 26, 10:43 AM · Security-Core, Operations, Privacy, monitoring
fgiunchedi closed T115945: status.wikimedia.org should not load Google Analytics, a subtask of T199816: Sunset Watchmouse's status.wikimedia.org, as Invalid.
Thu, Jul 26, 10:43 AM · User-fgiunchedi, monitoring, Patch-For-Review, Operations
fgiunchedi closed T199816: Sunset Watchmouse's status.wikimedia.org as Resolved.

It has been decided at the SRE weekly meeting to leave the deprecation page up indefinitely instead of removing the DNS name. I've updated enwiki pages as well, resolving this task and subtasks.

Thu, Jul 26, 10:42 AM · User-fgiunchedi, monitoring, Patch-For-Review, Operations

Wed, Jul 25

fgiunchedi added a comment to T198754: Logstash/Kibana architecture review.

Non exhaustive list of things that we'll need to address:

Wed, Jul 25, 2:13 PM · Wikimedia-Logstash, Goal, Operations
fgiunchedi added a comment to T199968: Investigate odd aggregation of MediaWiki.edit.failures.conflict.sum metric in graphite.

I believe the discrepancy comes from the fact that the whisper file for that metric uses average as aggregation method, not sum. Likely because it is a file created a long time ago, before we fixed the issue in modules/role/manifests/graphite/base.pp to have .sum metric files aggregate with sum.

Wed, Jul 25, 10:20 AM · Performance-Team, User-Addshore, WMDE-Analytics-Engineering, Graphite
fgiunchedi created T200317: grafana fails to load dashboards from disk.
Wed, Jul 25, 10:00 AM · Patch-For-Review, monitoring, Operations
fgiunchedi renamed T200313: Resource exhausted while scaling File:Voando_num_mar_de_areia.jpg from Thumbnails not created: Error: 429, Too Many Requests to cache resources exhausted while scaling File:Voando_num_mar_de_areia.jpg.
Wed, Jul 25, 9:40 AM · Performance-Team, Thumbor, Commons
fgiunchedi added a comment to T200313: Resource exhausted while scaling File:Voando_num_mar_de_areia.jpg.

Looks like thumbor/imagemagick are running into resource exhaustion when trying to scale this image (error below) resulting in 500s. Then poolcounter kicks in for this original due to repeated 500s while scaling and 429 are returned instead.

Wed, Jul 25, 9:39 AM · Performance-Team, Thumbor, Commons

Tue, Jul 24

fgiunchedi closed T196873: ms-be1036 in power off status, not responsive to power on commands as Resolved.

Host is back in service

Tue, Jul 24, 4:45 PM · User-fgiunchedi, ops-eqiad, Operations
Dzahn awarded T169035: bast3002 sdb broken a Like token.
Tue, Jul 24, 4:39 PM · Operations, ops-esams
fgiunchedi added a comment to T196484: rack/setup/install graphite1004.

Thanks for the update @Cmjohnson, not particularly urgent but it would be nice to have graphite1004 before the end of the quarter

Tue, Jul 24, 8:47 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations

Mon, Jul 23

fgiunchedi added a comment to T196483: rack/setup/install graphite2003.

Decom tasks: T200210: Decom graphite2002 T200209: Decom graphite2001

Mon, Jul 23, 3:44 PM · monitoring, Patch-For-Review, ops-codfw, Operations
fgiunchedi created T200210: Decom graphite2002.
Mon, Jul 23, 3:43 PM · monitoring, Operations, ops-codfw
fgiunchedi created T200209: Decom graphite2001.
Mon, Jul 23, 3:42 PM · ops-codfw, Operations, monitoring
fgiunchedi added a comment to T196484: rack/setup/install graphite1004.

@Cmjohnson what's the status for graphite1004 ?

Mon, Jul 23, 3:41 PM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi closed T196483: rack/setup/install graphite2003 as Resolved.

Resolving, I'll file decom tasks for graphite200[12]

Mon, Jul 23, 3:38 PM · monitoring, Patch-For-Review, ops-codfw, Operations