fgiunchedi (Filippo Giunchedi)
Awesome

Projects (16)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (150 w, 5 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi

Recent Activity

Today

fgiunchedi added a comment to T86552: monitor and alarm on SMART attributes.

I'll take this on as part of T86556: monitor SSD wear levels since this task is essentially a superset

Wed, Aug 23, 4:23 PM · User-fgiunchedi, Patch-For-Review, Operations, monitoring
fgiunchedi claimed T86552: monitor and alarm on SMART attributes.
Wed, Aug 23, 4:22 PM · User-fgiunchedi, Patch-For-Review, Operations, monitoring
fgiunchedi added a comment to T173422: Investigate the increase in the number of requests to Swift after the Page Previews deploy.

I think we can resolve this task, for swift I got T173721 going. The increase doesn't seem related to the PP deploy and happens periodically anyway, it will need separate investigation. Thoughts?

Wed, Aug 23, 3:35 PM · Readers-Web-Backlog (Tracking), Traffic, Operations, Page-Previews
fgiunchedi moved T173436: Delete graphite metrics for old CFs from Doing to Backlog on the User-fgiunchedi board.
Wed, Aug 23, 3:25 PM · User-fgiunchedi, Cassandra, Services (doing), Operations, Goal, Epic
fgiunchedi awarded T170839: Migrate dropwizard/metrics to scap3 a Burninate token.
Wed, Aug 23, 3:20 PM · Scap (Scap3-Adoption-Phase1)
fgiunchedi added a comment to T170839: Migrate dropwizard/metrics to scap3.

We can nuke the repo, it was deprecated in T104208: alternative Cassandra metrics reporting and https://gerrit.wikimedia.org/r/#/c/223041/

Wed, Aug 23, 3:20 PM · Scap (Scap3-Adoption-Phase1)
fgiunchedi added a comment to T171167: Evaluate LibreNMS' Graphite backend.

Looks like librenms polls every 5 minutes, so the gaps are there because no data has actually been sent.

Wed, Aug 23, 2:00 PM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
fgiunchedi added a comment to T169860: Investigate/setup prometheus blackbox_exporter.

I've put a sample dashboard at https://grafana.wikimedia.org/dashboard/db/network-probes showing for a given "target" (i.e. a bastion at the moment) its maximum latency from all sites and the number of times the probe has flapped.

Wed, Aug 23, 1:32 PM · monitoring, User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring
fgiunchedi moved T170817: Upgrade Thumbor servers to Stretch from Doing to Radar on the User-fgiunchedi board.
Wed, Aug 23, 9:42 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a comment to T161012: Implement http-less file-copy functionality.

I didn't fully read the code, though I'm curious what happens to the files on the FileBackend side of things and specifically to swift in production/beta. Thanks!

Wed, Aug 23, 7:58 AM · WMDE-QWERTY-Sprint-2017-08-22, WMDE-QWERTY-Sprint-2017-07-25, Patch-For-Review, User-Addshore, Move-Files-To-Commons, WMDE-QWERTY-Team-Board, TCB-Team

Yesterday

fgiunchedi closed T150108: fix partition scheme for logstash ingester hosts as Resolved.

Resolving, logstash ingestion is moving to ganeti

Tue, Aug 22, 4:12 PM · Patch-For-Review, Wikimedia-Logstash, Operations
fgiunchedi closed T150108: fix partition scheme for logstash ingester hosts, a subtask of T151971: Move logstash ingestion behind LVS, as Resolved.
Tue, Aug 22, 4:12 PM · Patch-For-Review, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T173829: Deactivate list "deployment-systems".

+1, we've never used the list

Tue, Aug 22, 3:05 PM · Wikimedia-Mailing-lists, User-MarcoAurelio
fgiunchedi moved T163673: Some swift disks wrongly mounted on 5 ms-be hosts from Doing to Backlog on the User-fgiunchedi board.
Tue, Aug 22, 1:06 PM · Patch-For-Review, User-fgiunchedi, Operations

Mon, Aug 21

fgiunchedi added a comment to T173422: Investigate the increase in the number of requests to Swift after the Page Previews deploy.

I'm +1 on the swift side to resume rollout everywhere but en/de

Mon, Aug 21, 5:40 PM · Readers-Web-Backlog (Tracking), Traffic, Operations, Page-Previews
fgiunchedi moved T173731: Reduce swift frontend conntrack usage from Backlog to Doing on the User-fgiunchedi board.
Mon, Aug 21, 4:45 PM · Patch-For-Review, User-fgiunchedi, Operations, media-storage
fgiunchedi added a project to T173731: Reduce swift frontend conntrack usage: User-fgiunchedi.
Mon, Aug 21, 4:44 PM · Patch-For-Review, User-fgiunchedi, Operations, media-storage
fgiunchedi added a comment to T173731: Reduce swift frontend conntrack usage.

Note that statsd and swift account for the majority of entries in conntrack.

Mon, Aug 21, 4:00 PM · Patch-For-Review, User-fgiunchedi, Operations, media-storage
fgiunchedi created T173731: Reduce swift frontend conntrack usage.
Mon, Aug 21, 3:36 PM · Patch-For-Review, User-fgiunchedi, Operations, media-storage
fgiunchedi moved T86556: monitor SSD wear levels from Backlog to Doing on the User-fgiunchedi board.
Mon, Aug 21, 3:21 PM · User-fgiunchedi, Operations-Software-Development, Operations, monitoring
fgiunchedi added a project to T86556: monitor SSD wear levels: User-fgiunchedi.
Mon, Aug 21, 3:21 PM · User-fgiunchedi, Operations-Software-Development, Operations, monitoring
fgiunchedi added a watcher for monitoring: fgiunchedi.
Mon, Aug 21, 3:17 PM
fgiunchedi added a comment to T173422: Investigate the increase in the number of requests to Swift after the Page Previews deploy.

FYI the periodic increase in swift requests is now tracked separately at T173721: Track down the source of periodic increases in requests to swift eqiad

Mon, Aug 21, 2:07 PM · Readers-Web-Backlog (Tracking), Traffic, Operations, Page-Previews
fgiunchedi created T173721: Track down the source of periodic increases in requests to swift eqiad.
Mon, Aug 21, 2:06 PM · media-storage, User-fgiunchedi, Operations
fgiunchedi moved T173490: Provision prometheus instance for cassandra/services metrics collection from Backlog to Doing on the User-fgiunchedi board.
Mon, Aug 21, 1:52 PM · Patch-For-Review, User-fgiunchedi, Services (doing), Cassandra
fgiunchedi added a comment to T173490: Provision prometheus instance for cassandra/services metrics collection.

I can't seem to get the following to work to extract all hosts that have prometheus::jmx_exporter_instance defined:

Mon, Aug 21, 1:47 PM · Patch-For-Review, User-fgiunchedi, Services (doing), Cassandra
fgiunchedi added a comment to T173490: Provision prometheus instance for cassandra/services metrics collection.

Prometheus instance is up and running, still missing the "targets" generation, i.e. the cassandra instances that are currently running jmx_exporter.

Mon, Aug 21, 1:27 PM · Patch-For-Review, User-fgiunchedi, Services (doing), Cassandra
fgiunchedi updated subscribers of T173710: Job queue is increasing non-stop.

cc @aaron and @Krinkle in case this behaviour rings a bell with the work that was done in T171371: Investigate 30x increase in Jobrunner errors around the same time the increase started

Mon, Aug 21, 12:34 PM · MW-1.30-release-notes (WMF-deploy-2017-08-29 (1.30.0-wmf.16)), Performance-Team (Radar), Patch-For-Review, Discovery-Search, Discovery, CirrusSearch, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
fgiunchedi added a comment to T151554: Track incoming HTTP request count on the Thumbor boxes.

Patches are merged and stats are being polled by prometheus in codfw and eqiad, I've added basic request rates by status to https://grafana.wikimedia.org/dashboard/db/thumbor

Mon, Aug 21, 12:33 PM · Patch-For-Review, User-fgiunchedi, Operations, Performance-Team, Thumbor
fgiunchedi created T173698: Backfill librenms data in graphite with historical RRDs.
Mon, Aug 21, 10:39 AM · User-fgiunchedi, netops, monitoring, Operations
fgiunchedi moved T136312: Encrypt syslog traffic from Backlog to Up next on the monitoring board.
Mon, Aug 21, 10:01 AM · Patch-For-Review, monitoring, User-fgiunchedi, Operations
fgiunchedi moved T171167: Evaluate LibreNMS' Graphite backend from Doing to Radar on the User-fgiunchedi board.
Mon, Aug 21, 10:00 AM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
fgiunchedi placed T171167: Evaluate LibreNMS' Graphite backend up for grabs.

Unassigned from me since the deployment part is pending

Mon, Aug 21, 9:59 AM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
fgiunchedi added a comment to T173628: deployment-imagescaler02 is not responding to salt.

Interesting, I seem to remember seeing something like this in production too but it self healed once puppet was running on the box

Mon, Aug 21, 8:41 AM · Salt, Beta-Cluster-Infrastructure

Fri, Aug 18

fgiunchedi added a comment to T169939: End of August milestone: Cassandra 3 cluster in production.

Yeah, sadly commitlogs are for all intents and purposes write-only (they're only read on crash-recovery), so it seems like the wrong trade-off from this perspective. I'm just not sure what would be better.

Fri, Aug 18, 4:27 PM · Cassandra, Patch-For-Review, Services (doing), Operations, Goal, Epic
fgiunchedi renamed T136312: Encrypt syslog traffic from encrypt syslog traffic to Encrypt syslog traffic.
Fri, Aug 18, 3:32 PM · Patch-For-Review, monitoring, User-fgiunchedi, Operations
fgiunchedi created T173571: Disk full on deployment-jobrunner02.
Fri, Aug 18, 12:53 PM · Patch-For-Review, Beta-Cluster-Infrastructure
fgiunchedi added a comment to T169939: End of August milestone: Cassandra 3 cluster in production.

restbase2001.codfw.wmnet has been re-imaged, but there are a couple of issues yet to resolve:

[ ... ]

Secondly, the agreed upon disk/mount layout doesn't provide for a common location to store commitlogs, hints, saved caches and heapdumps (can't believe I missed this).

Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            26G   19M   26G   1% /run
/dev/md0         28G  2.0G   25G   8% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/sda3       911G   72M  911G   1% /srv/sda
/dev/sdb3       911G   72M  911G   1% /srv/sdb
/dev/sdc3       911G   72M  911G   1% /srv/sdc
/dev/sde3       911G   72M  911G   1% /srv/sde
/dev/sdd3       911G   72M  911G   1% /srv/sdd

[ ... ]

... I'm not certain what the best course of action is. Trade-offs abound...

Regarding trade-offs, it seems like the choices here can be distilled down to:

  1. A single location for all commitlogs
    1. Create an additional partition on every disk, combine them in a RAID-0
      • PROS: performance
      • CONS: large blast radius (one disk failure takes out all commitlogs)
    2. Create an additional partition on every disk, combine them in a RAID-1(e)
      • PROS: fault tolerance
      • CONS: performance(?)
    3. Store commitlogs on the same device as the OS (a RAID-1, which is not currently large enough)
      • PROS: fault tolerance, fewer partitions
      • CONS: ugly, performance(?)
  2. Per instance storage of commitlogs
    1. Allocate one device per instance
      • PROS: fault tolerant(ish), performant(ish)
      • CONS: ugly, confusing, poor distribution of load (number of instances != number of disks)

        I'm inclined to think that the performance of a single RAID-1 array might be Good Enough for commitlogs, at which point 1B seems most attractive.
Fri, Aug 18, 12:13 PM · Cassandra, Patch-For-Review, Services (doing), Operations, Goal, Epic
fgiunchedi added a comment to T144479: Ensure thumbor container access is preserved by mw filebackend setzoneaccess.

Ping? Not granting thumbor access for newly created wikis it means files uploaded there won't get thumbnails.

Fri, Aug 18, 9:11 AM · MediaWiki-Maintenance-scripts, Operations, Performance-Team, Thumbor

Thu, Aug 17

fgiunchedi added a comment to T171167: Evaluate LibreNMS' Graphite backend.

Both issues have been fixed upstream! Pending deployment of latest version of librenms to production.

Thu, Aug 17, 4:48 PM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
fgiunchedi created T173518: Errors dealing with non-ascii characters in output.
Thu, Aug 17, 4:28 PM · puppet-compiler
fgiunchedi added a project to T173490: Provision prometheus instance for cassandra/services metrics collection: User-fgiunchedi.
Thu, Aug 17, 10:53 AM · Patch-For-Review, User-fgiunchedi, Services (doing), Cassandra
fgiunchedi moved T173436: Delete graphite metrics for old CFs from Backlog to Doing on the User-fgiunchedi board.
Thu, Aug 17, 10:53 AM · User-fgiunchedi, Cassandra, Services (doing), Operations, Goal, Epic
fgiunchedi added a comment to T172930: Long running thumbnail requests locking up Thumbor instances.

Which is a lot better than before where 502s were the most common response.

Thu, Aug 17, 10:51 AM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi created T173490: Provision prometheus instance for cassandra/services metrics collection.
Thu, Aug 17, 9:48 AM · Patch-For-Review, User-fgiunchedi, Services (doing), Cassandra

Wed, Aug 16

fgiunchedi created T173436: Delete graphite metrics for old CFs.
Wed, Aug 16, 5:37 PM · User-fgiunchedi, Cassandra, Services (doing), Operations, Goal, Epic
fgiunchedi added a comment to T173422: Investigate the increase in the number of requests to Swift after the Page Previews deploy.

So the increase in swift requests seem to be cyclic (daily) and corresponds to dips in cache_upload hitrate as per the graph below.

And an equivalent spike in swift requests (zoomed in on a given day)

Wed, Aug 16, 5:12 PM · Readers-Web-Backlog (Tracking), Traffic, Operations, Page-Previews
fgiunchedi added a project to T151554: Track incoming HTTP request count on the Thumbor boxes: User-fgiunchedi.
Wed, Aug 16, 4:52 PM · Patch-For-Review, User-fgiunchedi, Operations, Performance-Team, Thumbor
fgiunchedi added a comment to T173374: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite".".

Is there an exception id or anything like that attached to the error? I can't find anything related to that in logstash ATM

Wed, Aug 16, 2:09 PM · Operations, media-storage
fgiunchedi closed T173401: graphite2001 disk space alarms for big log files in /var/log/carbon as Resolved.

We're now deleting carbon-cache logs older than 15d with the above patch, resolving. The librenms invalid lines are tracked at T171167: Evaluate LibreNMS' Graphite backend. Thanks @elukey !

Wed, Aug 16, 2:05 PM · Patch-For-Review, User-fgiunchedi, Operations, Graphite, monitoring
fgiunchedi created T173415: puppet-compiler should display newly introduced resources entirely.
Wed, Aug 16, 1:08 PM · puppet-compiler
fgiunchedi reopened T171167: Evaluate LibreNMS' Graphite backend as "Open".

Reported upstream at https://github.com/librenms/librenms/issues/7167 and https://github.com/librenms/librenms/issues/7166

Wed, Aug 16, 11:02 AM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
fgiunchedi added a comment to T171167: Evaluate LibreNMS' Graphite backend.

Indeed it looks like librenms sends both metrics with whitespace in the name and metrics without values:

Wed, Aug 16, 10:30 AM · Patch-For-Review, User-fgiunchedi, netops, monitoring, Operations
fgiunchedi moved T136312: Encrypt syslog traffic from Backlog to Doing on the User-fgiunchedi board.
Wed, Aug 16, 9:28 AM · Patch-For-Review, monitoring, User-fgiunchedi, Operations
fgiunchedi moved T170817: Upgrade Thumbor servers to Stretch from Backlog to Doing on the User-fgiunchedi board.
Wed, Aug 16, 9:28 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi moved T172930: Long running thumbnail requests locking up Thumbor instances from Backlog to Radar on the User-fgiunchedi board.
Wed, Aug 16, 9:28 AM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi moved T173401: graphite2001 disk space alarms for big log files in /var/log/carbon from Backlog to Doing on the User-fgiunchedi board.
Wed, Aug 16, 9:28 AM · Patch-For-Review, User-fgiunchedi, Operations, Graphite, monitoring
fgiunchedi added a project to T173401: graphite2001 disk space alarms for big log files in /var/log/carbon: User-fgiunchedi.
Wed, Aug 16, 9:28 AM · Patch-For-Review, User-fgiunchedi, Operations, Graphite, monitoring
fgiunchedi added a comment to T170817: Upgrade Thumbor servers to Stretch.

It's a text rendering difference. Not that it's great in the original, but it definitely gets worse on Stretch. Does the machine you're building Thumbor on and deployment-imagescaler02 have all the fonts we normally install on imagescalers?

I'm building on copper, so gsfonts definitely needs to be in thumbor's build deps

Wed, Aug 16, 9:25 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a comment to T170817: Upgrade Thumbor servers to Stretch.

It's probably a minor difference in rsvg rendering. 98.8% is very good similarity. Let's double check if the rendering difference is significant.

This is the reference thumbnail, identical to what Thumbor generates on Jessie at the moment:

This is what rsvg-convert generates on Stretch (ran on deployment-imagescaler02):

It's a text rendering difference. Not that it's great in the original, but it definitely gets worse on Stretch. Does the machine you're building Thumbor on and deployment-imagescaler02 have all the fonts we normally install on imagescalers?

Wed, Aug 16, 8:34 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a comment to T170817: Upgrade Thumbor servers to Stretch.

Right off the bat, the first one with major differences, Century Schoolbook L, comes from the "gsfonts" package, which is found on thumbor1001, deployment-imagescaler01, but not on deployment-imagescaler02. @fgiunchedi is a role missing from deployment-imagescaler02 or something?

Wed, Aug 16, 8:33 AM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a comment to T173056: Import Wiki Loves Monuments photos from Flickr to Commons.

@fgiunchedi Shame we missed at Wikimania! We'll keep in touch about this.

Wed, Aug 16, 8:19 AM · Operations, Wiki-Loves-Monuments (2017)
fgiunchedi added a comment to T173276: Specific JPEG file on upload.wikimedia.org returns Content-Type: application/x-www-form-urlencoded.

I couldn't find the corresponding File: page for that file right away, anyways IIRC C-T is set by mediawiki at upload time. Something went wrong there on the last upload I presume? An interesting audit would be to check what C-T we're sending back for upload.w.o, for sure some types like that shouldn't be sent at all

Wed, Aug 16, 8:12 AM · media-storage, Multimedia, MediaWiki-File-management

Sun, Aug 13

fgiunchedi added a comment to T172930: Long running thumbnail requests locking up Thumbor instances.

FTR this happened again last night (UTC), I'm currently working on having thumbor run on stretch in T170817 which will also bring a newer gs. In my quick experiments I couldn't reproduce the lockup we've seen with the files above. This with per-filetype throttling that @Gilles mentioned should help with mitigation.

Sun, Aug 13, 7:44 PM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi added a comment to T170817: Upgrade Thumbor servers to Stretch.

@Gilles I can reproduce at will the test failure above on stretch, thoughts?

Sun, Aug 13, 7:41 PM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi edited projects for T170817: Upgrade Thumbor servers to Stretch, added: User-fgiunchedi; removed Patch-For-Review.
Sun, Aug 13, 7:24 PM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a project to T172930: Long running thumbnail requests locking up Thumbor instances: User-fgiunchedi.
Sun, Aug 13, 7:23 PM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi added a comment to T173056: Import Wiki Loves Monuments photos from Flickr to Commons.

@fgiunchedi what do you think are the risks? Number of incoming images maybe? Haven't seen any issues in that area for a long time. Maybe something else?

Sun, Aug 13, 7:21 PM · Operations, Wiki-Loves-Monuments (2017)
fgiunchedi added a comment to T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003.

This just happened again, any thoughts on what I wrote in T159922#3492238 ? Namely that xpra might not be necessarily the root cause

Sun, Aug 13, 7:05 PM · Services (blocked), Readers-Web-Backlog (Tracking), Patch-For-Review, Operations, Electron-PDFs

Sat, Aug 12

fgiunchedi awarded T150456: puppet compiler fails with modules using puppetdb a Like token.
Sat, Aug 12, 9:07 PM · Patch-For-Review, User-Joe, puppet-compiler, Operations

Thu, Aug 10

fgiunchedi accepted D743: `require_valid_service` to check service mask.
Thu, Aug 10, 6:23 PM · Release-Engineering-Team
fgiunchedi added a comment to T170817: Upgrade Thumbor servers to Stretch.

TODO:

  • Update thumbor package to latest upstream (fixes pillow dep and all fixes from @Gilles have been merged upstream)
Thu, Aug 10, 4:00 PM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a comment to T170817: Upgrade Thumbor servers to Stretch.

I started playing with thumbor on stretch and building the package on copper yields an error with pillow 4 whereas thumbor wants pillow 3 out of the box.

Thu, Aug 10, 3:10 PM · Patch-For-Review, User-fgiunchedi, Performance-Team (Radar), Operations, Thumbor
fgiunchedi added a comment to T172930: Long running thumbnail requests locking up Thumbor instances.

Do you have a list of pages with times? How often does it happen organically?

Thu, Aug 10, 2:45 PM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi updated subscribers of T172997: graphite cassandra metrics disk usage.
Thu, Aug 10, 2:07 PM · Cassandra, monitoring
fgiunchedi created T172997: graphite cassandra metrics disk usage.
Thu, Aug 10, 2:07 PM · Cassandra, monitoring

Wed, Aug 9

fgiunchedi added a comment to T172930: Long running thumbnail requests locking up Thumbor instances.

Below there's a list of top 20 files that failed to get converted today, unsurprisingly lots of pdfs there.
Checking the first file it seems ghostscript hangs though the pdf file is only 45MB, next thing I'll try the conversion with stretch's ghostscript and see if the behaviour is the same.

Wed, Aug 9, 8:37 PM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi created T172939: Thumbor webp handling.
Wed, Aug 9, 7:07 PM · Patch-For-Review, Performance-Team, Thumbor
fgiunchedi added a comment to T172930: Long running thumbnail requests locking up Thumbor instances.

AFAICT from the thumbor dashboard at the time of the outage it is the ghostscript engine (and thus PDF processing) spiking up in its request time

Wed, Aug 9, 7:00 PM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi updated the task description for T172930: Long running thumbnail requests locking up Thumbor instances.
Wed, Aug 9, 6:55 PM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi created T172930: Long running thumbnail requests locking up Thumbor instances.
Wed, Aug 9, 6:33 PM · Patch-For-Review, Performance-Team, User-fgiunchedi, Thumbor, Operations
fgiunchedi added a comment to D743: `require_valid_service` to check service mask.

LGTM, just a nit

Wed, Aug 9, 5:57 PM · Release-Engineering-Team
fgiunchedi added a comment to T172921: Nrpe command_timeout and "Service Check Timed Out" errors.

Thanks @herron ! Indeed the check is slow when the raid controller is busy and the machines have lots of traffic

Wed, Aug 9, 5:32 PM · Operations, monitoring

Mon, Aug 7

fgiunchedi added a comment to T156955: Standardizing our partman recipes.

RAID/disk layer:

  • either software or hardware raid
  • in any case one block device is exposed (including the single-disk case, e.g. for VMs)

    On top of this device we ought to have MBR/DOS or GPT partitions:
  • One /boot partition (or EFI system partition in the EFI case, T93208)
Mon, Aug 7, 8:26 PM · Patch-For-Review, Operations

Sun, Aug 6

fgiunchedi renamed T144479: Ensure thumbor container access is preserved by mw filebackend setzoneaccess from ensure thumbor container access is preserved by mw filebackend setzoneaccess to Ensure thumbor container access is preserved by mw filebackend setzoneaccess.
Sun, Aug 6, 5:46 PM · MediaWiki-Maintenance-scripts, Operations, Performance-Team, Thumbor
fgiunchedi raised the priority of T144479: Ensure thumbor container access is preserved by mw filebackend setzoneaccess from Low to High.

Since thumbor is in production now I'm bumping the priority because container perms need to be correct for new wikis

Sun, Aug 6, 5:46 PM · MediaWiki-Maintenance-scripts, Operations, Performance-Team, Thumbor

Thu, Aug 3

fgiunchedi added a comment to T158837: Consolidate performance website and related software.

(Draft / brain dump)

  • performance.wikimedia.org: simple frontend, low-priority, can go on a VM.
  • coal: high-throughput, high-priority, high-risk (hard to reproduce in case of failure) avoid contention with unrelated services.
  • coal-web: simple frontend, low-priory, but needs access to coal, so on same server.
  • xenon: Redis service receiving data from app servers is high-prio and part of MediaWiki request lifecycle. Best kept on mwlog1001. The rest is lower priority, and also deterministic and easy to reproduce in case of failure (it can just go back and replay the same feed and re-create the files). Could be moved to a VM maybe.
  • xhgui: Entirely standalone service for debugging MediaWiki requests, can go on a VM. Only requirement is that MediaWiki can connect to it. (Low-traffic and low-risk as debug is naturally opt-in through X-Wikimedia-Debug).
  • webperf: high-throughput, high-priority, high-risk (hard to reproduce in case of failure). Keep on dedicated hardware. Doesn't need as much CPU/RAM as hafnium currently does.

Possible outcome:

  • hafnium (decom) – formerly webperf
  • tungsten (decom) – formerly xhgui
  • osmium (decom)
  • graphite1001: remove performance::site, coal, and coal-web.
  • mwlog1001: remove most xenon stuff.
  • new vm-perf-web: static site with Apache proxying to vm-perf-xenon and vm-perf-xhgui.
  • new vm-perf-xenon: Somehow gets files from mwlog1001 (pro-active replica push, or scp pull). Creates flame graphs and runs local Apache to expose them through vm-perf-web.
  • new vm-perf-xhgui: Standalone PHP web service exposed through vm-perf-web with local Apache/MongoDB.
  • new perfcruncher1001 (hafnium replacement): webperf.

    @faidon @fgiunchedi Does the above seem reasonable? Thoughts?
Thu, Aug 3, 2:34 PM · monitoring, Performance-Team, Operations
fgiunchedi added a comment to T172148: Determine URL paths for Zim files.

@fgiunchedi We can keep a separate database / list of all the Zim files outside of Swift. If you think that using Swift as the source of truth is not a good idea, then we can definitely just maintain our own list.

My main concern there was such a database getting out of sync with what is actually stored there. But really thats probably me probably being a little paranoid and pre-optimizing.

Thu, Aug 3, 1:34 PM · Reading-Infrastructure-Team-Backlog (Kanban), Operations, Traffic, Wikipedia-Android-App-Backlog, Android-app-feature-Compilations
fgiunchedi added a comment to T172123: Determine how to upload Zim files to Swift infrastructure.

Hey @fgiunchedi,

  1. For production swift access isn't permitted from cloud vps, does the zim generation and upload need to happen in cloud vps?

In principle it doesn't have to happen in Cloud VPS, that's just where I'm prototyping right now, though we were thinking (or hoping) it would be an appropriate place for an occasional (~1x/month) zim generation job, if possible to just keep it running there after the prototyping is finished. If that's out of the question, then we'd have to talk to someone about finding some appropriate production cluster hardware...

Thu, Aug 3, 1:22 PM · Patch-For-Review, Reading-Infrastructure-Team-Backlog, Operations, Traffic, Wikipedia-Android-App-Backlog, Android-app-feature-Compilations
fgiunchedi moved T106937: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails from Backlog to Radar on the User-fgiunchedi board.
Thu, Aug 3, 8:13 AM · User-fgiunchedi, media-storage, Commons, Operations, monitoring
fgiunchedi created T172362: New puppet compiler differ html escape.
Thu, Aug 3, 7:57 AM · Patch-For-Review, User-Joe, puppet-compiler
fgiunchedi added a project to T171758: Simplify git-fat support for pulling from both production and labs: Operations.

CC'ing Operations here too for wider distribution

Thu, Aug 3, 7:44 AM · Release-Engineering-Team (Next), Gerrit, Operations, Scoring-platform-team, Scap, ORES

Wed, Aug 2

fgiunchedi updated the task description for T171926: Degraded RAID on ms-be1017.
Wed, Aug 2, 4:29 PM · ops-eqiad, Operations
fgiunchedi added a comment to T171926: Degraded RAID on ms-be1017.

looks like we're back, thanks @Cmjohnson !

Wed, Aug 2, 4:28 PM · ops-eqiad, Operations
fgiunchedi added a comment to T106937: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails.

Chatted with @chasemp about this today, the easiest way forward seems to be setting up an emulated check with thresholds for failure to load content. https://commons.wikimedia.org/wiki/Special:NewFiles is the easiest target as it is full of recent thumbnails that should just work.

Wed, Aug 2, 4:03 PM · User-fgiunchedi, media-storage, Commons, Operations, monitoring
fgiunchedi updated the task description for T171183: Degraded RAID on ms-be1016.
Wed, Aug 2, 3:50 PM · User-fgiunchedi, ops-eqiad, Operations
fgiunchedi added a comment to T171183: Degraded RAID on ms-be1016.

Swapping by @Cmjohnson worked!

Wed, Aug 2, 3:42 PM · User-fgiunchedi, ops-eqiad, Operations
fgiunchedi closed T171454: deployment-ms-beXX Duplicate declaration: Exec[swift_udev_reload] as Resolved.
Wed, Aug 2, 2:45 PM · Patch-For-Review, User-fgiunchedi, media-storage, Beta-Cluster-Infrastructure
fgiunchedi closed T171454: deployment-ms-beXX Duplicate declaration: Exec[swift_udev_reload], a subtask of T171174: a lot of beta cluster instances are not reachable over SSH, as Resolved.
Wed, Aug 2, 2:45 PM · Services (watching), Wikimedia-Incident, VPS-Projects, Operations, Release-Engineering-Team (Kanban), Beta-Cluster-Infrastructure
fgiunchedi closed T172254: Double quotes in nutcracker config make json stats invalid json as Resolved.

Resolving now as the nutcracker collector works again on scb, will reopen depending on what upstream decides re: https://github.com/twitter/twemproxy/issues/532

Wed, Aug 2, 2:38 PM · Patch-For-Review, monitoring, Operations