I'll take this on as part of T86556: monitor SSD wear levels since this task is essentially a superset
I think we can resolve this task, for swift I got T173721 going. The increase doesn't seem related to the PP deploy and happens periodically anyway, it will need separate investigation. Thoughts?
Looks like librenms polls every 5 minutes, so the gaps are there because no data has actually been sent.
I've put a sample dashboard at https://grafana.wikimedia.org/dashboard/db/network-probes showing for a given "target" (i.e. a bastion at the moment) its maximum latency from all sites and the number of times the probe has flapped.
I didn't fully read the code, though I'm curious what happens to the files on the FileBackend side of things and specifically to swift in production/beta. Thanks!
Resolving, logstash ingestion is moving to ganeti
+1, we've never used the list
Mon, Aug 21
I'm +1 on the swift side to resume rollout everywhere but en/de
Note that statsd and swift account for the majority of entries in conntrack.
FYI the periodic increase in swift requests is now tracked separately at T173721: Track down the source of periodic increases in requests to swift eqiad
I can't seem to get the following to work to extract all hosts that have prometheus::jmx_exporter_instance defined:
Prometheus instance is up and running, still missing the "targets" generation, i.e. the cassandra instances that are currently running jmx_exporter.
cc @aaron and @Krinkle in case this behaviour rings a bell with the work that was done in T171371: Investigate 30x increase in Jobrunner errors around the same time the increase started
Patches are merged and stats are being polled by prometheus in codfw and eqiad, I've added basic request rates by status to https://grafana.wikimedia.org/dashboard/db/thumbor
Unassigned from me since the deployment part is pending
Interesting, I seem to remember seeing something like this in production too but it self healed once puppet was running on the box
Fri, Aug 18
Ping? Not granting thumbor access for newly created wikis it means files uploaded there won't get thumbnails.
Thu, Aug 17
Both issues have been fixed upstream! Pending deployment of latest version of librenms to production.
Wed, Aug 16
So the increase in swift requests seem to be cyclic (daily) and corresponds to dips in cache_upload hitrate as per the graph below.And an equivalent spike in swift requests (zoomed in on a given day)
Is there an exception id or anything like that attached to the error? I can't find anything related to that in logstash ATM
Indeed it looks like librenms sends both metrics with whitespace in the name and metrics without values:
I couldn't find the corresponding File: page for that file right away, anyways IIRC C-T is set by mediawiki at upload time. Something went wrong there on the last upload I presume? An interesting audit would be to check what C-T we're sending back for upload.w.o, for sure some types like that shouldn't be sent at all
Sun, Aug 13
FTR this happened again last night (UTC), I'm currently working on having thumbor run on stretch in T170817 which will also bring a newer gs. In my quick experiments I couldn't reproduce the lockup we've seen with the files above. This with per-filetype throttling that @Gilles mentioned should help with mitigation.
@Gilles I can reproduce at will the test failure above on stretch, thoughts?
This just happened again, any thoughts on what I wrote in T159922#3492238 ? Namely that xpra might not be necessarily the root cause
Sat, Aug 12
Thu, Aug 10
- Update thumbor package to latest upstream (fixes pillow dep and all fixes from @Gilles have been merged upstream)
I started playing with thumbor on stretch and building the package on copper yields an error with pillow 4 whereas thumbor wants pillow 3 out of the box.
Wed, Aug 9
Below there's a list of top 20 files that failed to get converted today, unsurprisingly lots of pdfs there.
Checking the first file it seems ghostscript hangs though the pdf file is only 45MB, next thing I'll try the conversion with stretch's ghostscript and see if the behaviour is the same.
AFAICT from the thumbor dashboard at the time of the outage it is the ghostscript engine (and thus PDF processing) spiking up in its request time
LGTM, just a nit
Thanks @herron ! Indeed the check is slow when the raid controller is busy and the machines have lots of traffic
Mon, Aug 7
Sun, Aug 6
Since thumbor is in production now I'm bumping the priority because container perms need to be correct for new wikis
Thu, Aug 3
CC'ing Operations here too for wider distribution
Wed, Aug 2
looks like we're back, thanks @Cmjohnson !
Chatted with @chasemp about this today, the easiest way forward seems to be setting up an emulated check with thresholds for failure to load content. https://commons.wikimedia.org/wiki/Special:NewFiles is the easiest target as it is full of recent thumbnails that should just work.
Swapping by @Cmjohnson worked!
Resolving now as the nutcracker collector works again on scb, will reopen depending on what upstream decides re: https://github.com/twitter/twemproxy/issues/532