Tue, Aug 14
We can't delete inside indices easily, no. Dropping old indices is cheap compared to actually looking inside and delete only specific data. I'll clarify in the task description that this is a temporary bandaid though until we get more logstash hardware.
Also there's nothing logged on stdout on non-existent host and conftool exits 0. Ditto for a non-existant service
Not having pre-generated thumbnails at switchover time will have a significant impact in the sense that thumbor in codfw can get overloaded with thumbnails missing from both varnish and swift. Additionally it'd mean we'd have to keep relying on swiftrepl as more than a safety net as originally was designed for, hope that helps!
Mon, Aug 13
I've renewed the certs on restbase-dev* and ran puppet. Next up is cassandra roll restart to pick up the certs.
re: ms-be1040 it can be moved back to the old switch any time
Wed, Aug 8
I've added java threads and heap bytes to the dashboard, looks like there's a thread leak on 2 out of 3 hosts (unclear though if that's involved in packet loss)
For both questions the answer is "yes", it is something we wrote and should be trivial to backport to jessie/stretch. I believe a simple rebuild for stretch/jessie will do that.
Reopening since I just saw a brief 40 packets/s loss on logstash1008
[looking at the bikeshed] to me monitor is a bit too generic, also this is likely to be a single-use box (i.e. only icinga) so icinga1001 would work better IMO
I am not seeing packet loss anymore after moving to persisted queues, I'm resolving this though feel free to reopen. There is still the issue of slow pipelines of course, which hopefully we'll get more insights with logstash metrics exported into prometheus.
@Lea_WMDE what's your LDAP username to be added to wmde group?
I see hwalls is already in wmf ldap group:
To clarify, the procedure to request a developer account is here: https://www.mediawiki.org/wiki/Developer_account
this request in particular is to create your user on the WMF cluster. Wikitech users creation is self-service, please create an account there too and let us know your username. The wikitech and wikipedia accounts AFAIK are not related though.
Since this request is expanding root scope to other boxes I believe it'll need to be put up at the next SRE meeting on Monday
Public key swapped
Tue, Aug 7
Indeed it can happen since the alert is errors over four days, if no new errors come in the alert will recover
I've enabled disk persisted queues in logstash, early to tell now but it looks like that "fixed" (papered over) the issue so slow pipelines and outputs don't affect inputs anymore. I'll remove the daily restart later today if things look in order.
This indeed persists and triggers "mediawiki exceptions" alerts due to high volume of attempts, eg.
Reopening, looks like tegmen is suffering lots of nsca processes again :(
Mon, Aug 6
Can you try again?
Thanks @RobH ! Yeah role spare makes sense in this case.
Ack, thanks for the report @revi ! I'll defer to some people more intimate with uploadstash
No script no, just a review like the above. I'll deploy that later today.
OCG isn't in service anymore
Nowadays we're deprecating udp2log, though the issue of sending logs from logstash out elsewhere still stands. cc Parsing-Team for their opinion on what sort of log export they would like to see
Daily restart is in place now and a packet loss alert too. Unfortunately packet loss shows up some hours after a restart too, with not only the syslog receive buffer filling up but also the gelf port (12201 udp). In addition to that logstash laments errors while receiving gelf, so possibly related:
Thanks @RobH !
I believe that's because thumbor has to know about private containers, I've proposed https://gerrit.wikimedia.org/r/c/operations/puppet/+/450539 and we should update the new wiki creation checklist to include this step too.
Fri, Aug 3
We'll need to add jmx_exporter to Logstash too, to get JVM stats like most other JVMs on the fleet.
Slightly related, I asked syslog logstash input upstream to add settings for receive buffers: https://github.com/logstash-plugins/logstash-input-syslog/issues/50
UDP loss has been minimized now, though even with the current settings I've seen the receive buffer spike to ~1.5MB before getting drained. Short term what we could do is spool syslog traffic to disk (via logstash itself and persistent queues or rsyslog) instead of relying on being fast enough to drain the receive buffer.
Thu, Aug 2
Stalling this, might happen again and upstream likely will have mitigations in linux 4.19
I poked at logstash a little more and packet loss has gone away after these changes (not yet in puppet):
- pipeline.workers: 1 was explicit in the logstash configuration, I commented it to get #workers == #CPUs working on the pipeline
- the multiline filter isn't thread safe so even with the setting above I had to remove multiline filter for now (used only by hhvm-fatal)
- increased the default receive buffer to 4MB on all logstash hosts, that's enough headroom for logstash to catch up
A ton of these messages (from tcpdump), likely spamming syslog
After the restart on logstash1007 syslog shows taking about 90% time:
Wed, Aug 1
Indeed Thumbor rate-limits failures to render a certain original after some tries, https://wikitech.wikimedia.org/wiki/Thumbor#Throttling
My guess would be that 5000px exceeds thumbor's memory limit for rendering
Odd UploadStash would fail, can you try again? Does the 503 report any further error?
Tue, Jul 31
I took a look at both metrics and it seems https://github.com/BonnierNews/logstash_exporter metrics are more Prometheus-idiomatic (e.g. metric naming, usage of tags) so I think we should go for that.
Mon, Jul 30
Sorry for the delay! I've merged the patches so haproxy is now running alongside nginx on thumbor instances.
Things still missing off top of my head:
- Prometheus stats (via https://github.com/prometheus/haproxy_exporter)
- Firewall rules
- Queueing behaviour testing
I've inquired upstream, one of the suggested approaches is to run with page poisoning. I'll do that on one host in codfw, also this issue will be likely checked for and fixed in linux 4.19.
Indeed still another case of the 32nd bit flipping, interestingly on a codfw host where we haven't been seeing this yet:
Fri, Jul 27
I checked grafana 5.2 and it correctly skips invalid dashboards from disk, mentioning which dashboards are failing to load, resolving.
Thu, Jul 26
Parent task resolved!
Parent task resolved!
Parent task resolved!
Parent task resolved!
It has been decided at the SRE weekly meeting to leave the deprecation page up indefinitely instead of removing the DNS name. I've updated enwiki pages as well, resolving this task and subtasks.
Wed, Jul 25
I believe the discrepancy comes from the fact that the whisper file for that metric uses average as aggregation method, not sum. Likely because it is a file created a long time ago, before we fixed the issue in modules/role/manifests/graphite/base.pp to have .sum metric files aggregate with sum.
Looks like thumbor/imagemagick are running into resource exhaustion when trying to scale this image (error below) resulting in 500s. Then poolcounter kicks in for this original due to repeated 500s while scaling and 429 are returned instead.
Tue, Jul 24
Host is back in service
Thanks for the update @Cmjohnson, not particularly urgent but it would be nice to have graphite1004 before the end of the quarter
Mon, Jul 23
@Cmjohnson what's the status for graphite1004 ?