Fri, Oct 11
Hi @sbassett, apologies for the delayed reply! I'm not sure if deployment-prep access is all-or-nothing for services or shell access. In the sense that access to https://logstash-beta.wmflabs.org is one shared user/password and credentials are stored in a file in one of the deployment-prep hosts.
Thu, Oct 10
+1 on at least a week's worth of data
Thanks for reaching out @jcrespo, happy to help brainstorming on monitoring and which metrics make sense for this use case. Can do either on task or hangout for higher bandwidth
Wed, Oct 9
Tue, Oct 8
Indeed we'd need to upgrade its firmware as per T141756: audit / test / upgrade hp smartarray P840 firmware, holding off once we have new swift hw in place in eqiad to not "jinx it" if we possibly can
Mon, Oct 7
I like the idea of effectively proxying per datasource (Grafana upstream issue) as opposed to HTTP_PROXY + NO_PROXY in Grafana's environment.
[agreed on the rest]
@bd808 I'm echoing what @MoritzMuehlenhoff said (thanks!) and going with Buster seems worthwhile to me. Specifically Grafana 6 is a safe upgrade AFAIK (cc @CDanis) and ditto for graphite. @Phamhi I'd be happy to help reviewing patches for Buster support!
I'm +1 on turning on rebase if necessary and see how things play out, if they don't for some reason it is an easy revert
Wed, Oct 2
@TheAnarcat thanks indeed for taking the time to look into this!
Didn't realize this was normal and thought it was hp gen10-specific! Since it happens on other hosts too I wouldn't spend too much time on it, I'm ok to even resolve/decline the task
Setting as stalled for now, the immediate issue has been bandaided
Please note that putting these systems in production is becoming urgent, is there a status update and/or ETA?
Tue, Oct 1
Mon, Sep 30
Resolving again as this seems to have gone away on Sept 13th, cause still unclear to me though
It has been observed that during times of high Kafka traffic (i.e. when a backlog develops because Logstash can't keep up) the CPU on logstash hosts isn't maxed out, with only one thread typically using close to one core and the rest being mostly idle. That suggests to me one limiting factor at the moment might be Kafka consumer parallelism, specifically we have three partitions per topic by default and three logstash ingester hosts per site, so under normal circumstances there's one partition per host being consumed.
Graphite is on its way out eventually, declining
I don't recall seeing this error anytime recently, boldly declining and we shall reopen
@mmodell it seems to me with Phatality deployed to production we can resolve this task ?
Fri, Sep 27
Completed! See T228878 for subtask status
Resolving as this is complete, the ipsec alerts subtask is still open pending a firing of legacy/spammy alerts to compare to the new ones but otherwise done. systemd alerts have been stalled pending better aggregation/grouping capabilities.
Will be done as part of T141756: audit / test / upgrade hp smartarray P840 firmware, resolving
Thumbnails work for me on that wiki now, resolving
Hasn't reoccurred through multiple depool cycles, in the meantime swift has been upgraded too, declining
Row balancing has occurred naturally as we've cycled through hardware
I see this task is for 6x hosts and parent T228461 is for 9x, wanted to make sure that's expected/wanted ?
@Cmjohnson host is ready for decom! thanks
Thu, Sep 26
This has happened in the meantime!
Still ongoing from time to time (e.g. in september)
My two cents: since deployment-logstash03 has been setup in T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster as the 02 stretch replacement, my suggestion would be to work to bring up 03 and ditch 02 in the process
Another type of spam that we have observed, php-generated but comes through by means of syslog + apache: https://logstash.wikimedia.org/goto/c5afb82f07a0249524d66b74c82e55c4
Wed, Sep 25
I tested the cookbook on ms-be1027 in T233289, the host is powered down and not coming back (faulty hw) and the cookbook stopped when trying to get to the host, whereas IMHO it should have continued (and/or prompt) with the remaining steps
Indeed the decom script failed on this host that's powered down already, the full trace is
This is ready for you to take over @Papaul, thanks!
Should be all deployed now, ready for another round of plugin-install
Tue, Sep 24
Other Graphite producers found while auditing metrics changed in the last 7d
Reverted the above at https://gerrit.wikimedia.org/r/c/operations/puppet/+/538848 because a couple of things were not working yet, see comments in the review
Thanks for the report! The disk throughput in cluster overview is now fixed and showing bytes, could you try again ?
I'm done with both graphite200, good to go on my end
Mon, Sep 23
Sounds great! Adding ssh + ping for starters should be quite easy in puppet
Completed! I've updated the ripe atlas documentation at https://wikitech.wikimedia.org/wiki/RIPE_Atlas#Run_tests_from_the_command_line