Sorry I missed that, thanks for pinging me on T234900.
Tue, Oct 15
- Certain configured backup is not active (As far I can see, configurations are not cleaned up on decommission, something to look at)
There's a number of ready plugins for icinga on https://exchange.nagios.org/directory/Plugins/Backup-and-Recovery/Bacula
Same as eqiad, LGTM
Mon, Oct 14
In the interest of splitting off from this task what is probably going to be somewhat of a discussion, I 've created subtask T235437 for the rate limiting functionality of RESTBase/RESTrouter.
For what is worth, the poolcounter approach is probably the saner one long term. And per https://www.mediawiki.org/wiki/PoolCounter the protocol is simple enough that having a PoC to gauge whether it is a valid replacement shouldn't take too much work
Fri, Oct 11
The last issue we had with bacula host itself was some sort of storage degradation/failure, no?
Thu, Oct 10
Wed, Oct 9
Tue, Oct 8
We've been calling this out as a blocker to moving session storage to production, so I guess what I'm trying to determine is: Are we still blocked?
Mon, Oct 7
Thu, Oct 3
Tue, Oct 1
I 'll resolve this, all workers have now the check, with 91 children each, we will be alerted if this deviates too much from the configured thresholds
Mon, Sep 30
@Eevans Logs from kubernetes make it to logstash now, albeit we lack one last change in logstash to parse correctly the JSON fields (for container runtime enginer reasons they are JSON-in-JSON). We 'll get on that soon.
Logs are now making it to logstash so I am gonna boldly resolve this. That being said, there is a minor straggler that needs to be resolved, namely the JSON-in-JSON parsing of logs as most services ship logs in JSON format which gets wrapped in docker's JSON. Discussion is ongoing in https://gerrit.wikimedia.org/r/539519, although the approach in that patch will probably not be chosen and the parsing of the JSON in JSON will be done in logstash
Fri, Sep 27
We 've sidestepped the problem for now by disabling ip6 mapped addresses for ganeti hosts. This solves the chicken and problem, although we should arguably find a way to better configure IPv4 and IPv6 addresses on our hosts instead of relying on tricks like setting the token. I 'll resolve this for now.
Thu, Sep 26
Changing priority to normal since the host is now up and running, but we have a chicken and egg problem to solve here.
Found it. I had to comment out from /etc/network/interfaces the line
I don't think this is hardware related.
- set up the rate-limiting DHT inside k8s for RESTRouter (this is currently disabled, and not having rate-limiting is not acceptable)
@Halfak, this is pretty much blog post material. Anyway I 'll try and summarize what I think happened and what we can do below
Wed, Sep 25
restrouter is up and running, LVS is setup and discovery records have been merged. I think the migration can start. A draft dashboard is present at https://grafana.wikimedia.org/d/ZA_JiypZk/restrouter, however restrouter differs enough from the rest of the other service-runner based services as far as the statsd emitted metrics goes, that I don't feel qualified to delve more into this. Feel free to amend it to your needs.
The service has for long been deployed and even has nice dashboards in grafana, resolving.
Tue, Sep 24
And the config (sanitized)
Sat, Sep 21
Fri, Sep 20
After merging and deploying https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/538242/ and https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/538241/ more or less the same.
Thu, Sep 19
The previous was with v1.0.0-RC2
For starters let me say that the service owners should be the ones setting the SLIs/SLOs and those should be the ones the team can commit to. They are also not set in stone, but can be amended to better reflect the present reality (e.g. in case the SLOs were set very optimistically and it's impossible to reach them, or so pessimistically that they are always hit with extreme ease despite prolonged outages of the service) as long as they are clearly communicated and advertised (updating the wikipage and an email should suffice)
I link this approach it has the benefit of removing toil from releng and abstracting CI from repo owners as they only have to care about a well documented and defined contract point.
Wed, Sep 18
Going forward with Plan #1 (which I also find better)
https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?refresh=1m&orgId=1 has a tentative dashboard. Feel free to augment