Thu, Apr 19
With cache=none being set in all cluster for unrelated reasons, this is now unblocked. In the meantime jessie-backports has upgrade to 2.8. Fortunately the changelog does not have any worrying items in it. The upgrade will require a round of VM reboots, but otherwise looks ok. I 'll empty an eqiad ganeti host, upgrade to 2.8 and move a few VMs to it for testing.
All VMs have been migrated to using cache=none. I 'll resolve this successfully, hopefully we will not meet this issue again
1 month with no incident. I 'll proceed with rebooting all ganeti VMs on row_C and then move on to codfw
Wed, Apr 18
Since this was a [Discuss] task, resolved was conceptually correct.
Tue, Apr 17
Mon, Apr 16
https://github.com/ether/etherpad-lite/commit/9daade0b95bbc5443637977652d3cd0dbc44e112 fixes this but it's not yet in a release. I 've imported it locally and have been giving some testing but I 'll hold on the upgrade for a bit more time
Fri, Apr 13
Oh! That's great. Now I can do the following locally on my minikube instance
I went ahead and pulled 1.6.4 and 1.6.5 but they suffer from https://github.com/ether/etherpad-lite/issues/3378 so given that we are not currently vulnerable I 'll refrain from upgrading.
Thu, Apr 12
Wed, Apr 11
Wed, Apr 4
I am gonna close this as declined. Feel free to reopen though.
For what is worth, I don't like the idea of adding anything like that in network::constants. I don't even like the current $special_hosts construct (it has gotten out of hand) and I am the one who started it. ferm rules should not be defined using the macro way, since that is not immediately clear how it is constructed and thus difficult to reason about. The macro is only populated on the hosts, using ERB, it's uppercase and git grepping for it in our repo only reveals the uses, not the definition. Doing the mental jump from that to network::constants is not something we should be forcing ourselves to do. Instead we should be using role specific hiera lookups.
Mon, Apr 2
Fri, Mar 30
Thu, Mar 29
"Wasted" is a little strong here. It provides a real and immediate current benefit for however long it's running. And, Zotero is a project that is used in many other contexts so worse comes to worse you're helping the open source community at large ;).
Wed, Mar 28
As an FYI, T187194 has been filled in February under the is https://www.mediawiki.org/wiki/Code_stewardship_reviews context. At this point in time it remains unclear if and when an upgrade can/will happen.
Maybe I can help, but I 'll need a bit more information as to what the problems is. Which tool fails, with what invocation and what is the error?
Tue, Mar 27
- Network policy has been validated
- statsd_prometheus_exporter has been validated and prometheus is scraping each pod and collecting data
- logging approach has not been validated, turns out we need to upgrade components for this to work and reevaluate the sidecar approach. However logging works just fine for the mathoid service with each pod sending logs directly to logstash in the gelf format as well as logging to stdout, making the logs accessible to kubectl logs.
Mon, Mar 26
~2 weeks with no incident yet. That's very encouraging but we 've been in that position again around the new years holidays. Given that easter holidays are approaching I am reluctant to do any changes so I think this should stay in the waiting state for another ~2 weeks.
https://gerrit.wikimedia.org/r/#/c/421935/ for allowing access to staging related clients
This is old enough and I 've recently upgraded to 1.7.10 following the documentation at https://wikitech.wikimedia.org/wiki/Tools_Kubernetes#Building_debian_packages. Comment above was using the wrong version of the tar file by the kubernetes project IIRC. I 'll resolve this, feel free to reopen
Package built and uploaded to stretch-wikimedia and jessie-wikimedia. Resolving this, feel free to reopen.
I 've just added minikube to thirdparty/ci component for stretch-wikimedia. I 'll resolve this, feel free to reopen
I 've just added docker-ce under the thirdparty/ci component for stretch as well. Resolving, feel free to reopen
This has been achieved successfully and even surpassed the goal by achieving 100%. I 'll happily resolve this
- Backups servers (heze/helium in the current incarnation) will definitely have 10G (we 've already budgeted for it).
- Ganeti hosts are not so clear. Per grafana eqiad  and grafana codfw  we still don't need 10G there. codfw's traffic is the actual representative one since the latest large spikes/plateaus in eqiad are probably due to me doing many very heavy IO tests for T181121. Since this is long term planning and T181121 has probably been resolved, we should wait a few weeks and see if that is true. Of course we can only do simple projections and can't really predict the future, so it's difficult to say for sure. My hunch is that for now we don't need 10G and we probably won't need 10G on ganeti hosts for another 1-2 years. After that, I don't know.
- Kubernetes hosts have just got in production, are handling very minimal traffic and the entire idea of that infrastructure is to scale out, not scale up, so even if we end up running kafka stream processing (or anything for that matter) in kubernetes, to me it seems that 10G will be a waste of money, so I agree on the "pretty sure we won't need 10G".
Sat, Mar 24
I had a deeper look. The original build process uses versioning.mk to do the work of figuring out the version. We don't use that in the debian package we ship and our package has version v2.8+unreleased. I am not sure we need to go down the road of actually using versioning.mk during our built and if that actually adds value given that we intend to use our own compatible tiller image
Fri, Mar 23
Just pass --tiller-image=docker-registry.discovery.wmnet/tiller:latest to helm init
Mar 22 2018
Unfortunately wmf4750 will not do after all. After we powered off and unracked helium we figured out the raid card was too big for the space available in the R430. We need either a different server from the spares or a new server :-(
This caught my eye and I merged it just now. With the number of nodes at 4401630061 40GB should be enough indeed to cache old node positions for a full planet import. As a side note, maps boxes have 128GB so we are well below the 75% mark osm devs suggest. maps-test boxes vary between 64GB and 92GB, but those are to be reclaimed and put out of service anyway.
The metrics part has been validated. In fact https://grafana.wikimedia.org/dashboard/db/service-mathoid?orgId=1 currently has graphs being generated using prometheus and showing the same data that is being generated directly from statsd
Mar 21 2018
Indeed. Here it is https://wikitech.wikimedia.org/wiki/Incident_documentation/20180314-ORES.
Mar 20 2018
Aside from having to tag the release locally with v0.25.0 so that gbp could generate the source and using buster to build this, everything else worked out fine. Being go it even worked on jessie so I 've uploaded it already to thirdparty/ci. I 'll resolve this, feel free to reopen
This has been done. Resolving
Mar 19 2018
Yes, that's fine.
Mar 16 2018
I am guessing this was resolved and I am no longer needed.
Mar 15 2018
Mar 13 2018
All row_A eqiad VMs have been rebooted with cache=none. We are now again in a waiting period.
Mar 12 2018
The scap targets that would benefit from this (namely ores* boxes) now have git-lfs installed. @mmodell do we also need this on the scap masters ? I am not fully clear about the workflow that is going to be used here and where the git lfs related files are going to be fetched from.
No complaints in 6 days, I consider the problem resolved. I 'll keep this open for a few more days so that any problems reported find their way into this and then I 'll resolve the task as well.
cache=none tests during the weekend showed no problems. I 'll find a quiet point in time during the day and restart all VMs in cluster with that setting set. Then we are in waiting for a while.