relforge1001 can also be cleanly shutdown and restarted. It will crash the relforge cluster, but that cluster is not expected to be highly available. I'll warn the search platform team about it, they are the only users of that cluster.
For elastic103[0-5], we should be fine just shutting them down. The theory is that we should be able to loose a full row and not worry too much about it.
I can confirm that prometheus-elasticsearch-exporter.service is masked on logstash nodes. I have not rebooted one of the nodes, but I think this is good enough to close this task.
my mistake, this ticket is still ongoing, the CI part isn't done yet.
This has been done for some time already
It looks like @RobH still need to track this.
Since @Papaul says it is all done, I'll close this. No more need to track it on our side.
Mon, Jan 14
Validation was done by @Mathew.onipe.
Fri, Jan 11
New .deb is available on https://people.wikimedia.org/~gehel/prometheus-elasticsearch-exporter/
Thu, Jan 10
new version deployed on all wdqs servers
Wed, Jan 9
Tue, Jan 8
the problem just struck again. Restarting tilerator did not immediately fix the issue. It should recover once populate_admin() has completed.
Dec 6 2018
elastic2001-2024 are ready for decommission. They are taken our of the cluster and can be shutdown whenever you want (cc @Papaul)
We need to reassign some nodes between the psi and omega cluster, as removing old nodes would leave the clusters unbalanced between rows.
All servers configured.
Dec 5 2018
@Papaul: thanks! We'll take it from here, and notify you as soon as the old servers are ready for decommission.
Dec 3 2018
What we should probably do in this case is define default values to the hiera calls in profile::wdqs, and override only what needs to be different. At least for parameters where a default would make sense.
We already have the git-commit-id plugin configured, which creates a properties file and adds it to the jars. So we should be able to load it and output whatever we need. There is probably a jar somewhere with the logic required to parse that properties file, but it's trivial enough that we should not add another dependency just for that.
Note that the maps-root group does not exist yet. It is obvious from the name that we want members of that group to have full root access on the maps servers, but the task description should precise that this is a new group.
@Papaul it looks like elastic2037-39 already have entries as role(elasticsearch::cirrus) in site.pp and elastic2040-44 don't have any entry. I'll create them all as spares so that we can apply the elastic role with some level of control.
@Mathew.onipe patch deployed, can you validate that it works before moving the task to done?
Nov 30 2018
Nov 29 2018
https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/474266 has been merged and deployed, but the test queries are still not available on the wdqs servers. There is something I don't understand about the packaging. @Smalyshev could you have a look and point me to what I missed?
The new racking proposal looks good to me (new servers are still in the same row as the previous proposal, which is all I care about).
Nov 28 2018
Nov 27 2018
@Gehel for wdqs2006
The numbers above seem to indicate that we don't have a good signal / noise ratio, so an icinga check does not make much sense.
Nov 26 2018
The immediate issue is solved. I'd like a review from @Smalyshev to see if we have a good way to prevent this from happening again without increasing complexity too much.
Nov 23 2018
Reducing batch size seems to work, updates are processed again.
increasing heap stops the updater from crashing, but blazegraph refuses updates > 200M (we probably don't want to increase this limit)
Out of memory error on a bind:
Nov 22 2018
Nov 20 2018
Done, deployed, and tested
Nov 19 2018
Nov 15 2018
Looks like the new metric is flowing to prometheus
Nov 13 2018
this has been fixed already
Nov 9 2018
Nov 8 2018
The build succeeded. The new docs are published at https://doc.wikimedia.org/search-highlighter/experimental/, the publication date has been updated, and the site looks fine (minus the usual broken links).
Nov 7 2018
@Smalyshev we're good on this from my point of view. Could you check that running updater manually (with the -S option to output to console, or with -v for verbose) works as you expect?
Nov 6 2018
Nov 5 2018
Approved in weekly SRE meeting for full root access to wdqs servers
Why not drop the wrapper completely and just allow running an arbitrary command in the project directory? That way, we'll be able to configure ./mvnw clean install && ./mvnw site site:stage (or whatever we need). And not care about transient state needed to be propagated between containers. I'm not sure what value is that wrapper bringing, so I might be missing something.