Thu, Mar 15
Ok, let's go with wdqs-internal. The actual configuration is done is another ticket. Let's close this.
Preliminary decommissioning steps are done (pending the merge of https://gerrit.wikimedia.org/r/#/c/419702/). A few notes:
Wed, Mar 14
@Etonkovidova your test is actually hitting the production maps cluster. The Kartographer fronted is on betalabs, but it is depending on the production tiles / images. It looks this is using the snapshot service (which I don't know much about). So it looks like the snapshot service does not follow the same rules for headers. That's surprising, and probably a mistake, but I'm not familiar enough with the use case to know if there is a good reason to have a lower cache expiration on snapshots.
Tue, Mar 13
Check of the cache control header:
Data import for wdqs200 completed.
Nodes are repooled, all seems good. This can be closed!
The decision is to not replace this out of warranty RAM. We'll run with 3% less capacity until this batch of servers is renewed (in ~ 1year).
Data transfer completed from wdqs2001 to wdqs1004. Procedure is documented on wiki. The updater is catching up on a few hours of changes. Things look stable.
After experimenting a bit, I removed gzip from the pipeline. It looks like gzip is CPU bound (and not multi-threaded). Even with gzip -1, the transfer rate is slower than with no compression.
Data transfer done with:
@Smalyshev yes, there is a way to copy the data between wdqs nodes, I'll take care of it and document it here. The new wdqs cluster is not yet done reloading, so I'll take the data from wdqs2001 (I prefer to shutdown a node in codfw than running on a single node in eqiad, even if that should be fine).
Yay! Thanks @faidon for finding the issue!
Mon, Mar 12
Last cleanup has been done, this can be closed.
@Papaul : the initial data download on that server is completed. Feel free to restart and / or take it down any time you want.
Migration to prometheus is completed, dashboards have been updated and diamond / graphite code has been removed.
Sun, Mar 11
This interface has been flapping up and down according to icinga. I've put some downtime until we can determine what the issue is.
Fri, Mar 9
@Cmjohnson I'll be vacation starting March 18, and I would be more relax if I knew our wdqs eqiad clsuter has its usual 3 nodes. Coud you already rack one of the new cluster node (T188432) so that we can canibalize one until wdqs1004 is healthy again?
@Cmjohnson did message me about changing the port as well, so that's probably not it.
I have performed the release, the artifacts are uploaded to maven central (it will take some time for them to actually be available...)
Note that Icinga also has random failed checks (size of conntrack table, ferm, dpkg, ...) all with "Return code of 255 is out of bounds". The service (blazegraph) related checks are also failing, but that's expected, since the data isn't loaded yet.
Bad news... I just finished reimaging wdqs1004, and I still have trouble. SSH session suddenly / randomly freezes. I can't see the same link flapping that @Dzahn saw in dmesg (at least not yet).
Thu, Mar 8
Wed, Mar 7
Tue, Mar 6
Jolokia has been removed from the deployments, so now the updater should crash on DNS errors and be restarted by systemd. It is non trivial to test, but I might be able to reproduce the issue on wdqs-test by playing with iptables...
Mon, Mar 5
approved in Ops meeting
Note that Kartotherian / Tilerator have a life outside of WMF. At least @Yurik is still working on them. At the moment, we are in a situation where kartotherian / tilerator are managed mostly like an external project to which we contribute, more than a WMF project with external contributors. Bringing them back to gerrit might be creating a WMF fork of those project. This might or might not be what we want to do, but we should be clear on the implications before doing this move.
Note that in addition to deploying, @SBisson should also have the rights to restart the various services (it does not really make sense to have one permission and not the other, which means that we might want to merge those groups, but that is for another task).
Fri, Mar 2
The failed DNS resolution being sticky is probably an issue with negative caching in the JVM being either too high or infinite. We should be able to configure it via https://docs.oracle.com/javase/8/docs/technotes/guides/net/properties.html#ncnt . But that definitely requires a good integration test. We could steal some idea from https://github.com/schuch/dnslookup-integration-test
the maps production clusters (maps00[1-4]) and test cluster (maps-test200[1-4]) are managed by the same group (maps-admins). This is actually what we want. Stephane only needs to deploy to test right now, but he should also be able to deploy to production.
Thu, Mar 1
A thread dump of the stuck process: https://phabricator.wikimedia.org/P6771
Wed, Feb 28
Note: wdqs1006 does not exists (and has never existed to my knowledge). We could name those servers wdqs100[6-8] instead wdqs100[7-9].
Strange... according to T188432 the new wdqs servers are wdqs100[7-9]. The current wdqs cluster in eqiad is [[ https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L2115-L2117 | wdqs100[3-5] ]]. So what is this wdqs1006?
Note: this is one of the new wdqs servers, not in service yet.
@Cmjohnson this looks like an issue with the physical connection. Could you try moving the cable to another port on the switch so that we can eliminate that possibility?
A few things to check (thanks for the pointers from my fellow ops):
Tue, Feb 27
There isn't much impact on response time or even load on the cluster at this point. So I would not worry too much yet. If we loose another node, this is going to be an issue, but let's not borrow trouble yet. Note that we have already received new servers in codfw, but wdqs1004 is in eqiad, so that does not help. We can re-discuss that once the new servers are available (they are configured exactly the same, so moving them between clusters should be trivial).
wdqs1004 still does not look well. The logs show a number of UnknownHostException (T188413) and updates are not processed. SSH connections sometime freeze, but do recover, or connecting ends up in "Connection refused", but works again a few seconds later.
Hardware diagnostics have completed with no error. It is now up again, catching on updates, but still depooled. I'll keep an eye on it, and if it looks stable and has catched up on updates, I'll repool it (and still keep an eye on it).
Mon, Feb 26
Yep, we can close it.
Hardware diagnostic is running, I'll report back with the results when completed.
Those alerts are now available on Icinga and passing. I'll keep an eye on them for the next few days to make sure we don't have false positives, but that should be all good.
Fri, Feb 23
The recent crash of wdqs1004 (T188045) had an impact on the LDF service, which was hosted on wdqs1004 at the time of the crash. The LDF service has been manually routed to wdqs1005, but this raises the concern again of the stability of this service.
Thu, Feb 22
Those new systems are for a new cluster, independent of the current on (T178492). So we don't have any need to spread the failure domains across both the old and the new servers. So yes, the current racking locations are fine, but if you have a need to collocate them with the current wdqs nodes, that would be fine as well.
I need to check, but I think the metric needed for OSM replication lag are still good. I'll just need to cherry pick this patch and deploy it.
metrics are now collected. @EBernhardson if you could have a look and validate that this is what you expected...
Feb 20 2018
config.yaml and sources.yaml are now managed by puppet in a coherent way. This can be closed...
Feb 19 2018
@jmatazzoni I have no idea what this is (except that I can reproduce the issue). This looks Kartographer related more than Kartotherian / tilerator. Can we prioritize this high (or higher) in the collab team work?