Fri, Jun 15
Resolving as config is managed from scap3
Yes, resolved. I tracked that on another ticket as well...
Thu, Jun 14
SInce it was asked, and for the record, a list of the different indices sizes: P7257
Looking on deploy1001, I see that /srv/deployment/cassandra/metrics-collector/.git/DEPLOY_HEAD also has a reference to tin.eqiad.wmnet. I suppose I should correct it there. Could anyone confirm that editing that file is safe?
Wed, Jun 13
reimage of maps-test2004.cofdw.wmnet is completed. There is an open point (T197159) that will be fixed separately.
Editing /srv/deployment/cassandra/metrics-collector-cache/.config to replace the reference to tin with a ref to deploy1001 seems to fix the issue. But since this is a fresh reimage, that wrong config came from somewhere else, which also needs to be fixed.
reimage is failing, scap has a reference to tin. See T197159 for details.
@Pnorman: with the current templating we have with scap3, I think this can be closed. Can you confirm?
2.2.6-wmf5 uploaded to reprepro, we can close this task.
Tue, Jun 12
Looking at 14 hours of GC logs on elastic1020, I can see that the max pause time was ~400 ms, not bad!
cassandra-2.2.6-wmf5 deployed on maps-test2004, it seems to work just fine.
Mon, Jun 11
Fri, Jun 8
Wed, Jun 6
Looking at GC logs with G1 on elastic2001: It looks like pool sizes are fairly unstable, with the GC trading Eden for Tenured (and vice versa) at a fairly high rate.
Deployed and seems to be working
Tue, Jun 5
We have a "sleeping" task to order new disks: T186526
Mon, Jun 4
@RobH : thanks!
Since the bugs seen with G1 were for early version of Java 8 and that we've upgraded to much newer JVM everywhere, let's try G1 and see if we gain anything.
Data load is complete, this can be closed.
Thu, May 31
It looks like this worked, elastic2018 looks good again.
@Papaul could you have a look at elastic2018 and see if you understand anything? The server is powered off, do anything you'd like with it...
Wed, May 30
cassandra and cassandra-tools-wmf are manually installed from https://people.wikimedia.org/~eevans/debian/ on maps-test2004, after testing I'll upload them to reprepro.
Disk usage has been stable over the last week, so it looks like the tuning we did, while not recovering much space helped stabilize. I'll keep an eye on things for a while. The long term solution is probably to increase storage, or move object store (cassandra) out of the maps servers.
Tue, May 29
Last action based on this issue is tracked in T193605, we can close this.
@Papaul I thought I did resolve it already, but I think we had a duplicate. So yes, resolving!
Looks good! @Papaul thanks!
osmborder rebuilt and uploaded to reprepro. Only Cassandra left.
jvm-tools tested and copied from jessie-wikimedia to stretch-wikimedia
Mon, May 28
It looks like we have a few missing dependencies in Stretch:
Thu, May 24
Looking at kartotherian configuration, I can't find a reference to wdqs, so I presume this is hardcoded.
Wed, May 23
I see no significant drop in disk usage since enabling unchecked_tombstone_compaction. But disk usage remains stable, which means we are probably OK, at least for the short term.
Tue, May 22
Note that Cassandra usage is already tracked in graphite. We don't have them in a consolidated dashboard. @elukey has already done significant work to align different Cassandra dashboards (see T193017). We can probably build on that.
From conversation with @Eevans:
May 18 2018
Cassandra might have been running into compaction issues while we had both v3 and v4 keyspaces, and not enough space to run compaction. Thought I don't see any error in cassandra logs...
Investigation on T192759 lead to some interesting discoveries.
May 17 2018
The profile::maps::osm_master provides the disable_replication_cron variable which we can set to false to disable replication.