Hey sorry I was on vacation last week.
Tue, Jun 20
So what I did:
Caused by: java.nio.file.FileSystemException: .../_1ut5m.nvm: Too many open files
Mon, Jun 19
@Cmjohnson please proceed to decom/derack these servers and rack new ones in their place.
Fri, Jun 9
both maps and restbase are now monitored at the load-balancers of the SSL terminators in all datacenters. Resolving.
Thu, Jun 8
@faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I thought the nrpe checks on caches to be better just because it would monitor each cache host and not round-robin every host in a pool. It might also help seeing problems on individual caches.
Wed, Jun 7
In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many logical layers as possible. Sadly, there is a bug in service-checker that I need to fix before this can go live, but apart from that it should be pretty straightforward.
Tue, Jun 6
I would start monitoring restbase on text-lb and maps on text-upload.
Mon, Jun 5
Thu, Jun 1
Wed, May 31
So I have quite a few questions regarding this:
All done, the play-by-play is how I executed the switchover. I'll write up some more documentation, and close the ticket as resolved.
- Merge https://gerrit.wikimedia.org/r/356138
- sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent' (begins read-only)
- sudo cumin 'R:class = role::configcluster' 'disable-puppet "etcd replication switchover"'
- Merge https://gerrit.wikimedia.org/r/#/c/356139,
- sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
- Merge https://gerrit.wikimedia.org/r/#/c/356136/ and update dns
- sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
- sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
- Merge https://gerrit.wikimedia.org/r/356341
- sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent' (ends read-only)
- Merge and deploy https://gerrit.wikimedia.org/r/#/c/356137/
The simple script to set the replication index in codfw before starting replication:
The only thing we have added to 0.4.1 is https://github.com/wikimedia/operations-debs-nutcracker/commit/37fb9a2b939821c6d704ba09b7d80bcc88961224, which is useful if we raise the log verbosity but don't want details on every connection.
Tue, May 30
Since the problem presented itself only after ~ 15 minutes after the deploy, it could be that something that we were able to cache in WANCache before is now somehow uncacheable and thus very expensive to compute.
In fact, I suppose the problem is our proxy IP in eqiad has been banned. From the proxy machine
Actually it was a dumb comment - the log I pasted clearly reported TCP_MISS/302, so I'm not sure what's happening. Investigating further.
So the problem is the eqiad proxy cached a redirect to localhost, likely sent by the remote host during an outage
Mon, May 29
The problem (pdfrender hanging at startup) just showed up again on scb1002, and it seems there is no way to get around that race condition at the moment (no amount of waiting is ok).
I've seen there hasn't been much going on on this task, but I want to have the opportunity to say that I don't think it's a good idea to put a software created for other purposes (HS) to serve our sessions.
Fri, May 26
To be more clear: there is exactly 0% probability this was caused by something different than the release of -wmf2 to the wikipedias. The issue started at 19:20 UTC and finished the instant the train was rolled back.
So to summarize this succintly, I'll post the list of request on a random appserver that took more than 6 seconds to render on a random appserver:
Wed, May 24
@Cmjohnson I suggest we do the following:
Another option is not to care much how the current distribution goes but to just evenly distribute servers across rows, and then go on and rebalance the whole cluster.
Here is my proposal regarding these systems:
May 23 2017
Racking request is just that these new machines go in different rows. They can even be in the racks of the other conf* systems as those old systems will be eventually decommissioned.
May 22 2017
May 21 2017
Not going deeper in reasoning on the requirements that this tickets assume to be true (I'm not sure all of those are justified, but that's another topic) I would say that the "application-level TTLs" options seems the best way to go for a few reasons:
May 15 2017
May 12 2017
May 11 2017
I am re-doing our calico-containers repository from scratch, importing a version from upstream and managing the now-minimal changes to the Dockerfiles with quilt. This will make it easier to build calicoctl (the debian package) properly.
I think the basic idea for the patch is good, I think the implementation can be improved as it is not currently doing what it's intended to do.
May 9 2017
so, after some digging, I found out that conf2002.codfw.wmnet had, for some reason, auth enabled on etcd (while we now just proxy through nginx) and moreover only had the root user available. The most probable cause is me doing something wrong when disabling auth in eqiad during the conversion of that cluster.
May 3 2017
Yes, my only doubt with this proposal is exactly that we want to be active/active but to being able to serve all the traffic from a single datacenter.
I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.
May 2 2017
I converted the etcd cluster in eqiad to use nginx for auth/TLS, moved to ecdsa certs with the correct SANs, and started replication codfw => eqiad.
May 1 2017
Apr 29 2017
An additional case I'm going to study in more detail:
Apr 28 2017
Apr 27 2017
Just to err on the side of caution, I reviewed all the code of JobQueueRedis and of the JobChron, and I found no obvious parts of our LUA scripts that could cause replication to break, like non-deterministic statements.
Also let me add a few remarks on the redis replication: