Wed, May 24
@Cmjohnson I suggest we do the following:
Another option is not to care much how the current distribution goes but to just evenly distribute servers across rows, and then go on and rebalance the whole cluster.
Here is my proposal regarding these systems:
Tue, May 23
Racking request is just that these new machines go in different rows. They can even be in the racks of the other conf* systems as those old systems will be eventually decommissioned.
Mon, May 22
Sun, May 21
Not going deeper in reasoning on the requirements that this tickets assume to be true (I'm not sure all of those are justified, but that's another topic) I would say that the "application-level TTLs" options seems the best way to go for a few reasons:
Mon, May 15
Fri, May 12
Thu, May 11
I am re-doing our calico-containers repository from scratch, importing a version from upstream and managing the now-minimal changes to the Dockerfiles with quilt. This will make it easier to build calicoctl (the debian package) properly.
I think the basic idea for the patch is good, I think the implementation can be improved as it is not currently doing what it's intended to do.
Tue, May 9
so, after some digging, I found out that conf2002.codfw.wmnet had, for some reason, auth enabled on etcd (while we now just proxy through nginx) and for some reason only had the root user available. The most probable cause is me doing something wrong when disabling auth in eqiad during the conversion of that cluster.
Wed, May 3
Yes, my only doubt with this proposal is exactly that we want to be active/active but to being able to serve all the traffic from a single datacenter.
I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.
Tue, May 2
I converted the etcd cluster in eqiad to use nginx for auth/TLS, moved to ecdsa certs with the correct SANs, and started replication codfw => eqiad.
Mon, May 1
Sat, Apr 29
An additional case I'm going to study in more detail:
Fri, Apr 28
Thu, Apr 27
Just to err on the side of caution, I reviewed all the code of JobQueueRedis and of the JobChron, and I found no obvious parts of our LUA scripts that could cause replication to break, like non-deterministic statements.
Also let me add a few remarks on the redis replication:
So, I just re-ran showJobs in eqiad and codfw and there were big discrepancies, as expected
--- wasat.before 2017-04-27 11:45:20.725345007 +0200 +++ terbium.before 2017-04-27 11:45:31.189344455 +0200 @@ -1,15 +1,14 @@ -categoryMembershipChange: 3 queued; 52 claimed (3 active, 49 abandoned); 0 delayed -cdnPurge: 0 queued; 0 claimed (0 active, 0 abandoned); 32 delayed +categoryMembershipChange: 3398 queued; 30062 claimed (406 active, 29656 abandoned); 0 delayed +cdnPurge: 32648 queued; 57664 claimed (959 active, 56705 abandoned); 21 delayed
I finally managed to find some differences in one redis replication set, and those are all related to big data structures like enwiki:jobqueue:refreshLinks:l-unclaimed and enwiki:jobqueue:refreshLinks:z-claimed or enwiki:jobqueue:refreshLinks:h-idBySha1.
Ok so, I started using a (slightly modified) version of the script presented here:
Wed, Apr 26
For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time.
All clients have been successfully switched to codfw, and replication has been stopped; I tested depooling and pooling back a client (to test again that nginx-based auth works) and everything seems working flawlessly for now.
I don't think scap should interact with conftool by itself, unless it reproduces what the restart-<service> scripts are doing right now, which is:
Apr 21 2017
I've set up the replica and prepared changes for most next steps. When I'm back on Wednesday morning, we can decide if we want to failover to the new cluster directly or just do it in case something bad happens with the network maintenance and the eqiad cluster, and perform the switchover at a later date.
To summarize, I think it is possible to test EtcdConfig in beta at this point with the limited deployment I made.
I managed to get a bare-bones working installation of conftool in deployment-prep.
@GWicke I have seen the same job being re-executed multiple times (after succeeding) when I ran runJobs.php from the command-line to remove some pressure; I'm sure more cases can be found in the logs.
Apr 20 2017
The queue is down to 250K jobs, and I am confident all the old refreshlinks jobs have been removed. I'm leaving the ticket open at lower priority as I need to still take a look at this.
FTR, the queue is dropping fast, as the number of processed jobs. I'll de-deploy my hack as soon as I'm confident I killed all the rogue refreshlinks links.
I'm testing this live hack:
Raised the priority as this keeps biting us again and again, see T163418
Apr 19 2017
On possible explanation is this is due to jobs that already were running in eqiad before the switchover and got killed before acknowledging, due to the way we stop jobrunners in the switchover procedure.
@Ladsgroup is this happening for new jobs enqueued now?
terbium will be upgraded to jessie as soon as we've switched over, for the record.
Apr 12 2017
Apr 7 2017
Our production puppetmasters run on 3.8, several clients have been tested, and the agent should have minimal differences.
@MoritzMuehlenhoff I'm ok with a wider but limited rollout, but at least until the dc switchover and rollback are done, I'd prefer to stick with the known evil of 3.12 on most systems.
Apr 3 2017
@stjn I cannot reproduce your case and we should've fixed the largest underlying reason for such problems. Do you still experience the same issue?
This has been practically superseded by so many specific tickets it doesn't really make much sense anymore.
Status as of now: