Thanks to @Paladox work on this, the aphlict service unit now handles correctly the software.
If you want to better understand what puppet_ca does on an agent, and why removing it afterwards "doesn't break anything" there are good reads in the puppet docs:
Thu, Sep 21
Tue, Sep 19
FWIW we're seeing another almost-incontrollable growth of jobs on commons and probably other wikis. I might decide to raise the concurrency of those jobs.
This was caused by https://gerrit.wikimedia.org/r/#/c/365891/, yet another case of a labs-specific fix breaking production.
Fri, Sep 15
After some reasoning, I decided to go the following way:
Thu, Sep 14
After the discussion the other day at the containers cabal meeting, I promised to come up with a proposal for helm chart development/management. So here it is.
Wed, Sep 13
I would suggest AGAINST giving access to all logs. We should have a tail-ores command that specifically tails the ores logs, like we do with other services.
This is very promising, I was in the process of writing down my own requirements and it seems most things are already covered, although it's not clear from your post if we can have per-wiki stats as well as per-job stats.
Tue, Sep 12
We did a lot of work today on this, and I am thus running a new puppet compiler full run, which can be found here
dh-make-golang is what I'd use for creating a debian package from scratch, as it will also prepare packages for any dependency (read: any library dependency that still isn't in debian).
Mon, Sep 11
Fri, Sep 8
I was too optimistic, it appears, in declaring victory. The new resync at reduced speeds still triggered consensus issues. It seems the version of etcd we're using is particularly sensitive to i/o latency spikes. So while I'm inclined to either disable altoghether the raid resyncs or to stagger them between the servers of each cluster, I will consider upgrading etcd to a newer version (still in the 2.x series) as an option.
I did a review of how helm works/what it offers in relation to our environment.
Reducing the sync speed manually did the job, so we can just puppetize this.
I am a bit more concerned about performance and reliability implications of adding indirections in the data path itself. TLS is supported by all major platforms we use, so we should be able to avoid indirections for that. The main requirement to enable this is centralized certificate management, and exposing certs to services in a standardized manner, often via env vars.
Result of the latest experiment:
Running the mdadm command on one host caused a re-election to happen. It seems likely we found the culprit, so now I'm going to run the command at the same time on first two hosts, then on all three, to verify we found the origin of the issues.
Thu, Sep 7
Everything is set up and you can reach the correct LVS endpoint via the discovery DNS system at jobrunner.discovery.wmnet, via HTTPS.
I did some more number crunching on the instances of runJob.php I'm running on terbium, I found what follows:
Wed, Sep 6
Tue, Sep 5
We still have around 1.4 million items in queue for commons, evenly divided between htmlCacheUpdate jobs and refreshLinks jobs.
Mon, Sep 4
Fri, Sep 1
To recap quickly the plan:
After my series of changes the situation looks much better:
A containerized microservice environment should make developing and deploying applications as easy as possible.
For metrics collection, my proposal (after a chat with @fgiunchedi) have another sidecar running prometheus-statsd-exporter in the modifed version we maintain.
As far as logging goes, we have basically two big options:
Thu, Aug 31
Correcting myself after a discussion with @ema: since we have up to 4 cache layers (at most), we should process any job with a root timestamp newer than 4 times the cache TTL cap. So anything older than 4 days should be safely discardable.
@aaron so you're saying that when we have someone editing a lot of pages with a lot of backlinks we will see the jobqueue growing basically for quite a long time, as the divided jobs will be executed at a later time, and as long as the queue is long enough, we'll see jobs divided/inserted in the queue when division jobs are executed.
Wed, Aug 30
Also, going through the remainder of the design document and the implementation PoC, I could summarize the flow as follows:
Hi, I took a look at your current proposal and I see a series of issues with it. I might still not have understood what you're proposing fully, if that's the case, please let me know!
This task was about pdfrender failing to start, and that problem has been "hotfixed".
Mon, Aug 28
https://puppet-compiler.wmflabs.org/compiler02/7622/index-future.html has a list with most spurious differences removed.
Beware that, as HHVM developers declared themselves, the php 7 implementation in HHVM will never be 100% compatible with the PHP one.
Any news on this? We do need to rack the new appservers as putting them in production is needed in order to go on with the eqiad row D switch upgrade T172459
Aug 23 2017
Aug 11 2017
I don't think we're safe to do this maintenance until we do rack all the new mediawiki machines. We have almost half of our capacity for MediaWiki in row D. We have plans to remediate that when the new mediawiki servers will be racked (see T165519) but I'd say racking and setting up those servers should be a hard blocker for this maintenance at the moment.
Aug 10 2017
Full list of hosts using the future parser:
https://puppet-compiler.wmflabs.org/compiler02/7383/cp2001.codfw.wmnet/ shows the correct behaivour.
Aug 9 2017
So one additional complication: we need to refresh the facts timestamp for every pcc run, as we don't want to incur in a case of https://tickets.puppetlabs.com/browse/PUP-5441
This is resolved now that we use our own differ.
Yes, this is a duplicate of T150456
As can be seen here
I wrote a first version of the script that can be used to populate puppetdb; I'll upload it via puppet to all the compiler machines for now so that we can populate the db easily.
Aug 8 2017
Aug 7 2017
Aug 4 2017
This should be resolved with the new home-brewed differ:
I did rewrite the Rakefile according to @faidon's suggestions, did tweak the dockerfile/run environment a bit, and now an average job takes less than 20 seconds to execute, with the simple changes taking less than 10 seconds. All this while running specs if needed.
Aug 3 2017
I think the proposal is pretty sound - with a couple of suggestions to keep things more "familiar" for ops, and more in line with what we use for our production environment: