I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
I see, that makes sense.
That fixed it, as was intended when I submitted the patch. Sorry I forgot to add the hiera.
I forgot to add profile::wmcs::kubeadm::docker_vol: false on ingress nodes. Please do not increase disk for them.
It seems to me that adding something in the template to specify the port for the ingress is the simplest and safest solution. That should just be a profile-based template and adding a parameter with a sane default.
Fri, May 7
Ok updated the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying for ingress and standard worker nodes. The puppet patch is deployed. This should be good to go. We just need to use the correct flavors, etc. No actual cinder storage is required.
The last couple outages for this were caused by a need to restart acme-chief due to a known issue and more or less "wontfix". While a paging-type alert is not a bad idea, a simple systemd-timer that checks if the cert is coming due and restarts acme-chief if it isn't doing its job would shore that up, no?
This works and should be good to go. I set the TTL on the k8s.svc.tools.eqiad1.wikimedia.cloud to 300 because I know you are going to change it to a new cluster soon. Feel free to change that to 3600 (the default) after you've rebuild the haproxy nodes. @Majavah
Oh! There's no stale cache on the control nodes.
For now, I cannot get the cached A record to be forgotten by the host. Tried restarting nscd, and that did not help. Turning the old proxies off just brought everything down. My thought is to leave the old proxies up until the cache drops off. That would also be the best process for tools anyway and potentially a zero-downtime method.
The caching is frustratingly strong here. The old A record is still seen by the host somehow (if not by dig).
It looks like I have to delete it and recreate it as a CNAME. That means that it will briefly cause some chaos on Toolforge when we do it in tools. We might want to do it really fast via CLI there. In toolsbeta, I can do it now.
Wait...I have the sense of this reversed....The current name becomes a CNAME to the svc name...that would Just Work ™
The name target doesn't "change" unless all kubeconfigs do as well, which is important.
If we try it in toolsbeta and the certs all still validate, sure. My worry is that cert validation will collapse unless we make sure it's a valid altname for the k8s cluster as well.
Thu, May 6
We'll likely try that in toolsbeta first anyway.
That patch should make this just fine, and new nodes should be easy to deploy when required after it is merged. When we upgrade the ingress nodes, it'd make sense to set profile::wmcs::kubeadm::docker_vol: false on the appropriate prefix. I don't think that will unmount it retroactively on existing ingress nodes since it is usually hard to make puppet unmount things, right @Andrew ?
I see ingress nodes are using the role::wmcs::toolforge::k8s::worker class in puppet. I can factor out the docker volume and set a different ingress controller class for those that just uses a non-cinder setup. Ingress and control just don't need the extra disk, I don't think.
We could, but there isn't much need to. It doesn't actually consume that space on the ceph cluster. This is all thin-provisioned. The docker images we currently use are extremely large (for docker images). I'd rather keep plenty of room.
Got it. This is made for k8s nodes: g3.cores8.ram16.disk20.ephem140 That roughly matches the existing larger nodes. Using those with the cinder class should Just Work.
It looks like the grid nodes use the flavor g3.cores4.ram8.disk20.swap24.ephem20. K8s nodes actually cannot have swap, so that isn't quite right, for sure.
So if we have an appropriate flavor to use (and the grid flavor might work), we can use the same approach as https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456/
Does that need to be cinder or should it be like the grid nodes with ephemeral storage. I, personally, think the latter. Otherwise, rebuilding nodes will be a serious pain.
It does not seem to be currently leaking temp files. I'm not sure why the temp files filed the disk. Closing this for now.
bstorm@quarry-web-01:/tmp$ df -h . Filesystem Size Used Avail Use% Mounted on /dev/vda3 19G 3.5G 15G 20% /
All files are owned by www-data, so this is a quarry bug.
It's being caused by /tmp filling up.
Yep, that theory was it. Works now. Sorry about that quirk.
@MBH the old wikireplica names were forced over to the new wikireplicas this week. I just ran a test and found that SELECT * FROM actor WHERE actor_user IS NULL has loads of results on enwiki there, but not on ruwiki. That seems very strange. I'll dig a bit deeper. ruwiki is on a host that was down for maintenance when some of these fixes were deployed, so it is likely that host needs a re-run of the definitions.
That's very strange... I'm not aware of any changes to the views any time recently. Which wiki? @MBH do you have a quarry link I can try or something?
Wed, May 5
server: cloudbackup2001.codfw.wmnet and directory /srv/security-temp. There's 7 TB available and it is a much more suitable temporary location.
@Reedy new location for this is set up. I'll get on IRC and see how you want to move it.
@Reedy I have found a place in the cloud universe where I could put it in CODFW on a temporary basis. That would get it off systems that have public uses as well as saving us some of the disk space problem. A new setup for storing this data really needs to be purchased outside of cloud systems, though.
That helps a lot. I hope that isn't the second time you've had to explain it to me 😆
I'll close this one.
That would make more sense :)
Tue, May 4
Any thoughts on that @Marostegui ?
Since the script is not very amenable to creating cross-zone CNAMES, I just made them by hand for these three until we can make the eqiad.wmflabs A records into wikimedia.cloud A records in the Future. It works now.
Mon, May 3
I've ack'd the alert. We could try tuning the buffer parameter thing, or we could just change the alert. I'm honestly surprised that the web side threw the alert first.
Fri, Apr 30
For all I know, I'm just trying the wrong port? I just figure the next port in line seems safe.
Paged again for this this morning.
Thu, Apr 29
Hopefully that'll speed some responses up. It's still running awfully high iowait.
Wed, Apr 28
Nope, I'd probably google around for what "seems good", unfortunately.
I recommend we look at erasure coding the pools for Swift, since I don't know if I've said it directly related to this yet https://ceph.io/planet/erasure-coding-in-ceph/
Tue, Apr 27
Interesting question. When it all comes down to it, the data is probably some of the least secret on the web. So I'm not sure why we would care where it goes after it is accessed. My understanding about the enterprise offering is that it provides a high performance way of gathering lots of information quickly from our systems rather than a particular set of information. WMCS would not really be an enterprise class and highly scaled client from that perspective.
That alert may actually be actionable at least. It would seem surprising for a failing disk to crash a server, but I've seen it before if it is resetting like mad. Let's see if we can get that replaced.
Ah yeah, that's still there. I didn't see it on a very quick check of tab completing (because permissions) and dreamed it was gone :) @Reedy and @JFishback_WMF this is a public ticket, so I don't know if we need another one to discuss that material, but please note the description. This chunk of security data is a bit much for the dumps server to host if it keeps climbing as it is and the server has become more unreliable as well. I'm not sure you want potentially-unique data that matters to you stored on that storage controller.
Mon, Apr 26
That's my guess here. It's not a very sustainable practice across multiple servers and rebuilds. I think it's likely that some may even have been overwritten over time and there were once more than 5.
This is the space on labstore1006 (red) and labstore1007 (green) as a percentage of space used. It was pretty level for a little while, but it is steadily climbing since then at a rate that won't work for long (especially at labstore1007's mysteriously higher usage).
Unfortunately I'm not seeing anything in there.
That apt sources.list bug can't die quickly enough.
Sat, Apr 24
Ok, dumps is going to be served from labstore1006 for awhile. I've downloaded an "adu" report with the hpssaducli tools (hpssaducli -f adu-report.zip as root). I'm pleased that's readable and will go over it Monday to see if there's a useful report on what happened in there since this looks like hardware.
Going to fail over to labstore1006 because this looks suspect.
Ok, my shell just hung for a bit, and I see this in dmesg :(
This Apr 24 22:05:36 labstore1007 kernel: [35172747.658395] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap- En- Exp=1 would then refer to the larger data volume (because that's raid 10) from the paste (logical volume 2).
Ok, the PCI address logging the errors was 0000:05:00.0, which corresponds to the controller in slot 1 from the controller details: