Change Details

Creating this based off what I found investigating https://lists.wikimedia.org/pipermail/cloud/2020-January/000941.html - key points in bold We used to have 5 worker nodes for this new k8s cluster. Today when I went to look into this issue of pods stuck in ContainerCreating I found events saying `/data/project` was missing from the host. I looked at the host and found it was a new one - tools-k8s-worker-6. I quickly determined that profile::wmcs::nfsclient should have created the /data/project symlink to the NFS mount (which did exist), but that puppet had the type of cert issue you see on new hosts in projects which use their own puppetmasters. I also found that there was not just one new worker instance, but workers 6 through 14 had been created. I checked the first few and they **were all created by novaadmin**. I'm not sure it's ever valid for novaadmin to be creating instances. I went through the instances missing the /data/project symlink (-6, -7, -8, -13, -14) and fixed their link to the tools-puppetmaster. 4 other instances (-9, -10, -11, -12) somehow already had the symlink but had broken puppet - **I left these ones alone and don't know if we consider them to be working or not** While fixing instances talking to puppet I saw that **`sudo puppet cert list` on tools-puppetmaster-01 is a mess** and contains some names that should not even exist (stuff ending in `{`, stuff beginning `host-172-16`, one literally just for `.tools.eqiad.wmflabs`, one with a double full stop - `tools-worker-1005..eqiad.wmflabs`) We should also determine **how it was possible for an instance to get as far as registering itself as a node in k8s, and being considered healthy for pods to be scheduled, without even having working puppet**