I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Puppet runs ok again.
Got it. Prod is using the hiera function. Lookup has different syntax. Fixing.
Why doesn't this bite production, though? Do they have a custom lookup function?
Our puppetmaster versions are the same, and at material was pulled from modules/standard/manifests/diamond.pp, line 22. Wonder why it would be a problem on the VMs in our module?
I bet that's from T218365
At this point, we've had good evidence that pages are working on the "secondary" cluster. I also checked the configs resulting from things and think this confirms that the dumps servers have the stuff enabled as well (see below).
If we started right now, it seems that even Queens would be risky release since Extended Maintenance does not seem to ensure security fixes (only that community members decided to keep it in a best effort basis). With our current capacity, it seems that Rocky or even Stein would yield more future-proof results.
Wed, Mar 20
Are we seeing [21075.544262] nfsd: last server has exited, flushing export cache coming back? I haven't seen that one.
[21835.538290] rpc-srv/tcp: nfsd: sent only 256984 when sending 1048640 bytes - shutting down socket
This error predates my time at the foundation. It may relate to the usage of tc, which relies on packet dropping.
The patch is deployed across all replicas.
Tue, Mar 19
On looking, it doesn't seem possible through this mechanism. Another one will be needed for this.
Except that NFS may not be maintainable via blkio restrictions. Will verify if that is true and close the ticket if so.
This happened over the weekend on Stretch hosts as well, and then the next puppet run resolved it. It was weird.
I'm so glad it makes more sense now. :)
Here, we see the load rise from the puppet run, but it never really gets bad (this is a 64 core system, over 65 = bad). I don't think load (at least not load average--actual network congestion and such could be a problem) is the issue from that. Network was lower than it often is, of course. There was a jump on /dev/sdb's wait, which is the tools volume ( /dev/sdb tools lvm2 a-- 9.09t 1.09t). However, the problem was still localized to some servers and only measurable in miliseconds. That shouldn't explain the problem unless the stretch kernel is insanely sensitive--and only selectively since not all stretch clients experienced the brief outage.
In all fairness, not all stretch hosts show the NFS issue. It's just some of them.
I see no nfs error on trusty, and ldap is proceeding there easier.
Checked the nfs server, the old error I cleared up yesterday didn't happen at the time this server lost mount.
This happened last week, but I noticed it then because of the load numbers on tools-sgecron-01 alone (few crons were impacted). Almost all NFS was working well except a couple directories (which is bizarre). It required a reboot. If the directory affected includes one of the grid directories used by gridengine, the grid engine client dies--which has mostly affected exec nodes and not submit nodes.
Mon, Mar 18
I've removed the folder contents and disabled that tool. We will be tracking NFS issues in general. The LDAP problems we are experiencing currently could be part of the problem as well, but I'm hoping that clearing up that NFS error will clear up some more things.
Sun, Mar 17
After a hard reboot, I was able to get it to run puppet, but I was surprised at how many files it thought needed changes (for the most part the changes aren't actual content for that matter). P8212
tools-worker-1018:~$ sudo dpkg --configure -a
Setting up initramfs-tools (0.120+deb8u3) ...
update-initramfs: deferring update (trigger activated)
Processing triggers for initramfs-tools (0.120+deb8u3) ...
update-initramfs: Generating /boot/initrd.img-4.9.0-0.bpo.6-amd64
Emailed the tool owner. There's a lot of crons and at least one service. I'll disable it all, delete the folder and recreate it tomorrow, but I'm hoping the maintainer will be able to fix it in a more sensible way since I cannot create a reasonable backup of the data there.
Tar is succeeding in building a file when running from the server, but I suspect the directory contents are changing since it has been running for hours and hasn't finished. I'm stopping the tar.
Sat, Mar 16
Trying to tar up that directory then recreate it.
It looks like the php_cleanupILH_DOM_arwiki job in that project has been stuck since Sept. I'm not sure if that's the problem...
File: ‘mw-log’ Size: 596557824 Blocks: 1165256 IO Block: 1048576 directory Device: 26h/38d Inode: 236585023 Links: 2 Access: (2755/drwxr-sr-x) Uid: (51117/tools.liangent-php) Gid: (51117/tools.liangent-php) Access: 2016-11-14 19:29:20.165239239 +0000 Modify: 2019-03-16 22:40:21.263693712 +0000 Change: 2019-03-16 22:40:21.263693712 +0000 Birth: -
@liangent /data/project/liangent-php/mw-log appears to have too many subdirectories. I am seeing repeated filesystem errors on a fairly regular schedule. Something is creating these, potentially a cron job?
That inode is /srv/tools/shared/tools/project/liangent-php/mw-log
Running a find for that inode number. It's always the same one and is likely a directory...
I am seeing repeats of this error:
We found a bunch of these last week after the NFS filesystem. Looks like there was another blip :(
Fri, Mar 15
Ok, now they stay submit hosts.
That bit looks configurable (haven't found @aborrero 's test machine yet to play with it) https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-user-ids
Ok, it looks like to make that patch work correctly on our heavily-pinned Jessie-Mitaka setup, the python3 packages need to be pinned properly in modules/openstack/manifests/clientpackages/mitaka/jessie.pp
Never mind, I see that this just requires a pin to make it right. I'll note that on the other ticket.
Had revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/496863 because of a dependency issue on the baremetal servers running this kind of mixed setup.
I've exported pintoch.bz2 tobias47n9e.bz2 u_halfak.bz2 u_shiladsen.bz2 wikimaps_atlas.bz2 from labsdb1004. Only u_shiladsen has a lot of data in it. I figure that way I should be able to copy them over and import if you want. Otherwise, perhaps they are just things to archive somewhere?
Thu, Mar 14
Ah, it's already acked :)
Yeah, it's out of service for T217473
Wed, Mar 13
Tue, Mar 12
Linking in recent pages and even outages due to log files in excess of 1TB. There are far more of those.
It's just that and any activity remaining for T216441.
For record-keeping purposes, we've noticed through this task and T216988 that the stretch grid is especially sensitive to NFS issues where the Trusty grid is more prone to a brief hang that goes almost unnoticed and recovering. The jump in kernel versions, which changes the interpretation of and default mount options, is likely to blame, but nothing has been implemented yet that reduces this sensitivity.
Mon, Mar 11
Mar 11 14:53:01 tools-sgegrid-master kernel: [1224330.183189] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532427] sge_qmaster D 0 2938 1 0x00000000 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532434] ffff88eeb5504580 0000000000000000 ffff88eeb63ea200 ffff88eebfd98980 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532438] ffff88eeb637a380 ffff9848c386bb90 ffffffff820144b9 ffffffff81aac9e2 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532441] 00000000000bd815 ffff88eebfd98980 ffffffff82019364 ffff88eeb63ea200 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532445] Call Trace: Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532457] [<ffffffff820144b9>] ? __schedule+0x239/0x6f0 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532464] [<ffffffff81aac9e2>] ? update_cfs_rq_load_avg+0x212/0x490 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532469] [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532472] [<ffffffff82015170>] ? bit_wait+0x50/0x50 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532476] [<ffffffff820149a2>] ? schedule+0x32/0x80 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532478] [<ffffffff82017d4d>] ? schedule_timeout+0x1dd/0x380 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532481] [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532483] [<ffffffff82019370>] ? __switch_to_asm+0x40/0x70 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532485] [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70 Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532487] [<ffffffff82019370>] ? __switch_to_asm+0x40/0x70
Sun, Mar 10
If things still seem high on Tuesday, we could, perhaps, reopen and create more subtasks to clean up tools.
That leaves tools looking pretty good.
Since the entire amount of space appears to be from /data/project/bookworm/bookworm.out, truncating that.