Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (60 w, 3 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Yesterday

GTirloni awarded T218959: Puppet runs failing on tools-sgebastion-07.tools.eqiad.wmflabs a Love token.
Thu, Mar 21, 11:39 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm moved T216753: Document ToolsDB failover process for Clouddb Admins from Doing to Important on the cloud-services-team (Kanban) board.
Thu, Mar 21, 11:17 PM · Data-Services, cloud-services-team (Kanban)
Bstorm closed T218959: Puppet runs failing on tools-sgebastion-07.tools.eqiad.wmflabs as Resolved.

Puppet runs ok again.

Thu, Mar 21, 10:46 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218959: Puppet runs failing on tools-sgebastion-07.tools.eqiad.wmflabs.

Got it. Prod is using the hiera function. Lookup has different syntax. Fixing.

Thu, Mar 21, 10:22 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218959: Puppet runs failing on tools-sgebastion-07.tools.eqiad.wmflabs.

Why doesn't this bite production, though? Do they have a custom lookup function?

Thu, Mar 21, 10:20 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218959: Puppet runs failing on tools-sgebastion-07.tools.eqiad.wmflabs.

https://stackoverflow.com/questions/54133388/difference-between-puppet-hiera-and-lookup-in-puppet-module-code ?

Thu, Mar 21, 10:19 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218959: Puppet runs failing on tools-sgebastion-07.tools.eqiad.wmflabs.

Our puppetmaster versions are the same, and at material was pulled from modules/standard/manifests/diamond.pp, line 22. Wonder why it would be a problem on the VMs in our module?

Thu, Mar 21, 10:18 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218959: Puppet runs failing on tools-sgebastion-07.tools.eqiad.wmflabs.

I bet that's from T218365

Thu, Mar 21, 10:11 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm closed T217068: Investigate why Cloud Services was not paged when the tools filesystem on NFS went critical as Resolved.
Thu, Mar 21, 6:26 PM · Patch-For-Review, monitoring, cloud-services-team (Kanban)
Bstorm triaged T218925: Investigate or create alerting/messaging around when an NFS filesystem is ready for a cleanup as Normal priority.
Thu, Mar 21, 6:26 PM · monitoring, cloud-services-team (Kanban)
Bstorm added a comment to T217068: Investigate why Cloud Services was not paged when the tools filesystem on NFS went critical.

At this point, we've had good evidence that pages are working on the "secondary" cluster. I also checked the configs resulting from things and think this confirms that the dumps servers have the stuff enabled as well (see below).

Thu, Mar 21, 6:22 PM · Patch-For-Review, monitoring, cloud-services-team (Kanban)
Bstorm added a comment to T217086: Investigate why the new Son of Grid Engine grid landed in a worse state when NFS was filled than the old Sun Grid Engine grid did.

More mentions for linking things T169290, T169281, T203254 -- so far, the story runs like: all stretch kernels (acquired through backport or upgrade to stretch) seem to hate our NFS setup.

Thu, Mar 21, 5:50 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
Bstorm added a parent task for T203254: labstore1004 and labstore1005 high load issues following upgrades: T169290: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS.
Thu, Mar 21, 5:48 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a subtask for T169290: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS: T203254: labstore1004 and labstore1005 high load issues following upgrades.
Thu, Mar 21, 5:48 PM · Upstream, cloud-services-team, Operations
Bstorm lowered the priority of T209396: postgresql on clouddb1002 needs some kind of puppet management of pg_hba.conf from Normal to Low.
Thu, Mar 21, 5:33 PM · Scoring-platform-team, Data-Services, cloud-services-team (Kanban), Wikilabels
Bstorm renamed T209396: postgresql on clouddb1002 needs some kind of puppet management of pg_hba.conf from postgresql on labsdb1004 needs some kind of puppet management of pg_hba.conf to postgresql on clouddb1002 needs some kind of puppet management of pg_hba.conf.
Thu, Mar 21, 5:33 PM · Scoring-platform-team, Data-Services, cloud-services-team (Kanban), Wikilabels
Bstorm added a comment to T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton.

If we started right now, it seems that even Queens would be risky release since Extended Maintenance does not seem to ensure security fixes (only that community members decided to keep it in a best effort basis). With our current capacity, it seems that Rocky or even Stein would yield more future-proof results.

Thu, Mar 21, 4:51 PM · Cloud-VPS, Patch-For-Review, cloud-services-team (Kanban)
Bstorm awarded T214512: tools-prometheus can't connect to tools-worker-* on port 10255 for Kubernetes metrics a Party Time token.
Thu, Mar 21, 3:46 AM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)

Wed, Mar 20

Bstorm added a comment to T169281: Labstore nfsd processes report "sent only x when sending y bytes - shutting down socket".

Are we seeing [21075.544262] nfsd: last server has exited, flushing export cache coming back? I haven't seen that one.

Wed, Mar 20, 2:39 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T169281: Labstore nfsd processes report "sent only x when sending y bytes - shutting down socket".

[21835.538290] rpc-srv/tcp: nfsd: sent only 256984 when sending 1048640 bytes - shutting down socket
This error predates my time at the foundation. It may relate to the usage of tc, which relies on packet dropping.

Wed, Mar 20, 2:37 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

The patch is deployed across all replicas.

Wed, Mar 20, 4:44 AM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)

Tue, Mar 19

Bstorm updated subscribers of T216749: Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready.
Tue, Mar 19, 10:25 PM · Operations, decommission, Data-Services, cloud-services-team (Kanban)
Bstorm updated subscribers of T216749: Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready.

This should only be stalled on T216441 at this point. @jcrespo and @Marostegui, if that is ready to close, then we can begin this ticket.

Tue, Mar 19, 10:23 PM · Operations, decommission, Data-Services, cloud-services-team (Kanban)
Bstorm closed T218720: Add a blkio restriction to bastion cgroups as Invalid.

On looking, it doesn't seem possible through this mechanism. Another one will be needed for this.

Tue, Mar 19, 10:11 PM · Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm closed T218720: Add a blkio restriction to bastion cgroups, a subtask of T210098: Port cgroup restrictions and definitions to systemd/stretch for bastions, as Invalid.
Tue, Mar 19, 10:11 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T218720: Add a blkio restriction to bastion cgroups.

Except that NFS may not be maintainable via blkio restrictions. Will verify if that is true and close the ticket if so.

Tue, Mar 19, 7:00 PM · Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm triaged T218720: Add a blkio restriction to bastion cgroups as Normal priority.
Tue, Mar 19, 6:59 PM · Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T218494: `become`, `crontab` et al missing from Trusty hosts.

This happened over the weekend on Stretch hosts as well, and then the next puppet run resolved it. It was weird.
I'm so glad it makes more sense now. :)

Tue, Mar 19, 4:03 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218649: Grid jobs don't run as of Tue, March 19, 2019.

Here, we see the load rise from the puppet run, but it never really gets bad (this is a 64 core system, over 65 = bad). I don't think load (at least not load average--actual network congestion and such could be a problem) is the issue from that. Network was lower than it often is, of course. There was a jump on /dev/sdb's wait, which is the tools volume ( /dev/sdb tools lvm2 a-- 9.09t 1.09t). However, the problem was still localized to some servers and only measurable in miliseconds. That shouldn't explain the problem unless the stretch kernel is insanely sensitive--and only selectively since not all stretch clients experienced the brief outage.

Tue, Mar 19, 3:51 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218649: Grid jobs don't run as of Tue, March 19, 2019.

In all fairness, not all stretch hosts show the NFS issue. It's just some of them.

Tue, Mar 19, 3:42 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218649: Grid jobs don't run as of Tue, March 19, 2019.

I see no nfs error on trusty, and ldap is proceeding there easier.

Tue, Mar 19, 3:39 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218649: Grid jobs don't run as of Tue, March 19, 2019.

Checked the nfs server, the old error I cleared up yesterday didn't happen at the time this server lost mount.

Tue, Mar 19, 3:37 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218649: Grid jobs don't run as of Tue, March 19, 2019.

This happened last week, but I noticed it then because of the load numbers on tools-sgecron-01 alone (few crons were impacted). Almost all NFS was working well except a couple directories (which is bizarre). It required a reboot. If the directory affected includes one of the grid directories used by gridengine, the grid engine client dies--which has mostly affected exec nodes and not submit nodes.

Tue, Mar 19, 3:18 PM · cloud-services-team (Kanban), Toolforge

Mon, Mar 18

Bstorm added a comment to T218038: NFS issue affecting Toolforge SGE master.

Also for record keeping purposes, it is possible that NFS issues are related to T217280. It could also be related to what I took care of here: T218486#5030023

Mon, Mar 18, 10:34 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218486: Grid jobs stuck on host.

I've removed the folder contents and disabled that tool. We will be tracking NFS issues in general. The LDAP problems we are experiencing currently could be part of the problem as well, but I'm hoping that clearing up that NFS error will clear up some more things.

Mon, Mar 18, 10:32 PM · Toolforge
Bstorm added a subtask for T218038: NFS issue affecting Toolforge SGE master: T218486: Grid jobs stuck on host.
Mon, Mar 18, 10:31 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a parent task for T218486: Grid jobs stuck on host: T218038: NFS issue affecting Toolforge SGE master.
Mon, Mar 18, 10:31 PM · Toolforge

Sun, Mar 17

Bstorm added a comment to T218514: puppet breakage on Jessie tools nodes (and probably on Jessie VMs everywhere).

After a hard reboot, I was able to get it to run puppet, but I was surprised at how many files it thought needed changes (for the most part the changes aren't actual content for that matter). P8212

Sun, Mar 17, 5:31 PM · cloud-services-team (Kanban)
Bstorm created P8212 (An Untitled Masterwork).
Sun, Mar 17, 5:30 PM
Bstorm added a comment to T218514: puppet breakage on Jessie tools nodes (and probably on Jessie VMs everywhere).

tools-worker-1018:~$ sudo dpkg --configure -a
Setting up initramfs-tools (0.120+deb8u3) ...
update-initramfs: deferring update (trigger activated)
Processing triggers for initramfs-tools (0.120+deb8u3) ...
update-initramfs: Generating /boot/initrd.img-4.9.0-0.bpo.6-amd64

Sun, Mar 17, 5:07 PM · cloud-services-team (Kanban)
Bstorm added a comment to T218486: Grid jobs stuck on host.

Emailed the tool owner. There's a lot of crons and at least one service. I'll disable it all, delete the folder and recreate it tomorrow, but I'm hoping the maintainer will be able to fix it in a more sensible way since I cannot create a reasonable backup of the data there.

Sun, Mar 17, 4:47 PM · Toolforge
Bstorm added a comment to T218486: Grid jobs stuck on host.

Tar is succeeding in building a file when running from the server, but I suspect the directory contents are changing since it has been running for hours and hasn't finished. I'm stopping the tar.

Sun, Mar 17, 2:52 AM · Toolforge

Sat, Mar 16

Bstorm added a comment to T218486: Grid jobs stuck on host.

Trying to tar up that directory then recreate it.

Sat, Mar 16, 11:35 PM · Toolforge
Bstorm added a comment to T218486: Grid jobs stuck on host.

It looks like the php_cleanupILH_DOM_arwiki job in that project has been stuck since Sept. I'm not sure if that's the problem...

Sat, Mar 16, 11:30 PM · Toolforge
Bstorm removed a project from T216747: Decommission outdated and risky hardware: Tracking.
Sat, Mar 16, 11:13 PM · cloud-services-team (Kanban), Epic
Bstorm removed a project from T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure: Tracking.
Sat, Mar 16, 11:12 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T218486: Grid jobs stuck on host.
  File: ‘mw-log’
  Size: 596557824       Blocks: 1165256    IO Block: 1048576 directory
Device: 26h/38d Inode: 236585023   Links: 2
Access: (2755/drwxr-sr-x)  Uid: (51117/tools.liangent-php)   Gid: (51117/tools.liangent-php)
Access: 2016-11-14 19:29:20.165239239 +0000
Modify: 2019-03-16 22:40:21.263693712 +0000
Change: 2019-03-16 22:40:21.263693712 +0000
 Birth: -
Sat, Mar 16, 11:01 PM · Toolforge
Bstorm updated subscribers of T218486: Grid jobs stuck on host.

@liangent /data/project/liangent-php/mw-log appears to have too many subdirectories. I am seeing repeated filesystem errors on a fairly regular schedule. Something is creating these, potentially a cron job?

Sat, Mar 16, 10:33 PM · Toolforge
Bstorm added a comment to T218486: Grid jobs stuck on host.

That inode is /srv/tools/shared/tools/project/liangent-php/mw-log

Sat, Mar 16, 10:11 PM · Toolforge
Bstorm added a comment to T218486: Grid jobs stuck on host.

Running a find for that inode number. It's always the same one and is likely a directory...

Sat, Mar 16, 10:03 PM · Toolforge
Bstorm added a comment to T218486: Grid jobs stuck on host.

I am seeing repeats of this error:

Sat, Mar 16, 9:57 PM · Toolforge
Bstorm added a comment to T218486: Grid jobs stuck on host.

We found a bunch of these last week after the NFS filesystem. Looks like there was another blip :(

Sat, Mar 16, 9:51 PM · Toolforge

Fri, Mar 15

Bstorm closed T216992: Depool procedure doesn't work in SGE cluster as Resolved.

Ok, now they stay submit hosts.

Fri, Mar 15, 11:01 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T218126: LDAP: try how sssd works with our servers.

That worked:

Fri, Mar 15, 10:19 PM · cloud-services-team (Kanban)
Bstorm awarded T218126: LDAP: try how sssd works with our servers a Love token.
Fri, Mar 15, 9:40 PM · cloud-services-team (Kanban)
Bstorm added a comment to T218126: LDAP: try how sssd works with our servers.

That bit looks configurable (haven't found @aborrero 's test machine yet to play with it) https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-user-ids

Fri, Mar 15, 9:40 PM · cloud-services-team (Kanban)
Bstorm added a comment to T218423: Add python 3 packages to openstack::clientpackages::common.

Ok, it looks like to make that patch work correctly on our heavily-pinned Jessie-Mitaka setup, the python3 packages need to be pinned properly in modules/openstack/manifests/clientpackages/mitaka/jessie.pp

Fri, Mar 15, 9:28 PM · cloud-services-team (Kanban), Cloud-VPS
Bstorm closed T216497: CloudVPS: workaround archival of jessie-backports repo as Resolved.

Never mind, I see that this just requires a pin to make it right. I'll note that on the other ticket.

Fri, Mar 15, 9:26 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
Bstorm closed T216497: CloudVPS: workaround archival of jessie-backports repo, a subtask of T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton, as Resolved.
Fri, Mar 15, 9:25 PM · Cloud-VPS, Patch-For-Review, cloud-services-team (Kanban)
Bstorm reopened T216497: CloudVPS: workaround archival of jessie-backports repo as "Open".
Fri, Mar 15, 9:25 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
Bstorm reopened T216497: CloudVPS: workaround archival of jessie-backports repo, a subtask of T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton, as Open.
Fri, Mar 15, 9:24 PM · Cloud-VPS, Patch-For-Review, cloud-services-team (Kanban)
Bstorm reopened T216497: CloudVPS: workaround archival of jessie-backports repo, a subtask of T216711: Audit our puppet tree for uses of jessie-backports, as Open.
Fri, Mar 15, 9:24 PM · Patch-For-Review, Operations
Bstorm added a comment to T216497: CloudVPS: workaround archival of jessie-backports repo.

Had revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/496863 because of a dependency issue on the baremetal servers running this kind of mixed setup.

Fri, Mar 15, 9:24 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T217922: Migrate Wikilabels from labsdb1004 to clouddb1002.

I've exported pintoch.bz2 tobias47n9e.bz2 u_halfak.bz2 u_shiladsen.bz2 wikimaps_atlas.bz2 from labsdb1004. Only u_shiladsen has a lot of data in it. I figure that way I should be able to copy them over and import if you want. Otherwise, perhaps they are just things to archive somewhere?

Fri, Mar 15, 5:40 PM · Scoring-platform-team (Current), Wikilabels, Cloud-VPS

Thu, Mar 14

Bstorm closed T122508: Prevent overly-large log files as Resolved.
Thu, Mar 14, 6:06 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Toolforge
Bstorm closed T122508: Prevent overly-large log files, a subtask of T126083: overhaul labstore setup [tracking], as Resolved.
Thu, Mar 14, 6:06 PM · Data-Services, Tracking, Operations
Bstorm closed T122508: Prevent overly-large log files, a subtask of T216988: labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):, as Resolved.
Thu, Mar 14, 6:06 PM · cloud-services-team (Kanban), Data-Services
Bstorm closed T122508: Prevent overly-large log files, a subtask of T217993: 2019-03-10: tools and NFS share cleanup (high usage), as Resolved.
Thu, Mar 14, 6:06 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T217473: labstore1006 spontaneous reboot.

Thanks!

Thu, Mar 14, 4:48 PM · Patch-For-Review, Operations, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T217474: labstore1006 nfsd not started after reboot.

Ah, it's already acked :)

Thu, Mar 14, 4:39 PM · monitoring, cloud-services-team (Kanban)
Bstorm added a comment to T217474: labstore1006 nfsd not started after reboot.

Yeah, it's out of service for T217473

Thu, Mar 14, 4:38 PM · monitoring, cloud-services-team (Kanban)

Wed, Mar 13

Bstorm awarded T218186: sgebastion-07: /usr/local/bin/prometheus-puppet-agent-stats cannot fork a Burninate token.
Wed, Mar 13, 6:15 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)

Tue, Mar 12

Bstorm moved T122508: Prevent overly-large log files from Inbox to Important on the cloud-services-team (Kanban) board.
Tue, Mar 12, 7:04 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Toolforge
Bstorm added a comment to T122508: Prevent overly-large log files.

Linking in recent pages and even outages due to log files in excess of 1TB. There are far more of those.

Tue, Mar 12, 7:03 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Toolforge
Bstorm claimed T122508: Prevent overly-large log files.
Tue, Mar 12, 7:02 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Toolforge
Bstorm added a parent task for T122508: Prevent overly-large log files: T217993: 2019-03-10: tools and NFS share cleanup (high usage).
Tue, Mar 12, 7:00 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Toolforge
Bstorm added a subtask for T217993: 2019-03-10: tools and NFS share cleanup (high usage): T122508: Prevent overly-large log files.
Tue, Mar 12, 7:00 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a subtask for T216988: labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):: T122508: Prevent overly-large log files.
Tue, Mar 12, 7:00 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a parent task for T122508: Prevent overly-large log files: T216988: labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):.
Tue, Mar 12, 7:00 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, Toolforge
Bstorm placed T218141: Spin up virtualized NFS server strictly for Grid Engine database and management up for grabs.
Tue, Mar 12, 6:50 PM · cloud-services-team (Kanban), Toolforge
aborrero awarded T218139: Develop or expand grid troubleshooting playbook a Love token.
Tue, Mar 12, 6:48 PM · cloud-services-team (Kanban), Toolforge
Bstorm triaged T218141: Spin up virtualized NFS server strictly for Grid Engine database and management as High priority.
Tue, Mar 12, 6:38 PM · cloud-services-team (Kanban), Toolforge
Bstorm lowered the priority of T218139: Develop or expand grid troubleshooting playbook from High to Normal.
Tue, Mar 12, 6:18 PM · cloud-services-team (Kanban), Toolforge
Bstorm triaged T218139: Develop or expand grid troubleshooting playbook as High priority.
Tue, Mar 12, 6:18 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020.

It's just that and any activity remaining for T216441.

Tue, Mar 12, 3:36 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Patch-For-Review, Epic, Cloud-VPS
Bstorm added a comment to T218038: NFS issue affecting Toolforge SGE master.

For record-keeping purposes, we've noticed through this task and T216988 that the stretch grid is especially sensitive to NFS issues where the Trusty grid is more prone to a brief hang that goes almost unnoticed and recovering. The jump in kernel versions, which changes the interpretation of and default mount options, is likely to blame, but nothing has been implemented yet that reduces this sensitivity.

Tue, Mar 12, 3:33 PM · cloud-services-team (Kanban), Toolforge
Bstorm moved T216992: Depool procedure doesn't work in SGE cluster from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Mar 12, 3:27 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T217838: Toolforge Stretch - Increased LDAP utilization.

As a side note, for the near future I think we should consider using sssd rather than nslcd/nscd. https://en.wikipedia.org/wiki/System_Security_Services_Daemon

Tue, Mar 12, 3:18 PM · LDAP, Toolforge

Mon, Mar 11

Bstorm added a comment to T218038: NFS issue affecting Toolforge SGE master.

On labstore1004:

Mon, Mar 11, 4:07 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T218038: NFS issue affecting Toolforge SGE master.
Mar 11 14:53:01 tools-sgegrid-master kernel: [1224330.183189] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532427] sge_qmaster     D    0  2938      1 0x00000000
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532434]  ffff88eeb5504580 0000000000000000 ffff88eeb63ea200 ffff88eebfd98980
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532438]  ffff88eeb637a380 ffff9848c386bb90 ffffffff820144b9 ffffffff81aac9e2
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532441]  00000000000bd815 ffff88eebfd98980 ffffffff82019364 ffff88eeb63ea200
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532445] Call Trace:
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532457]  [<ffffffff820144b9>] ? __schedule+0x239/0x6f0
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532464]  [<ffffffff81aac9e2>] ? update_cfs_rq_load_avg+0x212/0x490
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532469]  [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532472]  [<ffffffff82015170>] ? bit_wait+0x50/0x50
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532476]  [<ffffffff820149a2>] ? schedule+0x32/0x80
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532478]  [<ffffffff82017d4d>] ? schedule_timeout+0x1dd/0x380
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532481]  [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532483]  [<ffffffff82019370>] ? __switch_to_asm+0x40/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532485]  [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532487]  [<ffffffff82019370>] ? __switch_to_asm+0x40/0x70
Mon, Mar 11, 4:05 PM · cloud-services-team (Kanban), Toolforge

Sun, Mar 10

Bstorm updated the task description for T217999: Prevent iabot from filling the tools project NFS filesystems.
Sun, Mar 10, 11:21 PM · InternetArchiveBot (v2.0), Data-Services, cloud-services-team (Kanban)
Bstorm closed T217993: 2019-03-10: tools and NFS share cleanup (high usage) as Resolved.

If things still seem high on Tuesday, we could, perhaps, reopen and create more subtasks to clean up tools.

Sun, Mar 10, 11:16 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T217993: 2019-03-10: tools and NFS share cleanup (high usage).

That leaves tools looking pretty good.

Sun, Mar 10, 11:15 PM · Data-Services, cloud-services-team (Kanban)
Bstorm closed T208466: bookworm is using 254GB of space as Resolved.
Sun, Mar 10, 11:13 PM · Tools
Bstorm closed T208466: bookworm is using 254GB of space, a subtask of T206239: 2018-10-04: tools and NFS share cleanup (high usage), as Resolved.
Sun, Mar 10, 11:13 PM · cloud-services-team (Kanban)
Bstorm closed T208466: bookworm is using 254GB of space, a subtask of T217993: 2019-03-10: tools and NFS share cleanup (high usage), as Resolved.
Sun, Mar 10, 11:13 PM · Data-Services, cloud-services-team (Kanban)
Bstorm changed the status of T208466: bookworm is using 254GB of space from Stalled to Open.

Since the entire amount of space appears to be from /data/project/bookworm/bookworm.out, truncating that.

Sun, Mar 10, 11:13 PM · Tools
Bstorm changed the status of T208466: bookworm is using 254GB of space, a subtask of T206239: 2018-10-04: tools and NFS share cleanup (high usage), from Stalled to Open.
Sun, Mar 10, 11:13 PM · cloud-services-team (Kanban)
Bstorm renamed T208466: bookworm is using 254GB of space from bookworm is using 88GB to bookworm is using 254GB of space.
Sun, Mar 10, 11:12 PM · Tools
Bstorm added a comment to T208466: bookworm is using 254GB of space.

This tool doesn't seem to be maintained. It is currently sporting a file using 254G of space.

Sun, Mar 10, 11:11 PM · Tools