Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (52 w, 1 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Sat, Jan 19

Bstorm closed T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72 as Resolved.

PHP 7.0 has been properly replaced at this point, I think, with 7.2 in the new grid.

Sat, Jan 19, 1:12 AM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm closed T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72, a subtask of T195689: Support PHP 7.x webservices on Toolforge, as Resolved.
Sat, Jan 19, 1:11 AM · Toolforge

Fri, Jan 18

Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

Will do

Fri, Jan 18, 9:35 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

A few need a reinstall, which I'm going to do, but phpunit re-installs 7.0. There's no phpunit in the thirdparty repo, and the stretch version is old and unsupported by the upstream.

Fri, Jan 18, 9:26 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

In fact, I think it better to strip them from the environment and add them deliberately if there is a reason to. I'll do that.

Fri, Jan 18, 9:05 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

I could uninstall php7 libraries where they exist on the bastion at least to prevent issues.

Fri, Jan 18, 9:04 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

The stretch grid nodes now have both php7.0 and php7.2 installed. It is possible this may cause future confusion when adding new nodes, but it looks good for now.

Fri, Jan 18, 9:04 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

Added that to the patch. Now it's only missing a deprecated package, which I'm more ok with.

Fri, Jan 18, 5:43 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

I totally missed the xdebug package! Thanks

Fri, Jan 18, 5:39 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

xdebug is not in the upstream repo either. If we want that, we may have to build it ourselves and put it in the tools repo or something, then?

Fri, Jan 18, 5:18 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

Ahh, that's because mcrypt is deprecated in 7.1. xdebug might still be a thing, though...

Fri, Jan 18, 5:11 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.

Looks like the prod repo is missing:
php-mcrypt
php-xdebug

Fri, Jan 18, 5:09 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm claimed T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72.
Fri, Jan 18, 4:14 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a project to T213666: Upgrade Toolforge grid engine PHP to 7.2 using packages from thirdparty/php72: cloud-services-team (Kanban).
Fri, Jan 18, 1:18 AM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm closed T123270: Make gridengine exec hosts also submit hosts as Resolved.

All new grid exec hosts (stretch) are now submit hosts.

Fri, Jan 18, 1:15 AM · Patch-For-Review, Toolforge
Bstorm closed T123270: Make gridengine exec hosts also submit hosts, a subtask of T199271: Upgrade the tools gridengine system, as Resolved.
Fri, Jan 18, 1:15 AM · Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T123270: Make gridengine exec hosts also submit hosts.

Since concurrent jobs are now limited by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-scheduler-config#6 and by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-global-config#43

Fri, Jan 18, 12:09 AM · Patch-For-Review, Toolforge

Thu, Jan 17

Bstorm closed T213183: Set up puppet to handle the global and scheduler configuration of gridengine as Resolved.
Thu, Jan 17, 11:53 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm closed T213183: Set up puppet to handle the global and scheduler configuration of gridengine, a subtask of T67777: Limit number of jobs users can execute in parallel, as Resolved.
Thu, Jan 17, 11:53 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213656: Update documentation or config for new grid around adminhost-only commands.

Added some notes in a few places. Also moved exec-manage to the master on SGE (and fixed it). I think that's it for this one?

Thu, Jan 17, 11:50 PM · cloud-services-team (Kanban)
Bstorm triaged T214106: Creating projects in openstack hangs on CLI as High priority.
Thu, Jan 17, 10:52 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm closed T212573: Request creation of indico VPS project as Resolved.

I've created the project. @JeanFred should see it in https://horizon.wikimedia.org/ now.

Thu, Jan 17, 12:00 AM · Cloud-VPS (Project-requests)
Bstorm closed T212573: Request creation of indico VPS project, a subtask of T210952: Investigate deployment of an Indico instance on the Wikimedia Cloud, as Resolved.
Thu, Jan 17, 12:00 AM · Cloud-VPS

Wed, Jan 16

Bstorm moved T213283: Request creation of wikidata-history-query-service VPS project from Inbox to Approved on the Cloud-VPS (Project-requests) board.
Wed, Jan 16, 11:16 PM · Cloud-VPS (Project-requests)
Bstorm added a comment to T213283: Request creation of wikidata-history-query-service VPS project.

Ok, I set you up with a project in horizon. You should be able to log in and set up an instance. I increased the project RAM so that a bigdisk2 will work.

Wed, Jan 16, 11:15 PM · Cloud-VPS (Project-requests)
Bstorm moved T213623: Request creation of StrangerBot VPS project from Inbox to Discussion needed on the Cloud-VPS (Project-requests) board.
Wed, Jan 16, 10:39 PM · Cloud-VPS (Project-requests)
Bstorm added a comment to T213951: update/fix/something exec-manage for the new sge grid.

I might just make the master and shadow submit hosts. That seems like it might be the best compromise unless I can find a way around how this acts.

Wed, Jan 16, 6:05 PM · Patch-For-Review, cloud-services-team (Kanban)
Legoktm awarded T213711: move tools proxy nodes to eqiad1 a Evil Spooky Haunted Tree token.
Wed, Jan 16, 3:40 AM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal

Tue, Jan 15

Krenair awarded T213711: move tools proxy nodes to eqiad1 a Evil Spooky Haunted Tree token.
Tue, Jan 15, 11:51 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

That said, I've built nfsd-ldap on stretch and have it on my home dir on install1002.
@GTirloni I'm waiting until tomorrow morning to add it to reprepro when I can make sure ops people are around to stop me from doing something stupid. Feel free to take over and grab it from my home dir there before I'm on! I put all the needed build files in my home dir and basically nothing else is in there on that server.

Tue, Jan 15, 11:24 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

Couldn't we just remove these users from ldap? Hmm, that's a different matter though because that's not shell access, that's cloud access.

I think it goes deeper than that in so much as, we don't want any production machines looking to LDAP as authoritative for anything.

Tue, Jan 15, 10:29 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

Couldn't we just remove these users from ldap? Hmm, that's a different matter though because that's not shell access, that's cloud access.

Tue, Jan 15, 10:25 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

That makes sense. nfsd-ldap only works for jessie at the moment, though :-p. Might have to rebuild the package.

Tue, Jan 15, 10:23 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

Most of this seems to have been in the admin module for like four years, though....

Tue, Jan 15, 10:18 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

So the obvious problem is with modules/admin/data/data.yaml

Tue, Jan 15, 10:17 PM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni awarded T213711: move tools proxy nodes to eqiad1 a Yellow Medal token.
Tue, Jan 15, 8:10 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal

Mon, Jan 14

Bstorm added a comment to T213711: move tools proxy nodes to eqiad1.

Restarted flannel on proxy-04, and it got it's flanneld interface up.

Mon, Jan 14, 10:16 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal
Bstorm added a comment to T213711: move tools proxy nodes to eqiad1.

Set my hosts file to point at 03, and it works like a charm. This is ready to go.

Mon, Jan 14, 10:14 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal
Bstorm added a comment to T213711: move tools proxy nodes to eqiad1.

Ok, tools-proxy-03 can now reach flannel network nodes in eqiad!
I had to add UDP port 8472 to both ends of the setup.

Mon, Jan 14, 10:01 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal
Bstorm added a comment to T213711: move tools proxy nodes to eqiad1.

Flannel and kube-proxy both work now, but the connection is still not possible. Checking kube workers.

Mon, Jan 14, 5:12 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal
Bstorm added a comment to T213711: move tools proxy nodes to eqiad1.

Flannel interface is now there. ferm needed a restart on the flannel etcd servers (and I added a sec group to allow the port across regions)

Mon, Jan 14, 5:03 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal
Bstorm added a comment to T213711: move tools proxy nodes to eqiad1.

Flannel interface isn't showing up right now:
on tools-proxy-01:

Mon, Jan 14, 4:10 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), cloud-services-team (Kanban), Goal
Bstorm added a comment to T213656: Update documentation or config for new grid around adminhost-only commands.

Leaving a note here that qconf -sql does not work from the new bastion either because of this submit/admin split. As far as I know, qconf -sql is the only way to find out what queues are available for job submission without finding and reading the configuration files.

Mon, Jan 14, 1:30 AM · cloud-services-team (Kanban)

Sun, Jan 13

Bstorm moved T213656: Update documentation or config for new grid around adminhost-only commands from Inbox to Doing on the cloud-services-team (Kanban) board.
Sun, Jan 13, 4:25 PM · cloud-services-team (Kanban)
Bstorm added a parent task for T213656: Update documentation or config for new grid around adminhost-only commands: Unknown Object (Task).
Sun, Jan 13, 4:24 PM · cloud-services-team (Kanban)
Bstorm created T213656: Update documentation or config for new grid around adminhost-only commands.
Sun, Jan 13, 4:23 PM · cloud-services-team (Kanban)

Fri, Jan 11

Bstorm renamed T213183: Set up puppet to handle the global and scheduler configuration of gridengine from Extend grid_configurator.py and puppet to handle the global and scheduler configuration of gridengine to Set up puppet to handle the global and scheduler configuration of gridengine.
Fri, Jan 11, 9:54 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213416: Toolforge outbound root email in eqiad1.

If other root emails come through at this point, we'll be good with things as they are, and that'll be nice (some parts will be baffling, but it'll be nice).

Fri, Jan 11, 5:00 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T213416: Toolforge outbound root email in eqiad1.

For specifically gridengine toolforge root emails, I may have found a small part of the problem with encoding that might be getting things flagged. If I can fail some jobs badly enough, I should be able to produce some related emails. This won't reflect on other VPSs and other types of toolforge emails from root that might be affected by issues.

Fri, Jan 11, 4:51 PM · Patch-For-Review, cloud-services-team (Kanban)

Thu, Jan 10

Bstorm moved T203254: labstore1004 and labstore1005 high load issues following upgrades from Doing to Important on the cloud-services-team (Kanban) board.
Thu, Jan 10, 11:33 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T173996: Add code snippets with integrity hashes to CDNJS interface.

Since we now actually do proxy the source instead of hosting the repo, should we close this @bd808 and @Samwilson?

Thu, Jan 10, 11:20 PM · Tools
Bstorm closed T213357: Scale out Son of Grid Engine/Stretch webgrid-lighttpd nodes as Resolved.

Looks good now :)

Thu, Jan 10, 10:48 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T213357: Scale out Son of Grid Engine/Stretch webgrid-lighttpd nodes, a subtask of T187219: Remove support for Trusty Grid Engine exec hosts, as Resolved.
Thu, Jan 10, 10:48 PM · Cloud-VPS, Epic
Bstorm closed T213355: Scale out Son of Grid Engine/Stretch webgrid-generic nodes as Resolved.

All set for generics

Thu, Jan 10, 6:59 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T213355: Scale out Son of Grid Engine/Stretch webgrid-generic nodes, a subtask of T187219: Remove support for Trusty Grid Engine exec hosts, as Resolved.
Thu, Jan 10, 6:59 PM · Cloud-VPS, Epic
Bstorm added a comment to T143639: Write a simple script that handles failovering proxies.

https://wikitech.wikimedia.org/w/index.php?title=Portal%3ACloud_VPS%2FAdmin%2FNova-manage&type=revision&diff=1812984&oldid=1812979 😅

Thu, Jan 10, 6:05 PM · cloud-services-team (Kanban), Wikimedia-Incident, Cloud-Services
Bstorm added a comment to T143639: Write a simple script that handles failovering proxies.

In the past, I've found that not removing beforehand can break things. It didn't always work well. My question here also is whether neutron handles that better. I think we should reconsider that edit just slightly to include the remove, just in case. I believe I discovered this on the static server failovers.

Thu, Jan 10, 5:51 PM · cloud-services-team (Kanban), Wikimedia-Incident, Cloud-Services
Bstorm added a comment to T213416: Toolforge outbound root email in eqiad1.

The issue is an email that isn't well formatted. An email from root for cron, for instance.

Thu, Jan 10, 5:41 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T213416: Toolforge outbound root email in eqiad1.

I'll also point out that toolforge is currently whitelisted at the MX. We are discussing the new cluster of toolforge only.

Thu, Jan 10, 4:02 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T213416: Toolforge outbound root email in eqiad1.

Just connecting that change to this task, since this is really what it was hoping to resolve.

Thu, Jan 10, 4:01 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T167412: host-vmem.erb is doing operations that make no sense.

Note that this will have no effect on nodes in the old grid. That would need to be done via manually, or perhaps one-liner script for a lot of qconfs in the write dir.

Thu, Jan 10, 12:34 AM · cloud-services-team (Kanban), Toolforge, Patch-For-Review
Bstorm added a comment to T213357: Scale out Son of Grid Engine/Stretch webgrid-lighttpd nodes.

Added 24 more nodes. They still need the puppet dance and such.

Thu, Jan 10, 12:26 AM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213355: Scale out Son of Grid Engine/Stretch webgrid-generic nodes.

Started up two servers. Beginning the dance for puppet.

Thu, Jan 10, 12:19 AM · cloud-services-team (Kanban), Toolforge
Bstorm closed T213353: Scale out Son of Grid Engine/Stretch exec nodes as Resolved.

They all have clear, ready status now.

Thu, Jan 10, 12:13 AM · cloud-services-team (Kanban), Toolforge
Bstorm closed T213353: Scale out Son of Grid Engine/Stretch exec nodes, a subtask of T187219: Remove support for Trusty Grid Engine exec hosts, as Resolved.
Thu, Jan 10, 12:13 AM · Cloud-VPS, Epic

Wed, Jan 9

Bstorm added a comment to T213353: Scale out Son of Grid Engine/Stretch exec nodes.

42 stretch exec nodes are up. Working through the puppet cert dances.

Wed, Jan 9, 11:41 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213252: toolscheckerctl fails to stop/start checks.

Yeah. There's a lot of checkers that should never be running.

Wed, Jan 9, 4:17 PM · Toolforge, cloud-services-team (Kanban)
Bstorm closed T212333: Drop several views from ptwiki as Resolved.

Got the lock. This is done on all replica hosts.

Wed, Jan 9, 3:53 PM · User-Banyek, cloud-services-team (Kanban), Data-Services, User-Zoranzoki21
Bstorm closed T212333: Drop several views from ptwiki, a subtask of T211544: Drop FlaggedRevs tables in database for ptwikipedia, as Resolved.
Wed, Jan 9, 3:53 PM · User-Banyek, DBA, User-Zoranzoki21
Bstorm claimed T212333: Drop several views from ptwiki.
Wed, Jan 9, 3:51 PM · User-Banyek, cloud-services-team (Kanban), Data-Services, User-Zoranzoki21
Bstorm added a comment to T212333: Drop several views from ptwiki.

Running with the --clean option

Wed, Jan 9, 3:50 PM · User-Banyek, cloud-services-team (Kanban), Data-Services, User-Zoranzoki21
Bstorm added a comment to T212333: Drop several views from ptwiki.

that's the view updates he means. Will run the update today.

Wed, Jan 9, 3:36 PM · User-Banyek, cloud-services-team (Kanban), Data-Services, User-Zoranzoki21

Tue, Jan 8

Bstorm added a comment to T210693: Create materialized views on Wiki Replica hosts for better query performance.

Well, my end can't use 'em very effectively. It's all whether we are going to build it around analytics then. I seem to recall there's another solution there for them as well? Dropping them would make some of our other tickets a bit easier (but we can also make that script skip around tables if we aren't dropping them).

Tue, Jan 8, 4:07 PM · Patch-For-Review, User-Banyek, Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
Bstorm moved T213183: Set up puppet to handle the global and scheduler configuration of gridengine from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Jan 8, 3:51 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm triaged T213183: Set up puppet to handle the global and scheduler configuration of gridengine as Normal priority.
Tue, Jan 8, 3:51 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T67777: Limit number of jobs users can execute in parallel.

The grid script that uses puppet's files in NFS only configures most of the functional grid environment outside global and scheduler stuff. Since qconf can take input from files for global and scheduler, it is possible to build files from templates and then smuggle them in with python, like I did for the rest of the grid. Should be pretty easy to extend it like that.

Tue, Jan 8, 3:40 PM · cloud-services-team (Kanban), Toolforge

Mon, Jan 7

Bstorm added a comment to T67777: Limit number of jobs users can execute in parallel.

I do believe we have decided to leave these limits in place only on the new grid.

Mon, Jan 7, 6:02 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T67777: Limit number of jobs users can execute in parallel.

Ok, so perhaps I can set that to 50 and maxujobs to 16.

Mon, Jan 7, 5:18 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T67777: Limit number of jobs users can execute in parallel.

I am curious if the scheduler will simply dump user jobs into a long tail of qw state if we restrict it a lot. So what I'm thinking of trying is reducing the main grid to 50 first to see what happens and then tighten to 16.

Mon, Jan 7, 3:59 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T67777: Limit number of jobs users can execute in parallel.

I figure the new grid can start with 16 to see how and where that creates problems at least.

Mon, Jan 7, 3:55 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T213039: Using a "_test" database that gets created and torn down for every test.

The thing here is just that you have to keep your db names really clear and separate, obviously :)

Mon, Jan 7, 3:33 PM · cloud-services-team
Bstorm added a comment to T213039: Using a "_test" database that gets created and torn down for every test.

It seems like you could do it using a sensible name for the DB on ToolsDB. The tables there are replicated, which isn't ideal for this purpose, but it depends on how this is used. A db that pops up and is then torn down once in a while seems like it would be fine.

Mon, Jan 7, 3:27 PM · cloud-services-team

Sat, Jan 5

Bstorm added a comment to T212981: npm is missing from Stretch image.

🎉 That was something I was sure we were going to hear about. Great!

Sat, Jan 5, 1:05 AM · Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)

Fri, Jan 4

Bstorm added a comment to T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server.

To put it another way, if you find something cool to fix this using some of the test environment we set up, it'd be great (by simplifying the mounts, finding some hidden options or trying a different DRBD module or some such nonsense), but if CephFS does it better in the course of that testing, let's put the effort there instead. That would be a subtask of this no matter what, just in case it fixed this along with other issues.

Fri, Jan 4, 4:58 AM · cloud-services-team (Kanban)
Bstorm added a comment to T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server.

The general notion of this ticket (although we've been distracted with the load problems on the tools/project NFS cluster) is that our current failover scheme is fundamentally broken. If you failover, you have to reboot things just to get it to let go of the bind mounts. On the dumps cluster, failing over happens via symlinks, but the mounts end up broken so badly anyway if something goes down that it's really hard to recover toolforge.

Fri, Jan 4, 4:48 AM · cloud-services-team (Kanban)
Bstorm added a comment to T210693: Create materialized views on Wiki Replica hosts for better query performance.

I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.

Fri, Jan 4, 4:19 AM · Patch-For-Review, User-Banyek, Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
Bstorm added a comment to T212333: Drop several views from ptwiki.

@Banyek The script has no real way to work with tables as written. It runs drop view and intentionally avoids changing any tables. Having the experimental materialized table there does break the logic there. This can be worked around if we choose to keep that materialized table around. I imagine we could also just remove the experimental table for now, though, right?

Fri, Jan 4, 3:54 AM · User-Banyek, cloud-services-team (Kanban), Data-Services, User-Zoranzoki21

Thu, Dec 27

Bstorm added a comment to T212360: Create hostnames for old and new Toolforge bastions that make sense.

+1

Thu, Dec 27, 1:14 AM · cloud-services-team (Kanban), Toolforge

Dec 21 2018

Bstorm closed T212390: Basic lighttpd+php webservice fails to run on Stretch grid as Resolved.

Added this rule Ingress IPv4 TCP 1024 - 65535 10.68.16.0/21 to the webserver group which is used for all webgrid nodes (and I think k8s as well). It will be needed during moves of servers to the new region no matter what until we move the proxy server and we have to reverse it.

Dec 21 2018, 5:30 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm closed T212390: Basic lighttpd+php webservice fails to run on Stretch grid, a subtask of T212153: Stand up the new sonofgridengine and stretch grid for testing in toolforge, as Resolved.
Dec 21 2018, 5:30 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T212390: Basic lighttpd+php webservice fails to run on Stretch grid.

From eqiad1-r:

Dec 21 2018, 5:20 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T212390: Basic lighttpd+php webservice fails to run on Stretch grid.

We had the wrong proxy node set for the new grid in hiera. Fixed that. Looks MUCH better.

Dec 21 2018, 5:08 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T212390: Basic lighttpd+php webservice fails to run on Stretch grid.

identd seems to work from the proxy to the exec node...

Dec 21 2018, 4:17 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T212390: Basic lighttpd+php webservice fails to run on Stretch grid.

Found the python script that is supposed to record that information in redis then reply with 'ok'. Looking for some kind of log or something from it.

Dec 21 2018, 3:45 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T212390: Basic lighttpd+php webservice fails to run on Stretch grid.

Inserted some debugging print statements on the exec node and found that the job fails after connecting and sending the port info.

Dec 21 2018, 3:24 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T212390: Basic lighttpd+php webservice fails to run on Stretch grid.

Also note that once the queues have errored, they won't re-run the job. They acquire an E status that won't clear even if you reboot the instance, so we can get faked out by that blocking a run when we fix things. qmod -c '*' will clear all errors--that caused me grief last night.

Dec 21 2018, 3:03 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T212390: Basic lighttpd+php webservice fails to run on Stretch grid.

Added 172.16.0.0/21 to port 5669 to the proxy security group for project-proxy. I figure basically anywhere in the proxy mess that is specifically opened for the 10. network is going to need the new region eventually--even if that likely isn't the problem here.

Dec 21 2018, 3:02 PM · Patch-For-Review, Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, cloud-services-team (Kanban)

Dec 19 2018

Bstorm added a comment to T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton.

From the Openstack docs, it looks like we are most likely to break designate, but that shouldn't fail if we do what @chasemp said https://docs.openstack.org/designate/pike/admin/upgrades/newton.html

Dec 19 2018, 4:23 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton.

The easy solution for now seems to me to be stretch+newton as the next step (especially as we upgrade all the clients to that anyway), but a nicely HA openstack cluster on k8s using helm could track releases that are actually actively supported upstream for The Future. http://superuser.openstack.org/articles/build-openstack-kubernetes/

Dec 19 2018, 4:17 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T212308: Rerun maintain-views for all tables to drop valid_tag and tag_summary tables.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/480590 is merged and this is ready.

Dec 19 2018, 3:45 PM · cloud-services-team (Kanban)