Page MenuHomePhabricator
Feed Advanced Search

Yesterday

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

From the convo here: https://github.com/kubernetes/kubernetes/issues/47695, the issue is that it is recognizing it as a user not as a node, which is strange.

Wed, Jun 26, 9:01 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Also have it here: Attempting to register node toolsbeta-arturo-k8s-worker-1.toolsbeta.eqiad.wmflabs
Jun 26 20:27:38 toolsbeta-arturo-k8s-worker-1 kubelet[5431]: E0626 20:27:38.134196 5431 kubelet_node_status.go:92] Unable to register node "toolsbeta-arturo-k8s-worker-1.toolsbeta.eqiad.wmflabs" with API server: nodes is forbidden: User "toolsbeta-arturo-k8s-worker-1.toolsbeta.eqiad.wmflabs" cannot create resource "nodes" in API group "" at the cluster scope

Wed, Jun 26, 8:30 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Much more progress. It's not trying as anonymous anymore because I missed a spot in the config:

Wed, Jun 26, 8:29 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I have manually changed /etc/default/kube-apiserver with the needed sections and had to change /lib/systemd/system/kube-apiserver.service because the admission control and admission plugins options are mutally exclusive. Overall, it seems to reject things still because at least some things are coming through as system:anonymous, which is still somewhat mysterious.

Wed, Jun 26, 7:07 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

The api-server needs to be started with --authorization-mode=Node apparently to do this.

Wed, Jun 26, 6:47 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215529: Puppetize/stand up a load balancer for K8s API servers.

I have it up on haproxy :)

Wed, Jun 26, 6:46 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

It may also be the wrong namespace. The kubeconfig is using default rather than such things as "kube-system".

Wed, Jun 26, 6:46 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

So I went ahead and created the node by hand running kubectl create -f node1.yaml where the file on the master has the following content:

Wed, Jun 26, 6:45 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.
Jun 26 18:31:31 toolsbeta-arturo-k8s-worker-1 kubelet[860]: E0626 18:31:31.427856     860 kubelet.go:2236] node "toolsbeta-arturo-k8s-worker-1.toolsbeta.eqiad.wmflabs" not found
Wed, Jun 26, 6:32 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Updating the patch with (hopefully) the right username :)

Wed, Jun 26, 6:28 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

That isn't quite correct, what I did there. It's based on prod and correctly authenticates with a cert. However, I think the way we are really trying to do this is more like https://kubernetes.io/docs/reference/access-authn-authz/node/

Wed, Jun 26, 6:24 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.
Jun 26 16:58:17 toolsbeta-arturo-k8s-worker-1 kubelet[30037]: E0626 16:58:17.629217   30037 kubelet_node_status.go:92] Unable to register node "toolsbeta-arturo-k8s-worker-1.toolsbeta.eqiad.wmflabs" with API server: nodes is forbidden: User "toolsbeta-arturo-k8s-worker-1.toolsbeta.eqiad.wmflabs" cannot create resource "nodes" in API group "" at the cluster scope
Wed, Jun 26, 4:59 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Set the file to look like:

Wed, Jun 26, 4:55 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

If we let kubenetes do it, it would be like https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/

Wed, Jun 26, 4:45 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T178510: Upgrade puppetmaster on toolsbeta and test, a subtask of T178508: Update VPS puppetmasters to 4.8 or newer, as Resolved.
Wed, Jun 26, 4:39 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm closed T178510: Upgrade puppetmaster on toolsbeta and test as Resolved.

The toolsbeta-puppetmaster-02 is certainly upgraded (is on stretch) and uses puppetdb as well.
I believe this is closeable.

Wed, Jun 26, 4:39 PM · cloud-services-team
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

@aborrero I found the problem we are having getting the first worker node up. https://kubernetes.io/docs/concepts/architecture/nodes/#self-registration-of-nodes

Wed, Jun 26, 4:37 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes

Tue, Jun 25

Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

The above question is aimed at @ayounsi and @faidon.

Tue, Jun 25, 5:34 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

(verified on tools-k8s-master-01, since I forgot to say how I know)

Tue, Jun 25, 5:30 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T226480: toolforge: puppet issue probably related to puppet-enc as Resolved.

Puppet is working again

Tue, Jun 25, 5:29 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

I pushed out the newer package on all Jessie vms.

Tue, Jun 25, 5:28 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T199271: Upgrade the tools gridengine system as Resolved.
Tue, Jun 25, 4:50 PM · Cloud-VPS (Ubuntu Trusty Deprecation), Toolforge, Epic, cloud-services-team (Kanban)
Bstorm moved T225320: Ceph Proof of Concept Build and Testing from Doing to Epics on the cloud-services-team (Kanban) board.
Tue, Jun 25, 4:50 PM · Epic, cloud-services-team (Kanban)
Bstorm added a project to T225320: Ceph Proof of Concept Build and Testing: Epic.
Tue, Jun 25, 4:49 PM · Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T207590: Research CephFS as a replacement for NFS: T225320: Ceph Proof of Concept Build and Testing.
Tue, Jun 25, 4:49 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a subtask for T225320: Ceph Proof of Concept Build and Testing: T207590: Research CephFS as a replacement for NFS.
Tue, Jun 25, 4:49 PM · Epic, cloud-services-team (Kanban)
Bstorm added a subtask for T225320: Ceph Proof of Concept Build and Testing: T90364: Test Ceph for instance storage.
Tue, Jun 25, 4:47 PM · Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T90364: Test Ceph for instance storage: T225320: Ceph Proof of Concept Build and Testing.
Tue, Jun 25, 4:47 PM · Wikimedia-Incident, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

So the fun part is: where to put that pin in puppet...

Tue, Jun 25, 2:19 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Pin to 3.4.2-1+deb8u2 that is.

Tue, Jun 25, 2:14 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

So that package is entirely broken, in other words. We probably will want to pin jessie to using the version before it.

Tue, Jun 25, 2:11 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.
Tue, Jun 25, 2:09 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Oh that's lovely.

Tue, Jun 25, 2:07 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Looks like the most likely place for this to fail is a urlopen on http://labs-puppetmaster.wikimedia.org:8100/v1/tools/node/tools-k8s-master-01.tools.eqiad.wmflabs (as an example) or if that comes back with invalid yaml. What's weird is that currently seems to result in valid yaml, but that isn't working nonetheless.

Tue, Jun 25, 1:44 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Notably puppet-enc will fail if the hostname isn't right. That needs to be noted when we look at updating the DNS names of our VMs.
It'll fail if the name doesn't end in wmflabs or labtests :)

Tue, Jun 25, 1:37 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Something restarted it then: [Tue Jun 25 06:31:52.624395 2019] [mpm_prefork:notice] [pid 17003] AH00171: Graceful restart requested, doing restart

Tue, Jun 25, 1:32 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Looks like the tools puppetmaster ended up with a bit of a problem this morning

Tue, Jun 25, 1:26 PM · cloud-services-team (Kanban), Toolforge

Mon, Jun 24

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I have a mind to experiment with a copy of maintain-kubeusers that speaks x509/RBAC instead of token/ABAC. There are other organizations that use multiple CAs, one for infra and one for users, and the certificates api makes this entirely doable from python. I'll kick that piece a little and see if it makes sense.

Mon, Jun 24, 9:47 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

So figuring, based on that data, that it may not be impossible to fill the link, it's extremely unlikely that we will (and we still would love to use jumbo frames), can we put this on other rows?

Mon, Jun 24, 8:13 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Ok, that said, I did write that misreading Mbps for Gbps...but what I said is still true! The PoC won't be anywhere near all that, and our full build out is a trickle compared to theoretical limits.

Mon, Jun 24, 7:52 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

I should point out that the PoC will not be capable of doing anywhere near that much IO. That would be what it would look like if we managed to convert the entire cluster to Ceph with a full build out. We would not handle the full buildout with three OSDs because of those numbers above.

Mon, Jun 24, 7:41 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T220836: Guidelines for Rust/Go tools deployment.

+1 on using modules. Since go mod vendor is even an option, and it does some basic hash checking, it seems sensible. I agree it requires recent golang, but previous versions were packaging chaos in general. Since it's usually producing a statically-linked binary, it could be deployed from either a "scratch" docker image (in k8s), which has literally nothing else in it, or a deb package that wouldn't be terribly hard to generate compared to an interpreted executable or dynamically linked binary.

Mon, Jun 24, 5:28 PM · serviceops-radar, Packaging
Bstorm added a comment to T201247: Sporadic puppet failures.

@Andrew this seems...fixed does it not?

Mon, Jun 24, 3:56 PM · cloud-services-team (Kanban), Operations
Bstorm moved T180513: Document wiki-replicas architecture for future automation from Important to Graveyard on the cloud-services-team (Kanban) board.
Mon, Jun 24, 3:54 PM · Documentation, Data-Services, cloud-services-team (Kanban), Cloud-VPS
Bstorm added a comment to T180513: Document wiki-replicas architecture for future automation.

The bug in mariadb is fixed in a future version of mariadb we don't have. I'm going to shuffle this to the graveyard for now. Overall, there isn't much interest in additional automation, yet, because the manual checks are still considered essential for now.

Mon, Jun 24, 3:54 PM · Documentation, Data-Services, cloud-services-team (Kanban), Cloud-VPS
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Had a huddle with @JHedden, actually. He'll add his thoughts soon (with a some info from our existing monitoring).

Mon, Jun 24, 3:51 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm placed T225833: Get jobutils in the aptly repo for stretch-toolsbeta up for grabs.
Mon, Jun 24, 3:33 PM · cloud-services-team (Kanban)
Bstorm updated the task description for T225320: Ceph Proof of Concept Build and Testing.
Mon, Jun 24, 3:30 PM · Epic, cloud-services-team (Kanban)
Bstorm updated subscribers of T224188: rack/setup/install (3) new osd ceph nodes.
Mon, Jun 24, 2:20 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Note: there are rate limits that can be set within openstack for this as well...but in some versions, they don't work right at all (they get ignored in some cases https://bugzilla.redhat.com/show_bug.cgi?id=1476830), but this is also things we want to be testing. That won't help back-end stuff, etc either. It's just a note.

Mon, Jun 24, 3:37 AM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

100% agree with you @faidon, and I appreciate the reply. I'm aiming to avoid any sugar-coating in my assessments of risks until I have more data (especially with 40G uplinks that are widely shared), partly to open conversations and make sure we design carefully. After a bit of time to think about this, I have some more thoughts.

Mon, Jun 24, 3:16 AM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services

Thu, Jun 20

Bstorm closed T225265: Fix labstore checks on cloudstore1008/9 as Resolved.

And when that runs on puppet, I see we are green.

Thu, Jun 20, 5:54 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.

Tentatively crossing off the registry validation bit because the webhook is deployable. I fully expect to find ways it isn't finished when we are rolling things out in toolsbeta.

Thu, Jun 20, 4:09 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm updated the task description for T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.
Thu, Jun 20, 4:08 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

@ayounsi Ceph docs are vague at best or tend to ask you to read dissertations eventually. Overall, everything comes back to "test it in your cluster and see". Ceph is capable of saturating 10G links under heavy load (and the private link would be able to saturate during node failures for rebuilds). A 40G link would be harder to saturate, but it is theoretically possible. This is a PoC, so my intent is to break it every which way and put it under test loads. We would certainly want to keep an eye on those links during tests (are you able to point me to where I could do that?).

Thu, Jun 20, 3:37 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services

Wed, Jun 19

Bstorm closed T225038: Add new schema tables as public views in labs as Resolved.

All set!

Wed, Jun 19, 10:00 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm closed T225038: Add new schema tables as public views in labs, a subtask of T224403: Update Labs replication and views to include new schema tables, as Resolved.
Wed, Jun 19, 10:00 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, Wikidata wb_terms Trailblazing
Bstorm added a comment to T225038: Add new schema tables as public views in labs.

It still needs the script run against the DBs before you'll see them in cloud services.

Wed, Jun 19, 9:38 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm added a comment to T225038: Add new schema tables as public views in labs.

Oh! Ok. I see it there on those dbs. So disregard what I said. We can get that out shortly.

Wed, Jun 19, 9:32 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm added a comment to T225038: Add new schema tables as public views in labs.

I mean I just checked and don't see the underlying tables on labsdb1009 (rather specifically). Are they in all wiki dbs or just a couple that I should verify?

Wed, Jun 19, 9:18 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm added a comment to T226088: DNS lookups for nih.gov hosts failing from Cloud VPS/Toolforge.

It works ok from the Foundation's public network servers. Just for notes.

Wed, Jun 19, 8:46 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T225038: Add new schema tables as public views in labs.

So these tables are not yet on the replicas, however the scripts we use to expose views will just skip the tables that aren't there until they are. Validating the patch and all locally, and then I'll merge it. Once merged, it won't do anything until the tables are on the replicas and then WMCS runs the script manually.

Wed, Jun 19, 6:44 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm added a comment to T225038: Add new schema tables as public views in labs.

Ahah! Thank you. It makes sense now. We'll get on it!

Wed, Jun 19, 5:45 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm updated subscribers of T225038: Add new schema tables as public views in labs.
Wed, Jun 19, 5:33 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm updated subscribers of T225038: Add new schema tables as public views in labs.

I have zero context regarding what these tables are from reading back in the tickets. @Bawolff (or someone else? I've been asking you on a lot of these tickets), can I get confirmation that these tables are entirely public without filters?

Wed, Jun 19, 5:33 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Patch-For-Review, Wikidata wb_terms Trailblazing
Bstorm awarded T216132: CloudVPS: create wmcs-vm-fsck script a Burninate token.
Wed, Jun 19, 4:46 PM · Wikimedia-Incident, cloud-services-team (Kanban)
Bstorm closed T217474: labstore1006 nfsd not started after reboot as Resolved.
Wed, Jun 19, 2:54 PM · Patch-For-Review, Data-Services, observability, cloud-services-team (Kanban)
Bstorm added a comment to T217474: labstore1006 nfsd not started after reboot.

Ok, all it did was enable the service, which was the idea. On next reboot it will hopefully not go poorly :)

Wed, Jun 19, 2:54 PM · Patch-For-Review, Data-Services, observability, cloud-services-team (Kanban)

Tue, Jun 18

Bstorm added a comment to T208690: create revision_commentindex.

Can we get anywhere by using the comment_revision view instead of comment? https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#The_actor_table_seems_really_slow--so_does_comment

Tue, Jun 18, 9:25 PM · Data-Services
Bstorm added a comment to T217474: labstore1006 nfsd not started after reboot.

Just to double-check: IIRC, back in the day we avoided this because we had multiple controllers attached to a shared shelf and if two controllers ran at the same time then terrible, terrible things happened. Is it safe to say that there's no current situation where having 'too many' nfs services running at once causes harm?

Tue, Jun 18, 9:18 PM · Patch-For-Review, Data-Services, observability, cloud-services-team (Kanban)
Bstorm added a comment to T217474: labstore1006 nfsd not started after reboot.

I thought I'd set them on the role rather than the profile? Checking that it appears that they aren't set in the right place. I'd rather it be on the role. Lemme check if that will work right. Thanks for looking!

Tue, Jun 18, 5:42 PM · Patch-For-Review, Data-Services, observability, cloud-services-team (Kanban)
Man77 awarded T220164: osm4wiki generating around 300 perl processes wherever it runs, which overloads the server for purposes of gridengine a Stroopwafel token.
Tue, Jun 18, 5:25 PM · Tools
Bstorm added a comment to T217474: labstore1006 nfsd not started after reboot.

That I believe is done. We can always do something that will make it page to be sure....
But the hiera works on the other NFS servers. It unfortunately tested itself.

Tue, Jun 18, 3:08 PM · Patch-For-Review, Data-Services, observability, cloud-services-team (Kanban)

Mon, Jun 17

Bstorm committed rLTRA744f9bda3c92: webhook: Add first run of code (authored by Bstorm).
webhook: Add first run of code
Mon, Jun 17, 6:27 PM
Bstorm added a comment to T225265: Fix labstore checks on cloudstore1008/9.

So this needs to connect to 111 over UDP from the public network, which is currently is not allowed to do.

Mon, Jun 17, 4:55 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Keeping an eye on trends a bit more

Mon, Jun 17, 4:09 PM · Data-Services, Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

The change doesn't seem to have hurt or helped since I made it. Stopping the client-side monitoring has done far more.

Mon, Jun 17, 4:08 PM · Data-Services, Patch-For-Review, cloud-services-team (Kanban)
Bstorm placed T225947: toolsbeta-docker-registry-01 doesn't work yet up for grabs.
Mon, Jun 17, 3:58 PM · Toolforge, cloud-services-team (Kanban)
Bstorm triaged T225947: toolsbeta-docker-registry-01 doesn't work yet as Low priority.
Mon, Jun 17, 3:58 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a project to T225947: toolsbeta-docker-registry-01 doesn't work yet: Toolforge.
Mon, Jun 17, 3:58 PM · Toolforge, cloud-services-team (Kanban)
Bstorm created T225947: toolsbeta-docker-registry-01 doesn't work yet.
Mon, Jun 17, 3:57 PM · Toolforge, cloud-services-team (Kanban)
Bstorm closed T221721: Puppet broken on several vms in toolsbeta as Resolved.
Mon, Jun 17, 3:54 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm closed T221721: Puppet broken on several vms in toolsbeta, a subtask of T221183: Rename and re-assign cloud dns servers, as Resolved.
Mon, Jun 17, 3:54 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm updated the task description for T221721: Puppet broken on several vms in toolsbeta.
Mon, Jun 17, 3:52 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T221721: Puppet broken on several vms in toolsbeta.

Puppet now runs on the registry node, but it doesn't work because it needs the SSL cert placed in the private repo (like in tools) and some odd prometheus error. However, puppet now functions, so this ticket is done. Whoever was working on making a beta registry can continue now.

Mon, Jun 17, 3:52 PM · Patch-For-Review, cloud-services-team (Kanban)

Fri, Jun 14

Bstorm updated the task description for T225320: Ceph Proof of Concept Build and Testing.
Fri, Jun 14, 11:51 PM · Epic, cloud-services-team (Kanban)
Bstorm added a comment to T225320: Ceph Proof of Concept Build and Testing.

All that said, instead of CephFS, RBD can be exported as iSCSI, which, if set up with appropriate multipathing and a clusterFS (or using GaneshaNFS so that userland locking takes place), we could build an actually HA Linux NFS server with Ceph backing it in the exact same way as it would be for cloudvirts, with all the same quirks. That would mean that NFS could still do immutable bits because, well, it can do that. While there may be useful cases where CephFS makes sense, we may get much more use out of Ceph by simply using it for block devices in nearly every case. That sort of consistent use might make managing a ceph cluster "easier" as well, even if it might complicate NFS a bit (but not really much more than it already is). The topic of shipping ceph rbds to iscsi targets is not a small one, so I'm putting that away for now (but it may be worth testing locally while kicking this around).

Fri, Jun 14, 11:51 PM · Epic, cloud-services-team (Kanban)
Bstorm added a comment to T225320: Ceph Proof of Concept Build and Testing.

Note: I just confirmed locally that CephFS cannot set extended attributes in Luminous.
The feature has a tracker here: http://tracker.ceph.com/issues/10679
As this was last updated 3 years ago, I expect them to implement that as soon as one of us writes the patch basically 😛

Fri, Jun 14, 11:33 PM · Epic, cloud-services-team (Kanban)
Bstorm updated the task description for T221721: Puppet broken on several vms in toolsbeta.
Fri, Jun 14, 4:27 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T221721: Puppet broken on several vms in toolsbeta.

Manually changed the /data/project/.system_sge/gridengine/default/common/act_qmaster file and got the process going. The beta grid is healthy again.

Fri, Jun 14, 4:27 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T221721: Puppet broken on several vms in toolsbeta.

The gridmaster service isn't running because apparently the shadow is! Jun 14 16:09:07 toolsbeta-sgegrid-master sge_qmaster[16056]: critical error: qmaster on host "toolsbeta-sgegrid-shadow.toolsbeta.eqiad.wmflabs" is still running - terminating

Fri, Jun 14, 4:10 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T221721: Puppet broken on several vms in toolsbeta.

Puppet is now able to run cleanly on the grid master.

Fri, Jun 14, 4:08 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm triaged T225833: Get jobutils in the aptly repo for stretch-toolsbeta as Low priority.
Fri, Jun 14, 4:08 PM · cloud-services-team (Kanban)
Bstorm created T225833: Get jobutils in the aptly repo for stretch-toolsbeta.
Fri, Jun 14, 4:07 PM · cloud-services-team (Kanban)
Bstorm added a comment to T221721: Puppet broken on several vms in toolsbeta.

sigh, toolsbeta-sgegrid-master cannot install jobutils because of weirdness in aptly most likely. I think we ended up making the stretch toolsbeta repo there an actual repo, so jobutils would need to actually be there for it to work E: Unable to locate package jobutils

Fri, Jun 14, 4:05 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm updated the task description for T221721: Puppet broken on several vms in toolsbeta.
Fri, Jun 14, 3:55 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T221721: Puppet broken on several vms in toolsbeta.

toolsbeta-proxy-01 seems kind of half-baked. I think it was someone's work toward making toolsbeta more like tools to test things against the proxy.
Function lookup() did not find a value for the name 'profile::toolforge::toolviews::mysql_password' at /etc/puppet/modules/profile/manifests/toolforge/toolviews.pp:4 I imagine this is a distant reflection of T101651.

Fri, Jun 14, 3:54 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm updated the task description for T221721: Puppet broken on several vms in toolsbeta.
Fri, Jun 14, 3:42 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm updated the task description for T221721: Puppet broken on several vms in toolsbeta.
Fri, Jun 14, 3:37 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T221721: Puppet broken on several vms in toolsbeta.

toolsbeta-k8s-lb-01 isn't working because I didn't finish it: profile::toolforge::k8s::api_servers is empty and needs values. I may just delete that instance.

Fri, Jun 14, 3:36 PM · Patch-For-Review, cloud-services-team (Kanban)