Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (172 w, 16 h)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Yesterday

Bstorm added a comment to T264221: Upgrade the nginx ingress controller in Toolforge (and likely PAWS).

This is now live on toolsbeta and ready for testing. I'm not particularly happy leaving the ports where k8s and haproxy are listening different (30000 vs 30002), but I can live with it.

Mon, May 10, 10:10 PM · User-Majavah, Patch-For-Review, PAWS, Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm moved T282474: tools.topicmatcher update_items_from_sparql.php frequently running Toolforge nodes out of resources from Inbox to Watching on the cloud-services-team (Kanban) board.
Mon, May 10, 9:31 PM · cloud-services-team (Kanban), Tools, Toolforge
Bstorm added a comment to T281039: Splunk On-Call doing something odd with routing some wmcs alerts.

I see, that makes sense.

Mon, May 10, 9:24 PM · cloud-services-team (Kanban), observability
Bstorm closed T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes as Resolved.

That fixed it, as was intended when I submitted the patch. Sorry I forgot to add the hiera.

Mon, May 10, 7:46 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm closed T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes, a subtask of T264221: Upgrade the nginx ingress controller in Toolforge (and likely PAWS), as Resolved.
Mon, May 10, 7:45 PM · User-Majavah, Patch-For-Review, PAWS, Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

I forgot to add profile::wmcs::kubeadm::docker_vol: false on ingress nodes. Please do not increase disk for them.

Mon, May 10, 7:43 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T264221: Upgrade the nginx ingress controller in Toolforge (and likely PAWS).

It seems to me that adding something in the template to specify the port for the ingress is the simplest and safest solution. That should just be a profile-based template and adding a parameter with a sane default.

Mon, May 10, 5:01 PM · User-Majavah, Patch-For-Review, PAWS, Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T281810: Request increased quota for monitoring Cloud VPS project.

+1

Mon, May 10, 3:02 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Fri, May 7

Bstorm triaged T282264: Monitor certificate validity for Cloud VPS as High priority.
Fri, May 7, 7:41 PM · cloud-services-team (Kanban), Cloud-VPS
Bstorm closed T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes as Resolved.

Ok updated the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying for ingress and standard worker nodes. The puppet patch is deployed. This should be good to go. We just need to use the correct flavors, etc. No actual cinder storage is required.

Fri, May 7, 7:26 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm closed T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes, a subtask of T264221: Upgrade the nginx ingress controller in Toolforge (and likely PAWS), as Resolved.
Fri, May 7, 7:25 PM · User-Majavah, Patch-For-Review, PAWS, Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T282264: Monitor certificate validity for Cloud VPS.

Probably related: T271778: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12

Fri, May 7, 6:35 PM · cloud-services-team (Kanban), Cloud-VPS
Bstorm added a comment to T282264: Monitor certificate validity for Cloud VPS.

I think it's this T273956: acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP

Fri, May 7, 6:34 PM · cloud-services-team (Kanban), Cloud-VPS
Bstorm added a comment to T282264: Monitor certificate validity for Cloud VPS.

The last couple outages for this were caused by a need to restart acme-chief due to a known issue and more or less "wontfix". While a paging-type alert is not a bad idea, a simple systemd-timer that checks if the cert is coming due and restarts acme-chief if it isn't doing its job would shore that up, no?

Fri, May 7, 5:57 PM · cloud-services-team (Kanban), Cloud-VPS
Bstorm closed T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access as Resolved.

This works and should be good to go. I set the TTL on the k8s.svc.tools.eqiad1.wikimedia.cloud to 300 because I know you are going to change it to a new cluster soon. Feel free to change that to 3600 (the default) after you've rebuild the haproxy nodes. @Majavah

Fri, May 7, 5:17 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access.

Oh! There's no stale cache on the control nodes.

Fri, May 7, 4:55 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access.

For now, I cannot get the cached A record to be forgotten by the host. Tried restarting nscd, and that did not help. Turning the old proxies off just brought everything down. My thought is to leave the old proxies up until the cache drops off. That would also be the best process for tools anyway and potentially a zero-downtime method.

Fri, May 7, 4:53 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access.

The caching is frustratingly strong here. The old A record is still seen by the host somehow (if not by dig).

Fri, May 7, 4:41 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access.

It looks like I have to delete it and recreate it as a CNAME. That means that it will briefly cause some chaos on Toolforge when we do it in tools. We might want to do it really fast via CLI there. In toolsbeta, I can do it now.

Fri, May 7, 4:26 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access.

Wait...I have the sense of this reversed....The current name becomes a CNAME to the svc name...that would Just Work ™

Fri, May 7, 4:14 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access.

The name target doesn't "change" unless all kubeconfigs do as well, which is important.

Fri, May 7, 4:12 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282227: Make toolforge k8s service names a CNAME to .svc. to allow editing without cloudinfra access.

If we try it in toolsbeta and the certs all still validate, sure. My worry is that cert validation will collapse unless we make sure it's a valid altname for the k8s cluster as well.

Fri, May 7, 4:11 PM · cloud-services-team (Kanban), Toolforge

Thu, May 6

Bstorm edited projects for T282191: HarvestTemplates does not return a result list anymore, after clicking "Load" nothing happens anymore, added: Tools; removed Toolforge.
Thu, May 6, 10:50 PM · Tools
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

We'll likely try that in toolsbeta first anyway.

Thu, May 6, 10:11 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

That patch should make this just fine, and new nodes should be easy to deploy when required after it is merged. When we upgrade the ingress nodes, it'd make sense to set profile::wmcs::kubeadm::docker_vol: false on the appropriate prefix. I don't think that will unmount it retroactively on existing ingress nodes since it is usually hard to make puppet unmount things, right @Andrew ?

Thu, May 6, 10:10 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

I see ingress nodes are using the role::wmcs::toolforge::k8s::worker class in puppet. I can factor out the docker volume and set a different ingress controller class for those that just uses a non-cinder setup. Ingress and control just don't need the extra disk, I don't think.

Thu, May 6, 9:20 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

We could, but there isn't much need to. It doesn't actually consume that space on the ceph cluster. This is all thin-provisioned. The docker images we currently use are extremely large (for docker images). I'd rather keep plenty of room.

Thu, May 6, 9:01 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

Got it. This is made for k8s nodes: g3.cores8.ram16.disk20.ephem140 That roughly matches the existing larger nodes. Using those with the cinder class should Just Work.

Thu, May 6, 6:12 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

It looks like the grid nodes use the flavor g3.cores4.ram8.disk20.swap24.ephem20. K8s nodes actually cannot have swap, so that isn't quite right, for sure.

Thu, May 6, 6:09 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm renamed T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes from Support Cinder worker nodes on Toolforge Kubernetes to Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.
Thu, May 6, 6:04 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

So if we have an appropriate flavor to use (and the grid flavor might work), we can use the same approach as https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456/

Thu, May 6, 6:03 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T282087: Support cinder or expanded ephemeral disk worker nodes on Toolforge Kubernetes.

Does that need to be cinder or should it be like the grid nodes with ephemeral storage. I, personally, think the latter. Otherwise, rebuilding nodes will be a serious pain.

Thu, May 6, 6:00 PM · Kubernetes, cloud-services-team (Kanban), Toolforge
Bstorm closed T282171: quarry-web-01 out of disk space as Resolved.

It does not seem to be currently leaking temp files. I'm not sure why the temp files filed the disk. Closing this for now.

Thu, May 6, 5:58 PM · cloud-services-team (Kanban), Quarry
Bstorm lowered the priority of T282171: quarry-web-01 out of disk space from High to Medium.
Thu, May 6, 5:53 PM · cloud-services-team (Kanban), Quarry
Bstorm added a comment to T282171: quarry-web-01 out of disk space.
bstorm@quarry-web-01:/tmp$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda3        19G  3.5G   15G  20% /
Thu, May 6, 5:50 PM · cloud-services-team (Kanban), Quarry
Bstorm added a comment to T282171: quarry-web-01 out of disk space.

All files are owned by www-data, so this is a quarry bug.

Thu, May 6, 5:38 PM · cloud-services-team (Kanban), Quarry
Bstorm added a comment to T282171: quarry-web-01 out of disk space.

It's being caused by /tmp filling up.

Thu, May 6, 5:36 PM · cloud-services-team (Kanban), Quarry
Bstorm moved T282171: quarry-web-01 out of disk space from Inbox to Doing on the cloud-services-team (Kanban) board.
Thu, May 6, 5:33 PM · cloud-services-team (Kanban), Quarry
Bstorm moved T282171: quarry-web-01 out of disk space from Backlog to Bugs on the Quarry board.
Thu, May 6, 5:33 PM · cloud-services-team (Kanban), Quarry
Bstorm triaged T282171: quarry-web-01 out of disk space as High priority.
Thu, May 6, 5:33 PM · cloud-services-team (Kanban), Quarry
Bstorm closed T276698: actor_user = NULL (IPs) missing from Wiki Replicas as Resolved.

Yep, that theory was it. Works now. Sorry about that quirk.

Thu, May 6, 5:10 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T276698: actor_user = NULL (IPs) missing from Wiki Replicas.

@MBH the old wikireplica names were forced over to the new wikireplicas this week. I just ran a test and found that SELECT * FROM actor WHERE actor_user IS NULL has loads of results on enwiki there, but not on ruwiki. That seems very strange. I'll dig a bit deeper. ruwiki is on a host that was down for maintenance when some of these fixes were deployed, so it is likely that host needs a re-run of the definitions.

Thu, May 6, 5:08 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T276698: actor_user = NULL (IPs) missing from Wiki Replicas.

That's very strange... I'm not aware of any changes to the views any time recently. Which wiki? @MBH do you have a quarry link I can try or something?

Thu, May 6, 4:36 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services

Wed, May 5

Bstorm awarded T274566: [ceph] Test and upgrade to Octopus a Party Time token.
Wed, May 5, 6:06 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T281203: dumps distribution servers space issues.

server: cloudbackup2001.codfw.wmnet and directory /srv/security-temp. There's 7 TB available and it is a much more suitable temporary location.

Wed, May 5, 5:45 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T281203: dumps distribution servers space issues.

@Reedy new location for this is set up. I'll get on IRC and see how you want to move it.

Wed, May 5, 5:24 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T281203: dumps distribution servers space issues.

@Reedy I have found a place in the cloud universe where I could put it in CODFW on a temporary basis. That would get it off systems that have public uses as well as saving us some of the disk space problem. A new setup for storing this data really needs to be purchased outside of cloud systems, though.

Wed, May 5, 5:11 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
Bstorm closed T281937: heartbeat_p perhaps not showing correct replag on clouddb1013 and clouddb1017 s1 as Invalid.

That helps a lot. I hope that isn't the second time you've had to explain it to me 😆
I'll close this one.

Wed, May 5, 4:18 PM · Data-Persistence (Consultation), cloud-services-team (Kanban), Data-Services
Jhernandez awarded T278252: Make alias for tools.db.svc.wikimedia.cloud a Like token.
Wed, May 5, 2:52 PM · Data-Services
Bstorm added a comment to T281937: heartbeat_p perhaps not showing correct replag on clouddb1013 and clouddb1017 s1.

That would make more sense :)

Wed, May 5, 12:27 AM · Data-Persistence (Consultation), cloud-services-team (Kanban), Data-Services

Tue, May 4

Bstorm updated subscribers of T281937: heartbeat_p perhaps not showing correct replag on clouddb1013 and clouddb1017 s1.

Any thoughts on that @Marostegui ?

Tue, May 4, 11:29 PM · Data-Persistence (Consultation), cloud-services-team (Kanban), Data-Services
Bstorm moved T281937: heartbeat_p perhaps not showing correct replag on clouddb1013 and clouddb1017 s1 from Backlog to Wiki replicas on the Data-Services board.
Tue, May 4, 11:28 PM · Data-Persistence (Consultation), cloud-services-team (Kanban), Data-Services
Bstorm created T281937: heartbeat_p perhaps not showing correct replag on clouddb1013 and clouddb1017 s1.
Tue, May 4, 11:28 PM · Data-Persistence (Consultation), cloud-services-team (Kanban), Data-Services
Bstorm closed T278252: Make alias for tools.db.svc.wikimedia.cloud, a subtask of T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture, as Resolved.
Tue, May 4, 10:59 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm closed T278252: Make alias for tools.db.svc.wikimedia.cloud as Resolved.

Since the script is not very amenable to creating cross-zone CNAMES, I just made them by hand for these three until we can make the eqiad.wmflabs A records into wikimedia.cloud A records in the Future. It works now.

Tue, May 4, 10:59 PM · Data-Services
Bstorm closed T281287: Now that labsdb1012 is clouddb1021, connect it to maintain-dbusers, a subtask of T269211: Convert labsdb1012 from multi-source to multi-instance, as Resolved.
Tue, May 4, 2:41 PM · Analytics-Kanban, cloud-services-team (Kanban), Data-Services, DBA, Patch-For-Review, Analytics-Clusters
Bstorm closed T281287: Now that labsdb1012 is clouddb1021, connect it to maintain-dbusers as Resolved.
Tue, May 4, 2:41 PM · cloud-services-team (Kanban), Data-Services

Mon, May 3

Bstorm added a comment to T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas.

I've ack'd the alert. We could try tuning the buffer parameter thing, or we could just change the alert. I'm honestly surprised that the web side threw the alert first.

Mon, May 3, 10:17 PM · Data-Services, cloud-services-team (Kanban)
Bstorm updated subscribers of T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas.
Mon, May 3, 10:16 PM · Data-Services, cloud-services-team (Kanban)
Bstorm triaged T281287: Now that labsdb1012 is clouddb1021, connect it to maintain-dbusers as Medium priority.
Mon, May 3, 6:34 PM · cloud-services-team (Kanban), Data-Services
Bstorm claimed T281287: Now that labsdb1012 is clouddb1021, connect it to maintain-dbusers.
Mon, May 3, 6:33 PM · cloud-services-team (Kanban), Data-Services
Bstorm moved T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas from Backlog to Wiki replicas on the Data-Services board.
Mon, May 3, 4:14 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a project to T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas: Data-Services.
Mon, May 3, 4:14 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T269211: Convert labsdb1012 from multi-source to multi-instance.

@Marostegui do you have any advice on how to configure clouddb1021 memory / memory alerts? Would it be worth doing to raise the thresholds from warning at 90% memory and critical at 95% to warning at 95% and critical to 98%?

Mon, May 3, 4:13 PM · Analytics-Kanban, cloud-services-team (Kanban), Data-Services, DBA, Patch-For-Review, Analytics-Clusters
Bstorm added a subtask for T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture: T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas.
Mon, May 3, 4:13 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a parent task for T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas: T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture.
Mon, May 3, 4:13 PM · Data-Services, cloud-services-team (Kanban)
Bstorm triaged T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas as Medium priority.
Mon, May 3, 4:10 PM · Data-Services, cloud-services-team (Kanban)

Fri, Apr 30

Bstorm added a comment to T224747: Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5.

For all I know, I'm just trying the wrong port? I just figure the next port in line seems safe.

Fri, Apr 30, 10:18 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Kanban)
Bstorm added a project to T224747: Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5: ops-eqiad.
Fri, Apr 30, 10:08 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Kanban)
Bstorm updated subscribers of T224747: Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5.

@Cmjohnson I'm trying to bring up the interface for the cable added on T266192, and I'm just getting no link at all. ethtool reports Link detected: no and ip link shows NO-CARRIER.

Fri, Apr 30, 10:06 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T275605: cloudmetrics1002: mysterious issue.

Paged again for this this morning.

Fri, Apr 30, 3:20 PM · cloud-services-team (Hardware)

Thu, Apr 29

Bstorm closed T279723: Remove 2 nodes from the tools-k8s-etcd cluster, a subtask of T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity), as Resolved.
Thu, Apr 29, 6:47 PM · Toolforge, cloud-services-team (Kanban)
Bstorm closed T279723: Remove 2 nodes from the tools-k8s-etcd cluster as Resolved.

Hopefully that'll speed some responses up. It's still running awfully high iowait.

Thu, Apr 29, 6:47 PM · cloud-services-team (Kanban)
Bstorm closed T279723: Remove 2 nodes from the tools-k8s-etcd cluster, a subtask of T267966: Try to squeeze better performance out of k8s-etcd nodes, as Resolved.
Thu, Apr 29, 6:47 PM · cloud-services-team (Kanban), Toolforge

Wed, Apr 28

Bstorm added a comment to T276961: Support Openstack Swift APIs via the radosgw.

Nope, I'd probably google around for what "seems good", unfortunately.

Wed, Apr 28, 11:33 PM · cloud-services-team (Kanban), Data-Services, Cloud-VPS, User-Marostegui
Bstorm committed rLTRAc8373d78f9a5: toolsbeta: For testing, I'd rather not upload tekton to the internal reg (authored by Bstorm).
toolsbeta: For testing, I'd rather not upload tekton to the internal reg
Wed, Apr 28, 6:40 PM
Bstorm added a comment to T276961: Support Openstack Swift APIs via the radosgw.

I recommend we look at erasure coding the pools for Swift, since I don't know if I've said it directly related to this yet https://ceph.io/planet/erasure-coding-in-ceph/

Wed, Apr 28, 4:55 PM · cloud-services-team (Kanban), Data-Services, Cloud-VPS, User-Marostegui
Bstorm added a comment to T280631: Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community.

So, in the Enterprise-WMCS ToU, along with a restriction on direct commercial use (and a restriction on on-selling), we should also include a restriction on the "systematic" provision of the feed to others. It would be important to not restrict "fair" sharing of the content (especially since the fortnightly dumps are available anyway). What we'd need to restrict is the systematic sharing - the bootleg recreation of the API.

Wed, Apr 28, 4:28 PM · Dumps-Generation, Okapi [Wikimedia Enterprise]

Tue, Apr 27

Bstorm added a comment to T280631: Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community.

Interesting question. When it all comes down to it, the data is probably some of the least secret on the web. So I'm not sure why we would care where it goes after it is accessed. My understanding about the enterprise offering is that it provides a high performance way of gathering lots of information quickly from our systems rather than a particular set of information. WMCS would not really be an enterprise class and highly scaled client from that perspective.

Tue, Apr 27, 7:01 PM · Dumps-Generation, Okapi [Wikimedia Enterprise]
Bstorm moved T281287: Now that labsdb1012 is clouddb1021, connect it to maintain-dbusers from Backlog to Wiki replicas on the Data-Services board.
Tue, Apr 27, 5:04 PM · cloud-services-team (Kanban), Data-Services
Bstorm created T281287: Now that labsdb1012 is clouddb1021, connect it to maintain-dbusers.
Tue, Apr 27, 5:04 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a project to T281045: labstore1007 crashed after storage controller errors--replace disk?: ops-eqiad.
Tue, Apr 27, 3:18 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm renamed T281045: labstore1007 crashed after storage controller errors--replace disk? from labstore1007 crashed after storage controller errors to labstore1007 crashed after storage controller errors--replace disk?.
Tue, Apr 27, 3:17 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

That alert may actually be actionable at least. It would seem surprising for a failing disk to crash a server, but I've seen it before if it is resetting like mad. Let's see if we can get that replaced.

Tue, Apr 27, 3:14 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm updated subscribers of T281203: dumps distribution servers space issues.

Ah yeah, that's still there. I didn't see it on a very quick check of tab completing (because permissions) and dreamed it was gone :) @Reedy and @JFishback_WMF this is a public ticket, so I don't know if we need another one to discuss that material, but please note the description. This chunk of security data is a bit much for the dumps server to host if it keeps climbing as it is and the server has become more unreliable as well. I'm not sure you want potentially-unique data that matters to you stored on that storage controller.

Tue, Apr 27, 3:06 PM · Security-Team, Data-Services, cloud-services-team (Kanban)

Mon, Apr 26

Bstorm added a comment to T278195: Sort out the 5 tool accounts with strange mysql usernames.

That's my guess here. It's not a very sustainable practice across multiple servers and rebuilds. I think it's likely that some may even have been overwritten over time and there were once more than 5.

Mon, Apr 26, 11:51 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm triaged T281203: dumps distribution servers space issues as High priority.
Mon, Apr 26, 11:29 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
Bstorm updated subscribers of T281203: dumps distribution servers space issues.
Mon, Apr 26, 11:28 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T281203: dumps distribution servers space issues.


This is the space on labstore1006 (red) and labstore1007 (green) as a percentage of space used. It was pretty level for a little while, but it is steadily climbing since then at a rate that won't work for long (especially at labstore1007's mysteriously higher usage).

Mon, Apr 26, 11:27 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
Bstorm created T281203: dumps distribution servers space issues.
Mon, Apr 26, 11:26 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

Unfortunately I'm not seeing anything in there.

Mon, Apr 26, 11:15 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm added a comment to T281176: Puppet broken on restricted.bastion.wmcloud.org.

That apt sources.list bug can't die quickly enough.

Mon, Apr 26, 6:19 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm awarded T281176: Puppet broken on restricted.bastion.wmcloud.org a Yellow Medal token.
Mon, Apr 26, 6:18 PM · Cloud-VPS, cloud-services-team (Kanban)

Sat, Apr 24

Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

Ok, dumps is going to be served from labstore1006 for awhile. I've downloaded an "adu" report with the hpssaducli tools (hpssaducli -f adu-report.zip as root). I'm pleased that's readable and will go over it Monday to see if there's a useful report on what happened in there since this looks like hardware.

Sat, Apr 24, 11:23 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm renamed T281045: labstore1007 crashed after storage controller errors--replace disk? from labstore1007 not reachable on main interface (ping failure) to labstore1007 crashed after storage controller errors.
Sat, Apr 24, 11:18 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

Going to fail over to labstore1006 because this looks suspect.

Sat, Apr 24, 11:07 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

Ok, my shell just hung for a bit, and I see this in dmesg :(

Sat, Apr 24, 11:04 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

This Apr 24 22:05:36 labstore1007 kernel: [35172747.658395] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap- En- Exp=1 would then refer to the larger data volume (because that's raid 10) from the paste (logical volume 2).

Sat, Apr 24, 10:59 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

Ok, the PCI address logging the errors was 0000:05:00.0, which corresponds to the controller in slot 1 from the controller details:

Sat, Apr 24, 10:52 PM · SRE, ops-eqiad, cloud-services-team (Hardware)
Bstorm added a comment to T281045: labstore1007 crashed after storage controller errors--replace disk?.

All the disks report they are fine (of course).

1root@labstore1007:~# hpssacli ctrl slot=3 pd all show status
2
3 physicaldrive 1E:1:1 (port 1E:box 1:bay 1, 6001.1 GB): OK
4 physicaldrive 1E:1:2 (port 1E:box 1:bay 2, 6001.1 GB): OK
5 physicaldrive 1E:1:3 (port 1E:box 1:bay 3, 6001.1 GB): OK
6 physicaldrive 1E:1:4 (port 1E:box 1:bay 4, 6001.1 GB): OK
7 physicaldrive 1E:1:5 (port 1E:box 1:bay 5, 6001.1 GB): OK
8 physicaldrive 1E:1:6 (port 1E:box 1:bay 6, 6001.1 GB): OK
9 physicaldrive 1E:1:7 (port 1E:box 1:bay 7, 6001.1 GB): OK
10 physicaldrive 1E:1:8 (port 1E:box 1:bay 8, 6001.1 GB): OK
11 physicaldrive 1E:1:9 (port 1E:box 1:bay 9, 6001.1 GB): OK
12 physicaldrive 1E:1:10 (port 1E:box 1:bay 10, 6001.1 GB): OK
13 physicaldrive 1E:1:11 (port 1E:box 1:bay 11, 6001.1 GB): OK
14 physicaldrive 1E:1:12 (port 1E:box 1:bay 12, 6001.1 GB): OK
15 physicaldrive 1E:2:1 (port 1E:box 2:bay 1, 6001.1 GB): OK
16 physicaldrive 1E:2:2 (port 1E:box 2:bay 2, 6001.1 GB): OK
17 physicaldrive 1E:2:3 (port 1E:box 2:bay 3, 6001.1 GB): OK
18 physicaldrive 1E:2:4 (port 1E:box 2:bay 4, 6001.1 GB): OK
19 physicaldrive 1E:2:5 (port 1E:box 2:bay 5, 6001.1 GB): OK
20 physicaldrive 1E:2:6 (port 1E:box 2:bay 6, 6001.1 GB): OK
21 physicaldrive 1E:2:7 (port 1E:box 2:bay 7, 6001.1 GB): OK
22 physicaldrive 1E:2:8 (port 1E:box 2:bay 8, 6001.1 GB): OK
23 physicaldrive 1E:2:9 (port 1E:box 2:bay 9, 6001.1 GB): OK
24 physicaldrive 1E:2:10 (port 1E:box 2:bay 10, 6001.1 GB): OK
25 physicaldrive 1E:2:11 (port 1E:box 2:bay 11, 6001.1 GB): OK
26 physicaldrive 1E:2:12 (port 1E:box 2:bay 12, 6001.1 GB): OK
27
28root@labstore1007:~# hpssacli ctrl slot=3 ld all show status
29
30 logicaldrive 1 (32.7 TB, 1+0): OK
31 logicaldrive 2 (32.7 TB, 1+0): OK
32
33root@labstore1007:~# hpssacli ctrl slot=1 pd all show status
34
35 physicaldrive 2I:4:1 (port 2I:box 4:bay 1, 1 TB): OK
36 physicaldrive 2I:4:2 (port 2I:box 4:bay 2, 1 TB): OK
37 physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 6001.1 GB): OK
38 physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 6001.1 GB): OK
39 physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 6001.1 GB): OK
40 physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 6001.1 GB): OK
41 physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 6001.1 GB): OK
42 physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 6001.1 GB): OK
43 physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 6001.1 GB): OK
44 physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 6001.1 GB): OK
45 physicaldrive 2I:2:1 (port 2I:box 2:bay 1, 6001.1 GB): OK
46 physicaldrive 2I:2:2 (port 2I:box 2:bay 2, 6001.1 GB): OK
47 physicaldrive 2I:2:3 (port 2I:box 2:bay 3, 6001.1 GB): OK
48 physicaldrive 2I:2:4 (port 2I:box 2:bay 4, 6001.1 GB): OK
49
50root@labstore1007:~# hpssacli ctrl slot=1 ld all show status
51
52 logicaldrive 1 (931.5 GB, 1): OK
53 logicaldrive 2 (32.7 TB, 1+0): OK

Sat, Apr 24, 10:47 PM · SRE, ops-eqiad, cloud-services-team (Hardware)