Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (69 w, 3 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Yesterday

bd808 awarded T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 a Party Time token.
Thu, May 23, 8:46 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Patch-For-Review, Epic, Cloud-VPS
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

With that merged, I'll coordinate with the DBAs to find a time to deploy it.

Thu, May 23, 5:17 PM · Patch-For-Review, Data-Services
Bstorm reopened T221339: Missing index on revision_userindex.rev_actor as "Open".

Since we're doing things here again, I should probably re-open it.

Thu, May 23, 5:06 PM · Patch-For-Review, Data-Services
Bstorm reopened T221339: Missing index on revision_userindex.rev_actor, a subtask of T219324: Update tools to use new actor storage, as Open.
Thu, May 23, 5:06 PM · Community-Tech-Sprint, Community-Tech
Bstorm reopened T221339: Missing index on revision_userindex.rev_actor, a subtask of T223667: Update to use new actor storage, as Open.
Thu, May 23, 5:06 PM · Community-Tech-Sprint, Community-Tech, XTools
Bstorm added a comment to T169287: etcd config depends on puppet certs, but puppet doesn't know.

But puppet doesn't run the agent like normal when it modifies a cert. It waits for signature, etc.?
However, now that you mention it, I think I was thinking about it wrong. Why not just subscribe to: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifests/ssl.pp#32 ?

Thu, May 23, 5:01 PM · Kubernetes, Cloud-Services, Operations
Bstorm added a comment to T223902: cloudcontrol: decide on FQDN for service endpoints.

I do think TLS should be on OpenStack service endpoints in general for a lot of reasons. Independent of the FQDN considerations, I strongly think that should factor in, if we can do it. A caching layer would benefit some read-only stuff, but I tend to imagine we'd want openstack stuff to dodge caching anyway since making that kind of API cache-friendly required quite a bit of tweaking and cache tuning the last time I did it elsewhere (and required me compiling in some varnish stuff to make auth work better through it). I generally have to imagine that OpenStack api caching won't look quite like MediaWiki api caching needs--but you never know. This all makes me think avoiding the prod caching layer might save us trouble at the outset.

Thu, May 23, 4:53 PM · Operations, Traffic, Cloud-VPS, cloud-services-team (Kanban)
Bstorm awarded T224155: Reduce size of referee db on toolsdb if at all possible a Party Time token.
Thu, May 23, 4:38 PM · Data-Services

Wed, May 22

Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.
  • I vote to add cloudosd1xxx to the naming conventions unless my team rebels against that. The related monitor nodes would end up cloudmon1xxx. They are more possibly multi-purpose, but they'll be primary monitors for Ceph. Since this is the PoC, we can always revisit that in the future.
  • I'll get back to you soon on the network placement and so forth.
  • They will be horizontally redundant between the three and should have different racks.
Wed, May 22, 9:27 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T224154: Reduce size of linkwatcher db on toolsdb if at all possible.

Yeah, if you open the request, it will make it a bit easier. I can help flesh it out.

Wed, May 22, 8:26 PM · Data-Services
Bstorm added a comment to T224154: Reduce size of linkwatcher db on toolsdb if at all possible.

You can communicate from Toolforge to another VPS, if that is useful, btw. If some components/tools need to stay on Toolforge, it should be possible to use VPS resources with the right security groups and such.

Wed, May 22, 7:37 PM · Data-Services
Bstorm added a comment to T224154: Reduce size of linkwatcher db on toolsdb if at all possible.

Would you start a request from that link I posted above with some specs to get the ball rolling? I can help with some notes as well.

Wed, May 22, 7:36 PM · Data-Services
Bstorm added a comment to T224154: Reduce size of linkwatcher db on toolsdb if at all possible.

Yeah, sounds like you'd benefit quite a bit from moving to Cloud VPS, really.

Wed, May 22, 7:35 PM · Data-Services
Bstorm added a comment to T224154: Reduce size of linkwatcher db on toolsdb if at all possible.

By any chance, there's no way that's quite in the wiki replicas? It does have link data in it. It would require queries across dbs, but people do that. I am imagining there's a performance barrier to doing that or similar for this tool, though.

Wed, May 22, 7:13 PM · Data-Services
Bstorm added a comment to T224154: Reduce size of linkwatcher db on toolsdb if at all possible.

How about spinning out into a VPS project that runs your own DB server https://phabricator.wikimedia.org/project/view/2875/?
That would give it more room to grow on its own without impact to the ToolsDB service. It would need a fairly large instance size, but with a good reason backing it, we grant those.
If you link the project request here, we already have a pretty good idea of the size of DB we are talking about. You'd need help from me or another admin to get the DB dumped out because of the file size limits on toolforge, but that could be arranged.

Wed, May 22, 6:59 PM · Data-Services
Bstorm added a comment to T224163: Migrate Wikilabels to new DB server.

Yup! I'm trying other avenues to keep toolsdb from running out of space on the replica as well. If it goes nuclear for some reason before then, I'll try to reach you before doing an emergency switch--and the issue is on the replica not the primary so I can always rebuild if I have to.

Wed, May 22, 6:12 PM · Scoring-platform-team (Current), Wikilabels
Bstorm added a comment to T224154: Reduce size of linkwatcher db on toolsdb if at all possible.

Can you provide an example, @Beetstra? I might be able to help find something if there is one.

Wed, May 22, 6:07 PM · Data-Services
Bstorm added a comment to T224163: Migrate Wikilabels to new DB server.

Ideally, there should be no config change needed if you are using the wikilabels.db.svc.eqiad.wmflabs address now.

Wed, May 22, 6:03 PM · Scoring-platform-team (Current), Wikilabels
Bstorm created T224155: Reduce size of referee db on toolsdb if at all possible.
Wed, May 22, 5:38 PM · Data-Services
Bstorm created T224154: Reduce size of linkwatcher db on toolsdb if at all possible.
Wed, May 22, 5:36 PM · Data-Services
Bstorm created T224152: toolsdb replica is running low on space -- cleanup large tables if possible.
Wed, May 22, 5:26 PM · cloud-services-team (Kanban), Data-Services
Bstorm merged T221673: Move wikilabels DB to it's own instance with replica into T224062: clouddb1002 low on space -- move wikilabelsdb.
Wed, May 22, 5:13 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Data-Services
Bstorm merged task T221673: Move wikilabels DB to it's own instance with replica into T224062: clouddb1002 low on space -- move wikilabelsdb.
Wed, May 22, 5:13 PM · Wikilabels, Scoring-platform-team, cloud-services-team (Kanban)
Bstorm closed T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 as Resolved.

This is done!

Wed, May 22, 5:11 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Patch-For-Review, Epic, Cloud-VPS
Bstorm closed T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020, a subtask of T172538: rack/setup/install labvirt10(19|20).eqiad.wmnet, as Resolved.
Wed, May 22, 5:11 PM · cloud-services-team (Kanban), Operations, Cloud-Services
Bstorm added a comment to T224062: clouddb1002 low on space -- move wikilabelsdb.

The new server, FYI is clouddb-wikilabels-01.clouddb-services.eqiad.wmflabs, so you can validate that you can make a read-only connection at any time if you are so inclined. I'll move DNS to that and then promote it to primary.

Wed, May 22, 5:09 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T224062: clouddb1002 low on space -- move wikilabelsdb.

@Halfak if you are cool with deploy-on-friday, I'll do it! :)
The replica is replicating now, so it should be pretty smooth. You'll be read-only for a few when DNS is shifted over to the new server, and then it'll go r/w as soon as I've got the promotion command in there. Since I did the procedure for OSMdb to migrate that much bigger DB, this should be easy :)

Wed, May 22, 5:07 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

Scheduled scratch migration for 2019-05-28@1800 UTC

Wed, May 22, 4:42 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

Ah! Also, that patch might help. I'll get testing it in review and hopefully the maintenance is over or will be soon.

Wed, May 22, 4:18 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

There are more fixes that @Anomie baked into the next patch set that will be deployed with the field drops. I tried not to be too ambitious in what I deployed for this ticket because I know the DB maintenance would block quick rollbacks. It is possible that will be improved by some of that.

Wed, May 22, 4:11 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

Scratch has now had one successful sync. Setting the patch to review and finding a reasonable date for it. Theoretically, since scratch shouldn't have a lot of open filehandles, it shouldn't be too bad as long as everything is working right.

Wed, May 22, 4:02 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm closed T217473: labstore1006 spontaneous reboot as Resolved.

This seems ok for now following the firmware upgrades. I'm going to close it.

Wed, May 22, 3:27 PM · Operations, Data-Services, cloud-services-team (Kanban)
Bstorm awarded T223906: Active/active rabbitMQ servers on wmcs controller nodes a Burninate token.
Wed, May 22, 2:30 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T224062: clouddb1002 low on space -- move wikilabelsdb.

It should be safe to swap it over at any time now. The only quirk is that it would be read-only for a bit after DNS jumps over to the secondary with a possible blip when promoting that to primary. @Halfak I just need to know when that might be ok (like when someone is around to catch the service if it falls). Hopefully that's tomorrow because I really want to reclaim this space for toolsdb soon (not that wikilabels is big or anything, rather I need the volume it's on for the big database next door). This will turn the wikilabels db into something on a separate, replicated pair of its own.

Wed, May 22, 12:19 AM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Data-Services

Tue, May 21

Bstorm added a project to T224062: clouddb1002 low on space -- move wikilabelsdb: Wikilabels.

Adding wikilabels tag just as a heads up. When we have a location set up, I'll start this replicating to the new locations and eventually change the primary. The actual changes should have very little service impact since I'll be moving the DNS alias, but I'll make sure and coordinate with @Halfak so things can be restarted if needed, etc. when the time comes.

Tue, May 21, 9:15 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Data-Services
Bstorm moved T224062: clouddb1002 low on space -- move wikilabelsdb from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, May 21, 9:12 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Data-Services
Bstorm triaged T224062: clouddb1002 low on space -- move wikilabelsdb as High priority.
Tue, May 21, 9:09 PM · Scoring-platform-team, Wikilabels, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T215529: Puppetize/stand up a load balancer for K8s API servers.

It looks like production is using LVS for this. Despite the starter work on using haproxy, this really should use that for consistency and familiarity within the foundation. kube-api is just an api, so there should be no real barrier to applying that.

Tue, May 21, 6:32 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T223902: cloudcontrol: decide on FQDN for service endpoints.

Following this for the NFS maps IP in T209527

Tue, May 21, 4:44 PM · Operations, Traffic, Cloud-VPS, cloud-services-team (Kanban)
Bstorm updated subscribers of T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.
Tue, May 21, 3:06 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.

The webhook could be done in flask/wsgi for the team to find it easier to maintain, but it also might be more flexible and quick to deploy if done in Go (where the actual objects from the k8s source can be imported to interpret the API objects).

Tue, May 21, 3:05 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.
NOTE: Pod security policies are available in 1.12, which is used in production. We don't need a newer version to use that functionality to replace 3 of 4 controllers.
Tue, May 21, 3:03 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm moved T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, May 21, 3:01 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm merged task T211354: Toolforge: Find replacement solutions for custom controllers into T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.
Tue, May 21, 3:01 PM · Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm merged T211354: Toolforge: Find replacement solutions for custom controllers into T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.
Tue, May 21, 3:01 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T217086: Investigate why the new Son of Grid Engine grid landed in a worse state when NFS was filled than the old Sun Grid Engine grid did as Resolved.

Overall, NFS is doing well now with regard to grid nodes going offline. This appears to have been a result of a variety of monitoring tools monitoring on the client side in Stretch.

Tue, May 21, 2:59 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
Bstorm closed T221339: Missing index on revision_userindex.rev_actor as Resolved.

I'll close this one for now.

Tue, May 21, 2:44 PM · Patch-For-Review, Data-Services
Bstorm closed T221339: Missing index on revision_userindex.rev_actor, a subtask of T219324: Update tools to use new actor storage, as Resolved.
Tue, May 21, 2:44 PM · Community-Tech-Sprint, Community-Tech

Mon, May 20

Bstorm added a comment to T223406: Remove reference to fields replaced by the actor table from WMCS views.

Announced that this is now going to be scheduled for June 3rd after more feedback and finding at least one issue that, I hope is fixed over at T221339.

Mon, May 20, 6:39 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

This should be all set, and I think you should be unblocked now @Anomie

Mon, May 20, 6:33 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

This is deployed across the replicas. Please test things and let me know if it is all good.

Mon, May 20, 6:33 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T223902: cloudcontrol: decide on FQDN for service endpoints.

Once we are capable of being multidomain within neutron, we wouldn't want the domain in keystone stuff, would we? Possibly in certain other things? We needed perhaps more separation than in the future because novanetwork wasn't even a thing in the new OpenStack, but plenty is segregated by region, I'm sure. I'm asking because I really don't know and am curious :-D

Mon, May 20, 1:59 PM · Operations, Traffic, Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a parent task for T219324: Update tools to use new actor storage: T223406: Remove reference to fields replaced by the actor table from WMCS views.
Mon, May 20, 1:51 PM · Community-Tech-Sprint, Community-Tech
Bstorm added a subtask for T223406: Remove reference to fields replaced by the actor table from WMCS views: T219324: Update tools to use new actor storage.
Mon, May 20, 1:51 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T219324: Update tools to use new actor storage.

Working on the change to views in cloud/toolforge here: T223406

Mon, May 20, 1:50 PM · Community-Tech-Sprint, Community-Tech
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

I should be deploying the changes above to all replicas starting tomorrow if all goes well.

Mon, May 20, 1:17 AM · Patch-For-Review, Data-Services

Fri, May 17

Bstorm added a comment to T223406: Remove reference to fields replaced by the actor table from WMCS views.

Announced a deployment of 2019-05-27. Hopefully that will work pretty well for everyone.

Fri, May 17, 10:30 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm closed T222255: tools-sgecron-01 virtual memory allocation error at midnight and noon UTC as Resolved.

I'm not finding any on my end. I also see no logs or dmesg entries since the week prior to the change. I'm going to call this one fixed by freeing overcommit on the cron host.

Fri, May 17, 6:06 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T222255: tools-sgecron-01 virtual memory allocation error at midnight and noon UTC.

That's kind of what I'm looking for. I was leaving this open for a while to see if we are still getting the issue. I deployed the changes last Friday, so it's fixed, I think, if there have been no messages since then. I'll poke around in the depths of my email filters to see if I have any.

Fri, May 17, 5:23 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

Aaaand after more checking, the temp table was a very old artifact of the errors I had in some view definitions, just like you suggested @Anomie. That is confirmed fixed on the replicas I've been able to deploy on and will be everywhere as soon as I can get it deployed past some DB maintenance.

Fri, May 17, 4:10 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

That would be good, but note the temp table isn't currently on the replicas (possibly related to T212972#5171045).

Fri, May 17, 3:54 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

I've got a commitment to finish the last two replicas on Monday.

Fri, May 17, 3:29 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)

Thu, May 16

Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

@TheDJ ssh sessions and possibly processes that run out of home directories or the project directory on NFS. Because it has NFS home directories, you'd want to make sure you re-opened your home directory after the symlink to /home is changed to the new mounts.

Thu, May 16, 5:07 PM · Patch-For-Review, cloud-services-team (Kanban)

Wed, May 15

Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

There is maintenance going on re: the replicas that is keeping some depooled and preventing me from proceeding for a while. @Anomie I'll need to know what level of urgency this has for you to know if we should adapt the maintenance activities around this or just wait it out. Compression is involved, so it'll be a while, I'm sure.

Wed, May 15, 4:38 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

Starting the run on the rest of them

Wed, May 15, 2:45 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

For validation purposes, this is the first run of the fixed layout on a live wiki on protected_titles_compat

Wed, May 15, 2:44 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)

Tue, May 14

Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

Aaand, I was out Monday and am catching up still. This may not get deployed until tomorrow.

Tue, May 14, 10:15 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm updated subscribers of T209527: Set up scratch and maps NFS services on cloudstore1008/9.

@aude @Awjrichards @Chippyy @cmarqu @coren @dschwen @jeremyb @Kolossos @MaxSem @Multichill @Nosy @TheDJ -- Just a heads up that I'm looking to begin data migration now for maps /home and /project. To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

Tue, May 14, 4:45 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T223272: CloudVPS: evaluate if we can make rsync use 10G in cloudvirts.

Not that it should matter to the virts themselves, but it would matter to the VMs. Just curious.

Tue, May 14, 3:44 PM · Patch-For-Review, Operations, netops, cloud-services-team (Kanban)
Bstorm added a comment to T223272: CloudVPS: evaluate if we can make rsync use 10G in cloudvirts.

Is the Neutron hardware on 10G?

Tue, May 14, 3:43 PM · Patch-For-Review, Operations, netops, cloud-services-team (Kanban)

Fri, May 10

Bstorm added a comment to T222255: tools-sgecron-01 virtual memory allocation error at midnight and noon UTC.

The one danger I see in this is if it actually hits OOM and kills the webservice monitor. Then webservices on the grid will not restart. I do not *think* that will happen because real memory usage isn't spiking from what I can tell, though if there was no resource contention, this wouldn't happen.

Fri, May 10, 9:15 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

I'm going to hold off the actual deploy until Monday in case this breaks a bunch of tools.

Fri, May 10, 8:11 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)

Thu, May 9

Bstorm added a comment to T101631: rev_len should be available also for deleted revisions in database replicas.

If I can get a thumbs up from @Bawolff, perhaps?

Thu, May 9, 11:04 PM · cloud-services-team (Kanban), Data-Services, Cloud-VPS
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

The upstream table doesn't seem to have an index on that field in general. That seems to suggest that running a where on that isn't a good idea? However, there is a usertext_timestamp index. The reason it is fast on the archive_userindex table is because it is hitting an index that I've added to the replicas named ar_actor_deleted. It is hit because the it checks the deleted field, which is used by the userindex view by default. That appears to be the pattern on these userindex views.

Thu, May 9, 9:58 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

I don't think using coalesce in the view changes anything, btw. The issue is the underlying table indexes or not. This is scanning the entire underlying revision table, which is bound to time out no matter what.

Thu, May 9, 9:48 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

So, apparently, the revision query scans the entire revision table first.

Thu, May 9, 9:42 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T221339: Missing index on revision_userindex.rev_actor.

Revision has to go through a temp table (revision_actor_temp), which does change things a fair bit. That is intended to be a temporary situation (T215466). Let me see what indexes that table has.

Thu, May 9, 9:26 PM · Patch-For-Review, Data-Services
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

That should fix things as long as it looks ok. I tested on my local mediawiki vagrant.

Thu, May 9, 9:21 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

Ok, the general problem is that a view cannot have both an inner and a left join in the same view and have the script do it (without writing a parser and packaging this separately from puppet at that point).

Thu, May 9, 8:54 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

Failed to find table image, comment in database frwiki as a source for view image_compat definitely the bug.

Thu, May 9, 6:45 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

There is a bug in the script (or somewhere) that is causing it to skip rebuilding some (_compat) tables. I think I know what it is. Since we changed to inner joins, there's a bug in there I suspected early on that it might miss tables that have them.

Thu, May 9, 6:36 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T212972: Remove reference to text fields replaced by the comment table from WMCS views.

I see it somewhere else as well. Let me try to figure out what's up there.

Thu, May 9, 6:28 PM · Patch-For-Review, Data-Services, Core Platform Team Backlog (Watching / External)
Bstorm added a comment to T222768: Odd performance issue on a toolforge DB replica.

I suspect that may have been related to how data transfer plays into it, perhaps. Running on the server itself, it doesn't need to. 🤔
If it's fixed, I'm happy, though!
Thanks @Marostegui !

Thu, May 9, 3:36 PM · cloud-services-team (Kanban), Data-Services, Tool-Database-Queries, Toolforge

Wed, May 8

Bstorm added a comment to T222255: tools-sgecron-01 virtual memory allocation error at midnight and noon UTC.

I'm curious about what is happening here that is new, though

I think it's just more and more people using cron. We have around 467 jobs at midnight UTC (T222255#5158245)!

Wed, May 8, 12:57 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T222768: Odd performance issue on a toolforge DB replica.

@bd808 I got the same result running explains on the server. The difference was that, while the query plan differs between the two for some reason, both were nearly instant in response. That part is what made me really scratch my head.

Wed, May 8, 1:40 AM · cloud-services-team (Kanban), Data-Services, Tool-Database-Queries, Toolforge

Tue, May 7

Bstorm added a comment to T222768: Odd performance issue on a toolforge DB replica.

It looks as if the query I ran second (which is the fast one always--it's the one listed first in the description of the ticket), has even more relations? Odd.

Tue, May 7, 11:41 PM · cloud-services-team (Kanban), Data-Services, Tool-Database-Queries, Toolforge
Bstorm updated subscribers of T222768: Odd performance issue on a toolforge DB replica.

These would be hitting labsdb1010 and labsdb1011. On both servers, the queries were blazing fast and used indexes. Nothing seems wrong at the proxy level either for the servers. At the same time, I could easily confirm that the latter query took me 24 seconds from a Toolforge server.

Tue, May 7, 11:40 PM · cloud-services-team (Kanban), Data-Services, Tool-Database-Queries, Toolforge
Bstorm triaged T222768: Odd performance issue on a toolforge DB replica as Normal priority.
Tue, May 7, 11:26 PM · cloud-services-team (Kanban), Data-Services, Tool-Database-Queries, Toolforge
Bstorm added projects to T222768: Odd performance issue on a toolforge DB replica: Data-Services, cloud-services-team (Kanban).
Tue, May 7, 11:25 PM · cloud-services-team (Kanban), Data-Services, Tool-Database-Queries, Toolforge
Bstorm added a comment to T222255: tools-sgecron-01 virtual memory allocation error at midnight and noon UTC.

This is interesting. There is a series of qstat processes segfaulting at 23:59:<seconds> many (not all) nights (according to dmesg). Occasionally it also affects sendmail, exim and qsub. Rarely at other times (11:59:45 was one set). Frustrating problem to track down.

Tue, May 7, 2:29 AM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T222255: tools-sgecron-01 virtual memory allocation error at midnight and noon UTC.

Yeah, there isn't anything there.

Tue, May 7, 1:55 AM · Patch-For-Review, cloud-services-team (Kanban), Toolforge

Mon, May 6

Bstorm added a comment to T222255: tools-sgecron-01 virtual memory allocation error at midnight and noon UTC.

I'm out-of-office, but I've been trying to poke at this problem while away.

Mon, May 6, 11:53 AM · Patch-For-Review, cloud-services-team (Kanban), Toolforge

Sat, May 4

Bstorm added a subtask for T220650: tools-manifest - webservicemonitor needs a longer timeout: T221301: Toolschecker webservice checks get out of sync likely from timeouts.
Sat, May 4, 9:41 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a parent task for T221301: Toolschecker webservice checks get out of sync likely from timeouts: T220650: tools-manifest - webservicemonitor needs a longer timeout.
Sat, May 4, 9:41 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

Just found that there's regular stack traces related to this -- we need a longer timeout:
May 3 06:25:12 tools-sgecron-01 collector-runner[3819]: 2019-05-03 06:25:12,728 Starting webservice for tool dplbot
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: 2019-05-03 06:25:35,477 Timed out attempting to start webservice for tool dplbot
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: Traceback (most recent call last):
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: File "/usr/lib/python3.5/subprocess.py", line 385, in run
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: stdout, stderr = process.communicate(input, timeout=timeout)
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: File "/usr/lib/python3.5/subprocess.py", line 801, in communicate
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: stdout, stderr = self._communicate(input, endtime, timeout)
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: File "/usr/lib/python3.5/subprocess.py", line 1447, in _communicate
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: self._check_timeout(endtime, orig_timeout)
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: File "/usr/lib/python3.5/subprocess.py", line 829, in _check_timeout
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: raise TimeoutExpired(self.args, orig_timeout)
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: subprocess.TimeoutExpired: Command '['/usr/bin/sudo', '-i', '-u', 'tools.dplbot', '/usr/bin/webservice', 'restart']' timed out after 15 seconds
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: During handling of the above exception, another exception occurred:
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: Traceback (most recent call last):
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 183, in _start_webservice
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: subprocess.check_output(command, timeout=15) # 15 second timeout!
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: **kwargs).stdout
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: File "/usr/lib/python3.5/subprocess.py", line 390, in run
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: stderr=stderr)
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: subprocess.TimeoutExpired: Command '['/usr/bin/sudo', '-i', '-u', 'tools.dplbot', '/usr/bin/webservice', 'restart']' timed out after 15 seconds
May 3 06:25:35 tools-sgecron-01 collector-runner[3819]: 2019-05-03 06:25:35,544 Service monitor run completed, 0 webservices restarted

Sat, May 4, 9:40 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a project to T140415: `webservice restart` does not always wait for service to stop before trying to start again: cloud-services-team (Kanban).
Sat, May 4, 9:37 PM · cloud-services-team (Kanban), Kubernetes, Toolforge, Tools-Kubernetes
Bstorm added a project to T222430: MemoryError on submitting a job through Toolforge's cron: cloud-services-team (Kanban).
Sat, May 4, 9:32 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a project to T222429: xml.etree.ElementTree.ParseError: no element found: line 1, column 0 on job submission through Cron: cloud-services-team (Kanban).
Sat, May 4, 9:30 PM · cloud-services-team (Kanban), Toolforge

Tue, Apr 30

Bstorm added a comment to T209527: Set up scratch and maps NFS services on cloudstore1008/9.

Ok. This all seems to work now. I'm prepared to set up a patch to change the client mounts and start sync jobs to migrate the data. That will wait until I get back, I imagine.

Tue, Apr 30, 12:37 AM · Patch-For-Review, cloud-services-team (Kanban)

Fri, Apr 26

Bstorm awarded T221985: puppet-merge shouldn't fail if `tput` doesn't grok your terminal a Evil Spooky Haunted Tree token.
Fri, Apr 26, 7:28 PM · Puppet, Operations