Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (25 w, 2 m)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Today

Bstorm added a comment to T198420: Improve unmount/relink setup for dumps (labstore1006/1007) failovers.

On the topic of NFS mounts, as it relates to issues we've seen unmounting things, for reference: https://access.redhat.com/solutions/157873

Mon, Jul 16, 9:30 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm removed a project from T88711: Fully puppetize Grid Engine: Cloud-Services.
Mon, Jul 16, 6:22 PM · Goal, Puppet, Toolforge
Bstorm added a comment to T88711: Fully puppetize Grid Engine.

I believe T199276#4420812 was possibly due to NFS mount happening after package installation (as long as the setup from the package runs when puppet installs it, which I haven't confirmed for sure yet) since the package install should have created those dirs AND run the initiation script. This becomes a non-issue if puppetdb is used to get around NFS.

Mon, Jul 16, 5:55 PM · Goal, Puppet, Toolforge
Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

To get puppet to run on a new Trusty tools node requires a downgrade of libgdal-dev (just as a note). This is because Trusty isn't really supported anymore here, I presume. libgdal was upgraded beyond the support of needed libraries for the grid at WMF.

Mon, Jul 16, 5:53 PM · Toolforge, Epic, cloud-services-team (Kanban)

Thu, Jul 12

Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

Excellent, apparently now that the cluster is running in toolsbeta, puppet succeeds correctly.

Thu, Jul 12, 10:13 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

It does not. Very interesting.

Thu, Jul 12, 10:12 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

The service survives a puppet run, but puppet invariably complains about some things:

Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for toolsbeta-grid-master.toolsbeta.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1531433278'
Notice: /Stage[main]/Gridengine::Master/Service[gridengine-master]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Gridengine::Master/Service[gridengine-master]: Unscheduling refresh on Service[gridengine-master]
error: commlib error: got select error (Connection refused)
Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]: Could not evaluate: Field 'shortcut' is required
Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[release]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[release]: Could not evaluate: Field 'shortcut' is required
Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]: Could not evaluate: Field 'shortcut' is required
Thu, Jul 12, 10:10 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

I think much of this was a chicken or egg thing? All of the above should have been created by the package on install. This suggests that maybe puppet mounts NFS over the package install locations (which can be resolved). It also suggests that getting rid of the NFS config would be good, yet again.

Thu, Jul 12, 10:07 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

After manual creation (for now), we are bought to:

Thu, Jul 12, 7:24 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

And another

07/12/2018 19:20:25|  main|toolsbeta-grid-master|E|database directory /var/spool/gridengine/spooldb doesn't exist
07/12/2018 19:20:25|  main|toolsbeta-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed
07/12/2018 19:20:25|  main|toolsbeta-grid-master|C|setup failed
Thu, Jul 12, 7:21 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T88711: Fully puppetize Grid Engine.

T199276#4420812 is one thing needed for this.

Thu, Jul 12, 7:19 PM · Goal, Puppet, Toolforge
Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

Apparently, there is a dependency on a particular directory that isn't in puppet:
06/06/2018 18:53:57| main|toolsbeta-grid-master|C|can't change to directory "/var/spool/gridengine/qmaster"

Thu, Jul 12, 7:17 PM · Toolforge, Epic, cloud-services-team (Kanban)

Wed, Jul 11

Bstorm added a comment to T199276: Test running a stretch exec node in the existing system on toolsbeta.

Interestingly, the grid master is down in tools beta. It also fails on puppet runs. Poking at that.

Wed, Jul 11, 8:41 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T177850: Page if the grid engine master is unreachable: T199271: Upgrade the tools gridengine system.
Wed, Jul 11, 4:51 PM · Patch-For-Review, monitoring, Toolforge, cloud-services-team (Kanban)
Bstorm added a subtask for T199271: Upgrade the tools gridengine system: T177850: Page if the grid engine master is unreachable.
Wed, Jul 11, 4:51 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm closed T197244: Move analytics wiki replica cluster for switch and data center reconfigure as Resolved.

All set

Wed, Jul 11, 3:48 PM · cloud-services-team (Kanban)
Bstorm closed T197245: Move toolsdb and wikilabels cluster servers for datacenter reconfiguration as Resolved.
Wed, Jul 11, 3:47 PM · cloud-services-team (Kanban), Toolforge, Scoring-platform-team, Wikilabels
Bstorm added a comment to T197245: Move toolsdb and wikilabels cluster servers for datacenter reconfiguration.

This is done.

Wed, Jul 11, 3:47 PM · cloud-services-team (Kanban), Toolforge, Scoring-platform-team, Wikilabels

Tue, Jul 10

Bstorm triaged T199276: Test running a stretch exec node in the existing system on toolsbeta as Normal priority.
Tue, Jul 10, 8:53 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T88711: Fully puppetize Grid Engine: T199271: Upgrade the tools gridengine system.
Tue, Jul 10, 8:10 PM · Goal, Puppet, Toolforge
Bstorm added a subtask for T199271: Upgrade the tools gridengine system: T88711: Fully puppetize Grid Engine.
Tue, Jul 10, 8:10 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T88733: Document our GridEngine set up: T199271: Upgrade the tools gridengine system.
Tue, Jul 10, 8:10 PM · Documentation, Cloud-Services, Toolforge, Puppet
Bstorm added a subtask for T199271: Upgrade the tools gridengine system: T88733: Document our GridEngine set up.
Tue, Jul 10, 8:10 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T88237: Track and alert based on gridengine error states: T199271: Upgrade the tools gridengine system.
Tue, Jul 10, 8:09 PM · Cloud-Services, monitoring, Toolforge
Bstorm added a subtask for T199271: Upgrade the tools gridengine system: T88237: Track and alert based on gridengine error states.
Tue, Jul 10, 8:09 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T161898: Tools instances flapping puppet failure alerts: T199271: Upgrade the tools gridengine system.
Tue, Jul 10, 8:08 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a subtask for T199271: Upgrade the tools gridengine system: T161898: Tools instances flapping puppet failure alerts.
Tue, Jul 10, 8:08 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a parent task for T162955: rebuild tools-grid-master as a large instance: T199271: Upgrade the tools gridengine system.
Tue, Jul 10, 8:03 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, Cloud-Services
Bstorm added subtasks for T199271: Upgrade the tools gridengine system: Restricted Task, T162955: rebuild tools-grid-master as a large instance.
Tue, Jul 10, 8:03 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm created T199271: Upgrade the tools gridengine system.
Tue, Jul 10, 8:01 PM · Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

labsdb1006 is now also moved. Asked @akosiaris for help getting it up correctly as a master.

Tue, Jul 10, 7:39 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T197245: Move toolsdb and wikilabels cluster servers for datacenter reconfiguration.

labsdb1004 is moved, tomorrow will be 1005.

Tue, Jul 10, 7:38 PM · cloud-services-team (Kanban), Toolforge, Scoring-platform-team, Wikilabels
Bstorm added a comment to T197244: Move analytics wiki replica cluster for switch and data center reconfigure.

labsdb1010 is moved.

Tue, Jul 10, 7:37 PM · cloud-services-team (Kanban)
Bstorm added a parent task for T199248: Smart alert on labstore1006: T196651: rack upgraded storage capacity in labstore100[67].eqiad.wmnet.
Tue, Jul 10, 4:48 PM · cloud-services-team (Kanban)
Bstorm added a subtask for T196651: rack upgraded storage capacity in labstore100[67].eqiad.wmnet: T199248: Smart alert on labstore1006.
Tue, Jul 10, 4:48 PM · Datasets-General-or-Unknown, ops-eqiad, Cloud-VPS, Operations
Bstorm added a project to T199248: Smart alert on labstore1006: cloud-services-team (Kanban).
Tue, Jul 10, 4:47 PM · cloud-services-team (Kanban)
Bstorm added a parent task for T199236: Handle SMART for multiple shelves and controllers: T199248: Smart alert on labstore1006.
Tue, Jul 10, 4:47 PM · User-fgiunchedi, Operations, monitoring
Bstorm added a subtask for T199248: Smart alert on labstore1006: T199236: Handle SMART for multiple shelves and controllers.
Tue, Jul 10, 4:47 PM · cloud-services-team (Kanban)
Bstorm added a comment to T199248: Smart alert on labstore1006.

This appears to be a problem with the monitor more than the array.

Tue, Jul 10, 4:46 PM · cloud-services-team (Kanban)
Bstorm created T199248: Smart alert on labstore1006.
Tue, Jul 10, 4:46 PM · cloud-services-team (Kanban)
Bstorm added a comment to T198420: Improve unmount/relink setup for dumps (labstore1006/1007) failovers.
  • also, action item for me: check into mount option possibilities to make this work better
Tue, Jul 10, 4:40 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

Well, that was silly of me. Of course there are a bunch of roles and things not created on the re-imaged server. It'll probably need some kind of dump and restore to make this easy unless there's a doc around.

Tue, Jul 10, 5:13 AM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

Re-imaged labsdb1006 to stretch. In the process, I found that the storage is a bit odd. One of the LVs is named "_placeholder", which prevents puppet from working and it isn't mounted. This could be by design. I renamed the _placeholder to the correct volume name, similar to the current master and apparently had to create a filesystem on it. Puppet created the directory tree there once I mounted it, and I think the cron job that syncs over files from OSM should run in a few minutes (checking var/spool). If that finishes by morning, it should actually be ready then. This was a bit heavier than I expected, but it might work out.

Tue, Jul 10, 1:08 AM · Patch-For-Review, cloud-services-team (Kanban)

Mon, Jul 9

Bstorm added a comment to T196651: rack upgraded storage capacity in labstore100[67].eqiad.wmnet.

New shelf is now live and part of the /srv/dumps filesystem on labstore1006. It isn't fully restored to service yet, but everything looks good to do so.

Mon, Jul 9, 10:59 PM · Datasets-General-or-Unknown, ops-eqiad, Cloud-VPS, Operations

Thu, Jul 5

Bstorm added a comment to T194855: Degraded RAID on labvirt1020.

Disabled unused raid controller in the BIOS, which is at least half of this alert. However, this also is missing a battery, which HP considers an optional purchase that we should have.

Thu, Jul 5, 9:50 PM · ops-eqiad, Operations
Bstorm added a comment to T196507: Degraded RAID on labvirt1019.

Now that the spam is done from the last vandalism, @RobH, I am curious what can be done about the battery. There is some quirky history regarding the array here, but I figure we probably need to buy the battery either way. The raid card for this server and its partner were shipped without the "optional" cache backup battery. This is at least one reason there are degraded RAID alerts for them.

Thu, Jul 5, 9:21 PM · ops-eqiad, Operations
Bstorm added a comment to T196507: Degraded RAID on labvirt1019.

I went ahead and disabled the unused RAID controller in the BIOS. I have confirmed is not enough to clear the monitor. The lack of battery still reads as "critical".

Thu, Jul 5, 9:18 PM · ops-eqiad, Operations
Bstorm moved T171394: Better monitoring for labstore backup crons from Inbox to Doing on the cloud-services-team (Kanban) board.
Thu, Jul 5, 6:50 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
Bstorm moved T198420: Improve unmount/relink setup for dumps (labstore1006/1007) failovers from Inbox to To-Do on the cloud-services-team (Kanban) board.
Thu, Jul 5, 6:48 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm moved T197244: Move analytics wiki replica cluster for switch and data center reconfigure from Inbox to Doing on the cloud-services-team (Kanban) board.
Thu, Jul 5, 6:48 PM · cloud-services-team (Kanban)
Bstorm moved T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration from Inbox to Doing on the cloud-services-team (Kanban) board.
Thu, Jul 5, 6:48 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm moved T197245: Move toolsdb and wikilabels cluster servers for datacenter reconfiguration from Inbox to Doing on the cloud-services-team (Kanban) board.
Thu, Jul 5, 6:47 PM · cloud-services-team (Kanban), Toolforge, Scoring-platform-team, Wikilabels
Bstorm added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

That seems fair.

Thu, Jul 5, 6:27 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T197245: Move toolsdb and wikilabels cluster servers for datacenter reconfiguration.

@jcrespo -- With some issues around the RAID still giving me trouble, we could perhaps do that stretch upgrade when we move to VMs. Otherwise, would that draw out the service impact a lot? Databases would need to come down during it, I presume, and perhaps we can do that.

Thu, Jul 5, 6:26 PM · cloud-services-team (Kanban), Toolforge, Scoring-platform-team, Wikilabels
Bstorm closed T198700: Request creation of WikiCiteVis VPS project as Resolved.
Thu, Jul 5, 5:24 PM · Cloud-VPS (Project-requests)
Bstorm added a watcher for cloud-services-team (Kanban): Bstorm.
Thu, Jul 5, 4:55 PM
Bstorm added a comment to T161898: Tools instances flapping puppet failure alerts.

In discussions at today's retrospective, one proposal discussed is to nice the puppet agent process on worker and exec nodes in toolforge, prioritizing user code, which could help with some things (or cause additional staleness alerts).

Thu, Jul 5, 4:52 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T161898: Tools instances flapping puppet failure alerts.

One of the things I've found causing these failures is that there is a condition (one that is supposedly solved on the Red Hat Network that I cannot access, I hear) that causes Puppet to attempt to mount already mounted NFS mounts. This throws exit code 32 (already mounted), which puppet treats as a failure and kills the whole run.

Thu, Jul 5, 4:50 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm edited projects for T161898: Tools instances flapping puppet failure alerts, added: cloud-services-team (Kanban); removed Patch-For-Review.
Thu, Jul 5, 4:48 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm closed T198877: Minimize or eliminate flapping and erroneous puppet alerts from shinken as Invalid.

Removing this task in favor of the previous. There's a lot of work and context in that one.

Thu, Jul 5, 4:48 PM · cloud-services-team (Kanban)
Bstorm removed a subtask for T198877: Minimize or eliminate flapping and erroneous puppet alerts from shinken: T161898: Tools instances flapping puppet failure alerts.
Thu, Jul 5, 4:47 PM · cloud-services-team (Kanban)
Bstorm removed a parent task for T161898: Tools instances flapping puppet failure alerts: T198877: Minimize or eliminate flapping and erroneous puppet alerts from shinken.
Thu, Jul 5, 4:47 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T198877: Minimize or eliminate flapping and erroneous puppet alerts from shinken.

Found the previous ticket for this. Adding it as a subtask in order to preserve that context, at very least.

Thu, Jul 5, 4:41 PM · cloud-services-team (Kanban)
Bstorm added a parent task for T161898: Tools instances flapping puppet failure alerts: T198877: Minimize or eliminate flapping and erroneous puppet alerts from shinken.
Thu, Jul 5, 4:41 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a subtask for T198877: Minimize or eliminate flapping and erroneous puppet alerts from shinken: T161898: Tools instances flapping puppet failure alerts.
Thu, Jul 5, 4:41 PM · cloud-services-team (Kanban)
Bstorm created T198877: Minimize or eliminate flapping and erroneous puppet alerts from shinken.
Thu, Jul 5, 4:30 PM · cloud-services-team (Kanban)
Bstorm added a comment to T198700: Request creation of WikiCiteVis VPS project.

Sorry about that!

Thu, Jul 5, 3:05 PM · Cloud-VPS (Project-requests)

Wed, Jul 4

Bstorm added a comment to T197985: Cleanup docker images in PAWS kubernetes cluster.

So I've added the extra args and restarted the kublet process, however, not only did this not clean up space, running docker container prune and docker image prune also didn't help as much as they would be expected to. A bit strange.

Wed, Jul 4, 12:04 AM · PAWS

Tue, Jul 3

Bstorm added a comment to T198700: Request creation of WikiCiteVis VPS project.

You should be good to go for now. I've created your project and added you to it. You should be able to access things via Horizon, now, to set up.

Tue, Jul 3, 9:08 PM · Cloud-VPS (Project-requests)

Mon, Jul 2

Bstorm added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

Oh no, all this times up the other way around, and the page join removal would have put things back to the way they were before you started to see problems, and the addition of the joins are likely to still be a problem. If I put the joins back, it would crush performance for basically anything querying the page table. I'm quite sure that has no effect here. The MCR revisions (which added lots more joins, including to the revision table), however, introduced many joins to the tables you are querying. Removing those joins might fix things, but they will also break backward compatibility with MCR. I am concerned about the overall health of the database system on that server, though because of how much I see things in an "opening tables" state. That seems weird. It could be connected to the MCR changes, or it might be something else.

Mon, Jul 2, 5:43 PM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm updated subscribers of T195515: GUC query performance regressed 100x from <3s to 80-300s.
Mon, Jul 2, 4:37 PM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm added a comment to T195747: Create views for the schema change for refactored actor storage.

I've rebased and set up the patch for the replicas. Is the actor table ready to go with img_actor and all that now?
Also, any comments on the patch?

Mon, Jul 2, 4:34 PM · Core-Platform-Team, Patch-For-Review, Data-Services
Bstorm claimed T193655: rack/setup/install labstore1008 & labstore1009.
Mon, Jul 2, 2:53 PM · cloud-services-team (Kanban), Patch-For-Review, ops-eqiad, Cloud-VPS, Operations
Bstorm claimed T195747: Create views for the schema change for refactored actor storage.
Mon, Jul 2, 2:38 PM · Core-Platform-Team, Patch-For-Review, Data-Services

Fri, Jun 29

Bstorm added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

No that's not needed in this case, cause it's a peculiar one. The servers are a mirror of openstreetmap so we can just resync fully from upstream (plus some minor pg_dump+import right before the switch). But

Fri, Jun 29, 4:23 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

Bump up the switch move or the stretch upgrade? The switch move is scheduled for the 10th and 11th to avoid conflicts with holiday plans and so forth as well as to coincide with two other database clusters moving on the same days.

Fri, Jun 29, 4:22 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm closed T188681: Maintain-dbusers should handle failures due to replicas being in maintenance as Resolved.
Fri, Jun 29, 4:12 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
Bstorm closed T192098: Add tmpreaper to all tools execute nodes, if appropriate as Declined.
Fri, Jun 29, 4:10 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

A timeline for upgrading to stretch or this move event? The basic datacenter reconfig is just scheduled to happen on the dates in the description. To make 1006 the master, we should be mindful that replication needs to be fully set up (it is non-functional at the moment).

Fri, Jun 29, 3:59 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T198479: labvirt1009 HP Raid alert.

Should be. It's an HP Smart P420i in a RAID 10 logical disk and is the only failure. Unless the disk itself isn't a hot swap form factor, it should be good, right? I'm, of course presuming that it is a hot swap form factor, which might be silly.

Fri, Jun 29, 3:38 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team

Thu, Jun 28

Bstorm created T198420: Improve unmount/relink setup for dumps (labstore1006/1007) failovers.
Thu, Jun 28, 8:22 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T196651: rack upgraded storage capacity in labstore100[67].eqiad.wmnet.

Cabling information grabbed from these two documents: D3600 manual: http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04219600-1.pdf
D3000 series wiring guide: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c05252635

Thu, Jun 28, 8:07 PM · Datasets-General-or-Unknown, ops-eqiad, Cloud-VPS, Operations

Wed, Jun 27

Bstorm updated subscribers of T195515: GUC query performance regressed 100x from <3s to 80-300s.

@jcrespo I was going through this again today, and I noticed that the primary replica server for the web has an awful lot of this "opening tables" going on--even from 'system user' processes. I also noticed this is not happening on the analytics servers.

Wed, Jun 27, 6:49 PM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm added a comment to T153163: Set up and use exported resources for Tool Labs's shared knowledge.

puppetdb-terminus appears to be installed. I don't see it configured, though.

Wed, Jun 27, 6:17 PM · Patch-For-Review, Toolforge, Cloud-Services
Bstorm added a comment to T153163: Set up and use exported resources for Tool Labs's shared knowledge.

It would appear that this was not set up on the tools puppetmaster at this time Jun 27 18:01:06 tools-puppetmaster-01 puppet-master[19573]: You cannot collect exported resources without storeconfigs being set; the export is ignored at /etc/puppet/modules/monitoring/

Wed, Jun 27, 6:02 PM · Patch-For-Review, Toolforge, Cloud-Services
Bstorm updated subscribers of T196507: Degraded RAID on labvirt1019.

@RobH Any thoughts on that battery issue above? I'm going to see if the first controller that isn't being used can be disabled in the BIOS or something.

Wed, Jun 27, 4:46 PM · ops-eqiad, Operations
Bstorm added a comment to T196507: Degraded RAID on labvirt1019.

From https://h20195.www2.hpe.com/v2/getpdf.aspx/c04346301.pdf?ver=2

Wed, Jun 27, 4:17 PM · ops-eqiad, Operations
Bstorm added a comment to T196507: Degraded RAID on labvirt1019.

This server appears to be fully functional from all views I can see. However, the monitor for RAID would disagree and think it is critical. I believe it reports that there are no drives on one controller (which is correct!) and no batteries on the live controller (which I'm not so sure of). If the actual live controller actually doesn't have a battery and isn't supposed to, that's probably fine. If it should be reporting a battery, then we might still have something to fix. I'll dig around a little regarding that.

Wed, Jun 27, 4:12 PM · ops-eqiad, Operations
Bstorm closed T194964: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 as Resolved.

Looking good! The VM is doing a puppet run. I think the network is working on these things now.

Wed, Jun 27, 4:09 PM · Cloud-Services, Patch-For-Review, ops-eqiad, Operations
Bstorm added a comment to T194855: Degraded RAID on labvirt1020.

This is currently still some kind of an issue on both servers. The thing is that I'm not sure if it is a problem or just describing reality (embedded controller has no disk and installed controller doesn't report a battery).

Wed, Jun 27, 3:53 PM · ops-eqiad, Operations

Tue, Jun 26

Bstorm added a comment to T194964: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.

The bad, for some reason, even though eth1 shows up ok as up, the VM on there has no access to the network and is failing at DHCP. That seems more fixable in this state that it was before, though!

eth1 is now working fine.

  1. Switch was configured for ge-4/0/33 and not xe-4/0/33
  2. The test IP was configured on eth1.1102 instead of br1102. Moved it and it can now ping other IPs on the same subnet.
Tue, Jun 26, 10:35 PM · Cloud-Services, Patch-For-Review, ops-eqiad, Operations
Bstorm added a comment to T194964: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.

Thanks!

Tue, Jun 26, 3:01 PM · Cloud-Services, Patch-For-Review, ops-eqiad, Operations

Fri, Jun 22

Bstorm closed T197977: https://tools-prometheus.wmflabs.org/tools responds with 503, a subtask of T53434: Implement a system to monitor tools on tool-labs, as Resolved.
Fri, Jun 22, 7:52 PM · User-Matthewrbowker, community-labs-monitoring, Toolforge
Bstorm closed T197977: https://tools-prometheus.wmflabs.org/tools responds with 503 as Resolved.
Fri, Jun 22, 7:52 PM · cloud-services-team (Kanban), monitoring, Toolforge
Bstorm added a comment to T197977: https://tools-prometheus.wmflabs.org/tools responds with 503.
$ systemctl status prometheus@tools.service
● prometheus@tools.service - prometheus server (instance tools)
   Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static)
   Active: active (running) since Fri 2018-06-22 19:44:36 UTC; 7min ago
 Main PID: 12729 (prometheus)
   CGroup: /system.slice/system-prometheus.slice/prometheus@tools.service
           └─12729 /usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-a.
Fri, Jun 22, 7:52 PM · cloud-services-team (Kanban), monitoring, Toolforge
Bstorm added a comment to T197977: https://tools-prometheus.wmflabs.org/tools responds with 503.

Dude! The answer was right in front of me. On a *nix system, prometheus tries to open the file for reading: https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock_unix.go#L43

Fri, Jun 22, 7:42 PM · cloud-services-team (Kanban), monitoring, Toolforge
Bstorm added a comment to T197977: https://tools-prometheus.wmflabs.org/tools responds with 503.

Apparently, the file's existence shouldn't matter (at least in current Prometheus). It should be able to lock it, but it cannot https://gitlab.cncf.ci/prometheus/prometheus/blob/934d86b936f3e6e35581c44e6ec22236f4c69de3/util/flock/flock.go#L31

Fri, Jun 22, 7:39 PM · cloud-services-team (Kanban), monitoring, Toolforge
Bstorm added a comment to T197977: https://tools-prometheus.wmflabs.org/tools responds with 503.
$ systemctl status prometheus@tools.service
● prometheus@tools.service - prometheus server (instance tools)
   Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2018-06-22 19:02:37 UTC; 1s ago
  Process: 30784 ExecStart=/usr/bin/prometheus -storage.local.max-chunks-to-persist 524288 -storage.local.memory-chunks 1048576 -storage.local.path /srv/prometheus/tools/metrics -web.listen-address 127.0.0.1:9902 -web.external-url https://tools-prometheus.wmflabs.org/tools -storage.local.retention 730h0m0s -config.file /srv/prometheus/tools/prometheus.yml -storage.local.chunk-encoding-version 2 (code=exited, status=1/FAILURE)
 Main PID: 30784 (code=exited, status=1/FAILURE)
Fri, Jun 22, 7:09 PM · cloud-services-team (Kanban), monitoring, Toolforge
Bstorm closed T183920: 2018-01-02: labstore Tools and Misc share very full as Resolved.

This seem pretty good at this point, so I'll close this task for now.

Fri, Jun 22, 5:53 PM · cloud-services-team (Kanban), Operations, Cloud-VPS
Bstorm updated subscribers of T184126: Templatetiger-Updating: Lost connection to MySQL server during query.

Taking a look at where I think the query killer lives, it seems like the comment won't have any affect. However, it could be that some exception is needed for LOAD DATA LOCAL type statements? @jcrespo Am I close to the mark here?

Fri, Jun 22, 5:40 PM · Data-Services, Toolforge