Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (35 w, 9 h)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Yesterday

Bstorm added a comment to T202558: Ban spam arriving to my tools email.

It's not a huge load, though, to be clear.

Mon, Sep 24, 5:03 PM · Patch-For-Review, Puppet, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T202558: Ban spam arriving to my tools email.

We are now generating additional frozen messages due to the reply this sends, lol. It's interesting.

Mon, Sep 24, 5:02 PM · Patch-For-Review, Puppet, cloud-services-team (Kanban), Cloud-Services
Bstorm awarded T196137: toolforge: prometheus issue is filling up email queue a Party Time token.
Mon, Sep 24, 4:40 PM · cloud-services-team (Kanban), Patch-For-Review, Toolforge
Bstorm added a comment to T202558: Ban spam arriving to my tools email.

Noticed on deploying this that the mail queue was clogged up with frozen messages rejected from qq.com servers. I cleaned them with:
exim -bpu | grep '*** frozen ***' | awk '{print $3}' | xargs -i exim -Mrm {}

Mon, Sep 24, 4:35 PM · Patch-For-Review, Puppet, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T153281: webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind.

Adding a note because I just found it: this is where it needs to change if it were all in puppet https://github.com/wikimedia/puppet/blob/production/modules/toollabs/templates/gridengine/queue-webgrid.erb#L14

Mon, Sep 24, 4:14 PM · Toolforge
Bstorm added a comment to T202558: Ban spam arriving to my tools email.

Thanks @herron and @GTirloni! The spamhaus list certainly can't hurt. I don't think we get so much email as to run afoul of their policies?

Mon, Sep 24, 3:29 PM · Patch-For-Review, Puppet, cloud-services-team (Kanban), Cloud-Services

Fri, Sep 21

Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Ok, now the test environment is working correctly and not randomly collapsing. Presuming this stays true over the weekend, the procedure is as follows:

Fri, Sep 21, 7:05 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

It did. Killing rpc-statd fixed the problem. This doesn't need statd for basically anything. It was only started due to the nfs-common package which doesn't run in this version. Rebooting the server doesn't restart it, and nfs continues working. So far so good. Running more bonny++ to stress-test it.

Fri, Sep 21, 4:40 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T205112: Queriing enwiki_p.externallinks access is denied.

Odd... Could have sworn I did those. Hitting it again. There may have been a replication lag or something that made those not get the change right away?

Fri, Sep 21, 4:10 PM · Data-Services
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

The correct procedure for installing the backports:

Fri, Sep 21, 3:05 PM · Patch-For-Review, cloud-services-team (Kanban)

Thu, Sep 20

Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Thu, Sep 20, 6:47 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Not sure it is a new situation, but I just became aware that the RPS setting intended to balance IRQs over CPUs for network receive queues isn't working on labstore1004. All receive and tx queues for the interface are clearly going over CPU0 only...and RPS settings are looking like they are not really set in general in sysfs. This also has two numa nodes...what a lovely rabbit hole. (copied from IRC)

Thu, Sep 20, 6:36 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm updated subscribers of T200557: Create a stretch and Son of Grid Engine grid in toolsbeta.

In installing toolsbeta's puppetdb, I created a document to try and bring some structure to the process before this goes to tools, and @Paladox found that the process makes some interesting, prod-focused changes: P7573

Thu, Sep 20, 2:23 PM · Patch-For-Review, Toolforge, Epic, cloud-services-team (Kanban)

Wed, Sep 19

Krenair awarded T200557: Create a stretch and Son of Grid Engine grid in toolsbeta a Party Time token.
Wed, Sep 19, 8:31 PM · Patch-For-Review, Toolforge, Epic, cloud-services-team (Kanban)
Bstorm updated subscribers of T204857: notebook1003 failed network mount on boot.

Guessing some folks who might be interested in this task.

Wed, Sep 19, 6:15 PM · Cloud-Services, Analytics-Cluster, Analytics, Operations
Bstorm added a comment to T204857: notebook1003 failed network mount on boot.

Seems like there was a dependency during boot where it gets held up waiting for the mount of NFS in systemd. Perhaps the mount task needs to wait for network or something.

Wed, Sep 19, 6:14 PM · Cloud-Services, Analytics-Cluster, Analytics, Operations
Bstorm added a comment to T204857: notebook1003 failed network mount on boot.

[bstorm@notebook1003]:dumps-labstore1006.wikimedia.org $ df .
Filesystem 1K-blocks Used Available Use% Mounted on
labstore1006.wikimedia.org:/dumps 105068749584 46516000048 53278293472 47% /mnt/nfs/dumps-labstore1006.wikimedia.org

Wed, Sep 19, 6:10 PM · Cloud-Services, Analytics-Cluster, Analytics, Operations

Tue, Sep 18

Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

@GTirloni has built a VPS project for us to test this in. Clearly, we need a bit more testing before releasing this again. So far, the secondary NFS server, labstore1005, still has the backported packages installed. If we cannot make this work in testing, I'll roll back the secondary as well.

Tue, Sep 18, 2:06 PM · Patch-For-Review, cloud-services-team (Kanban)

Mon, Sep 17

Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

I suspect that I may also need to restart some services such as rpcbind, portmap, etc, that may not have been captured by the service I did restart. The result was that the server was advertising exports, but it was not giving anybody permission to use them (despite the exports file being correct).

Mon, Sep 17, 6:49 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Well, that went terribly! Stopping the nfs-kernel-server and installing nfs-common and nfs-kernel-server backported packages did not result in a working server. Rolled back.

Mon, Sep 17, 6:48 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm updated the task description for T204530: cloudvps: tools and toolsbeta trusty deprecation.
Mon, Sep 17, 5:47 PM · cloud-services-team (FY2018-19), Goal, Cloud-VPS
Bstorm updated the task description for T204530: cloudvps: tools and toolsbeta trusty deprecation.
Mon, Sep 17, 5:46 PM · cloud-services-team (FY2018-19), Goal, Cloud-VPS
Bstorm added a comment to T195747: Create views for the schema change for refactored actor storage.

Ok, I believe, from everything I'm seeing here and on the replicas that I can push out the patch and rebuild views (with depools) as in the current patch set https://gerrit.wikimedia.org/r/c/operations/puppet/+/431823. I've tested it on my local setup.

Mon, Sep 17, 5:40 PM · cloud-services-team (Kanban), Core-Platform-Team, Patch-For-Review, Data-Services
Bstorm added a comment to T114117: Drop externallinks.el_from_namespace on wmf databases.

Done re-running that view.

Mon, Sep 17, 4:47 PM · DBA, Schema-change
zhuyifei1999 awarded T204530: cloudvps: tools and toolsbeta trusty deprecation a Love token.
Mon, Sep 17, 4:06 PM · cloud-services-team (FY2018-19), Goal, Cloud-VPS
Bstorm added a subtask for T204530: cloudvps: tools and toolsbeta trusty deprecation: T200557: Create a stretch and Son of Grid Engine grid in toolsbeta.
Mon, Sep 17, 1:47 PM · cloud-services-team (FY2018-19), Goal, Cloud-VPS
Bstorm added a parent task for T200557: Create a stretch and Son of Grid Engine grid in toolsbeta: T204530: cloudvps: tools and toolsbeta trusty deprecation.
Mon, Sep 17, 1:47 PM · Patch-For-Review, Toolforge, Epic, cloud-services-team (Kanban)
Bstorm claimed T204530: cloudvps: tools and toolsbeta trusty deprecation.
Mon, Sep 17, 1:46 PM · cloud-services-team (FY2018-19), Goal, Cloud-VPS
Bstorm triaged T204530: cloudvps: tools and toolsbeta trusty deprecation as Normal priority.
Mon, Sep 17, 1:45 PM · cloud-services-team (FY2018-19), Goal, Cloud-VPS

Fri, Sep 14

Bstorm added a comment to T204359: Investigate and/or deploy LACP to NFS servers for Cloud Services.

This is a separate issue from T203254 because it is a long-standing problem.

Fri, Sep 14, 3:34 PM · cloud-services-team (Kanban)
Bstorm created T204359: Investigate and/or deploy LACP to NFS servers for Cloud Services.
Fri, Sep 14, 3:33 PM · cloud-services-team (Kanban)
Bstorm awarded T180318: Add CI to all labs/tools/* repositories and archive obsolete ones a Love token.
Fri, Sep 14, 1:56 PM · Tools, Patch-For-Review, Gerrit, Continuous-Integration-Config
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Also note: it depends on keyutils.

Fri, Sep 14, 1:44 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Ok, simply starting and stopping the nfs server resolved that. Note: in this version the service is called nfs-server, not nfs-kernel-server.

Fri, Sep 14, 1:43 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Tried it on the inactive server first and found that it does something a bit odd with an inactive. Without nfs started, the exportfs command fails, which is no surprise in some ways, but apparently it did work before. May need to make the script not run the exportfs command unless it is active.

Fri, Sep 14, 1:24 PM · Patch-For-Review, cloud-services-team (Kanban)

Thu, Sep 13

Bstorm added a comment to T161898: Tools instances flapping puppet failure alerts.

See my comments on the change. I'm suddenly wondering why we are using ls as our test. There might be some context in git. I'll check.

Thu, Sep 13, 5:48 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-Services
Bstorm reopened T181650: Change views for the new columns of the refactored comment storage as "Open".

@Anomie Since it is clear that the joins are impacting performance badly for some tables for this as well (especially some of the userindex views and friends) where they are only there for better performance. I am thinking it might be good to move the comment stuff over to compat views. It seems we are already moving to the "write new" phase of this, so I'm not sure exactly how much impact it'll be if I do that. Are we presently at "write both" or "write new" @Marostegui ?

Thu, Sep 13, 4:36 PM · cloud-services-team (Kanban), Data-Services
Bstorm reopened T181650: Change views for the new columns of the refactored comment storage, a subtask of T166733: Deploy refactored comment storage, as Open.
Thu, Sep 13, 4:36 PM · Core-Platform-Team (CPT-Q1-Jul-Sep-2018), User-notice, Epic, Release-Engineering-Team (Watching / External)
Bstorm reopened T181650: Change views for the new columns of the refactored comment storage, a subtask of T174569: Schema change for refactored comment storage, as Open.
Thu, Sep 13, 4:36 PM · MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), Patch-For-Review, Dumps-Generation, Data-Services, Blocked-on-schema-change, DBA
Bstorm added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

Reading back, revision_userindex does still have two joins in it. It joins on revision_comment_temp and comment. I'm going to see if the comment joins can be feasibly removed from at least the *_userindex tables.

Thu, Sep 13, 4:06 PM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

I am going to look at applying the same logic in one more place, in case that's doable.

Thu, Sep 13, 2:16 PM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

It's more generally applied to the web replicas now. There is still one analytics replica I need to apply this change to, but that is promising.

Thu, Sep 13, 2:06 PM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm added a comment to T161898: Tools instances flapping puppet failure alerts.

Good ideas, but this all far predates the kernel upgrades. The NFS performance from the servers also hasn't significantly degraded for clients (just for the servers from all metrics I can see). The NFS errors that cause this are that our puppetization often tries to mount NFS volumes that are already mounted (for some reason), which returns status 32 (in puppet, that's a failure). 32 in NFS means that it's already mounted. On the next run, the puppet setup notes the mount. The question is: why didn't it see it in the first place? also can we make it recognize 32 as success?

Thu, Sep 13, 12:41 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-Services

Wed, Sep 12

Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

It's early to tell, but this may have reduced overall load a small amount. If so, some changes will be needed to make the tuning persistent. It's hard to tell because it is not a big difference. That said, it would not have paged last night.

Wed, Sep 12, 4:03 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T201544: My database tables on Postgres -OSM Server are away.

Ok, the wp_coords_red0 table is restored from that dump in the gis DB at this point. How for does that get us? @Kolossos

Wed, Sep 12, 3:24 PM · Maps, Cloud-Services
Bstorm closed T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks as Resolved.
Wed, Sep 12, 2:59 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

I'm hoping this fixes some things when fully released, because some tables where I've added joins cannot be stripped of them. The relationships involved kind of dictate the data scrubbing and so forth.

Wed, Sep 12, 11:55 AM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

You've got the right idea. I'm not likely to get this out to everything in one day, but the nice thing is that the serious breaking stuff that needs the compat isn't coming for a while to my knowledge. That will be announced with more detail soon. I've been trying to get ahead of production schema changes with the views by adding all those joins and things before. Since that caused problems, we're changing direction so that there will be some _compat tables available, but things will be fast by default. If you follow changes in production, you should be fine in general anyway.

Wed, Sep 12, 11:53 AM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions

Tue, Sep 11

Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Applied settings here https://docs.linbit.com/docs/users-guide-8.4/#s-prepare-storage
regarding io queue scheduler. Just checking to see what impact, if any, that has here.

Tue, Sep 11, 5:44 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks.

This appears resolved as long as this relates ultimately to the failover for NFS server reboots, not to T203254

Tue, Sep 11, 5:35 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks.

tools-exec-1418 was in the same condition. This may have been around for a while due to NFS changes. I'm not sure if it wasn't triggered very recently, though.

Tue, Sep 11, 5:20 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks.

Seems to have done it after truncating the offending files.

Tue, Sep 11, 5:07 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks.

Per the historical task, remounted NFS.

Tue, Sep 11, 4:58 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a project to T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks: Toolforge.
Tue, Sep 11, 4:49 PM · Toolforge, cloud-services-team (Kanban)
Bstorm created T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks.
Tue, Sep 11, 4:49 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

scwiki should now have more performant views. If all is well with it by tomorrow, I'll start deploying changes I mentioned here T174047 to the rest of the replicas. This should make compatibility joins in the views an "opt in" thing if you need them vs refactoring the tool.

Tue, Sep 11, 3:01 PM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
Bstorm added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

Ok, scwiki now has the compatibility whatnot moved to separate tables. If that doesn't break a lot of tools for some reason by tomorrow, I probably should just go ahead and put up patches to depool servers to do the whole batch instead of waiting. Some tools are really suffering from the performance constraints.

Tue, Sep 11, 2:57 PM · Patch-For-Review, Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
Bstorm added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

@jcrespo, looks great! I think that, since the work is already done anyway, with the slow development possible on some tools, having some compatibility views available should be helpful. I'm eager to remove some of the compatibility joins from the replicas (which is part of what this patch does), but it can wait until after the datacenter switchover. I'll merge and deploy it to a smaller wiki so see how that goes. scwiki seems ok to me for now.

Tue, Sep 11, 2:20 PM · Patch-For-Review, Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
Bstorm added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

(we should test on deploy on a smaller db first).

Tue, Sep 11, 1:55 PM · Patch-For-Review, Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
Bstorm added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

That sounds like a reasonable edit for the docs.

Tue, Sep 11, 1:48 PM · Patch-For-Review, Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
Bstorm added a comment to T200557: Create a stretch and Son of Grid Engine grid in toolsbeta.

Stood up a puppetdb server for toolsbeta and found a bunch hiera values it needs. Figuring out where those *should* be. I believe wikitech is basically deprecated, so they should be in horizon or ops/puppet, I suppose.

Tue, Sep 11, 1:46 PM · Patch-For-Review, Toolforge, Epic, cloud-services-team (Kanban)
Bstorm added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

For that matter, will anything break if I deploy that now? Like is the compatibility joins in the current tables actually needed at this point?

Tue, Sep 11, 1:44 PM · Patch-For-Review, Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
Bstorm added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

@jcrespo I am prepared to push the patch I put up with the "_compat" tables to provide an easy fix for people with not-so-maintained but still important tools. I'm not sure I should NULL the fields on this ticket just yet?

Tue, Sep 11, 1:38 PM · Patch-For-Review, Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata

Fri, Sep 7

Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

I may still try to enable debugging during a particularly high load period this weekend to see if I can get something more useful.

Fri, Sep 7, 8:49 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

I do see that we have not tweaked the IO scheduler characteristics inline with DRBD documentation. We are using the deadline scheduler for IO (which seems default on server installs here), but the other recommendations are not in place. Since at this point IO is the only place I can find where improvements are very likely (since this is under "high load" when it is sort-of-kind-of-quiet for this cluster). It is entirely possible that some characteristics of the scheduler changed after the upgrade to become more expensive than previously. IOwait is very low, but it isn't always low per processor, which could correspond to the uninterruptible sleep state of the nfsd threads. That does suggest it is worth it to tweak the scheduler a bit. Sadly, Friday evening is not the time to attempt that.

Fri, Sep 7, 8:48 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

How fun is that? I don't see anything wrong in the debug output.

Fri, Sep 7, 8:13 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Dug around very deeply. Overall, the problem is the state of the NFSD threads so commonly ends up in uninterruptible sleep (which increases load and can't possibly help performance). Opening the firehose by turning on all NFS debugging flags.

Fri, Sep 7, 7:23 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T196137: toolforge: prometheus issue is filling up email queue.

Wait. I wonder if all of these relay email through the same host (I think they do).

Fri, Sep 7, 12:18 PM · cloud-services-team (Kanban), Patch-For-Review, Toolforge
Bstorm added a comment to T196137: toolforge: prometheus issue is filling up email queue.

The crontab has been pretty good for months now. The errors from exim continue to come in day after day.

Fri, Sep 7, 12:00 PM · cloud-services-team (Kanban), Patch-For-Review, Toolforge

Wed, Sep 5

Bstorm added a comment to T201544: My database tables on Postgres -OSM Server are away.

@akosiaris Do you have any thoughts on the above? I might just try downloading it again and trying again.

Wed, Sep 5, 9:52 PM · Maps, Cloud-Services
Bstorm closed T171394: Better monitoring for labstore backup crons as Resolved.

This should actually be working now.

Wed, Sep 5, 8:05 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
Bstorm closed T171394: Better monitoring for labstore backup crons, a subtask of T126083: overhaul labstore setup [tracking], as Resolved.
Wed, Sep 5, 8:05 PM · Data-Services, Tracking, Operations
Bstorm added a comment to T114117: Drop externallinks.el_from_namespace on wmf databases.

Done on my end.

Wed, Sep 5, 12:20 PM · DBA, Schema-change

Tue, Sep 4

Bstorm closed T202820: Prepare and check storage layer for fixcopyright.wikimedia.org as Resolved.

Updated docs to get around the above problem. All steps are now done for cloud services.
I am able to connect from tools.

Tue, Sep 4, 5:39 PM · fixcopyright.wikimedia.org, Data-Services, DBA
Bstorm closed T202820: Prepare and check storage layer for fixcopyright.wikimedia.org, a subtask of T202819: Create production wiki: fixcopyright.wikimedia.org, as Resolved.
Tue, Sep 4, 5:39 PM · fixcopyright.wikimedia.org, Security-Team, Patch-For-Review, User-Urbanecm, Wiki-Setup (Create)
Bstorm added a comment to T202820: Prepare and check storage layer for fixcopyright.wikimedia.org.

Noted for my reference:

Tue, Sep 4, 5:13 PM · fixcopyright.wikimedia.org, Data-Services, DBA
Bstorm added a comment to T202820: Prepare and check storage layer for fixcopyright.wikimedia.org.

Works! views and indexes are in place. Setting up DNS and so forth.

Tue, Sep 4, 4:33 PM · fixcopyright.wikimedia.org, Data-Services, DBA
Bstorm added a comment to T202820: Prepare and check storage layer for fixcopyright.wikimedia.org.

SHOW GRANTS doesn't show the SUPER option on that user, though. Huh.

Tue, Sep 4, 4:29 PM · fixcopyright.wikimedia.org, Data-Services, DBA
Bstorm added a comment to T202820: Prepare and check storage layer for fixcopyright.wikimedia.org.

Nope. It could have just been missed in the blur of GRANT changes? I thought we'd done one of these since, though.

Tue, Sep 4, 4:27 PM · fixcopyright.wikimedia.org, Data-Services, DBA
Bstorm added a comment to T202820: Prepare and check storage layer for fixcopyright.wikimedia.org.

@Marostegui, on all three replicas, I just got:

Tue, Sep 4, 4:25 PM · fixcopyright.wikimedia.org, Data-Services, DBA
Bstorm lowered the priority of T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server from High to Normal.
Tue, Sep 4, 3:14 PM · cloud-services-team (Kanban)
Bstorm added a comment to T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server.

I'm imagining this being done in a Cloud VPS, so it should touch a lot of things.

Tue, Sep 4, 3:12 PM · cloud-services-team (Kanban)
Bstorm triaged T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server as High priority.
Tue, Sep 4, 3:12 PM · cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

subtree_check was not the magic fix. There is still a share doing subtree checking on there, but the proportional fix I'm hoping for isn't quite what I'm seeing. However, dmesg is filled with

Tue, Sep 4, 2:52 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T114117: Drop externallinks.el_from_namespace on wmf databases.

Done!

Tue, Sep 4, 12:27 PM · DBA, Schema-change

Fri, Aug 31

Bstorm claimed T180513: Document wiki-replicas architecture for future automation.

It may be that this will be closed soon. The documentation we have is not bad, however, a bug in MariaDB prevents additional automation (and even easy use of the script for new wikis).

Fri, Aug 31, 4:59 PM · Documentation, Data-Services, cloud-services-team (Kanban), Cloud-VPS
Bstorm added a comment to T198479: labvirt1009 HP Raid alert.

Please @Cmjohnson! :)

Fri, Aug 31, 3:32 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team
Bstorm edited projects for T169570: nfs-manage failover script needs to be tested with real load and fixed, added: cloud-services-team (Kanban); removed Patch-For-Review, Operations, Cloud-Services.
Fri, Aug 31, 3:32 PM · cloud-services-team (Kanban)
Bstorm claimed T169570: nfs-manage failover script needs to be tested with real load and fixed.

Just to put an assignee on this one, since I'm thinking about it a lot.

Fri, Aug 31, 3:31 PM · cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

Also, it is worth noting that the load here is much higher than on labstore1006/7 with a similar kernel. From labstore1006 no_subtree_check ;-)

Fri, Aug 31, 2:40 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

I've been looking at lizardfs and similar notions, so I don't consider that at all too far outside the box, lol. I'll try small changes first, but things like that might be the way forward.

Fri, Aug 31, 2:36 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T171394: Better monitoring for labstore backup crons.

Re-opening this task because during kernel upgrades a backup failed (Wed is the misc backup). I can confirm that it did NOT set off an alarm because it didn't exit with anything that the systemd service would be bothered by. It threw some information into the log, though,

Fri, Aug 31, 2:28 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
Bstorm reopened T171394: Better monitoring for labstore backup crons as "Open".
Fri, Aug 31, 2:26 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
Bstorm reopened T171394: Better monitoring for labstore backup crons, a subtask of T126083: overhaul labstore setup [tracking], as Open.
Fri, Aug 31, 2:26 PM · Data-Services, Tracking, Operations
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

For reference:

Fri, Aug 31, 2:22 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T203254: labstore1004 and labstore1005 high load issues following upgrades.

One of the first things I'd like to try is removing subtree checking from the exports. I want to have test cases ready for when that happens, so I'm putting that off until Tuesday.

Fri, Aug 31, 2:18 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm triaged T203254: labstore1004 and labstore1005 high load issues following upgrades as Normal priority.
Fri, Aug 31, 2:17 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T196507: Degraded RAID on cloudvirt1019.
Fri, Aug 31, 1:00 PM · ops-eqiad, Operations

Thu, Aug 30

Bstorm added a comment to T201544: My database tables on Postgres -OSM Server are away.

@Kolossos I get

ERROR:  value too long for type character varying(40)
CONTEXT:  COPY wp_coords_red0, line 3225, column instance: "Coat of Arms of Tumanny (Murmansk oblast) proposal - 2.png"
Thu, Aug 30, 4:00 PM · Maps, Cloud-Services
Bstorm added a comment to T202820: Prepare and check storage layer for fixcopyright.wikimedia.org.

👍

Thu, Aug 30, 3:33 PM · fixcopyright.wikimedia.org, Data-Services, DBA