Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Projects (9)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (149 w, 4 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Yesterday

Bstorm added a comment to T267376: Set up IP addresses for the new wiki replicas setup.

It's not only about learning entirely. It is a lot of internal infrastructure about the Foundation to learn for a student programmer, and I do no not understand for a moment how that would be controversial (after spending a couple years supporting them). The idea behind our frontend setup is that nobody has to look up anything other than the wiki database they are connecting to (which has already made our customers unhappy). Most of the people conversant in the sections work for WMDE or the Foundation, so we actually have kept it mostly out of the communication plan.

Fri, Dec 4, 5:25 PM · Data-Services, cloud-services-team (Kanban)

Thu, Dec 3

Bstorm created T269399: Set up a way to persist non-default number of wikireplicas connections across all instances.
Thu, Dec 3, 10:28 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

So all the existing replicas will also now have the Toolforge user accounts. When we set up clouddb1020, we just need to run the harvest-replica bit again.

Thu, Dec 3, 9:46 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T267376: Set up IP addresses for the new wiki replicas setup.

Removed the reference to clouddb1020. Testing with that server would make the networking much more problematic. We will just have to test with one of the proxies by taking it out of the pool once this is sorted out.

Thu, Dec 3, 4:16 PM · Data-Services, cloud-services-team (Kanban)
Bstorm triaged T267376: Set up IP addresses for the new wiki replicas setup as High priority.
Thu, Dec 3, 4:15 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

Yep, marking them done. Also I think I worked out the kinks in the maintain-dbusers process yesterday, so I should be able to get the user accounts syncing soon.

Thu, Dec 3, 3:15 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm updated the task description for T268312: Deploy labsdbuser and views to new clouddb hosts.
Thu, Dec 3, 3:13 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T211750: Introduce Python code formatters usage.

FWIW, WMCS uses black with line-length set to 80 for all python and has for a little while. In non-puppet repos, we have tox check for it (any line length other than 80 generates failures). https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Python_coding
Our team doesn't agree on text editors, so I can't say we have anything to offer on the setup part of the equation.

Thu, Dec 3, 1:53 AM · User-Kormat, tox-wikimedia, Patch-For-Review, Operations, SRE-tools

Wed, Dec 2

Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

So I'm glad I noticed that warning. Two things have come out of it:

  1. I've fix the script to be much better around the multi-instance settings.
  2. It was creating _p databases for wikis that aren't on the replicas and should not be (so I am cleaning them up).
  3. It was only running half the loop in many cases (fixed).
  4. The script can do the CREATE DB and role stuff just fine on its own now (based on 2 above).
Wed, Dec 2, 10:39 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

Yeah, I think I'll update the task over there today to take clouddb1020 off it. It just makes it more confusing anyway.

Wed, Dec 2, 6:02 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T267374: Set up a Toolforge buildpack CD pipeline as a POC.

Really nice work!

I think argo is appealing because it's intended to be deployed via k8s, allowing us to reuse that infra. And we would control it, allowing us to manage upgrades, etc.

I'm curious, is there anything preventing us from running any of the other solutions on our k8s? (Jenkins/gitlab/...)

Wed, Dec 2, 3:52 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T267376: Set up IP addresses for the new wiki replicas setup.

I would assume though, that the logic behinds all this allows us to simply pool/depool hosts directly on haproxy config (like we do now) and we don't have to touch anything DNS based?

Yup! That's the idea.

Wed, Dec 2, 3:48 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

I'll get more info today by trying again on servers that have possible issues with debug logging (s7, s6, s5...I think all the others ran fine). Issues could be in the script, the dblists, the dbs. I'll also do s1 again now that they are back.

Wed, Dec 2, 3:29 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

I am not fully sure how the script works, but given that each instance has its own wikis, I would expect that is normal?
For instance if you connect to the 3311 port, then only enwiki would be there. So it is expected that, for instance jawiki isn't there, as jawiki is part of s6 (3316).
So I would expect that if we read ./usr/local/lib/mediawiki-config/dblists/s1.dblist to create the views on hosts with 3311, only the wikis showing up on those will succeed.

Wed, Dec 2, 3:19 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services

Tue, Dec 1

Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

Running create views across all the hosts except clouddb1013 and clouddb1017, I got this anomaly:

Tue, Dec 1, 9:38 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

@Bstorm you can go ahead from your side, and let's mark the hosts as done once everything is done from your end

Tue, Dec 1, 8:00 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T233335: Change maintain-meta_p script to use the sitematrix API.

Ah no, never mind, it uses general site info queries.

Tue, Dec 1, 3:36 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T233335: Change maintain-meta_p script to use the sitematrix API.

Interestingly, it currently does use that API...
It just doesn't use it for everything.

Tue, Dec 1, 3:23 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

PS - I am aware of the wmf-pt-killer script setup causing puppet to fail. I'll get that tomorrow.

Tue, Dec 1, 12:31 AM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

I haven't created all the users yet either. I'm going to need to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/642570, and see how that goes before it will even work. That said, the indexes are done as well. So we've got views and indexes on that instance :) The settings there are sufficient for that.

Tue, Dec 1, 12:30 AM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services

Mon, Nov 30

Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

@Marostegui The views are created on s1@clouddb1013. That was nice and smooth.
The indexes are in process. It's taking a little time for that part...and I already mixed up one because I made the mistake of not starting a screen session and will have to redo that index, but that's ok.

Mon, Nov 30, 10:50 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm closed T268669: Upgrade PAWS k8s to 1.17, a subtask of T263284: Upgrade Toolforge K8s to 1.17, as Resolved.
Mon, Nov 30, 6:42 PM · Patch-For-Review, Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm closed T268669: Upgrade PAWS k8s to 1.17 as Resolved.
Mon, Nov 30, 6:42 PM · PAWS, Kubernetes, cloud-services-team (Kanban)
Bstorm added a comment to T268893: [tools-sgecron-01] The server is getting out of space, daemon.log is growing a lot.

We had a broken LDAP issue last week. LDAP was hard down. I can get the datetime later. It may have been broken since then since I did not check it.

Mon, Nov 30, 3:34 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

Sorry, I haven't had a chance to test. I plan to today.

Mon, Nov 30, 3:04 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services

Wed, Nov 25

Bstorm closed T268786: ceph pg 6.91 inconsistent as Resolved.

After a quick review, since I don't see anything, I'm closing this. Please reopen and update as needed.

Wed, Nov 25, 8:07 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T268786: ceph pg 6.91 inconsistent.

I don't see anything on cloudcephosd1015 to indicate a hardware error just now.

Wed, Nov 25, 7:50 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T268786: ceph pg 6.91 inconsistent.
[bstorm@cloudcephmon1002]:~ $ sudo ceph pg repair 6.91
instructing pg 6.91 on osd.117 to repair
Wed, Nov 25, 7:48 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm triaged T268786: ceph pg 6.91 inconsistent as High priority.
Wed, Nov 25, 7:47 PM · Cloud-VPS, cloud-services-team (Kanban)

Tue, Nov 24

Bstorm added a comment to T266300: Establish a systemd timer to remove long-running processes on the bastion in a random and somewhat friendly way.

Coming back around to this, mysql is one that could definitely be used inappropriately because you could effectively run a bot as mysql. However, it wouldn't daemonize at least. We could maybe add it since we already monitor for crons and there is a query/session killer.

Tue, Nov 24, 7:30 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
Bstorm updated the task description for T268669: Upgrade PAWS k8s to 1.17.
Tue, Nov 24, 7:11 PM · PAWS, Kubernetes, cloud-services-team (Kanban)
Chicocvenancio awarded T268669: Upgrade PAWS k8s to 1.17 a Like token.
Tue, Nov 24, 7:09 PM · PAWS, Kubernetes, cloud-services-team (Kanban)
Bstorm created T268669: Upgrade PAWS k8s to 1.17.
Tue, Nov 24, 7:08 PM · PAWS, Kubernetes, cloud-services-team (Kanban)
Bstorm added a comment to T267374: Set up a Toolforge buildpack CD pipeline as a POC.

Also some of the argocd functionality like rollbacks looks like it requires proper image tagging, which would force us to stop using :latest.

Tue, Nov 24, 7:02 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

I should be able to make the indexes without either of those when the code is merged, I think.

Tue, Nov 24, 12:01 AM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services

Mon, Nov 23

Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

I'm aiming to set up https://gerrit.wikimedia.org/r/c/operations/puppet/+/642503 to be a sorta noop on the existing labsdb* things and to set up stuff correctly on the new ones. Since it runs things in order, I can manually edit the config when it's deployed to only run against s1 and just try it there.

Mon, Nov 23, 11:42 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T268438: Toolforge bastion to support python 3.6/3.7/3.8.

People are encouraged to use the kubernetes system and containers, but not for jobs yet. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Python#Virtual_Environments_and_Packages.

Mon, Nov 23, 7:57 PM · Toolforge (Software install/update)
Bstorm added a comment to T267989: Do some checks of how many Quarry queries will break in a multiinstance environment.

As an aside, if you use the webservice shell command, you can get it working (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Python#Virtual_Environments_and_Packages). That'll launch you a container where you can generate the venv for 3.7. The 3.5 thing is only a concern if you try to launch on the gridengine (which I don't recommend).

Mon, Nov 23, 7:31 PM · Quarry, cloud-services-team (Kanban)
Bstorm added a comment to T266587: ToolsDB replication is broken.

Thank you again! I'll poke around and see.

Mon, Nov 23, 6:35 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T266587: ToolsDB replication is broken.

Sounds good. I'll set another time for it. Based on the rsync speeds I saw last time with the dump, I am hopeful that this will be a comparatively short read-only period.

Mon, Nov 23, 2:51 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)

Sat, Nov 21

Bstorm added a comment to T266587: ToolsDB replication is broken.

The database load died again. This time it was:
ERROR 1030 (HY000) at line 663214: Got error 175 "File too short; Expected more data in file" from storage engine Aria
@Marostegui any ideas? Is my dump bad somehow? I'll google around a bit in case there's something I can do. I'd rather not spend another 14 hours in read-only mode. I also wonder if I can get away with throwing this in read-only and copying the data directory. I have a solid rsync setup.

Sat, Nov 21, 1:11 AM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)

Fri, Nov 20

Bstorm added a comment to T268355: cronspam from prometheus-directory-size (on labstore1004).

That actually makes sense since that path was removed, and I bet it was coded in somewhere.

Fri, Nov 20, 10:04 PM · cloud-services-team (Kanban), observability, Operations
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

@Marostegui Random question: where does centralauth live in this setup? We are so far planning on keeping meta_p on s7 for historical reasons (or possibly on all sections if meta_p becomes a much better thing with tooling assuming it ends up on s7).

Fri, Nov 20, 5:58 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a subtask for T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture: T268312: Deploy labsdbuser and views to new clouddb hosts.
Fri, Nov 20, 5:57 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a parent task for T268312: Deploy labsdbuser and views to new clouddb hosts: T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture.
Fri, Nov 20, 5:57 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T268312: Deploy labsdbuser and views to new clouddb hosts.

@Bstorm we should start with this "early" (meaning: before all the hosts are ready, in case we find issues).
Next week I will deploy the user, role and _p databases to clouddb1015:3316 and clouddb1019:3316. I will ping you once that is done so you can try to create the views and do all the magic behind them.

Fri, Nov 20, 2:45 PM · Patch-For-Review, DBA, cloud-services-team (Kanban), Data-Services

Thu, Nov 19

Bstorm added a parent task for T217473: labstore1006 spontaneous reboot: Unknown Object (Task).
Thu, Nov 19, 9:54 PM · Operations, Data-Services, cloud-services-team (Kanban)
Bstorm added a parent task for T268280: labstore1006 spontaneous reboot: T217473: labstore1006 spontaneous reboot.
Thu, Nov 19, 9:51 PM · cloud-services-team (Hardware)
Bstorm added a subtask for T217473: labstore1006 spontaneous reboot: T268280: labstore1006 spontaneous reboot.
Thu, Nov 19, 9:51 PM · Operations, Data-Services, cloud-services-team (Kanban)
Bstorm added a project to T268281: Degraded RAID on labstore1006: cloud-services-team (Hardware).
Thu, Nov 19, 9:49 PM · cloud-services-team (Hardware), ops-eqiad, Operations
Bstorm added a subtask for T268280: labstore1006 spontaneous reboot: T268281: Degraded RAID on labstore1006.
Thu, Nov 19, 9:48 PM · cloud-services-team (Hardware)
Bstorm added a parent task for T268281: Degraded RAID on labstore1006: T268280: labstore1006 spontaneous reboot.
Thu, Nov 19, 9:48 PM · cloud-services-team (Hardware), ops-eqiad, Operations
Bstorm added a comment to T268280: labstore1006 spontaneous reboot.

Thu, Nov 19, 9:43 PM · cloud-services-team (Hardware)
Bstorm added a comment to T268280: labstore1006 spontaneous reboot.

If the ticket wasn't auto-created for it. The failed drive is Port: 1E, box:2, bay: 10 (SAS) according to ILO

Thu, Nov 19, 9:42 PM · cloud-services-team (Hardware)
Bstorm created T268280: labstore1006 spontaneous reboot.
Thu, Nov 19, 9:28 PM · cloud-services-team (Hardware)
Bstorm added a comment to T267376: Set up IP addresses for the new wiki replicas setup.

Other thing to consider is if we need CloudVPS VM private addresses leaking into prod at all (be it LVS or the dbproxy layer). So here is my question for @Bstorm and @Marostegui: Could the DBproxy handle / it is accepted / desirable to see all incoming mysql connections as coming from a single public IPv4 address (nat.openstack.eqiad1.wikimediacloud.org, 185.15.56.1)? Do we need to know which client is using the proxy? (for ratelimiting, access control, etc?)

Thu, Nov 19, 7:32 PM · Data-Services, cloud-services-team (Kanban)
Bstorm updated subscribers of T267989: Do some checks of how many Quarry queries will break in a multiinstance environment.

Adding @dcaro in case he has time or interest to help dig in that database. It's in the quarry Cloud VPS project. Local root can access the database (and do anything).

Thu, Nov 19, 5:36 PM · Quarry, cloud-services-team (Kanban)

Wed, Nov 18

Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

So to reduce the dataset for time, it would have to go either or both ways. Two queries, one on changes to commons and one on changes to enwiki that both go back and check for existing images on the other. That could end up more efficient, but it would also introduce complications etc. Worth experimenting with maybe.

Wed, Nov 18, 7:55 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

@Green_Cardamom Now I get what you mean. Thank you for explaining it.

Wed, Nov 18, 7:51 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

@Green_Cardamom I was also thinking that the enwiki query is "dominant" here because we are tagging things there and don't need to act unless there is a file there. That would suggest you could search enwiki's recent changes to such files and then only even check commons for that particular file (which is slow when searching the entire set, but it wouldn't be for a recent subset). I may be misunderstanding the ultimate goal, though.

Wed, Nov 18, 7:47 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

@Green_Cardamom I was thinking of tracking recent_changes on the enwiki side and searching for specific items on the commons side as they are found from recent_changes in enwiki. That doesn't require loading the entire corpus of either into memory at any point and it would reduce the number of queries to commons (because going through the entire list of enwiki candidates and checking commons for all of them is slow). The inner joins are why I think that might be valid, and we are only tagging things in enwiki, right? If we also are looking for things to tag in commons, then the approach would not be valid.

Wed, Nov 18, 7:42 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

@Green_Cardamom Is it required to examine the entire set of records or just recent ones? That's why I was asking if there was some way to leverage the recentchanges views. I don't know if the api can do the same things. It seems reasonable that only things changed in the past month or so need to be checked, right? I, so far, don't know how to filter for that yet, but if I can find time, I'll try if that isn't a bad idea.

Wed, Nov 18, 5:55 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

I wonder if we could leverage the recent_changes views to limit the initial dataset. The query would not need to review all history and records that haven't been touched, right?

Wed, Nov 18, 12:03 AM · cloud-services-team (Kanban), Tools

Tue, Nov 17

Bstorm updated the task description for T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.
Tue, Nov 17, 11:25 PM · cloud-services-team (Kanban), Tools
Bstorm updated the task description for T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.
Tue, Nov 17, 11:25 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

That's basically what that notebook is doing in the paged subquery version. I'm just trying to reason through if there's anything I could add to that.

Tue, Nov 17, 11:01 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

From that notebook, it seems the way to do this is not to query both tables and combine but to make a query against one and then nest short queries in a loop from the results or something like that, maybe? This won't work well if we try to use python as a join directly since we aren't going to have the RAM for the combined datasets most likely even if we get it working. I'll see if I can find a way to demonstrate something like that that to see how feasible it is.

Tue, Nov 17, 10:57 PM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267374: Set up a Toolforge buildpack CD pipeline as a POC.

Sounds good for now. On the other hand, it isn't crazy to try things like https://github.com/rootless-containers/rootlesskit to see if we can make a docker socket to point at. We only need to be able to build and the equivalent of push, right (I say naively)? I wonder how hard to it would be to test using something like that after you get it working with a "real" socket?

Tue, Nov 17, 9:14 PM · cloud-services-team (Kanban), Toolforge
Bstorm updated subscribers of T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

@AntiCompositeNumber started work on this a bit using PAWS (I believe this is the same as this ticket): https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ipynb

Tue, Nov 17, 7:17 PM · cloud-services-team (Kanban), Tools
Bstorm updated the task description for T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity).
Tue, Nov 17, 4:14 PM · cloud-services-team (Kanban)
Jhernandez awarded T267989: Do some checks of how many Quarry queries will break in a multiinstance environment a Doubloon token.
Tue, Nov 17, 12:02 PM · Quarry, cloud-services-team (Kanban)
Bstorm removed a subtask for T211096: PAWS: Rebuild and upgrade Kubernetes: T253134: Find an alternative solution for the mysql-proxy in PAWS.
Tue, Nov 17, 12:23 AM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm removed a subtask for T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture: T253134: Find an alternative solution for the mysql-proxy in PAWS.
Tue, Nov 17, 12:23 AM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm removed parent tasks for T253134: Find an alternative solution for the mysql-proxy in PAWS: T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture, T211096: PAWS: Rebuild and upgrade Kubernetes.
Tue, Nov 17, 12:23 AM · PAWS
Bstorm removed a parent task for T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture: T180513: Document wiki-replicas architecture for future automation.
Tue, Nov 17, 12:22 AM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm removed a subtask for T180513: Document wiki-replicas architecture for future automation: T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture.
Tue, Nov 17, 12:22 AM · Documentation, Data-Services, cloud-services-team (Kanban), Cloud-VPS
Bstorm removed a parent task for T180513: Document wiki-replicas architecture for future automation: T101659: Run a documentation sprint for Cloud VPS and Toolforge.
Tue, Nov 17, 12:20 AM · Documentation, Data-Services, cloud-services-team (Kanban), Cloud-VPS
Bstorm removed a subtask for T101659: Run a documentation sprint for Cloud VPS and Toolforge: T180513: Document wiki-replicas architecture for future automation.
Tue, Nov 17, 12:20 AM · Developer-Wishlist (2017), Developer-Advocacy, Documentation, Toolforge
Bstorm added a parent task for T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture: Unknown Object (Task).
Tue, Nov 17, 12:19 AM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a subtask for T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture: T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.
Tue, Nov 17, 12:17 AM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a parent task for T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's: T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture.
Tue, Nov 17, 12:17 AM · cloud-services-team (Kanban), Tools
Bstorm triaged T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's as Medium priority.
Tue, Nov 17, 12:16 AM · cloud-services-team (Kanban), Tools
Bstorm added a comment to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

Adding this to the WMCS workboard because this may be a good opportunity to generate a clear example of how to do this for a common use case of cross-database joins.

Tue, Nov 17, 12:15 AM · cloud-services-team (Kanban), Tools
Bstorm added a project to T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's: cloud-services-team (Kanban).
Tue, Nov 17, 12:14 AM · cloud-services-team (Kanban), Tools
Bstorm updated subscribers of T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.
Tue, Nov 17, 12:13 AM · cloud-services-team (Kanban), Tools
Bstorm renamed T267989: Do some checks of how many Quarry queries will break in a multiinstance environment from Do some checks of how many queries will break in a multiinstance environment to Do some checks of how many Quarry queries will break in a multiinstance environment.
Tue, Nov 17, 12:12 AM · Quarry, cloud-services-team (Kanban)
Bstorm triaged T267989: Do some checks of how many Quarry queries will break in a multiinstance environment as Medium priority.
Tue, Nov 17, 12:11 AM · Quarry, cloud-services-team (Kanban)

Mon, Nov 16

Bstorm created T267989: Do some checks of how many Quarry queries will break in a multiinstance environment.
Mon, Nov 16, 11:48 PM · Quarry, cloud-services-team (Kanban)
Bstorm added a comment to T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture.

Ok, apparently mysqlproxy is smarter than I thought, and it can tell that I'm not pointing it at different IP addresses when I give it a list of different names that point at the same IP. As of now, the otherwise working code to connect in the new way is commented out. I may need to add another switch that generates the set of proxies when there are multiple proxy addresses to connect to. At very least, the code is all in place to do this now.

Mon, Nov 16, 10:49 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm committed rPAWS22e54afa5db1: emergency change: temporarily rolling back multiinstance proxy (authored by Bstorm).
emergency change: temporarily rolling back multiinstance proxy
Mon, Nov 16, 10:43 PM
Bstorm added a comment to T266506: Getting "502 Bad Gateway" on Toolforge tools in clusters, including tools ordia and scholia.

Looks like that might be related to some issues at least, if not this one.

Mon, Nov 16, 8:16 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T266506: Getting "502 Bad Gateway" on Toolforge tools in clusters, including tools ordia and scholia.

I might have found something. Just before the issue happened, prometheus reported the number of namespaces with >=1 pod dropped from 919 to 5 briefly. That suggests that prometheus suddenly didn't see any pods in *any* of the tool namespaces for a moment. That could be a metrics hiccup, but it might be significicant because the timing is not long before the bunch of 502s (that don't seem to appear in the proxy logs):

Mon, Nov 16, 7:53 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T266506: Getting "502 Bad Gateway" on Toolforge tools in clusters, including tools ordia and scholia.

I also have now managed to be online when this would have happened and could find no actual record of it happening except in the dashboard. That makes me really wonder what the dashboard was recording. The front proxy had not logged a bunch of 502s.

Mon, Nov 16, 7:39 PM · cloud-services-team (Kanban), Toolforge
Bstorm placed T267966: Add more k8s-etcd nodes to the cluster on tools project up for grabs.

Adding @dcaro as subscriber because this could affect T267082

Mon, Nov 16, 7:23 PM · cloud-services-team (Kanban), Toolforge
Bstorm created T267966: Add more k8s-etcd nodes to the cluster on tools project.
Mon, Nov 16, 7:22 PM · cloud-services-team (Kanban), Toolforge

Sun, Nov 15

Bstorm added a comment to T266587: ToolsDB replication is broken.

Since it seems very likely that is why mariadb hung up on me, setting the session variable to the maximum to try to prevent this from happening again.

Sun, Nov 15, 7:47 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T266587: ToolsDB replication is broken.

That increases the chances that we will fall too far behind to start replication. Really hoping not :(

Sun, Nov 15, 7:39 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T266587: ToolsDB replication is broken.

Removed comment on replication since that was actually just old logs. Nothing current about that. It is entirely possible that there are larger-than-default inserts in there that used a larger max_allowed_packet (not package...sorry for the typo in the log) in the session variables. By doubling what they were (32MB), it might succeed.

Sun, Nov 15, 7:38 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T266587: ToolsDB replication is broken.

I've restarted the import while I look for some reason that might have happened.

Sun, Nov 15, 7:27 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T266587: ToolsDB replication is broken.
Sun, Nov 15, 7:23 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)