User Details
- User Since
- Nov 2 2014, 11:35 PM (448 w, 4 d)
- Availability
- Available
- IRC Nick
- andrewbogott
- LDAP User
- Unknown
- MediaWiki User
- Andrewbogott [ Global Accounts ]
Today
A ton of files in /srv/mediawiki/images/wikitech/archive but deleteArchivedFiles.php --delete says there's nothing to delete. It's tempting to just rm that directory anyway but it would be nice to know what's happening first...
I will look if I can get an ssh connection. Worst case we can resize the instance and increase our monthly bill be a few bucks. Thanks for noticing!
Yesterday
Thank you! I've reverted the quota change.
Ah, sorry, I should've read back further in the task! Yes, that host can+should be deleted.
FYI, the thing with docker not starting is upstream bug https://storyboard.openstack.org/#!/story/2010599 which could use a comment or two in support
The proposed fix works! I've submitted it upstream
Wed, Jun 7
I reduced the rdb chunk size in glance-api.conf but that didn't resolve the issue... now I see
While creating the snapshot, I see these errors:
I downloaded a variety of images ('openstack image save') and it's only the recent snapshot of a VM that seems broken:
I downloaded a variety of images ('openstack image save') and it's only the recent snapshot of a VM that seems broken:
Tue, Jun 6
'failed to create' generally signifies a quota issue. Indeed, the 'trove' project is out of security groups. I've increased the quota from 40 to 100.
Getting things truly detached and ready for attachment required me to remove things from the database as well as detach in the CLI.
This is probably https://bugs.launchpad.net/charm-nova-compute/+bug/2019888. It doesn't just affect wikiwho volumes, I was able to reproduce in testlabs.
I think this is a missing dependency in the package.
I now have the proper version installing via cloud-init () but now when puppet is invoked it says:
Thanks @Dzahn ! The challenge is to encode that in cloud-init yaml (which may or may not be possible)
I think the above patch is adequate for this.
This is apparently known behavior:
Mon, Jun 5
I've tried to describe the best practices here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_cli
This is happening again on tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud
Fri, Jun 2
topranks> Cathal Mooney Pings are being blocked by 185.15.57.5 itself it seems:
1:39 PM https://www.irccloud.com/pastebin/TZm8TF4e/
Plain Text • 4 lines raw | line numbers
1:40 PM i.e. they are getting there but it's sending unreachable messages back
1:40 PM traffic does seem to get beyond the cloudgw
1:41 PM https://www.irccloud.com/pastebin/Dki06Xhv/
Plain Text • 8 lines raw | line numbers
1:46 PM They seem to be making it to cloudnet/neutron, which is generating the rejects:
1:46 PM https://www.irccloud.com/pastebin/PZTwUGW0/
Plain Text • 9 lines raw | line numbers
1:47 PM Not sure if that helps. What I can say is that nothing here is using the 172.20.x addressing, or this is not being affected by the new cloud-private networking.
1:47 PM cloudweb, cloudgw and cloudnet are on their existing addresses that they were prior to starting any of this
1:49 PM Seems there is a NAT rule to forward this traffic to/from VM IP 172.16.128.97
1:49 PM But that IP is unreachable from the cloudnet for some reason
1:50 PM root@cloudnet2005-dev:/home/cmooney# ip neigh show 172.16.128.97
1:50 PM 172.16.128.97 dev qr-21e10025-d4 FAILED
1:51 PM It can ping other VMs so I think the issue isn't with cloudnet2005 connection to the instance vlan
1:51 PM https://www.irccloud.com/pastebin/vSa9SoOM/
Plain Text • 4 lines raw | line numbers
1:53 PM TL;DR - I don't think this is a physical network issue, and it's not using any of the new components
1:57 PM cloudnet2005-dev can't reach VM tools-codfw1dev-bastion-2 for some reason
@Ladsgroup can I assign this to you?
Thu, Jun 1
I haven't dug much, but designate is currently failing on cloudservices200[45]-dev because the services on that host are unable to contact mysql on cloudcontrols:
Wed, May 31
sgtm!
To expand on @Aklapper's comment -- the distinction (for me) is whether the application involves community collaboration, or if the project is essentially a 'laptop in the cloud'. If you're doing things that you could easily do on a local box then we're unlikely to approve. If, on the other hand, you need public-facing things or persistent services (a web service, a subscription to an external event stream, a project that has more than one person working on the same host) then we would be more likely to consider it as a cloud-vps candidate.
Option 2 seems like the right call to me. I'm curious about the non-free concern... would we be limiting install to particular repos, or would users also be able to inject non-free repos before installing packages?
Out of an abundance of caution, I fixed these by hand. Everything seems OK now but the issue will likely recur if a grub update is pushed out for Buster again.
Tue, May 30
This seems to only happen to hosts with /dev/vda rather than /dev/sda. But it doesn't happen on ALL hosts with /dev/vda.
The list of affected instances (via sudo cumin --force A:all 'dpkg --list | grep grub-pc | grep iF'):
Once the grub failure is dealt with, installing 'apt install libsss-certmap0' fixes puppet and ssh
This is somehow related to grub. If I run 'apt install libnss-sss' it complains about a failure to install grub on /dev/vda3. Bypassing that failure by hand seems to get things unstuck but I'm not sure that unattended upgrades can do that.
Fri, May 26
The openstack rbac work that I've been doing[0] has hit some serious roadbumps, but I'm swiftly approaching a stopping point. Y'all are long overdue for an update, so here's a summary of where I'm at.
Thu, May 25
The backup nodes now have postgres installed with a 'back2' user and a 'backy2' table and a backy-generated schema. To actually switch backy2 over from sqlite to posgres, merge the following patch:
Great, quota reverted.
Wed, May 24
I temporarily increased your storage quota by 100G. That should let you create a new 100G volume, get whatever data you want onto that, and then resize wikidb, and then delete the stray volume left over.
@Chicocvenancio you should now have access to a new 'Lutz' project with 80Gb of database storage quota. You're among the first people to follow this workflow so please follow up here if you find things that are broken.
@Chicocvenancio I'm wrong, you're right... there's a different workflow toolforge+trove. I may to need to do some coding but we'll get this together.
+1 approved. Please ping on this ticket when the old VM is removed so we can revert (some of) the quota increase.
+1 approved, even though we'd rather use you as a test case in toolforge :)
Then wikitech is wrong :( Where is the page that sent you here?
Hello again! I'm just checking in that someone is still tasked with resolving this after our recent team shuffles.
Tue, May 23
This is running on cloudbackup100[34].
I worked on this some today! There are still some blanks to fill in. I welcome feedback on the three-star support level column that I added.
I think we're now down to the minimum -- just dumps (which are huge) are on metal and everything is on VMs.
Now we have etcd running on cloudvirtlocal100[1-3] and things seem to be working fine.
real 0m14.480s
user 0m1.966s
sys 0m0.176s
root@cloudcontrol1005:~/foo#
root@cloudcontrol1005:~/foo#
root@cloudcontrol1005:~/foo# time ./bar.py
Mon, May 22
Two things:
https://governance.openstack.org/tc/goals/selected/consistent-and-secure-rbac.html#the-issues-we-are-facing-with-scope-concept <- implies that enabling scope now is premature and may never be necessary. I'm torn between mourning the lost effort and cheering the simpler model
Thu, May 18
Wed, May 17
I do not think this is a result of my auth refactors, since nothing on the backend has been updated in eqiad1 yet. I suspect instead that this is something that's changed in the standalone nova CLI.
cloudlb2001-dev seems unable to reach designate. "telnet 208.80.153.43 9001" and "telnet 208.80.153.44 9001" both fail from cloudlb2001-dev. That likely means that haproxy is not pooling designate on cloudlb2001.
Removing '185.15.57.24 openstack.codfw1dev.wikimediacloud.org' from /etc/hosts in cloudcontrol2001-dev sfixed the 503 problem. The intermittent timeouts are still happening.
Tue, May 16
"wget https://openstack.codfw1dev.wikimediacloud.org:29001" returns 503 no matter whether haproxy is or isn't running on cloudcontrol2005-dev. This surprises me since openstack.codfw1dev.wikimediacloud.org is a CNAME for cloudcontrol2005-dev.wikimedia.org. Some routing thing is happening that I don't understand.
telnet cloudservices2005-dev.wikimedia.org 9001
should the cidr for these hosts get their own network:constants entry? Otherwise I can pass around the list of lb nodes as parameters but I fear we'll wind up doing that a lot.
Mon, May 15
Confusing because the openstacksdk docs (https://docs.openstack.org/openstacksdk/latest/user/guides/connect_from_config.html) say: