I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
It's not only about learning entirely. It is a lot of internal infrastructure about the Foundation to learn for a student programmer, and I do no not understand for a moment how that would be controversial (after spending a couple years supporting them). The idea behind our frontend setup is that nobody has to look up anything other than the wiki database they are connecting to (which has already made our customers unhappy). Most of the people conversant in the sections work for WMDE or the Foundation, so we actually have kept it mostly out of the communication plan.
Thu, Dec 3
So all the existing replicas will also now have the Toolforge user accounts. When we set up clouddb1020, we just need to run the harvest-replica bit again.
Removed the reference to clouddb1020. Testing with that server would make the networking much more problematic. We will just have to test with one of the proxies by taking it out of the pool once this is sorted out.
Yep, marking them done. Also I think I worked out the kinks in the maintain-dbusers process yesterday, so I should be able to get the user accounts syncing soon.
FWIW, WMCS uses black with line-length set to 80 for all python and has for a little while. In non-puppet repos, we have tox check for it (any line length other than 80 generates failures). https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Python_coding
Our team doesn't agree on text editors, so I can't say we have anything to offer on the setup part of the equation.
Wed, Dec 2
So I'm glad I noticed that warning. Two things have come out of it:
- I've fix the script to be much better around the multi-instance settings.
- It was creating _p databases for wikis that aren't on the replicas and should not be (so I am cleaning them up).
- It was only running half the loop in many cases (fixed).
- The script can do the CREATE DB and role stuff just fine on its own now (based on 2 above).
Yeah, I think I'll update the task over there today to take clouddb1020 off it. It just makes it more confusing anyway.
Yup! That's the idea.
I'll get more info today by trying again on servers that have possible issues with debug logging (s7, s6, s5...I think all the others ran fine). Issues could be in the script, the dblists, the dbs. I'll also do s1 again now that they are back.
Tue, Dec 1
Running create views across all the hosts except clouddb1013 and clouddb1017, I got this anomaly:
Ah no, never mind, it uses general site info queries.
Interestingly, it currently does use that API...
It just doesn't use it for everything.
PS - I am aware of the wmf-pt-killer script setup causing puppet to fail. I'll get that tomorrow.
I haven't created all the users yet either. I'm going to need to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/642570, and see how that goes before it will even work. That said, the indexes are done as well. So we've got views and indexes on that instance :) The settings there are sufficient for that.
Mon, Nov 30
@Marostegui The views are created on s1@clouddb1013. That was nice and smooth.
The indexes are in process. It's taking a little time for that part...and I already mixed up one because I made the mistake of not starting a screen session and will have to redo that index, but that's ok.
We had a broken LDAP issue last week. LDAP was hard down. I can get the datetime later. It may have been broken since then since I did not check it.
Sorry, I haven't had a chance to test. I plan to today.
Wed, Nov 25
After a quick review, since I don't see anything, I'm closing this. Please reopen and update as needed.
I don't see anything on cloudcephosd1015 to indicate a hardware error just now.
[bstorm@cloudcephmon1002]:~ $ sudo ceph pg repair 6.91 instructing pg 6.91 on osd.117 to repair
Tue, Nov 24
Coming back around to this, mysql is one that could definitely be used inappropriately because you could effectively run a bot as mysql. However, it wouldn't daemonize at least. We could maybe add it since we already monitor for crons and there is a query/session killer.
Mon, Nov 23
I'm aiming to set up https://gerrit.wikimedia.org/r/c/operations/puppet/+/642503 to be a sorta noop on the existing labsdb* things and to set up stuff correctly on the new ones. Since it runs things in order, I can manually edit the config when it's deployed to only run against s1 and just try it there.
People are encouraged to use the kubernetes system and containers, but not for jobs yet. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Python#Virtual_Environments_and_Packages.
As an aside, if you use the webservice shell command, you can get it working (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Python#Virtual_Environments_and_Packages). That'll launch you a container where you can generate the venv for 3.7. The 3.5 thing is only a concern if you try to launch on the gridengine (which I don't recommend).
Thank you again! I'll poke around and see.
Sounds good. I'll set another time for it. Based on the rsync speeds I saw last time with the dump, I am hopeful that this will be a comparatively short read-only period.
Sat, Nov 21
The database load died again. This time it was:
ERROR 1030 (HY000) at line 663214: Got error 175 "File too short; Expected more data in file" from storage engine Aria
@Marostegui any ideas? Is my dump bad somehow? I'll google around a bit in case there's something I can do. I'd rather not spend another 14 hours in read-only mode. I also wonder if I can get away with throwing this in read-only and copying the data directory. I have a solid rsync setup.
Fri, Nov 20
That actually makes sense since that path was removed, and I bet it was coded in somewhere.
@Marostegui Random question: where does centralauth live in this setup? We are so far planning on keeping meta_p on s7 for historical reasons (or possibly on all sections if meta_p becomes a much better thing with tooling assuming it ends up on s7).
Thu, Nov 19
If the ticket wasn't auto-created for it. The failed drive is Port: 1E, box:2, bay: 10 (SAS) according to ILO
Adding @dcaro in case he has time or interest to help dig in that database. It's in the quarry Cloud VPS project. Local root can access the database (and do anything).
Wed, Nov 18
So to reduce the dataset for time, it would have to go either or both ways. Two queries, one on changes to commons and one on changes to enwiki that both go back and check for existing images on the other. That could end up more efficient, but it would also introduce complications etc. Worth experimenting with maybe.
@Green_Cardamom Now I get what you mean. Thank you for explaining it.
@Green_Cardamom I was also thinking that the enwiki query is "dominant" here because we are tagging things there and don't need to act unless there is a file there. That would suggest you could search enwiki's recent changes to such files and then only even check commons for that particular file (which is slow when searching the entire set, but it wouldn't be for a recent subset). I may be misunderstanding the ultimate goal, though.
@Green_Cardamom I was thinking of tracking recent_changes on the enwiki side and searching for specific items on the commons side as they are found from recent_changes in enwiki. That doesn't require loading the entire corpus of either into memory at any point and it would reduce the number of queries to commons (because going through the entire list of enwiki candidates and checking commons for all of them is slow). The inner joins are why I think that might be valid, and we are only tagging things in enwiki, right? If we also are looking for things to tag in commons, then the approach would not be valid.
@Green_Cardamom Is it required to examine the entire set of records or just recent ones? That's why I was asking if there was some way to leverage the recentchanges views. I don't know if the api can do the same things. It seems reasonable that only things changed in the past month or so need to be checked, right? I, so far, don't know how to filter for that yet, but if I can find time, I'll try if that isn't a bad idea.
I wonder if we could leverage the recent_changes views to limit the initial dataset. The query would not need to review all history and records that haven't been touched, right?
Tue, Nov 17
That's basically what that notebook is doing in the paged subquery version. I'm just trying to reason through if there's anything I could add to that.
From that notebook, it seems the way to do this is not to query both tables and combine but to make a query against one and then nest short queries in a loop from the results or something like that, maybe? This won't work well if we try to use python as a join directly since we aren't going to have the RAM for the combined datasets most likely even if we get it working. I'll see if I can find a way to demonstrate something like that that to see how feasible it is.
Sounds good for now. On the other hand, it isn't crazy to try things like https://github.com/rootless-containers/rootlesskit to see if we can make a docker socket to point at. We only need to be able to build and the equivalent of push, right (I say naively)? I wonder how hard to it would be to test using something like that after you get it working with a "real" socket?
@AntiCompositeNumber started work on this a bit using PAWS (I believe this is the same as this ticket): https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ipynb
Adding this to the WMCS workboard because this may be a good opportunity to generate a clear example of how to do this for a common use case of cross-database joins.
Mon, Nov 16
Ok, apparently mysqlproxy is smarter than I thought, and it can tell that I'm not pointing it at different IP addresses when I give it a list of different names that point at the same IP. As of now, the otherwise working code to connect in the new way is commented out. I may need to add another switch that generates the set of proxies when there are multiple proxy addresses to connect to. At very least, the code is all in place to do this now.
Looks like that might be related to some issues at least, if not this one.
I might have found something. Just before the issue happened, prometheus reported the number of namespaces with >=1 pod dropped from 919 to 5 briefly. That suggests that prometheus suddenly didn't see any pods in *any* of the tool namespaces for a moment. That could be a metrics hiccup, but it might be significicant because the timing is not long before the bunch of 502s (that don't seem to appear in the proxy logs):
I also have now managed to be online when this would have happened and could find no actual record of it happening except in the dashboard. That makes me really wonder what the dashboard was recording. The front proxy had not logged a bunch of 502s.
Sun, Nov 15
Since it seems very likely that is why mariadb hung up on me, setting the session variable to the maximum to try to prevent this from happening again.
That increases the chances that we will fall too far behind to start replication. Really hoping not :(
Removed comment on replication since that was actually just old logs. Nothing current about that. It is entirely possible that there are larger-than-default inserts in there that used a larger max_allowed_packet (not package...sorry for the typo in the log) in the session variables. By doubling what they were (32MB), it might succeed.
I've restarted the import while I look for some reason that might have happened.