@Ladsgroup thanks! Do you have any more details about the ones still happening, like an example request?
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Thu, Apr 17
Tue, Apr 15
@MusikAnimal thanks! I just pulled in the change and restarted all the services, so it should be live now.
Mon, Apr 14
I've implemented a fix here: https://github.com/wikimedia/wikiwho_api/pull/20/files
I think it's common for users who have lost/forgotten their password to attempt enough tries with password guesses to trigger the 'failed attempts' email, so some path that leads to a password reset makes sense there. The 'change password' link the email is easily interpreted to be a synonym for 'reset password'.
I mean, in the case of the user not seeing the password reset link from https://auth.wikimedia.org/enwiki/wiki/Special:UserLogin (where it appears for me) I assume it's because of the block.
I guess the 'Forgot password' link is not appearing on the login page for that reason as well, but it's very confusing for an affected user because there's no indication of why the link does not appear, if they are looking for it. (So, this is probably not SUL3.)
Further info from the user: When navigating directly to https://auth.wikimedia.org/enwiki/wiki/Special:PasswordReset, the user ended up reaching an error message indicating an IP block:
@Tgr is this related to SUL3?
Tue, Apr 1
The requests definitely include an email address (and no password). The account creation flow from the Dashboard was working at least as recently as March 28, so I assume it's something that changed in the April 1 deployment.
Hmm... I am getting this error with a test account request as well. My first guess was that it's related to T390437, which just deployed.
Mon, Mar 31
Thanks Esther! Great work!
Thanks Formasit! Great work!
Thu, Mar 27
I've just deploy this. Thanks to Betty Alagwu for implementing it!
Okay, I think I've finally found the major cause of the downtime that continued happening after the data rearchitecture deployment! I deployed a fix today, and I'm very optimistic that it will be much more stable going forward.
Mar 18 2025
We've gotten through nearly all of the first updates now, and I've re-initiated update processing for the already-processed ones. The last few are still working through their first updates, but we're cycles through now with about 1 or 2 days of latency for the rest.
Mar 13 2025
We continue chewing threw the backlog, down to 57 events remaining for their first update. We're on the last 20 of the medium queue (up to 10k edits). Once those are complete, I will start steady-state updates for the rest of the already-processed events.
Mar 11 2025
Working through the initial backlog continues apace. We're down to 204 courses remaining, but that still includes the majority of the very large courses.
Mar 10 2025
It looks like things are working smoothly so far. I have added extra workers, since the memory footprint of the update cycle is so much lower now, so there a bunch of processes working through the update backlog. We identified a number of inefficiencies that we can fix to speed up the update cycles, which I'll be deploying as we have them ready, but meanwhile we're down to 515 events in the queues awaiting their first update with the new system. The bigger ones that will take a long time to process are at the end, and may take a while to get through, but I will start steady-state processing for the ones that aren't going to take too long once we've done all the initial processing of the events with fewer than 10k edits.
The deployment is complete, and I've enqueued all the current events for their first cycle through the new update cycle. There are 804 current ones, and the low-revision count ones should get processed first. The others will take a while, I'll keep this issue updated as it progresses.
I'm deploying the major update to the statistics generation system today. I've paused all the data update workers, and I'm currently generating a database dump. After that, I'll deploy and then the system will begin processing all current courses using the new system; it will take quite a while to do the initial round of processing, as it will be pulling in data for all edits since the beginning of each still-active event. Once the initial round of processing is complete, updates should eventually reach a steady state with a better rate that currently.
Mar 4 2025
Thanks! This is a good idea. I will write up an issue for it.
Feb 24 2025
@Daimona the goal will be to get it pretty much completely stable. The Wiki Education instance (dashboard.wikiedu.org) has had ~100% uptime for more than year, and I think it's realistic to get very close to that for Programs & Events Dashboard. (The setup on labs is a little more complicated with system components spread across multiple instances, and also subject to occasional WM Cloud-wide issues, so it will probably not be quite 100%, but I want to get very close.) This update addresses the root cause of the current instability situation (which is that updates for very large events are too demanding of the database), but there are likely to be secondary bottlenecks that reveal themselves once we get it into production, so there might be some further work to do to get it really stable. The update will completely eliminate the bottleneck queries against the Dashboard's revisions database, which are the main culprits, so if all goes well, even if large events still cause problems in terms of the update cycle, it should stop taking down the web server along with it.
Feb 14 2025
Yes, I'm very close to deploying a major update intended to get the system into a stable state. In the meantime, I will continue restarting the instance that runs the web server when it goes down (which typically happens because the database server is overloaded because of non-web requests from data updates).
Jan 15 2025
@Samwalton9-WMF this time, it was good old-fashioned unresponsive server (so the homepage also was not loading). I believe this happens when the combination of web requests and the background workers use up all the database compute; without intervention, it typically comes back online within an hour, and restarting the peony-web instance via Horizon fixes it quickly.
Jan 8 2025
Yes, I think it was related. It was hanging on fetching two Vega libraries that are hosted on toolforge (instead of a commercial CDN). I stopped and started the webservice for wikiedudashboard tool, and that fixed it.
Dec 17 2024
Thanks for your contributions @Tunrayono
Dec 12 2024
One case of it happened in Safari 16.4.1 .
Dec 6 2024
In T381673#10387136, @matmarex wrote:I guess the error message was 'deflate-invaliddeflate': "Content provided is not properly deflated"?
Nov 19 2024
I've restarted the webserver and it's back up.
Nov 18 2024
I received a screenshot of a console error from one of the affected users. It looks like it's "Remove does not match insert" from TransactionSquasher that is being thrown, but there's no user-facing indication of the problem or how to solve it.
Nov 14 2024
We've deploy an update that skips the requirement to remove the campaigns on P&E Dashboard.
Nov 12 2024
I think this would make a great task to roll into the (hopefully) upcoming Outreachy project: https://phabricator.wikimedia.org/T374390
Either is fine. The system has some database bottlenecks related to running updates on courses/events with large numbers of edits, so when a new high-activity event (or a surge of activity for an ongoing one) hits the system, it sometimes puts enough load on the database that the web service gets backed up and becomes unresponsive. It typically comes back online within a few hours at most (or sooner if I reboot the VM that runs the webserver), but it's down enough that I know it causes problems for things like sharing metrics with external partners.
Nov 4 2024
I just fielded a question about this today, telling the confused user pretty much exactly what the new message says.
Oct 28 2024
@MusikAnimal @TheresNoTime any chance either of you could squeeze this in soonish? It's blocker to rolling out support for more languages for my Impact Visualizer (https://impact-visualizer.wmcloud.org/) project, and I'm eager to get some of the folks from those languages using it before I run out of time/money to for active development.
I've filed an issue for this. The requirement to remove the campaigns before deleting a course is the desired behavior on Wiki Education Dashboard, but I think it does make sense to remove the requirement on Programs & Events Dashboard.
Oct 25 2024
This looks like a solid timeline.
Oct 23 2024
This timeline looks pretty reasonable.
Oct 15 2024
I restarted the webserver.
Oct 7 2024
In T374390#10204387, @vaYshaleee wrote:Hello are these tasks ready to be taken by the participants?
Sep 11 2024
I restarted the web server. I'm not sure what caused this particular outage, but we've got plans to improve the monitoring so that it's easier to pinpoint performance bugs.
Sep 9 2024
Sep 3 2024
I have not had any more reports of this bug.
Aug 14 2024
Aug 12 2024
Aug 9 2024
Jul 23 2024
I've just deployed an update that prevents changing those fields for a Dashboard event that is synced to an on-wiki Event.
Jul 22 2024
I do not have request logs for it. Someone on Telegram noted that registration is closed; can that be related?
Jul 2 2024
Done!
I'm most of the way there, and hope to finish up tomorrow.
Jul 1 2024
@ifried it's still on my todo list, but I have not fixed it yet. I can probably prioritize this in late July.
Jun 17 2024
Thanks! I plan to work on this the first week of July.
Jun 12 2024
Jun 3 2024
@EUwandu-WMF I think the change will be to disable changes to the 'Title', 'Institution', and 'When' text fields (which are the components of the 'slug' identifier that Event Registration uses to connect). You'll still be able to change start/end dates and wikis.
@Daimona yes, that sounds correct. The Dashboard should not allow changing the name (including the 'When' component) of a course after it has been connected to Event Registration. I will file an issue to make sure we get that fixed.
Apr 16 2024
Mar 19 2024
Either would be fine.
I like this. I think keeping the non-JSON pages and translations on-wiki (at least, the ones for individual TrainingSlides), but making the database the only source for the Library and Module configurations that are currently implemented as .json pages makes sense.
Sure. I think it needs more explanation of how you'd tackle the problem of having this new system alongside the existing mode of training content. How will new features be rolled out in a way that will be backwards compatible? Also, how will removing content work?
I worked with Esther on her Outreachy project to improve the docs for Programs & Events Dashboard, and I can say with confidence that she would do a fantastic job on this project. She's really thoughtful and thorough when it comes to reviewing existing docs and identifying the right changes to make. She picked up on all the technical things necessary for working with complex wikitext pages quickly, she's self-driven when it comes to diving into code to figure out how things work, and she's great with finding the right people to ask when she needs input. Working with Esther was a pleasure, and I strongly endorse this proposal.
@Ujjwalpathaak I'm not going to do full reviews of any proposals before the final submissions, but if you'd like me to look at a specific aspect, I can do that.
I've just switched it back. Hopefully it's working again now.
Ah yes, today I switched it out of P&E mode into Wiki Education mode, so that's probably what broke it.
Mar 11 2024
I've just sent the data to @Rmaung. It includes every record in the database, which should stretch all the way back to when the first training module was added.