I work on software to help connect Wikipedia and academia. I used to work for Wikimedia Foundation. Now I work for a spin-off nonprofit, Wiki Education: https://wikiedu.org
sage@wikiedu.org
I work on software to help connect Wikipedia and academia. I used to work for Wikimedia Foundation. Now I work for a spin-off nonprofit, Wiki Education: https://wikiedu.org
sage@wikiedu.org
Sounds like a good idea. I've created an issue for it.
It's worth noting that we could enable additional languages without an initial processing run; in that case, every new request would trigger processing that particular article (and result in slower, but still usable, performance for the first request for each article for Who Wrote That and Dashboard end users). I think that strategy would be fine for enabling all additional languages; the only blocker would be making sure we have appropriate storage for handling the gradual accumulation of processed articles files for the additional languages (and clearing some headway for continued growth of storage requirements for the already-enabled ones).
When will we start listing projects?
I think I have a good guess at what happened, a race condition of sorts.
In T401873#11095223, @Daimona wrote:@Ragesoss Hi! This course was associated to this event on-wiki on July 28th. However, it looks like it is no longer linked. This did not happen on our side AFAICT. Do you have an idea of how it might have happened, and if it could be prevented going forwards? Thanks!
Here's a nice visualization of the progress we made on performance problems since the start of the internship, in the form of the downtime monitor for Programs & Events Dashboard.
Thanks! I was unable to implement this last week, as I was travelling.
When does your edit-a-thon start?
Yes, that one seems like the same thing to me.
Thanks! I just processed the backlog of account requests on Programs & Events Dashboard, and things are working normally again.
Any update on this? We still have account requests coming in regularly on Programs & Events Dashboard for editathons and such, and can't process them.
Wiki Education continues to get reports of this error from new student editors. This is the most common error we hear about from users lately.
@Ladsgroup thanks! Do you have any more details about the ones still happening, like an example request?
@MusikAnimal thanks! I just pulled in the change and restarted all the services, so it should be live now.
I've implemented a fix here: https://github.com/wikimedia/wikiwho_api/pull/20/files
I think it's common for users who have lost/forgotten their password to attempt enough tries with password guesses to trigger the 'failed attempts' email, so some path that leads to a password reset makes sense there. The 'change password' link the email is easily interpreted to be a synonym for 'reset password'.
I mean, in the case of the user not seeing the password reset link from https://auth.wikimedia.org/enwiki/wiki/Special:UserLogin (where it appears for me) I assume it's because of the block.
I guess the 'Forgot password' link is not appearing on the login page for that reason as well, but it's very confusing for an affected user because there's no indication of why the link does not appear, if they are looking for it.
Further info from the user: When navigating directly to https://auth.wikimedia.org/enwiki/wiki/Special:PasswordReset, the user ended up reaching an error message indicating an IP block:
@Tgr is this related to SUL3?
The requests definitely include an email address (and no password). The account creation flow from the Dashboard was working at least as recently as March 28, so I assume it's something that changed in the April 1 deployment.
Hmm... I am getting this error with a test account request as well. My first guess was that it's related to T390437, which just deployed.
Thanks Esther! Great work!
Thanks Formasit! Great work!
I've just deploy this. Thanks to Betty Alagwu for implementing it!
Okay, I think I've finally found the major cause of the downtime that continued happening after the data rearchitecture deployment! I deployed a fix today, and I'm very optimistic that it will be much more stable going forward.
We've gotten through nearly all of the first updates now, and I've re-initiated update processing for the already-processed ones. The last few are still working through their first updates, but we're cycles through now with about 1 or 2 days of latency for the rest.
We continue chewing threw the backlog, down to 57 events remaining for their first update. We're on the last 20 of the medium queue (up to 10k edits). Once those are complete, I will start steady-state updates for the rest of the already-processed events.
Working through the initial backlog continues apace. We're down to 204 courses remaining, but that still includes the majority of the very large courses.
It looks like things are working smoothly so far. I have added extra workers, since the memory footprint of the update cycle is so much lower now, so there a bunch of processes working through the update backlog. We identified a number of inefficiencies that we can fix to speed up the update cycles, which I'll be deploying as we have them ready, but meanwhile we're down to 515 events in the queues awaiting their first update with the new system. The bigger ones that will take a long time to process are at the end, and may take a while to get through, but I will start steady-state processing for the ones that aren't going to take too long once we've done all the initial processing of the events with fewer than 10k edits.
The deployment is complete, and I've enqueued all the current events for their first cycle through the new update cycle. There are 804 current ones, and the low-revision count ones should get processed first. The others will take a while, I'll keep this issue updated as it progresses.
I'm deploying the major update to the statistics generation system today. I've paused all the data update workers, and I'm currently generating a database dump. After that, I'll deploy and then the system will begin processing all current courses using the new system; it will take quite a while to do the initial round of processing, as it will be pulling in data for all edits since the beginning of each still-active event. Once the initial round of processing is complete, updates should eventually reach a steady state with a better rate that currently.
Thanks! This is a good idea. I will write up an issue for it.
@Daimona the goal will be to get it pretty much completely stable. The Wiki Education instance (dashboard.wikiedu.org) has had ~100% uptime for more than year, and I think it's realistic to get very close to that for Programs & Events Dashboard. (The setup on labs is a little more complicated with system components spread across multiple instances, and also subject to occasional WM Cloud-wide issues, so it will probably not be quite 100%, but I want to get very close.) This update addresses the root cause of the current instability situation (which is that updates for very large events are too demanding of the database), but there are likely to be secondary bottlenecks that reveal themselves once we get it into production, so there might be some further work to do to get it really stable. The update will completely eliminate the bottleneck queries against the Dashboard's revisions database, which are the main culprits, so if all goes well, even if large events still cause problems in terms of the update cycle, it should stop taking down the web server along with it.
Yes, I'm very close to deploying a major update intended to get the system into a stable state. In the meantime, I will continue restarting the instance that runs the web server when it goes down (which typically happens because the database server is overloaded because of non-web requests from data updates).
@Samwalton9-WMF this time, it was good old-fashioned unresponsive server (so the homepage also was not loading). I believe this happens when the combination of web requests and the background workers use up all the database compute; without intervention, it typically comes back online within an hour, and restarting the peony-web instance via Horizon fixes it quickly.
Yes, I think it was related. It was hanging on fetching two Vega libraries that are hosted on toolforge (instead of a commercial CDN). I stopped and started the webservice for wikiedudashboard tool, and that fixed it.
Thanks for your contributions @Tunrayono
One case of it happened in Safari 16.4.1 .
In T381673#10387136, @matmarex wrote:I guess the error message was 'deflate-invaliddeflate': "Content provided is not properly deflated"?
I've restarted the webserver and it's back up.
I received a screenshot of a console error from one of the affected users. It looks like it's "Remove does not match insert" from TransactionSquasher that is being thrown, but there's no user-facing indication of the problem or how to solve it.
We've deploy an update that skips the requirement to remove the campaigns on P&E Dashboard.
I think this would make a great task to roll into the (hopefully) upcoming Outreachy project: https://phabricator.wikimedia.org/T374390
Either is fine. The system has some database bottlenecks related to running updates on courses/events with large numbers of edits, so when a new high-activity event (or a surge of activity for an ongoing one) hits the system, it sometimes puts enough load on the database that the web service gets backed up and becomes unresponsive. It typically comes back online within a few hours at most (or sooner if I reboot the VM that runs the webserver), but it's down enough that I know it causes problems for things like sharing metrics with external partners.
I just fielded a question about this today, telling the confused user pretty much exactly what the new message says.
@MusikAnimal @TheresNoTime any chance either of you could squeeze this in soonish? It's blocker to rolling out support for more languages for my Impact Visualizer (https://impact-visualizer.wmcloud.org/) project, and I'm eager to get some of the folks from those languages using it before I run out of time/money to for active development.
I've filed an issue for this. The requirement to remove the campaigns before deleting a course is the desired behavior on Wiki Education Dashboard, but I think it does make sense to remove the requirement on Programs & Events Dashboard.
This looks like a solid timeline.
This timeline looks pretty reasonable.
I restarted the webserver.
In T374390#10204387, @vaYshaleee wrote:Hello are these tasks ready to be taken by the participants?
I restarted the web server. I'm not sure what caused this particular outage, but we've got plans to improve the monitoring so that it's easier to pinpoint performance bugs.
I have not had any more reports of this bug.
I've just deployed an update that prevents changing those fields for a Dashboard event that is synced to an on-wiki Event.
I do not have request logs for it. Someone on Telegram noted that registration is closed; can that be related?
Done!
I'm most of the way there, and hope to finish up tomorrow.
@ifried it's still on my todo list, but I have not fixed it yet. I can probably prioritize this in late July.