Page MenuHomePhabricator

Ragesoss (Sage Ross)
Chief Technology Officer, Wiki Education

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 7:26 PM (363 w, 1 d)
Availability
Available
IRC Nick
ragesoss
LDAP User
Ragesoss
MediaWiki User
Ragesoss [ Global Accounts ]

I work on software to help connect Wikipedia and academia. I used to work for Wikimedia Foundation. Now I work for a spin-off nonprofit, Wiki Education: https://wikiedu.org

sage@wikiedu.org

Recent Activity

Fri, Sep 10

Ragesoss awarded T288840: Migrate WikiWho service to production or VPS a Love token.
Fri, Sep 10, 7:02 PM · Education-Program-Dashboard, XTools, Who-Wrote-That, Community-Tech

Aug 11 2021

Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

@Ragesoss -- thanks for the update! Did you end up implementing it such that all users have the features on? Or so that all users have the features off?

Aug 11 2021, 3:33 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)

Aug 10 2021

Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

It would work for the account creation API as well, I think, although you'd get an invalid parameter warning as the parameter is not formally registered.

It's definitely not a behavior we tried to test -- so even if it works, I don't think it should be relied upon.

Aug 10 2021, 5:37 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)
Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

@kostajh thanks much!

Aug 10 2021, 4:40 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Yesterday the system was broken for several hours because of a full disk on p-and-e-dashboard-database. This is because I hadn't configured the database to use Cinder volume for the tmpdir, so temporary tables filled up the small root partition.

Aug 10 2021, 4:36 PM · VPS-Projects, Education-Program-Dashboard

Aug 9 2021

Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

I'm working on this now, and I have a few questions.

Aug 9 2021, 6:22 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)
Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

@MMiller_WMF : I haven't implemented it yet — it's near the top of my TODO list — but I think it should work fine. I'll report back once I do implement and deploy it, likely this week. Thanks!

Aug 9 2021, 3:54 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)

Jul 29 2021

Ragesoss created T287693: Programs & Events Dashboard 2021 user survey.
Jul 29 2021, 4:58 PM · Education-Program-Dashboard, Surveys

Jun 16 2021

Ragesoss added a comment to T285063: "editorial guidelines" link from Diff blog author Dashboard does not work.

Note that the same thing (with a slightly different href, https://#) also affects the New Post interface: https://diff.wikimedia.org/wp-admin/post-new.php

Jun 16 2021, 5:18 PM · Diff-blog
Ragesoss created T285063: "editorial guidelines" link from Diff blog author Dashboard does not work.
Jun 16 2021, 5:14 PM · Diff-blog

Jun 14 2021

Ragesoss closed T284756: Outreach Dashboard's export for Article scoped program includes pages outside of the scope as Resolved.

This is live now.

Jun 14 2021, 6:51 PM · Education-Program-Dashboard

Jun 11 2021

Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

Is there a method available right now that I could implement to mark Wiki Education users?

I think that mostly depends on "do we want to force-enable the features, or force-disable". by reading this task, it's still not clear to me which one of those two things is desired here.

Jun 11 2021, 5:46 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)

Jun 10 2021

Ragesoss added a comment to T284752: Setting explicit time zone in Outreach Dashboard.

Thanks, this is helpful feedback.

Jun 10 2021, 9:13 PM · Education-Program-Dashboard
Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

Is there a method available right now that I could implement to mark Wiki Education users? Some of them are new accounts where we could set an additional param in the account creation URL, but some create their accounts on their own before joining a course on the Dashboard. But we do attempt to make an options edit immediately after a user self-enrolls in a course, so rolling some kind of identification setting into that will probably be the most reliable way to handle it. (This does not happen for cases where the instructor adds the username to the course, rather than the user doing it themselves while signed in to the Dashboard, but that's relatively rare... and those accounts won't get opted out of any experimental features anway.)

Jun 10 2021, 4:17 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)
Ragesoss added a comment to T284756: Outreach Dashboard's export for Article scoped program includes pages outside of the scope.

This happened to be a very simple change: https://github.com/WikiEducationFoundation/WikiEduDashboard/commit/1cbb98730efc2592ba9eaa5cc336b45c75bd08a4

Jun 10 2021, 3:56 PM · Education-Program-Dashboard
Ragesoss added a comment to T284760: Batch assignment of articles to editor in Outreach Dashboard.

This makes sense. I've added an issue on GitHub pointing here.

Jun 10 2021, 3:55 PM · Education-Program-Dashboard
Ragesoss added a comment to T284752: Setting explicit time zone in Outreach Dashboard.

This is worth considering, but I'm not sure it's worth it to increase the complexity of the UI to add a timezone picker. We're currently storing all times as UTC, but converting to the user's local timezone on the frontend, which is a good solution for most users. But perhaps it could be done in a way that's unobtrusive enough to be worth it.

Jun 10 2021, 3:53 PM · Education-Program-Dashboard
Ragesoss added a comment to T284756: Outreach Dashboard's export for Article scoped program includes pages outside of the scope.

I've added a fix for this, and it should be live after the next deployment... likely next week sometime.

Jun 10 2021, 3:48 PM · Education-Program-Dashboard
Ragesoss added a comment to T284757: In Outreach Dashboard limit wikis in "Add available articles" form to only Tracked wikis.

This is a good idea. I've added an issue for it: https://github.com/WikiEducationFoundation/WikiEduDashboard/issues/4526

Jun 10 2021, 3:47 PM · Education-Program-Dashboard

Jun 3 2021

Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

@Urbanecm_WMF thank you! this is very helpful.

Jun 3 2021, 5:39 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)

Jun 2 2021

Ragesoss added a comment to T284119: Growth: options for disabling/enabling Growth features for groups of users.

I was thinking we could use a query parameter at registration time, something like geOptOut=1, that would be checked in the various LocalUserCreated implementations (help panel, homepage, mentor), and that would basically prevent the code in those hooks running for users created via that path. @Ragesoss would that work for your use case?

Jun 2 2021, 3:49 PM · User-Urbanecm_WMF (Engineering), Growth-Team (Current Sprint)

May 27 2021

Legoktm awarded T283776: Request increased quota for globaleducation Cloud VPS project a Like token.
May 27 2021, 5:47 PM · Cloud-VPS (Quota-requests)
Ragesoss created T283776: Request increased quota for globaleducation Cloud VPS project.
May 27 2021, 12:21 AM · Cloud-VPS (Quota-requests)

May 18 2021

Ragesoss added a comment to T283096: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block.

Thanks all! Martin helped me get it sorted out: I disabled the global block locally for en.wikipedia, and updated the local version of it to allow logged-in edits. Dashboard edits are working again now.

May 18 2021, 9:05 PM · User-Zabe, Wikimedia-Site-requests
Ragesoss added a comment to T283096: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block.

Thanks. I thought there was something else letting the dashboard IP get around a global block, because it looks like the old IP is also in a blocked range: https://meta.wikimedia.org/wiki/Special:Contributions/45.56.98.206

May 18 2021, 8:02 PM · User-Zabe, Wikimedia-Site-requests
Ragesoss added a comment to T283096: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block.

Thanks @Zabe! Is this live already? We're still having all our edits blocked like this:

May 18 2021, 7:51 PM · User-Zabe, Wikimedia-Site-requests
Ragesoss added a comment to T283096: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block.

2600:3c03::f03c:91ff:fe08:7973/128 is the ipv6 range that went with the 45.56.98.206 server, so I believe the ipv6 line from T151823 can be replaced as well.

May 18 2021, 5:55 PM · User-Zabe, Wikimedia-Site-requests
Ragesoss created T283096: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block.
May 18 2021, 5:35 PM · User-Zabe, Wikimedia-Site-requests

May 11 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Today I set up an additional node to process the long program update queue. My best guess is that running out of memory is the source of the problems we've had lately (ie, the server going down until an Apache restart), so moving that most resource-intensive process to a separate server should help.

May 11 2021, 6:13 PM · VPS-Projects, Education-Program-Dashboard

May 10 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I set up and switched over to a fresh VM today; with the separate database server, this was completely seamless as all I needed to do once it was set up was to change the Web Proxy on Horizon to point to the new VM.

May 10 2021, 9:58 PM · VPS-Projects, Education-Program-Dashboard

Apr 29 2021

Ragesoss added a comment to T281334: Intermittent 500s when loading course.json.

My guess is that this is the same issue that affects users of the frontend when Apache gets into a state where a portion of the requests immediately return a 500. I still don't know exactly what causes it — likely a slow route using up most of the application threads. The system has been stable since we restarted the server yesterday. Have you had any more errors in the last ~21 hours?

Apr 29 2021, 1:30 PM · Education-Program-Dashboard

Apr 28 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I'm back into the system now after some help with restarting the server and fixing LDAP integration. My work preparing for a distributed architecture is still in progress, but it's looking good; I hope to have it much more stable and resilient a few months from now.

Apr 28 2021, 2:39 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I'm not sure what is wrong, but it might be related to the recent NFS outage on cloud services... I can't currently log in to the server, as I'm getting Permission denied (publickey)

Apr 28 2021, 1:58 PM · VPS-Projects, Education-Program-Dashboard

Apr 23 2021

Ragesoss closed T280922: outreachdashboard failing with 500 Internal Server Error, a subtask of T273067: Outreach Dashboard has been having recurring outages, as Resolved.
Apr 23 2021, 1:21 PM · Cloud-VPS, Education-Program-Dashboard
Ragesoss closed T280922: outreachdashboard failing with 500 Internal Server Error as Resolved.

Couldn't find the immediate cause — didn't seem to be a problem at the application layer — but it's been up since the restart. We're to switch from Apache to nginx as part of the larger infrastucture update that's in the works in the coming months.

Apr 23 2021, 1:21 PM · Education-Program-Dashboard

Apr 22 2021

Ragesoss added a comment to T280922: outreachdashboard failing with 500 Internal Server Error.

I restarted Apache and that brought it back up. Will investigate the cause.

Apr 22 2021, 1:47 PM · Education-Program-Dashboard

Apr 15 2021

Ragesoss added a comment to T279956: Request increased quota for globaleducation Cloud VPS project (multi-node).

Will do.

Apr 15 2021, 1:48 PM · Cloud-VPS (Quota-requests)

Apr 14 2021

Ragesoss added a comment to T279956: Request increased quota for globaleducation Cloud VPS project (multi-node).

Could you please detail how much RAM and CPU you would need in your quota?

Apr 14 2021, 3:45 PM · Cloud-VPS (Quota-requests)

Apr 12 2021

Ragesoss created T279956: Request increased quota for globaleducation Cloud VPS project (multi-node).
Apr 12 2021, 8:10 PM · Cloud-VPS (Quota-requests)

Apr 8 2021

Ragesoss closed T279320: Enable automated notifications from Outreach Dashboard for pt.wikiversity as Resolved.
Apr 8 2021, 8:39 PM · Education-Program-Dashboard
Ragesoss added a comment to T279320: Enable automated notifications from Outreach Dashboard for pt.wikiversity.

@Ederporto I've enabled pt.wikiversity edits. Please test it out and confirm that it's working properly.

Apr 8 2021, 5:46 PM · Education-Program-Dashboard

Apr 6 2021

Ragesoss added a comment to T279320: Enable automated notifications from Outreach Dashboard for pt.wikiversity.

I've added the configuration for this: https://github.com/WikiEducationFoundation/WikiEduDashboard/commit/5130e8ae4033114259074a05ca61ad12d50c772d

Apr 6 2021, 8:50 PM · Education-Program-Dashboard

Apr 5 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I fixed the too-long-SQL-queries problem! After the switch to a separate database server, the Rails app was still connecting to the database using the mysql2 gem built against MariaDB 10.1, but the database server is now running MariaDB 10.3. I upgraded libmariadb-dev to 10.3 on the Rails server, rebuilt the gem, and the problem went away. :-D

Apr 5 2021, 6:18 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss closed T263159: Analyze 'need help' requests from Wiki Education Dashboard as Resolved.

This project hasn't been done; it might make a good project for a future round, although it's not a top priority.

Apr 5 2021, 4:18 PM · Outreach-Programs-Projects, Education-Program-Dashboard, Outreachy (Round 21)

Apr 2 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

My MySQL docs point to this possible problem:

Apr 2 2021, 8:01 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

There's also one more remaining case of too-long-SQL-queries that I can't work around as easily as the others; for an ArticleScopedProgram that is tracking based on a large category or PetScan query, we use the array of article titles for that category/query to find which edited articles are in scope, and if there are thousands of titles it makes the SQL too long and the database connection cuts off. Here's an example of a program being affected by this problem: https://outreachdashboard.wmflabs.org/courses/wikidata/Museum_Day_2021_-_Museums_(2021)/articles/edited

Apr 2 2021, 7:37 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I've now re-enabled updates for all programs except for the following ones which are disproportionately responsible for high system load, which I need to study more to find possible bottlenecks before I'll feel safe re-enabling. Many of them are ones that track very large numbers of articles over long time periods; I may need to add some limitations such as not allowing programs that track activity for more than 1 or 2 years.

Apr 2 2021, 7:29 PM · VPS-Projects, Education-Program-Dashboard

Mar 31 2021

Ragesoss closed T278983: New Replica returns different data than old Replica for query using `change_tag` table, a subtask of T272523: Early testing of the new Wiki Replicas multi-instance architecture, as Resolved.
Mar 31 2021, 8:14 PM · Developer-Advocacy (Apr-Jun 2021), Data-Services, cloud-services-team (Kanban)
Ragesoss closed T278983: New Replica returns different data than old Replica for query using `change_tag` table as Resolved.

Thanks for getting to the bottom of this!

Mar 31 2021, 8:14 PM · Data-Services
Ragesoss added a comment to T278983: New Replica returns different data than old Replica for query using `change_tag` table.

Oh, I see! I never noticed before that this example query was returning duplicates of the same revision; the intent was to return just one entry for a given revision, and have it report system: true if it was from any of the specified change tags. I'll update my test accordingly. Wiki Education Dashboard has been fetching duplicate revisions for a while now, which might explain why the system indication has been incosistent... it depended on which copy of the revision got written to the database first. (Programs & Events Dashboard just has one change tag specified in its config, so this won't have made an operational difference there.)

Mar 31 2021, 7:50 PM · Data-Services
Ragesoss added a comment to T278983: New Replica returns different data than old Replica for query using `change_tag` table.

No, my code doesn't rely on the order of the results.

Mar 31 2021, 6:48 PM · Data-Services
Ragesoss added a comment to T278983: New Replica returns different data than old Replica for query using `change_tag` table.

Here's a query:

Mar 31 2021, 5:10 PM · Data-Services
Ragesoss created T278983: New Replica returns different data than old Replica for query using `change_tag` table.
Mar 31 2021, 5:03 PM · Data-Services
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Sage, have you checked if a memory module is failing in the respective host server?

Could you update the MariaDB server to the latest minor version ?

Mar 31 2021, 3:33 PM · VPS-Projects, Education-Program-Dashboard

Mar 30 2021

Ragesoss renamed T277651: Outreachdashboard.wmflabs.org is down frequently with database problems from Outreachdashboard.wmflabs.org is down with "queue full" to Outreachdashboard.wmflabs.org is down frequently with database problems.
Mar 30 2021, 5:29 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I've now disabled updates for all the 'long' programs, which are ones that have taken longer than 10 minutes for a data update. Many of these which take between 10 and 30 minutes are probably fine, and I'll start re-enabling updates for such pages. I'll be adding a message shortly to indicate when a particular program has updates disabled. For the longest ones, I plan to do more work to limit the time-consuming parts of the process (likely by disabling some features) so that updates can eventually be re-enabled.

Mar 30 2021, 5:20 PM · VPS-Projects, Education-Program-Dashboard

Mar 29 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I've finished reloading the revisions table, brought the web server back up, and turned on the background processes for short and medium events and 'constant' updates. I'll turn on the 'daily' updates process in a little while.

Mar 29 2021, 5:53 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I've completed the revisions.sql dump (33G), dropped the revisions table, taken it out of recovery mode, and started restoring the table (mysql dashboard < revisions.sql).

Mar 29 2021, 2:14 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I've taken the Dashboard offline, put the database in innodb_force_recoery=1 mode, and started mysqldump dashboard revisions > revisions.sql. It'll take perhaps a couple of hours to drop and reload the revisions table, after which I'll switch out of recovery bring the frontend back online while I work on isolating the events that take too long.

Mar 29 2021, 1:47 PM · VPS-Projects, Education-Program-Dashboard

Mar 28 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

This error at 11:31 UTC March 28 happened right before the index corruption:

Mar 28 2021, 4:39 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The system had looked quite stable, but about 4 hours ago, all the update processes once again started erroring due to a corrupted revisions index. So far, I haven't found any hints at what the acute cause is. Given that this happened again after moving the database to a new server and a Cinder volume, it seems like a hardware problem is a very unlikely reason.

Mar 28 2021, 3:36 PM · VPS-Projects, Education-Program-Dashboard

Mar 25 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The adjustments I made to avoid long query strings are working so far, and all update processes are currently back online. I'll continue monitoring closely for additional problems.

Mar 25 2021, 8:43 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I'm still stumped on why larger query strings are causing the app to lose its database connection. I've started putting in place some workarounds to avoid needing such long query strings in the places that have failed so far.

Mar 25 2021, 7:32 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I already tried increasing net_write_timeout as well, but that didn't help.

Mar 25 2021, 5:45 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Yeah, I've been looking at that. It points to the problem most likely being network related, rather than server related. I ran tcpdump through a disconnection and then used strings to see what the traffic looked like, but it offers no useful hints that I can glean.

Mar 25 2021, 5:21 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Also tried:
innodb_buffer_pool_size=12GB

Mar 25 2021, 4:31 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I tried these settings as well, to no avail:

Mar 25 2021, 4:09 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I've set log_warning=4 to surface logs from these connection errors on the mysqld side. They look like this: 2021-03-25 15:47:13 10 [Warning] Aborted connection 10 to db: 'dashboard' user: 'outreachdashboard' host: 'programs-and-events-dashboard.globaleducation.eqiad1.wikimed' (Got an error reading communication packets)

Mar 25 2021, 3:57 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The web app is largely working now, but seems to be a problem with the database configuration when the system tries to run some of the larger queries, which trigger a connection error. Investigating it today.

Mar 25 2021, 3:25 PM · VPS-Projects, Education-Program-Dashboard
ToniSant awarded T277651: Outreachdashboard.wmflabs.org is down frequently with database problems a Love token.
Mar 25 2021, 12:32 PM · VPS-Projects, Education-Program-Dashboard

Mar 24 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The database population step is complete (using up 61G out of 177G on the Cinder volume). I've turned the Dashboard web server back on, and I'll gradually enable related services.

Mar 24 2021, 9:24 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Database population seems to have sped up some, and is now estimated to complete in 2.5 more hours (which means I might be able to get the Dashboard up and running this evening).

Mar 24 2021, 6:00 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Meanwhile, I'm setting up a pair of test VMs to try out running the system in this two-server configuration to identify any extra hurdles to expect once the database finishes loading.

Mar 24 2021, 4:40 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Hi can we have more details on when the dashbpard problem will be resolved? Just to know if it will be oprational this week end

Mar 24 2021, 4:23 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I've got the backup file transferred to the new VM, I've configured mariadb to use /srv/mysql, and I've started loading the data (pv 2021-03-23_22h10m_12.sql.gz | gunzip | mysql dashboard). It looks like it's going to take all day to complete the load, so it will probably be tomorrow when I can try to bring the Dashboard back online with the new database server.

Mar 24 2021, 3:53 PM · VPS-Projects, Education-Program-Dashboard

Mar 23 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The backup is complete. I've spun up a new VM to put the database on. Setting up the Cinder volume was a breeze; I used 180 of the 200GiB quota for this volume (an uncompressed backups is huge, so this doesn't actually provide a huge amount of headroom, but it should be fine for now).

Mar 23 2021, 11:56 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

As before, running a dump errors out in the revisions table, so I've edited /etc/mysql/mariadb.conf.d/50-server.cnf again to set innodb recover mode 1. This time I'll do a complete backup and then use that .sql file to populate a new database on a new VM (with MariaDB 10.3, instead of 10.1 which it's currently using).

Mar 23 2021, 10:17 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T278132: Request increased quota for globaleducation Cloud VPS project.

+1 from me for granting 200GiB of Cinder quota today.

One of the very nice things about Cinder volumes is that they are resizable, so if this space becomes well used additional quota can be granted to grow the volume(s) in the future with minimal disruption to the project.

Mar 23 2021, 10:01 PM · User-bd808, cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
Ragesoss added a comment to T278132: Request increased quota for globaleducation Cloud VPS project.

Okay.

Mar 23 2021, 9:57 PM · User-bd808, cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Thanks @Legoktm . I just took the system down to run a dump. Not sure sure exactly how to hook up the database to a separate VM, but I think I can figure it out. I'll attempt to go with the move @bd808 suggests.

Mar 23 2021, 9:49 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

I'm currently unsure what the cause of the corruption is, but I've read this "Most InnoDB corruptions are hardware-related. Corrupted page writes can be caused by power failures or bad memory. The issue also can be caused by using network-attached storage (NAS) and allocating InnoDB databases on it." (source)

Mar 23 2021, 8:34 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss reopened T277651: Outreachdashboard.wmflabs.org is down frequently with database problems as "Open".

Although it's mostly only affecting the update processes for slow-updating programs, it appears that problems with the Revisions index are back. I'm not sure what to do about this in the short term. I'm investigating the idea of moving more processing outside of the Dashboard, but that's not something that could be done quickly.

Mar 23 2021, 3:08 PM · VPS-Projects, Education-Program-Dashboard

Mar 22 2021

Ragesoss closed T277651: Outreachdashboard.wmflabs.org is down frequently with database problems as Resolved.
Mar 22 2021, 5:23 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T278060: Not able to track a Commons category at Dashboard.

Can you provide some additional details? What category did you try to track, and what did you enter as the name of it? What project are you doing this for, and what do you want to be able to selectively track?

Mar 22 2021, 3:00 PM · Education-Program-Dashboard
Ragesoss created T278132: Request increased quota for globaleducation Cloud VPS project.
Mar 22 2021, 2:28 PM · User-bd808, cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Mar 21 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The disk was full due to a large leftover file from the previous repair effort. Looks like things went back to normal as soon as I freed up space.

Mar 21 2021, 2:04 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss closed T278042: 500 Internal Service Error for Outreach Dashboard as Resolved.

This was caused by a full disk.

Mar 21 2021, 2:03 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss closed T278042: 500 Internal Service Error for Outreach Dashboard, a subtask of T277651: Outreachdashboard.wmflabs.org is down frequently with database problems, as Resolved.
Mar 21 2021, 2:02 PM · VPS-Projects, Education-Program-Dashboard

Mar 18 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Note that there's nothing stopping you from querying the wikireplicas directly from a VPS project, you just need to create a Toolforge tool to get credentials (replica.my.cnf), which you can then stick in your VM.

Mar 18 2021, 5:14 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T273067: Outreach Dashboard has been having recurring outages.

I've been working with Rails performance expert Nate Berkopec this month, and we've come to a clearer understanding of the usual cause of recurring outages. In most cases (except for the most recent one), the system gets overloaded through the combination of the background processes (which are responsibile for updating stats for courses and tie up a significant portion of system resources on essentially a continual basis) and multiple users requesting particularly slow pages (ie, ones that load a lot of data). The web server uses up to 6 "Passenger" processes that handle a web request by sending it to Ruby on Rails application. If all 6 of these are tied up with very slow requests, then the system can't process any more requests in the meantime.

Mar 18 2021, 4:30 PM · Cloud-VPS, Education-Program-Dashboard
Ragesoss closed T277651: Outreachdashboard.wmflabs.org is down frequently with database problems as Resolved.
Mar 18 2021, 4:01 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

To wrap up, my best guess at what happened is:

  • Sometime on March 16, the index of the revisions table became corrupted.
  • Whenever the database looked at the affected index entry, the database server would crash, interrupting any web requests and background stats jobs that were running at the time.
  • By the time the database could restart, the web server had a full queue of unprocessed requests waiting.
  • For reasons I don't understand yet, this also tied up the system such that it stopped processing web requests altogether and also became unresponsive via SSH.
  • Upon restarting the server, it would run fine for a few minutes until once again encountering the corrupted database index entry.
Mar 18 2021, 4:01 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The system is still stable, so I've just enabled the final Sidekiq processes for long stats updates and the daily job. I've also re-enabled the weekly database backup script (which runs on Sunday), and deployed a minor change to help with logging slower courses.

Mar 18 2021, 3:45 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Still going fine, so I've enabled the short and medium sidekiq processes. The slowest programs, and the daily_update job, are still disabled.

Mar 18 2021, 12:36 AM · VPS-Projects, Education-Program-Dashboard

Mar 17 2021

Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Things are going fine so far. I've just re-enabled the constant_update and schedule_course_updates cron jobs, and the constant_update Sidekiq process that handles them.

Mar 17 2021, 10:32 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The Dashboard is back up and running. I haven't enabled or restarted any of the background processes yet except for the 'default' one that handles transactional background jobs (like wiki edits and error reporting), so it hasn't started pulling in new data yet.

Mar 17 2021, 9:55 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

Reloading the revisions table has completed. I'm going to restart the frontend.

Mar 17 2021, 9:47 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

@Theklan done. (I added .htaccess, enabled mod_rewrite, and set a Directory stanza with AllowOverride All in the site config.)

Mar 17 2021, 8:00 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

After dumping the revisions table, I've done drop table revisions;. Then I took it mysqld out of recovery mode and restarted it. Now I'm running mysql dashboard < revisions.sql to restore the revisions table from the dump.

Mar 17 2021, 7:16 PM · VPS-Projects, Education-Program-Dashboard
Ragesoss added a comment to T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.

The backup script ran without issue is recovery mode, so I have an up-to-date, possibly valid backup of the whole DB now. I'm currently running a dump of just the revisions table (mysqldump dashboard revisions > revisions.sql) in preparation for dropping and importing it.

Mar 17 2021, 5:41 PM · VPS-Projects, Education-Program-Dashboard