The database server for Programs & Events crashed early on 2025-12-04. This task will track related problems and work to restore full functionality.
Description
Event Timeline
Day 1: 2025-12-04
On the morning of 2025-12-04, I logged on to find that Programs & Events Dashboard was down, with the web server unable to connect to the database.
I restarted the peony-database instance (which runs the mariadb server, with data stored on an attached cinder volume), but it would not come back up, reporting an error mounting the filesystem each time. I'm not sure what caused that filesystem to fail.
I detached the cinder volume, then deleted and spun up a new peony-database server instance. Upon setting up mariadb (a somewhat new version than was running previously, but still 10.x) and configuring it for the cinder volume, it failed to start because of index corruption errors, and there were also transactions that were scheduled for rollback.
Since we had enough space on the cinder volume, at this point I made a complete copy of the data directory before proceeding: /srv/mysql.bak.
I proceeded to dump as much of the data as I could into sql files, using mysqldump. The following tables had index corruption and could not be completely extracted: articles_courses, article_course_timeslices, articles, course_wiki_timeslices, users. Most of this represent data that could be regenerated, except for users, so I attempted to restore as much of users table as possible. While it seemed that only two rows could not be extracted because of the index corruption (which I worked around by dumping up to the affected rows, and then separately dumping after the affected rows), I believe that additional user records were lost and not recoverable. The latest creation date for the recovered user records was November 14 2025, so I think we've essentially lost all the records of new users added for tracking since that date. (For the other corrupted tables, I dumped all records up to the first row with index corruption, but did not get the later records, since these should all be possible to regenerate through the Dashboard's update process). I completed the data dumps by the end of the day on 2025-12-04.
Day 2: 2025-12-05
This morning rebuilt the database using the data dumps. One file contained all the tables that didn't have index corruption:
sudo mysqldump --single-transaction --routines --triggers \ --databases dashboard \ --ignore-table=dashboard.articles_courses \ --ignore-table=dashboard.article_course_timeslices \ --ignore-table=dashboard.articles \ --ignore-table=dashboard.course_wiki_timeslices \ --ignore-table=dashboard.revisions \ --ignore-table=dashboard.users \ > /srv/database-dump/dashboard-no-bad-tables.sql
(The revisions table was not corrupted, but is large and no longer used by the system.)
After loading the good tables, I loaded the partially-recovered data from each of the corruption tables. Loading the course_wiki_timeslices file resulted in duplicate key error before the end, but the others loaded without error. Since that represents data that can be regenerated, I proceeded without trying to address it.
At this point, with at least partial data restored to all the required tables, I reconnected the Rails web server, bringing the Dashboard site back online.
Exploring the system, I found errors occurring related to orphaned records: courses_users records with missing user records, and articles_courses records with missing article records. The courses_users records represent enrollments in courses. 1823 of these were orphaned, so I deleted these. Basically this represents 1823 times when a user was added to a course — I think exclusively since November 14 — that were lost. So, courses that have added users since then are probably missing users. There were also 5560957 missing article records referenced by 9473774 existing articles_courses records; these articles_courses records also need to be deleted. These represent an article edited during a particular course, so the Articles tabs of courses and campaigns will be incomplete until the data for the affected courses is regenerated. (The deletion is currently in progress.)
I'll continue identifying errors related to missing database records and clean up orphaned records to restore the system to a consistent state, and once that is complete I will turn the update system on so that current courses will (hopefully) generate their missing data — except for the missing enrollment records.
Thanks you very much Ragesoss for your restoration work. Which data records or which time period have actually been lost and cannot be restored? Background: We have a challenge that started on December 1. best regards
If the list of participants includes everyone you expected, then the rest of the data should be regenerated once updates are back online. If the list of participants is empty or incomplete, then that data may be lost and you'd need to re-add usernames if you know them.
Day 5: 2025-12-08
I finished deleting the orphaned articles_courses records over the weekend, and it looks like that has stopped the frontend errors, so the database is hopefully in a consistent state now. I'm going to begin turning on the background queues.
Update: I've now enabled all the background jobs, so events are getting data updates. It is likely to take a while to catch up with processing them.