Page MenuHomePhabricator

Reduce likelihood of Wikipedia Library data loss
Closed, ResolvedPublic

Description

Today we had brief downtime of the library (T330083). When it came back up we'd lost the database and had to restore from backup. We should take steps to reduce the likelihood and impact of data loss.

Event Timeline

jsn.sherman subscribed.

I verified that the system can withstand a deploy now:
first I tested on staging
then I took a backup on prod
then I forced a deploy on prod, which restarted the database since it had an image update

I started this system back up within minutes of the outage being over, and I wonder if the storage was not entirely in a good state when I did so. Everything looks fine now, and there are no obvious red flags on the running system. Deeper inspection would require unmounting the filesystem, which would mean more downtime for the site.

Rather than spending more time investigating, I propose that I:

  • finish some instance cleanup that we started to make more room in the project
  • increase backups to 2x daily
  • create a fresh prod instance with the same dedicated docker data volume setup that I used for wikilink
  • move to the fresh instance
jsn.sherman renamed this task from Investigate Wikipedia Library database loss after downtime to Reduce likelihood of Wikipedia Library data loss.Feb 21 2023, 6:24 PM
jsn.sherman updated the task description. (Show Details)

pr merged; now I need to remove matomo and create a new production instance in horizon