Today we had brief downtime of the library (T330083). When it came back up we'd lost the database and had to restore from backup. We should take steps to reduce the likelihood and impact of data loss.
Description
Related Objects
Event Timeline
I verified that the system can withstand a deploy now:
first I tested on staging
then I took a backup on prod
then I forced a deploy on prod, which restarted the database since it had an image update
I started this system back up within minutes of the outage being over, and I wonder if the storage was not entirely in a good state when I did so. Everything looks fine now, and there are no obvious red flags on the running system. Deeper inspection would require unmounting the filesystem, which would mean more downtime for the site.
Rather than spending more time investigating, I propose that I:
- finish some instance cleanup that we started to make more room in the project
- increase backups to 2x daily
- create a fresh prod instance with the same dedicated docker data volume setup that I used for wikilink
- move to the fresh instance
pull request to drop syslog and backup every 12 hours
https://github.com/WikipediaLibrary/TWLight/pull/1132
pr merged; now I need to remove matomo and create a new production instance in horizon
Hotfixed some of our ops that aged poorly:
https://github.com/WikipediaLibrary/TWLight/compare/eb7b1a246a1997fdc75b9fe052bed864ef1ce4cf..918ddb440220395ccbbdfce68a564eec53f5f7ac