Reduce likelihood of Wikipedia Library data loss
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Samwalton9-WMF
	Feb 20 2023, 2:51 PM

Description

Today we had brief downtime of the library (T330083). When it came back up we'd lost the database and had to restore from backup. We should take steps to reduce the likelihood and impact of data loss.

Related Objects

Mentioned Here: T330083: The Wikipedia Library Server Error: Can't connect to MySQL server on 'tasks.production_db'

Event Timeline

Samwalton9-WMF created this task.Feb 20 2023, 2:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 20 2023, 2:51 PM

I verified that the system can withstand a deploy now:
first I tested on staging
then I took a backup on prod
then I forced a deploy on prod, which restarted the database since it had an image update

I started this system back up within minutes of the outage being over, and I wonder if the storage was not entirely in a good state when I did so. Everything looks fine now, and there are no obvious red flags on the running system. Deeper inspection would require unmounting the filesystem, which would mean more downtime for the site.

Rather than spending more time investigating, I propose that I:

finish some instance cleanup that we started to make more room in the project
increase backups to 2x daily
create a fresh prod instance with the same dedicated docker data volume setup that I used for wikilink
move to the fresh instance

jsn.sherman renamed this task from Investigate Wikipedia Library database loss after downtime to Reduce likelihood of Wikipedia Library data loss.Feb 21 2023, 6:24 PM

jsn.sherman updated the task description. (Show Details)

pull request to drop syslog and backup every 12 hours
https://github.com/WikipediaLibrary/TWLight/pull/1132

pr merged; now I need to remove matomo and create a new production instance in horizon

Hotfixed some of our ops that aged poorly:
https://github.com/WikipediaLibrary/TWLight/compare/eb7b1a246a1997fdc75b9fe052bed864ef1ce4cf..918ddb440220395ccbbdfce68a564eec53f5f7ac

Samwalton9-WMF closed this task as Resolved.Mar 8 2023, 2:32 PM

Reduce likelihood of Wikipedia Library data lossClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Reduce likelihood of Wikipedia Library data loss
Closed, ResolvedPublic
Actions