Page MenuHomePhabricator

etherpad database issues
Closed, ResolvedPublic

Description

Etherpad , our real-time collaborative editing tool suffered an outage due to what we only know for now is database corruption since 14:27 UTC. This was detected shortly after it happened and we (ops in charge of the service and the database) worked to reestablish the service.

As the service continued crashing despite our efforts, we decided to recover a database backup from 2016-06-22 01:00:01 UTC. The service is now back up and working (since 16:11), but that means that you may have lost a day and a half of edits.

I understand that that may cause a lot of inconveniences, specially for the people at Wikimania. We are now trying to recover more than that, but as the crashes could come back, or not all could be recovered, plus the service itself is needed, the plan is the following:

  • Keep the current pads as is, not delete any edit done from now on
  • If possible, recover the last days of edits on a separate address.

This ticket will be updated for progress.

Edit: The previous version was recovered and it is available on https://etherpad-restore.wikimedia.org make sure you recover all lost pads- it will only be available for a week.

Event Timeline

jcrespo created this task.Jun 23 2016, 4:51 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 23 2016, 4:51 PM
jcrespo moved this task from Triage to In progress on the DBA board.
jcrespo triaged this task as High priority.Jun 23 2016, 5:02 PM
jcrespo updated the task description. (Show Details)Jun 23 2016, 5:55 PM
jcrespo added a comment.EditedJun 23 2016, 8:31 PM

So good news: we have been able to recover until just a few minutes before crashing (which means virtually no data loss).

The problem is we have yet to reimport and most importantly: check that those versions are usable (they do not crash the software).

Can you please make sure to not overwrite the things added later? I re-did a bunch of the work I did this afternoon in preperation of the discussions tomorrow, under the same url. I think you were planning this already, just making sure. Thanks!

@Effeietsanders as I sent on my email- no data will be added, deleted or overwritten on the current etherpad. I promised that and I will maintain that. We thought about that in advance.

The idea will be to make one restore available on a separate URL and allow everybody to self-serve its recovery.

Clarification: no data will be added, deleted or overwritten on the current etherpad by us (operators), you are expected to do that as usual (use the current https://etherpad.wikimedia.org/).

jcrespo added a comment.EditedJun 23 2016, 8:52 PM

@akosiaris I managed to reimport the tables, with two different timestamps. They are on the same host (m1-master), and I have granted permission to the same users as the regular etherpad:

  • etherpadlite_restore database contains a snapshot of the table (store) from this morning
  • etherpadlite_restore2 database contains a snapshot of the table (store) from this afternoon

We want to try etherpadlite_restore2 first; if that doesn't work (crashes) we try etherpadlite_restore.

Hope it is clear.

Change 295757 had a related patch set uploaded (by Alexandros Kosiaris):
cache::misc: Set up a temporary etherpad host

https://gerrit.wikimedia.org/r/295757

Change 295757 merged by Alexandros Kosiaris:
cache::misc: Set up a temporary etherpad host

https://gerrit.wikimedia.org/r/295757

Change 295760 had a related patch set uploaded (by Alexandros Kosiaris):
Set up etherpad-restore.wikimedia.org

https://gerrit.wikimedia.org/r/295760

Change 295760 merged by Alexandros Kosiaris:
Set up etherpad-restore.wikimedia.org

https://gerrit.wikimedia.org/r/295760

Thanks to @jcrespo 's efforts and using the etherpadlite_restore2 database, we now have http://etherpad-restore.wikimedia.org. This is a temporary host we will keep for a few days around to allow users that may want to access the pads created/modified during the 2016-06-22 01:00:01 UTC - 2016-06-23 14:27 UTC timeperiod (~1,5 days) and restore their content.

Please DO use it for any pad whose content you care about. It will not stay around for long

Change 295763 had a related patch set uploaded (by Alexandros Kosiaris):
Introduce etherpad100b.eqiad.wmnet

https://gerrit.wikimedia.org/r/295763

Change 295763 merged by Alexandros Kosiaris:
Introduce etherpad100b.eqiad.wmnet

https://gerrit.wikimedia.org/r/295763

Change 295764 had a related patch set uploaded (by Alexandros Kosiaris):
etherpad-restore: Use etherpad1001b

https://gerrit.wikimedia.org/r/295764

Change 295764 merged by Alexandros Kosiaris:
etherpad-restore: Use etherpad1001b

https://gerrit.wikimedia.org/r/295764

It is finally working: https://etherpad-restore.wikimedia.org (if it does not, wait for your DNS cache to update).

Please recover anything you want and a) copy it to https://etherpad.wikimedia.org b) Then make a personal copy if you do not want to risk losing it again. The above URL will only be available for a week before we delete it.

jcrespo updated the task description. (Show Details)Jun 23 2016, 10:46 PM

Thanks a lot for recovering the data, you're saving me some bitter teardrops.

jcrespo lowered the priority of this task from High to Low.Jun 24 2016, 11:15 AM

There is no pending recovery actions, just waiting to remove etherpad-restore and check backups continue as usual once Wikimania is over.

Base added a subscriber: Base.Jun 25 2016, 1:22 PM

Mentioned in SAL [2016-07-05T07:29:51Z] <akosiaris> T138516 stop the secondary etherpad instance on etherpad1001. etherpad-restore.wikimedia.org has served its purpose, killing it

Change 297352 had a related patch set uploaded (by Alexandros Kosiaris):
Revert "cache::misc: Set up a temporary etherpad host"

https://gerrit.wikimedia.org/r/297352

Change 297353 had a related patch set uploaded (by Alexandros Kosiaris):
Revert "Set up etherpad-restore.wikimedia.org"

https://gerrit.wikimedia.org/r/297353

Change 297353 merged by Alexandros Kosiaris:
Revert "Set up etherpad-restore.wikimedia.org"

https://gerrit.wikimedia.org/r/297353

Change 297352 merged by Alexandros Kosiaris:
Revert "cache::misc: Set up a temporary etherpad host"

https://gerrit.wikimedia.org/r/297352

Mentioned in SAL [2016-07-05T07:40:05Z] <akosiaris> T138516 forcing a puppet run on cache::misc hosts after merging https://gerrit.wikimedia.org/r/297352

Mentioned in SAL [2016-07-05T07:55:04Z] <jynus> dropping etherpad_restore2 database from m1 T138516

akosiaris closed this task as Resolved.Jul 5 2016, 7:55 AM
akosiaris claimed this task.

etherpad-restore.wikimedia.org has been removed today. I am considering this resolved now since tracking that was the only reason this was left open.

In retrospect, it would have been nice to make a list of etherpad titles which existed in etherpad-restore but not etherpad. (Given it was a Wikimania day, they'd mostly be public notekeeping pads.)

it would have been nice to make a list of etherpad titles

Have you ever looked at etherpad MySQL schema? Hint: It has one single table with 2 columns, and it looks like this:

globalAuthor:a.ERreterERtreDFert | {"colorId":24,"name":null,"timestamp":1925327486894,"padIDs":{"fegdfgdsrtewsrtewt":1}}

Sadly the above is the best piece of open-source software available for real-time collaborative editing. Sigh